├── .DS_Store ├── .gitignore ├── Assets ├── readwise-python_1.png ├── readwise-python_2.png ├── readwise-python_3.png ├── readwise-python_4.png ├── readwise-python_5.png └── readwise-python_6.png ├── README.md ├── readwise-GET.py ├── readwise-GET_install.py └── readwiseMetadata.py.default /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | bin/ 10 | build/ 11 | develop-eggs/ 12 | dist/ 13 | eggs/ 14 | lib/ 15 | lib64/ 16 | parts/ 17 | sdist/ 18 | var/ 19 | *.egg-info/ 20 | .installed.cfg 21 | *.egg 22 | 23 | # Installer logs 24 | pip-log.txt 25 | pip-delete-this-directory.txt 26 | 27 | # Unit test / coverage reports 28 | .tox/ 29 | .coverage 30 | .cache 31 | nosetests.xml 32 | coverage.xml 33 | 34 | # Translations 35 | *.mo 36 | 37 | # Mr Developer 38 | .mr.developer.cfg 39 | .project 40 | .pydevproject 41 | 42 | # Rope 43 | .ropeproject 44 | 45 | # Django stuff: 46 | *.log 47 | *.pot 48 | 49 | # Sphinx documentation 50 | docs/_build/ 51 | 52 | # Metadata with access token 53 | readwiseMetadata.py 54 | 55 | # readWise Categories 56 | readwiseCategories/ 57 | 58 | # Extra log 59 | readwiseGET_scheduled.log 60 | -------------------------------------------------------------------------------- /Assets/readwise-python_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_1.png -------------------------------------------------------------------------------- /Assets/readwise-python_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_2.png -------------------------------------------------------------------------------- /Assets/readwise-python_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_3.png -------------------------------------------------------------------------------- /Assets/readwise-python_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_4.png -------------------------------------------------------------------------------- /Assets/readwise-python_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_5.png -------------------------------------------------------------------------------- /Assets/readwise-python_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolevanderhoeven/readwise2directory/972f91d1a79ccc1d7b4bb359cc50a51ea364595e/Assets/readwise-python_6.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Description 2 | 3 | Fetch new books and highlights from Readwise and print the results as markdown files in a chosen directory (i.e. Obsidian vault) 4 | 5 | I'm a huge fan of [Readwise](https://readwise.io/) and [Obsidian](https://obsidian.md/), and I hope this is helpful to others like me who wanted something a bit different than the basic markdown export (beta) 6 | 7 | # Features 8 | 9 | - Fetch all or subset of new books and highlights from Readwise (via their API https://readwise.io/api_deets) 10 | - Filter by custom `date from` or `last script run` date 11 | - Group and sort highlights by book/article/podcast/tweet 12 | - Create new markdown notes or append to existing ones in a chosen directory (i.e. Obsidian vault) 13 | - Filenames are formatted using [slugify](https://docs.djangoproject.com/en/3.1/ref/utils/) 14 | - Highlights with 'discard' tag are removed 15 | - Books with no highlights are ignored 16 | - Markdown notes are formatted as: 17 | - Book metadata - in YAML format 18 | - Title 19 | - Author 20 | - Number of highlights 21 | - Last updated date - formatted as "YYMMDD dddd" in wikilinks 22 | - Readwise URL 23 | - Title - as a heading 1 24 | - Highlight data 25 | - Text 26 | - Block reference ID - using the Readwise highlight ID as the unique block reference 27 | - Note 28 | - Tags - optional 29 | - References (e.g. original URL) 30 | - Date - formatted as "YYMMDD dddd" in wikilinks 31 | - Store book and highlight data into JSON files for easy retrieval and manipulation 32 | - Print outputs to the console and store in a log file for troubleshooting 33 | 34 | # Screenshots 35 | 36 | ##### Markdown note with book metadata as YAML frontmatter 37 | ![](Assets/readwise-python_1.png) 38 | 39 | ##### Cover images with hyperlinks to their source URLs in Readwise 40 | ![](Assets/readwise-python_6.png) 41 | 42 | ##### Highlight data with Readwise highlight ID's as unique block references 43 | ![](Assets/readwise-python_2.png) 44 | 45 | ##### Markdown note with headings (h1-h5) from Readwise 46 | ![](Assets/readwise-python_3.png) 47 | 48 | ##### Graph view of results 49 | ![](Assets/readwise-python_4.png) 50 | 51 | ##### Log file of outputs 52 | ![](Assets/readwise-python_5.png) 53 | 54 | # Installation 55 | 56 | - Clone this repo or download the ZIP folder and move to a chosen directory - this will serve as the `sourceDirectory` for running the scripts 57 | - Make sure the `readwiseCategories` folder is in the same directory as the `readwise-GET.py` script. This will store your JSON files. 58 | - Configure the `readwiseMetadata.py` file: 59 | - Required 60 | - Rename `readwiseMetadata.py.default to readwiseMetadata.py`. 61 | - Add your token - https://readwise.io/access_token 62 | - Specify a valid `targetDirectory` path for your markdown notes (e.g. Dropbox folder, Obsidian Vault). 63 | - Optional 64 | - Customise the request query string - add a `dateFrom` (formatted as "YYYY-MM-DD"), otherwise the `last script run` will be used (if available) or all highlights will be fetched 65 | - Add your `email` and `password` 66 | - Specify a `chromedriverDirectory` - instructions [here](https://chromedriver.chromium.org/) 67 | - Update the `highlightLimitToFetchTags` - default is 10 (recommended) 68 | - Specify a valid `downloadsDirectory` 69 | - Install the Python modules specified in `readwise-GET_install.py` via [pip](https://packaging.python.org/tutorials/installing-packages/) 70 | - Open the terminal or command prompt and navigate to the `sourceDirectory` (i.e. downloaded folder) - e.g. cd C:/Users/johnsmith/Downloads/readwise2directory-main 71 | - Run the `readwise-GET.py` script 72 | - `py readwise-GET.py` (on Windows) or `python3.9 readwise-GET.py` (on Mac) 73 | - Note: ~3 minutes to process ~1300 books, ~6200 highlights and ~2900 tags 74 | 75 | # Disclaimers 76 | 77 | - This is NOT an official plugin or integration, so please use mindfully. 78 | - This is my first real contribution on GitHub, so I'm open to any and all feedback 79 | 80 | # Requirements 81 | - A Readwise account and a valid access token (https://readwise.io/access_token) 82 | - Python 3.9.0+ (https://www.python.org/downloads/). 83 | 84 | # Contributions 85 | 86 | [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/paypalme/nicrivard) 87 | 88 | If you like this plugin, please consider donating; I really appreciate any and all support! ❤️ 89 | -------------------------------------------------------------------------------- /readwise-GET.py: -------------------------------------------------------------------------------- 1 | ############################## 2 | ### Import python packages ### 3 | ############################## 4 | 5 | import requests, os, io, sys, shutil, django, json, time 6 | #from datetime import datetime 7 | import datetime 8 | from itertools import groupby 9 | from operator import itemgetter 10 | from unidecode import unidecode 11 | from pathvalidate import ValidationError, validate_filepath 12 | from pathlib import Path 13 | from django.utils.text import slugify 14 | from json import JSONEncoder 15 | from json.decoder import JSONDecodeError 16 | import pandas as pd 17 | import numpy as np 18 | from selenium import webdriver 19 | from selenium.webdriver.common.by import By 20 | from selenium.common.exceptions import TimeoutException, NoSuchElementException 21 | from selenium.webdriver.common.keys import Keys 22 | from selenium.webdriver.chrome.options import Options 23 | from selenium.webdriver.support.ui import WebDriverWait 24 | from selenium.webdriver.support import expected_conditions as EC 25 | 26 | ########################## 27 | ### Log script outputs ### 28 | ########################## 29 | 30 | old_stdout = sys.stdout 31 | 32 | old_cwd = os.getcwd() 33 | 34 | startTime = datetime.datetime.now() 35 | 36 | def logDateTimeOutput(message): 37 | log_file = open('readwiseGET.log', 'a') 38 | sys.stdout = log_file 39 | now = datetime.datetime.now() 40 | print(now.strftime("%Y-%m-%dT%H:%M:%SZ") + " " + str(message)) 41 | sys.stdout = old_stdout 42 | log_file.close() 43 | 44 | logDateTimeOutput('Script started') 45 | 46 | ######################## 47 | ### Create functions ### 48 | ######################## 49 | 50 | # Check if a directory variable is defined and formatted correctly 51 | # If TRUE, add a new system path for that. If FALSE, do nothing. 52 | def insertPath(directory): 53 | if directory == "" or directory is None: 54 | return 55 | else: 56 | try: 57 | sys.path.insert(1, directory) 58 | except ValidationError as e: 59 | logDateTimeOutput(e) 60 | 61 | # Check if a 'dateFrom' variable is defined and formatted correctly 62 | # If TRUE, convert to UTC format. If FALSE, default to dateLastScriptRun 63 | def convertDateFromToUtcFormat(dateFrom): 64 | if dateFrom == "" or dateFrom is None: 65 | lastScriptRunDateMatchingString = ' Script complete' 66 | try: 67 | for line in reversed(list(open('readwiseGET.log', 'r').readlines())): 68 | if lastScriptRunDateMatchingString in line: 69 | dateLastScriptRun = str(line.replace(lastScriptRunDateMatchingString, '')).rstrip("\n") 70 | dateFrom = dateLastScriptRun 71 | message = 'Last successful script run = "' + str(dateFrom) + '" used as dateFrom in query string' 72 | logDateTimeOutput(message) 73 | print(message) 74 | return dateLastScriptRun 75 | except IOError: 76 | logDateTimeOutput('Failed to read readwiseGET.log file') 77 | elif dateFrom != "" or dateFrom is not None: 78 | try: 79 | dateFrom = datetime.datetime.strptime(dateFrom, '%Y-%m-%d') 80 | dateFrom = dateFrom.strftime("%Y-%m-%dT%H:%M:%SZ") 81 | message = 'Date from = "' + str(dateFrom) + '" from readwiseMetadata used in query string' 82 | logDateTimeOutput(message) 83 | print(message) 84 | return dateFrom 85 | except ValueError: 86 | logDateTimeOutput("Incorrect data format. It should be 'YYYY-MM-DD'") 87 | else: 88 | message = 'No dateFrom variable defined in readwiseMetadata or readwiseGET.log. Fetching all readwise highlights' 89 | logDateTimeOutput(message) 90 | print(message) 91 | 92 | def replaceNoneInListOfDict(listOfDicts): 93 | for i in range(len(listOfDicts)): 94 | for k, v in iter(listOfDicts[i].items()): 95 | if k == 'location' and v is None: 96 | listOfDicts[i][k] = 0 97 | if k == 'location_type' and v == 'none': 98 | listOfDicts[i][k] = 'custom' 99 | 100 | ###################################################### 101 | ### Manipulating book and highlight data with JSON ### 102 | ###################################################### 103 | 104 | # Load JSON file into list of categories objects 105 | def loadBookDataFromJsonToObject(): 106 | for i in range(len(categoriesObjectNames)): 107 | try: 108 | with open(sourceDirectory + "/readwiseCategories/" + categoriesObjectNames[i] + ".json", 'r') as infile: 109 | try: 110 | categoriesObject[i] = json.load(infile) # list of categories objects with up-to-date data loaded from JSON files 111 | message = str(len(categoriesObject[i])) + ' books loaded from ' + str(categoriesObjectNames[i]) + '.json' 112 | logDateTimeOutput(message) 113 | except JSONDecodeError: 114 | categoriesObject[i] = [] 115 | except FileNotFoundError: 116 | categoriesObject[i] = [] 117 | 118 | # Check if 'book_id' exists already. If no, append book data to the relevant category object 119 | def appendBookDataToObject(): 120 | newBooksCounter = 0 121 | updatedBooksCounter = 0 122 | totalNumberOfBooks = len(booksListResultsSort) 123 | print('totalNumber of Books = ' + str(totalNumberOfBooks)) 124 | for key, value in booksListResultsGroup: # key = 'category' 125 | old_newBooksCounter = newBooksCounter 126 | old_updatedBooksCounter = updatedBooksCounter 127 | for data in value: 128 | book_id = str(data['id']) 129 | title = unidecode(data['title']) 130 | if(str(data['author']) == "None"): 131 | author = " " 132 | else: 133 | author = unidecode(data['author']) 134 | source = data['category'] 135 | num_highlights = data['num_highlights'] 136 | updated = data['updated'] 137 | cover_image_url = data['cover_image_url'] 138 | url = data['highlights_url'] 139 | source_url = data['source_url'] 140 | highlights = [] 141 | values = { "book_id" : book_id, "title" : title, "author" : author, "source" : source, "url" : url, "cover_image_url" : cover_image_url, "source_url" : source_url, "num_highlights" : num_highlights, "updated" : updated, "highlights" : highlights } 142 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 143 | print('title = ' + title) 144 | if not any(d["book_id"] == book_id for d in categoriesObject[indexCategory]): 145 | categoriesObject[indexCategory].append(values) 146 | newBooksCounter += 1 147 | print(str((newBooksCounter + updatedBooksCounter)) + '/' + str(len(booksListResultsSort)) + ' books added or updated') 148 | print('New title = ' + title) 149 | else: 150 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(book_id) 151 | categoriesObject[indexCategory][indexBook]['book_id'] = book_id 152 | categoriesObject[indexCategory][indexBook]['title'] = title 153 | categoriesObject[indexCategory][indexBook]['author'] = author 154 | categoriesObject[indexCategory][indexBook]['source'] = source 155 | categoriesObject[indexCategory][indexBook]['num_highlights'] = num_highlights 156 | categoriesObject[indexCategory][indexBook]['updated'] = updated 157 | categoriesObject[indexCategory][indexBook]['cover_image_url'] = cover_image_url 158 | categoriesObject[indexCategory][indexBook]['url'] = url 159 | categoriesObject[indexCategory][indexBook]['source_url'] = source_url 160 | updatedBooksCounter += 1 161 | print(str((newBooksCounter + updatedBooksCounter)) + '/' + str(len(booksListResultsSort)) + ' books added or updated') 162 | new_newBooksCounter = newBooksCounter 163 | new_updatedBooksCounter = updatedBooksCounter 164 | message = str(new_newBooksCounter - old_newBooksCounter) + ' new books added and ' + str(new_updatedBooksCounter - old_updatedBooksCounter) + ' updated in ' + str(categoriesObjectNames[indexCategory]) + ' object' 165 | logDateTimeOutput(message) 166 | 167 | # Check if 'highlight_id' exists already. If no, append highlight data to the relevant 'book_id' within the category object 168 | def appendHighlightDataToObject(): 169 | newHighlightsCounter = 0 170 | updatedHighlightsCounter = 0 171 | for key, value in highlightsListResultsGroup: # key = 'book_id' 172 | listCategories = [item for category in categoriesObject for item in category] 173 | if any(d.get('book_id') == str(key) for d in listCategories): # Check if the 'book_id' from the grouped highlights exists. 174 | index = list(map(itemgetter('book_id'), listCategories)).index(str(key)) 175 | source = listCategories[index]['source'] # Get the 'category' of the corresponding 'book_id' from the grouped highlights 176 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 177 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(key)) # Identify which position the 'book_id' corresponds to within the category object 178 | for data in value: 179 | id = str(data['id']) 180 | note = unidecode(data['note']) 181 | location = str(data['location']) 182 | location_type = data['location_type'] 183 | book_id = str(data['book_id']) 184 | url = str(data['url']) 185 | highlighted_at = str(data['highlighted_at']) 186 | updated = str(data['updated']) 187 | text = unidecode(data['text']) 188 | tags = [] 189 | # If the source is a book, add 10 hours to the highlighted at date to account for timezone difference between me and AKST (Amazon's highlighted at timezone) 190 | # if (str(source) == 'books'): 191 | # highlighted_at = str(data['highlighted_at']) # 2021-02-06T04:56:00Z 192 | # highlighted_at = datetime.datetime.strptime(highlighted_at, "%Y-%m-%dT%H:%M:%SZ") 193 | # highlighted_at = highlighted_at + datetime.timedelta(hours=10) 194 | # highlighted_at = highlighted_at.strftime("%Y-%m-%dT%H:%M:%SZ") 195 | # print(' appendHighlightDataToObject highlighted_at =', highlighted_at) 196 | # highlight = { "id" : id, "text" : text, "note" : note, "tags" : tags, "location" : location, "location_type" : location_type, "url" : url, "highlighted_at" : highlighted_at, "updated" : updated } 197 | if not any(d["id"] == id for d in categoriesObject[indexCategory][indexBook]['highlights']): 198 | highlight = { "id" : id, "text" : text, "note" : note, "tags" : tags, "location" : location, "location_type" : location_type, "url" : url, "highlighted_at" : highlighted_at, "updated" : updated } 199 | categoriesObject[indexCategory][indexBook]['highlights'].append(highlight) 200 | sorted(categoriesObject[indexCategory][indexBook]['highlights'], key = itemgetter('location')) 201 | newHighlightsCounter += 1 202 | listOfBookIdsToUpdateMarkdownNotes.append([str(key), str(source)]) 203 | print(str((newHighlightsCounter + updatedHighlightsCounter)) + '/' + str(len(highlightsListResultsSort)) + ' highlights added or updated') 204 | else: 205 | indexHighlight = list(map(itemgetter('id'), categoriesObject[indexCategory][indexBook]['highlights'])).index(id) # Should be the same as 'data' 206 | tags = categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'] 207 | highlight = { "id" : id, "text" : text, "note" : note, "tags" : tags, "location" : location, "location_type" : location_type, "url" : url, "highlighted_at" : highlighted_at, "updated" : updated } 208 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight] = highlight 209 | sorted(categoriesObject[indexCategory][indexBook]['highlights'], key = itemgetter('location')) 210 | updatedHighlightsCounter += 1 211 | listOfBookIdsToUpdateMarkdownNotes.append([str(key), str(source)]) 212 | print(str((newHighlightsCounter + updatedHighlightsCounter)) + '/' + str(len(highlightsListResultsSort)) + ' highlights added or updated') 213 | message = str(newHighlightsCounter) + ' new highlights added and ' + str(updatedHighlightsCounter) + ' updated (excl tags)' # '.json' 214 | logDateTimeOutput(message) 215 | 216 | def appendTagsToHighlightObject(list_highlights): 217 | if fetchTagsBoolean is False: 218 | return 219 | else: 220 | if len(list_highlights) == 0: 221 | return 222 | else: 223 | # Open new Chrome window via Selenium 224 | print('Opening new Chrome browser window...') 225 | options = webdriver.ChromeOptions() 226 | options.add_argument('--ignore-certificate-errors') 227 | options.add_argument('--incognito') 228 | options.add_argument('--headless') 229 | options.add_argument('--log-level=3') # to stop logging 230 | options.add_argument("start-maximized") 231 | driver = webdriver.Chrome(chromedriverDirectory, options=options) 232 | # driver = webdriver.Chrome(chromedriverDirectory) 233 | driver.get('https://readwise.io/accounts/login') 234 | print('Logging into readwise using credentials provided in readwiseMetadata') 235 | # Input email as username from readwiseMetadata 236 | username = driver.find_element_by_xpath("//*[@id='id_login']") 237 | username.clear() 238 | username.send_keys(email) # from 'readwiseMetadata' 239 | # Input password from readwiseMetadata 240 | password = driver.find_element_by_xpath("//*[@id='id_password']") 241 | password.clear() 242 | password.send_keys(pwd) # from 'readwiseMetadata' 243 | # Click login button 244 | driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div/div/div/form/div[3]/button").click() 245 | print('Log-in successful! Fetching tags...') 246 | # Loop through new highlights 247 | updatedTagsCounter = 0 248 | newOrUpdatedTagsProgressCounter = 0 249 | for i in range(len(list_highlights)): # key = 'book_id' 250 | listCategories = [item for category in categoriesObject for item in category] 251 | key = str(list_highlights[i]['book_id']) 252 | id = str(list_highlights[i]['id']) 253 | index = list(map(itemgetter('book_id'), listCategories)).index(key) 254 | source = listCategories[index]['source'] # Get the 'category' of the corresponding 'book_id' from the grouped highlights 255 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 256 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(key)) # Identify which position the 'book_id' corresponds to within the category object 257 | bookLastUpdated = categoriesObject[indexCategory][indexBook]['updated'] 258 | indexHighlight = list(map(itemgetter('id'), categoriesObject[indexCategory][indexBook]['highlights'])).index(id) 259 | # highlights = categoriesObject[indexCategory][indexBook]['highlights'] 260 | book_id = categoriesObject[indexCategory][indexBook]['book_id'] 261 | bookReviewUrl = 'https://readwise.io/bookreview/' + book_id 262 | # Open new tab in Chrome window 263 | driver.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 't') 264 | # driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 't') 265 | driver.get(bookReviewUrl) 266 | # Loop through tags and append to highlight object within corresponding book object 267 | try: 268 | xPathHighlightId = "//*[@id=\'highlight" + id + "\']" 269 | highlightIdBlock = WebDriverWait(driver, 10).until( 270 | EC.presence_of_element_located((By.XPATH, xPathHighlightId)) 271 | ) 272 | tagLinks = highlightIdBlock.find_elements_by_class_name("tag-link") # Get tags within 'highlight id' block 273 | # Load original tags (if they exist) 274 | originalTags = categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'] 275 | originalTags = sorted(originalTags) 276 | originalTagsCounter = len(originalTags) 277 | # Ignore highlights with no tags 278 | if tagLinks == []: 279 | pass 280 | newTags = [] 281 | for tag in tagLinks: 282 | if tag == 'readwise': 283 | pass 284 | originalHref = tag.get_attribute("href") # e.g. https://readwise.io/tags/ 285 | trimHref = originalHref.replace('https://readwise.io/tags/', '') # e.g. 286 | newTags.append(trimHref) 287 | newTags = sorted(newTags) 288 | newTagsCounter = len(newTags) 289 | if originalTags == newTags: 290 | pass 291 | else: 292 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'] = newTags 293 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['updated'] = bookLastUpdated 294 | updatedTagsCounter += abs((newTagsCounter - originalTagsCounter)) 295 | newOrUpdatedTagsProgressCounter += 1 296 | listOfBookIdsToUpdateMarkdownNotes.append([str(key), str(source)]) 297 | print(str(newOrUpdatedTagsProgressCounter) + '/' + str(len(list_highlights)) + ' highlights updated with tags') 298 | except: 299 | message = 'Error looping through tags in highlight id block "' + str(id) + '". Book id: "' + str(book_id) + '". Book URL: "' + str(bookReviewUrl) + '". File: "' \ 300 | + str(categoriesObjectNames[indexCategory]) + '.json". Book location: "' + str(indexBook) + '". Highlight location: "' + str(indexHighlight) + '".' 301 | logDateTimeOutput(message) 302 | pass 303 | driver.quit() 304 | try: 305 | message = str(updatedTagsCounter) + ' tags added or updated to ' + str(len(list_highlights)) + ' highlights in ' + str(categoriesObjectNames[indexCategory]) + ' object' 306 | logDateTimeOutput(message) 307 | except UnboundLocalError: 308 | message = 'No tags to add or update' 309 | logDateTimeOutput(message) 310 | 311 | def appendUpdatedHighlightsToObject(): 312 | listOfBookIdsFromBooksList = [] 313 | listOfBookIdsFromHighlightsList = [] 314 | listofBookIdsWithMissingHighlights = [] 315 | for i in range(len(booksListResultsSort)): 316 | listOfBookIdsFromBooksList.append(str(booksListResultsSort[i]['id'])) 317 | for i in range(len(highlightsListResultsSort)): 318 | listOfBookIdsFromHighlightsList.append(str(highlightsListResultsSort[i]['book_id'])) 319 | listOfBookIdsFromHighlightsList = list(dict.fromkeys(listOfBookIdsFromHighlightsList)) # Remove duplicates 320 | for i in range(len(listOfBookIdsFromBooksList)): 321 | if listOfBookIdsFromBooksList[i] not in listOfBookIdsFromHighlightsList: 322 | listofBookIdsWithMissingHighlights.append(str(listOfBookIdsFromBooksList[i])) 323 | else: 324 | pass 325 | for i in range(len(listofBookIdsWithMissingHighlights)): 326 | missingHighlightsListQueryString = { 327 | "page_size": 1000, # 1000 items per page - maximum 328 | "page": 1, # Page 1 >> build for loop to cycle through pages and stop when complete 329 | "book_id": listofBookIdsWithMissingHighlights[i], 330 | } 331 | # Trigger GET request with missingHighlightsListQueryString 332 | missingHighlightsList = requests.get( 333 | url="https://readwise.io/api/v2/highlights/", 334 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 335 | params=missingHighlightsListQueryString # query string object 336 | ) 337 | # Convert response into JSON object 338 | try: 339 | missingHighlightsListJson = missingHighlightsList.json() # type(missingHighlightsListJson) = 'dictionary' 340 | except ValueError: 341 | message = 'Response content from missingHighlightsList request is not valid JSON' 342 | logDateTimeOutput(message) 343 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 344 | break 345 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 346 | try: 347 | # Create dictionary of missingHighlightsListJson['results'] 348 | missingHighlightsListResults = missingHighlightsListJson['results'] # type(highlightsListResults) = 'list' 349 | except NameError: 350 | message = 'Cannot extract results from empty JSON for missingHighlightsList request' 351 | logDateTimeOutput(message) 352 | print(message) 353 | break 354 | # Loop through pagination using 'next' property from GET response 355 | try: 356 | additionalLoopCounter = 0 357 | while missingHighlightsListJson['next']: 358 | additionalLoopCounter += 1 359 | print('Fetching additional missing highlight data from readwise... (page ' + str(additionalLoopCounter) + ')') 360 | missingHighlightsList = requests.get( 361 | url=missingHighlightsListJson['next'], # keep same query parameters from booksListQueryString object 362 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 363 | ) 364 | try: 365 | print('Converting additional missing highlight data returned into JSON... (page ' + str(additionalLoopCounter) + ')') 366 | missingHighlightsListJson = missingHighlightsList.json() # type(missingHighlightsListJson) = 'dictionary' 367 | except ValueError: 368 | message = 'Response content from additional missingHighlightsList request is not valid JSON' 369 | logDateTimeOutput(message) 370 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 371 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 372 | break 373 | try: 374 | missingHighlightsListResults.extend(missingHighlightsListJson['results']) 375 | except NameError: 376 | message = 'Cannot extract results from empty JSON for additional missingHighlightsList request' 377 | logDateTimeOutput(message) 378 | print(message) 379 | break 380 | except NameError: 381 | message = 'Cannot loop through pagination from empty response' 382 | logDateTimeOutput(message) 383 | print(message) 384 | break 385 | # Replace 'location': None and 'location_type': 'none' values in list of dictionaries 386 | replaceNoneInListOfDict(missingHighlightsListResults) 387 | # Sort highlightsListResults data by 'book_id' key and 'location' 388 | missingHighlightsListResultsSort = sorted(missingHighlightsListResults, key = itemgetter('location')) 389 | newMissingHighlightsCounter = 0 390 | updatedMissingHighlightsCounter = 0 391 | if len(missingHighlightsListResults) == 0: 392 | break 393 | else: 394 | try: 395 | for j in range(len(missingHighlightsListResultsSort)): 396 | listCategories = [item for category in categoriesObject for item in category] 397 | book_id = str(missingHighlightsListResultsSort[j]['book_id']) 398 | index = list(map(itemgetter('book_id'), listCategories)).index(book_id) 399 | source = listCategories[index]['source'] # Get the 'category' of the corresponding 'book_id' from the grouped highlights 400 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 401 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(book_id)) # Identify which position the 'book_id' corresponds to within the category object 402 | bookLastUpdated = categoriesObject[indexCategory][indexBook]['updated'] 403 | id = str(missingHighlightsListResultsSort[j]['id']) 404 | note = unidecode(missingHighlightsListResultsSort[j]['note']) 405 | location = str(missingHighlightsListResultsSort[j]['location']) 406 | location_type = missingHighlightsListResultsSort[j]['location_type'] 407 | url = str(missingHighlightsListResultsSort[j]['url']) 408 | # If the source is a book, add 10 hours to the highlighted at date to account for timezone difference between me and AKST (Amazon's highlighted at timezone) 409 | if (str(source) == 'books'): 410 | highlighted_at = str(data['highlighted_at']) # 2021-02-06T04:56:00Z 411 | highlighted_at = datetime.datetime.strptime(highlighted_at, "%Y-%m-%dT%H:%M:%SZ") 412 | highlighted_at = highlighted_at + datetime.timedelta(hours=10) 413 | highlighted_at = highlighted_at.strftime("%Y-%m-%dT%H:%M:%SZ") 414 | print('appendUpdatedHighlightsToObject highlighted_at =', highlighted_at) 415 | highlighted_at = str(missingHighlightsListResultsSort[j]['highlighted_at']) 416 | updated = str(missingHighlightsListResultsSort[j]['updated']) 417 | text = unidecode(missingHighlightsListResultsSort[j]['text']) 418 | tags = [] 419 | highlight = { "id" : id, "text" : text, "note" : note, "tags" : tags, "location" : location, "location_type" : location_type, "url" : url, "highlighted_at" : highlighted_at, "updated" : updated } 420 | if not any(d["id"] == id for d in categoriesObject[indexCategory][indexBook]['highlights']): 421 | categoriesObject[indexCategory][indexBook]['highlights'].append(highlight) 422 | sorted(categoriesObject[indexCategory][indexBook]['highlights'], key = itemgetter('location')) 423 | newMissingHighlightsCounter += 1 424 | listOfBookIdsToUpdateMarkdownNotes.append([str(book_id), str(source)]) 425 | print(str((newMissingHighlightsCounter + updatedMissingHighlightsCounter)) + '/' + str(len(missingHighlightsListResultsSort)) + ' missing highlights added or updated') 426 | else: 427 | indexHighlight = list(map(itemgetter('id'), categoriesObject[indexCategory][indexBook]['highlights'])).index(id) 428 | if categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['text'] == text: 429 | if categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['note'] == note: 430 | pass 431 | else: 432 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['note'] = note 433 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['updated'] = bookLastUpdated 434 | else: 435 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['text'] = text 436 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['updated'] = bookLastUpdated 437 | if categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['note'] == note: 438 | pass 439 | else: 440 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['note'] = note 441 | sorted(categoriesObject[indexCategory][indexBook]['highlights'], key = itemgetter('location')) 442 | updatedMissingHighlightsCounter += 1 443 | listOfBookIdsToUpdateMarkdownNotes.append([str(book_id), str(source)]) 444 | print(str((newMissingHighlightsCounter + updatedMissingHighlightsCounter)) + '/' + str(len(missingHighlightsListResultsSort)) + ' missing highlights added or updated in ' \ 445 | + str(categoriesObjectNames[indexCategory]) + ' object') 446 | except ValueError: 447 | pass 448 | try: 449 | message = str(updatedMissingHighlightsCounter) + ' highlights updated (incl tags) in ' + str(categoriesObjectNames[indexCategory]) + ' object' # '.json' 450 | logDateTimeOutput(message) 451 | appendHighlightsToListForFetchingTags(missingHighlightsListToFetchTagsFor, missingHighlightsListResultsSort) 452 | appendHighlightsToListForFetchingTags(allHighlightsToFetchTagsFor, missingHighlightsListResultsSort) 453 | # appendTagsToHighlightObject(missingHighlightsListResultsSort) 454 | except UnboundLocalError: 455 | message = 'No missing highlights (incl tags) to update' 456 | logDateTimeOutput(message) 457 | 458 | def appendBookAndHighlightObjectToJson(): 459 | for i in range(len(categoriesObjectNames)): 460 | try: 461 | with open(os.path.join(sourceDirectory, "readwiseCategories", categoriesObjectNames[i] + ".json"), 'w') as outfile: 462 | json.dump(categoriesObject[i], outfile, indent=4) 463 | except FileNotFoundError: 464 | with open(os.path.join(sourceDirectory, "readwiseCategories", categoriesObjectNames[i] + ".json"), 'x') as outfile: 465 | json.dump(categoriesObject[i], outfile, indent=4) 466 | 467 | def replaceNoneInListOfDict(listOfDicts): 468 | for i in range(len(listOfDicts)): 469 | for k, v in iter(listOfDicts[i].items()): 470 | if k == 'location' and v is None: 471 | listOfDicts[i][k] = 0 472 | if k == 'location_type' and v == 'none': 473 | listOfDicts[i][k] = 'custom' 474 | if k == 'highlighted_at' and v is None: 475 | listOfDicts[i][k] = str(v) 476 | 477 | def removeHighlightsWithDiscardTag(): 478 | listCategories = list(categoriesObject) 479 | highlightsWithDiscardTagCounter = 0 480 | for i in range(len(listCategories)): 481 | for k in range(len(listCategories[i])): 482 | book_id = str(listCategories[i][k]['book_id']) 483 | source = str(listCategories[i][k]['source']) 484 | originalNumberOfhighlights = listCategories[i][k]['num_highlights'] 485 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 486 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(book_id)) # Identify which position the 'book_id' corresponds to within the category object 487 | originalListOfHighlights = listCategories[i][k]['highlights'].copy() 488 | newListOfHighlights = categoriesObject[indexCategory][indexBook]['highlights'].copy() 489 | for n in range(len(originalListOfHighlights)): 490 | try: 491 | if any('discard' in s for s in listCategories[i][k]['highlights'][n]['tags']): 492 | id = listCategories[i][k]['highlights'][n]['id'] 493 | indexHighlight = list(map(itemgetter('id'), newListOfHighlights)).index(str(id)) 494 | newListOfHighlights.pop(indexHighlight) 495 | # listCategories[i][k]['highlights'].pop(n) # Remove highlight with 'discard' tag from list 496 | highlightsWithDiscardTagCounter += 1 497 | except IndexError: 498 | continue 499 | categoriesObject[indexCategory][indexBook]['highlights'] = newListOfHighlights 500 | newNumberOfhighlights = len(newListOfHighlights) 501 | categoriesObject[indexCategory][indexBook]['num_highlights'] = newNumberOfhighlights 502 | if str(originalNumberOfhighlights - newNumberOfhighlights) == '0': 503 | pass 504 | else: 505 | print(str(originalNumberOfhighlights - newNumberOfhighlights) + ' highlights removed from ' + str(listCategories[i][k]['book_id'])) 506 | message = str(highlightsWithDiscardTagCounter) + ' highlights discarded' 507 | logDateTimeOutput(message) 508 | print(message) 509 | 510 | def appendHashtagToTags(): 511 | listCategories = list(categoriesObject) 512 | tagsWithNoHashtag = 0 513 | for i in range(len(listCategories)): 514 | for k in range(len(listCategories[i])): 515 | book_id = str(listCategories[i][k]['book_id']) 516 | source = str(listCategories[i][k]['source']) 517 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 518 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(book_id)) # Identify which position the 'book_id' corresponds to within the category object 519 | for n in range(len(listCategories[i][k]['highlights'])): 520 | id = listCategories[i][k]['highlights'][n]['id'] 521 | indexHighlight = list(map(itemgetter('id'), categoriesObject[indexCategory][indexBook]['highlights'])).index(str(id)) 522 | for t in range(len(listCategories[i][k]['highlights'][n]['tags'])): 523 | tag = str(listCategories[i][k]['highlights'][n]['tags'][t]) 524 | positionTag = listCategories[i][k]['highlights'][n]['tags'].index(tag) # Should be the same as 't' 525 | if listCategories[i][k]['highlights'][n]['tags'][t].startswith('#'): 526 | pass 527 | else: 528 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'][positionTag] = '#' + \ 529 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'][positionTag] 530 | # listCategories[i][k]['highlights'][n]['tags'][t] = '#' + listCategories[i][k]['highlights'][n]['tags'][t] 531 | tagsWithNoHashtag += 1 532 | message = str(tagsWithNoHashtag) + ' tags updated with hashtags' 533 | print(message) 534 | 535 | # Set boolean value to determine if tags should be fetched (default = True) 536 | # If any of the optional input variables in readwiseMetadata are blank or missing, set boolean to False 537 | fetchTagsBoolean = True 538 | 539 | def fetchTagsTrueOrFalse(fetchTagsBoolean, inputVariable): 540 | if fetchTagsBoolean is False: 541 | return False 542 | elif inputVariable == "" or inputVariable is None: 543 | return False 544 | else: 545 | return True 546 | 547 | ################################################ 548 | ### Load CSV export into dataframe and lists ### 549 | ################################################ 550 | 551 | def latest_download_file(): 552 | path = sourceDirectory 553 | files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime) 554 | newest = files[-1] 555 | return newest 556 | 557 | def download_wait(): 558 | seconds = 0 559 | dl_wait = True 560 | while dl_wait and seconds < 20: 561 | time.sleep(1) 562 | dl_wait = False 563 | for fname in os.listdir(sourceDirectory): 564 | if fname.endswith('.crdownload'): 565 | dl_wait = True 566 | seconds += 1 567 | newest_file = latest_download_file() 568 | return newest_file 569 | 570 | ####### V2.0 ####### 571 | 572 | # Use Selenium to export CSV extract of highlight data, and save in sourceDirectory 573 | def downloadCsvExport(latestDownloadedFileName): # with_ublock=False, chromedriverDirectory=None 574 | if fetchTagsBoolean is False: 575 | return 576 | else: 577 | # Open new Chrome window via Selenium 578 | print('Opening new Chrome browser window...') 579 | options = webdriver.ChromeOptions() 580 | options.add_argument("--headless") 581 | options.add_argument("window-size=1920,1080") 582 | options.add_argument("--log-level=3") # to stop logging 583 | options.add_argument("--silent") 584 | options.add_argument("--disable-logging") 585 | options.add_argument("--disable-blink-features=AutomationControlled") 586 | options.add_experimental_option('prefs', { 587 | # "download.default_directory": downloadsDirectory, # Set own Download path 588 | "download.prompt_for_download": False, # Do not ask for download at runtime 589 | "download.directory_upgrade": True, # Also needed to suppress download prompt 590 | "w3c": False, # allows selenium to accept cookies with a non-int64 'expiry' value 591 | "excludeSwitches": ["enable-logging"], # removes the 'DevTools listening' log message 592 | "excludeSwitches": ["enable-automation"], # prevent Cloudflare from detecting ChromeDriver as bot 593 | "useAutomationExtension": False, 594 | }) 595 | driver = webdriver.Chrome( 596 | executable_path=chromedriverDirectory, 597 | options=options, 598 | ) 599 | driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command') 600 | params = {'behavior': 'allow', 'downloadPath': sourceDirectory} 601 | driver.execute_cdp_cmd('Page.setDownloadBehavior', params) 602 | driver.get('https://readwise.io/accounts/login') 603 | print('Logging into readwise using credentials provided in readwiseMetadata') 604 | # Input email as username from readwiseMetadata 605 | WebDriverWait(driver, 10).until( 606 | EC.presence_of_element_located((By.XPATH, "//*[@id='id_login']"))) 607 | username = driver.find_element_by_xpath("//*[@id='id_login']") 608 | username.clear() 609 | username.send_keys(email) # from 'readwiseMetadata' 610 | # Input password from readwiseMetadata 611 | WebDriverWait(driver, 10).until( 612 | EC.presence_of_element_located((By.XPATH, "//*[@id='id_password']"))) 613 | password = driver.find_element_by_xpath("//*[@id='id_password']") 614 | password.clear() 615 | password.send_keys(pwd) # from 'readwiseMetadata' 616 | # Click login button 617 | WebDriverWait(driver, 10).until( 618 | EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/div/div/div/div/div/div/form/div[3]/button"))) 619 | driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div/div/div/form/div[3]/button").click() 620 | print('Log-in successful! Redirecting to export page...') 621 | driver.get('https://readwise.io/export') 622 | # Click export CSV button 623 | WebDriverWait(driver, 10).until( 624 | EC.presence_of_element_located((By.XPATH, "//*[@id='MiscApp']/div/div[3]/div/div[1]/div/div[2]/div/button"))) 625 | driver.find_element_by_xpath("//*[@id='MiscApp']/div/div[3]/div/div[1]/div/div[2]/div/button").click() 626 | print('Redirect successful! Waiting for CSV export...') 627 | dlFilename = download_wait() 628 | # rename the downloaded file 629 | shutil.move(dlFilename, os.path.join(sourceDirectory, latestDownloadedFileName)) 630 | message = str(latestDownloadedFileName) + ' successfully added to ' + str(sourceDirectory) 631 | logDateTimeOutput(message) 632 | print(message) 633 | print('Closing Chrome browser window...') 634 | driver.quit() 635 | 636 | # Clean-up list values 637 | def cleanUpListValues(listFromCsv, replacementCharacter): 638 | for i in range(len(listFromCsv)): 639 | if(str(listFromCsv[i]) == "nan"): 640 | listFromCsv[i] = str(replacementCharacter) 641 | else: 642 | listFromCsv[i] = unidecode(str(listFromCsv[i])) 643 | 644 | # Make book titles valid filenames via Django 645 | def convertTitleToValidFilename(listToConvert): 646 | for i in range(len(listToConvert)): 647 | listToConvert[i] = slugify(listToConvert[i]) 648 | # listToConvert[i] = get_valid_filename_django(listToConvert[i]) 649 | 650 | # Convert all book titles to lowercase 651 | def toLowercase(listToConvert): 652 | for i in range(len(listToConvert)): 653 | listToConvert[i] = listToConvert[i].lower() 654 | 655 | # Replace empty CSV cells of 'Tags' with "" 656 | def replaceEmptyTagCells(list_Tags): 657 | for i in range(len(list_Tags)): 658 | if(str(list_Tags[i]) == "nan"): 659 | list_Tags[i] = "" 660 | else: 661 | list_Tags[i] = list_Tags[i].replace(',', ' ') 662 | 663 | # Normalise date strings e.g. 2020-01-01T12:59:59Z >> 2020-01-01 12:59:59 664 | def dateStringNormaliser(dateString): 665 | for i in range(len(dateString)): 666 | dateString[i] = dateString[i].replace('T', ' ')[0 : 19] 667 | 668 | # Create empty lists to fill data from CSV 669 | list_Highlight = [] 670 | list_BookTitle = [] 671 | list_BookAuthor = [] 672 | list_AmazonBookId = [] 673 | list_Note = [] 674 | list_Color = [] 675 | list_Tags = [] 676 | list_LocationType = [] 677 | list_Location = [] 678 | list_HighlightedAt = [] 679 | 680 | # Create additional lists to supplement ones provided in the CSV export 681 | list_ReadwiseBookId = [] # 'Readwise Book ID' 682 | list_Source = [] # 'Source' # e.g. Articles 683 | list_Url = [] # 'Url' 684 | list_NumberOfHighlights = [] # 'Number of Highlights' 685 | list_UpdatedAt = [] # 'Updated at' 686 | list_HighlightId = [] # 'Highlight ID' 687 | 688 | # Fill newly-created lists with empty values to aid with index matching 689 | def fillListWithEmptyCharacters(listToGetRangeFrom, listToFill): 690 | for i in range(len(listToGetRangeFrom)): 691 | listToFill.append("") 692 | 693 | # Create lists to add extracted highlight data from API calls 694 | # Then we can compare these lists to to those from the CSV export to retrieve highlight id's, book id's, and highlight tags 695 | list_extractedHighlightId = [] 696 | list_extractedHighlightText = [] 697 | list_extractedHighlightTags = [] 698 | list_extractedHighlightBookId = [] 699 | list_extractedHighlightBookTitle = [] 700 | list_extractedHighlightBookAuthor = [] 701 | list_extractedHighlightLocation = [] 702 | list_extractedHighlightedAt = [] 703 | 704 | # Create lists to collect fallouts e.g. no highlight id retrieved from highlight text, highlight text has duplicate values 705 | list_noMatchingHighlightIdFromText = [] 706 | list_noMatchingBookIdFromTitle = [] 707 | list_duplicateHighlightTextValues = [] 708 | 709 | # Fill empty lists with values from highlights list of dictionaries 710 | def fillListsWithHighlightData(listToFill): 711 | for j in range(len(listToFill)): 712 | for k, v in iter(listToFill[j].items()): 713 | if k == 'text': 714 | list_extractedHighlightText[j] = str(v) 715 | if k == 'id': 716 | list_extractedHighlightId[j] = str(v) 717 | if k == 'location': 718 | list_extractedHighlightLocation[j] = str(v) 719 | if k == 'highlighted_at': 720 | list_extractedHighlightedAt[j] = str(v) 721 | if k == 'book_id': 722 | list_extractedHighlightBookId[j] = str(v) 723 | 724 | # Clean-up extracted list values 725 | def cleanUpExtractedListValues(listFromJson): 726 | for i in range(len(listFromJson)): 727 | listFromJson[i] = unidecode(str(listFromJson[i])) 728 | 729 | # Mark duplicate values e.g. AirrQuotes 730 | def checkForDuplicates(listToGetRangeFrom, listToCheckDuplicateValues): 731 | for i in range(len(listToGetRangeFrom)): 732 | if listToCheckDuplicateValues.count(listToCheckDuplicateValues[i]) > 1: 733 | list_duplicateHighlightTextValues[i] = 'Duplicate value' 734 | 735 | # Fetch highlight id, book id, and tags from 'highlight text' or 'highlighted at' (if there are duplicates) 736 | def fetchTagsFromCsvData(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, list_HighlightedAt, \ 737 | list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId, list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, \ 738 | list_extractedHighlightLocation, list_extractedHighlightedAt, list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues): 739 | textMatch = 0 740 | noMatch = 0 741 | tagsFromTextMatch = 0 742 | totalNumberOfTags = sum(1 for x in list_Tags if x != '') 743 | for i in range(len(list_extractedHighlightText)): 744 | try: 745 | if list_duplicateHighlightTextValues[i] == 'Duplicate value': 746 | if list_extractedHighlightedAt[i] in list_HighlightedAt: 747 | index1 = list_HighlightedAt.index(list_extractedHighlightedAt[i]) 748 | list_HighlightId[index1] = str(list_extractedHighlightId[i]) 749 | list_ReadwiseBookId[index1] = str(list_extractedHighlightBookId[i]) 750 | list_duplicateHighlightTextValues[i] = "" 751 | if(str(list_Tags[index1]) == ""): 752 | list_extractedHighlightTags[i] = "" 753 | else: 754 | list_extractedHighlightTags[i] = str(list_Tags[index1]) 755 | tagsFromTextMatch += 1 756 | textMatch += 1 757 | print(str(textMatch) + '/' + str(len(list_extractedHighlightText)) + ' highlights matched with ' \ 758 | + str(tagsFromTextMatch) + '/' + str(totalNumberOfTags) + ' tags') 759 | else: 760 | noMatch += 1 761 | message = str(list_extractedHighlightId[i]) + ' from ' + str(list_extractedHighlightBookId[i]) + ' not matched as it is a duplicate' 762 | print(message) 763 | pass 764 | else: 765 | if list_extractedHighlightText[i] in list_Highlight: 766 | index2 = list_Highlight.index(list_extractedHighlightText[i]) 767 | list_HighlightId[index2] = str(list_extractedHighlightId[i]) 768 | list_ReadwiseBookId[index2] = str(list_extractedHighlightBookId[i]) 769 | if(str(list_Tags[index2]) == ""): 770 | list_extractedHighlightTags[i] = "" 771 | else: 772 | list_extractedHighlightTags[i] = str(list_Tags[index2]) 773 | tagsFromTextMatch += 1 774 | textMatch += 1 775 | print(str(textMatch) + '/' + str(len(list_extractedHighlightText)) + ' highlights matched with ' \ 776 | + str(tagsFromTextMatch) + '/' + str(totalNumberOfTags) + ' tags') 777 | else: 778 | try: 779 | list_noMatchingHighlightIdFromText[i] = 'No highlight text match' 780 | except IndexError: 781 | return 782 | except IndexError: 783 | return 784 | message = str(textMatch) + '/' + str(len(list_extractedHighlightText)) + ' highlights matched with ' \ 785 | + str(tagsFromTextMatch) + '/' + str(totalNumberOfTags) + ' tags' 786 | logDateTimeOutput(message) 787 | 788 | def appendTagsFromCsvToCategoriesObject(list_highlights, list_ExtractedTags): 789 | tagsFromCsvCounter = 0 790 | totalNumberOfTags = sum(1 for x in list_ExtractedTags if x != '') 791 | for i in range(len(list_highlights)): # key = 'book_id' 792 | listCategories = [item for category in categoriesObject for item in category] 793 | key = str(list_highlights[i]['book_id']) 794 | id = str(list_highlights[i]['id']) 795 | index = list(map(itemgetter('book_id'), listCategories)).index(key) 796 | source = listCategories[index]['source'] # Get the 'category' of the corresponding 'book_id' from the grouped highlights 797 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 798 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(key)) # Identify which position the 'book_id' corresponds to within the category object 799 | bookLastUpdated = categoriesObject[indexCategory][indexBook]['updated'] 800 | indexHighlight = list(map(itemgetter('id'), categoriesObject[indexCategory][indexBook]['highlights'])).index(id) 801 | # highlights = categoriesObject[indexCategory][indexBook]['highlights'] 802 | book_id = categoriesObject[indexCategory][indexBook]['book_id'] 803 | bookReviewUrl = 'https://readwise.io/bookreview/' + book_id 804 | indexTags = list_extractedHighlightId.index(id) 805 | if str(list_ExtractedTags[indexTags]) == '' or str(list_ExtractedTags[indexTags]) == 'nan': 806 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'] = [] 807 | else: 808 | tagsArray = str(list_ExtractedTags[indexTags]).split() 809 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['tags'] = tagsArray 810 | categoriesObject[indexCategory][indexBook]['highlights'][indexHighlight]['updated'] = bookLastUpdated 811 | tagsFromCsvCounter += 1 812 | print(str(tagsFromCsvCounter) + '/' + str(totalNumberOfTags) + ' tags added or updated from the CSV export') 813 | message = str(tagsFromCsvCounter) + '/' + str(totalNumberOfTags) + ' tags added or updated from the CSV export' 814 | logDateTimeOutput(message) 815 | 816 | def runFetchCsvData(): 817 | readwiseCsvExportFileName = 'readwise-data.csv' 818 | downloadCsvExport(readwiseCsvExportFileName) 819 | readwiseCsvExportPath = os.path.join(sourceDirectory, readwiseCsvExportFileName) 820 | df = pd.read_csv(readwiseCsvExportPath) 821 | # Insert complete path to the excel file and optional variables 822 | # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html 823 | df.sort_values(by=['Highlighted at'], ascending=True) 824 | # Insert the name of the column as a string in brackets 825 | list_Highlight = list(df['Highlight']) 826 | list_BookTitle = list(df['Book Title']) 827 | list_BookAuthor = list(df['Book Author']) 828 | list_AmazonBookId = list(df['Amazon Book ID']) 829 | list_Note = list(df['Note']) 830 | list_Color = list(df['Color']) 831 | list_Tags = list(df['Tags']) 832 | list_LocationType = list(df['Location Type']) 833 | list_Location = list(df['Location']) 834 | list_HighlightedAt = list(df['Highlighted at']) 835 | cleanUpListValues(list_Highlight, " ") 836 | cleanUpListValues(list_BookAuthor, " ") 837 | cleanUpListValues(list_Note, " ") 838 | cleanUpListValues(list_Location, "0") 839 | convertTitleToValidFilename(list_BookTitle) 840 | toLowercase(list_BookTitle) 841 | replaceEmptyTagCells(list_Tags) 842 | dateStringNormaliser(list_HighlightedAt) 843 | fillListWithEmptyCharacters(list_HighlightedAt, list_ReadwiseBookId) 844 | fillListWithEmptyCharacters(list_HighlightedAt, list_Source) 845 | fillListWithEmptyCharacters(list_HighlightedAt, list_Url) 846 | fillListWithEmptyCharacters(list_HighlightedAt, list_NumberOfHighlights) 847 | fillListWithEmptyCharacters(list_HighlightedAt, list_UpdatedAt) 848 | fillListWithEmptyCharacters(list_HighlightedAt, list_HighlightId) 849 | return list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, list_HighlightedAt, \ 850 | list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId 851 | 852 | def runExtractDataFromApi(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, list_HighlightedAt, \ 853 | list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId): 854 | allHighlightsToFetchTagsForSortByDate = sorted(allHighlightsToFetchTagsFor, key = itemgetter('highlighted_at')) 855 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightTags) 856 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightText) 857 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightId) 858 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightLocation) 859 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightedAt) 860 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_extractedHighlightBookId) 861 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_noMatchingHighlightIdFromText) 862 | fillListWithEmptyCharacters(allHighlightsToFetchTagsForSortByDate, list_duplicateHighlightTextValues) 863 | fillListsWithHighlightData(allHighlightsToFetchTagsForSortByDate) 864 | cleanUpExtractedListValues(list_extractedHighlightText) 865 | cleanUpExtractedListValues(list_extractedHighlightId) 866 | cleanUpExtractedListValues(list_extractedHighlightLocation) 867 | cleanUpExtractedListValues(list_extractedHighlightedAt) 868 | cleanUpExtractedListValues(list_extractedHighlightBookId) 869 | dateStringNormaliser(list_extractedHighlightedAt) 870 | checkForDuplicates(list_extractedHighlightText, list_extractedHighlightText) 871 | return allHighlightsToFetchTagsForSortByDate, list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, list_extractedHighlightLocation, \ 872 | list_extractedHighlightedAt, list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues 873 | 874 | def runFetchTagsFromCsvData(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, list_HighlightedAt, \ 875 | list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId, list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, \ 876 | list_extractedHighlightLocation, list_extractedHighlightedAt, list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues): 877 | fetchTagsFromCsvData(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, list_HighlightedAt, \ 878 | list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId, list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, \ 879 | list_extractedHighlightLocation, list_extractedHighlightedAt, list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues) 880 | appendTagsFromCsvToCategoriesObject(allHighlightsToFetchTagsFor, list_extractedHighlightTags) 881 | 882 | ############################################################### 883 | ### Create markdown notes with updated books and highlights ### 884 | ############################################################### 885 | 886 | # Create function for generating new markdown notes 887 | # Change working directory to desired path set in the readwiseMetadata file 888 | # Append book metadata at the top of the note e.g. title, author, source, readwise url 889 | # Append all highlights separated by "---" beneath the book metadata 890 | 891 | def createMarkdownNote(listOfBookIdsToUpdateMarkdownNotes): 892 | # for x in listOfBookIdsToUpdateMarkdownNotes: 893 | # print("listOfBookIdsToUpdateMarkdownNotes " + str(x)) 894 | booksWithNoHighlights = 0 895 | booksWithHeadings = 0 896 | if os.path.exists(targetDirectory): 897 | os.chdir(targetDirectory) 898 | else: 899 | print('Error! The target directory does not exist or is incorrect') 900 | # Match the 'book_id' to the correct category dictionary e.g. books, articles 901 | # Retrieve 'book_id' metadata from the dictionary 902 | listCategories = list(categoriesObject) 903 | # listCategories = [item for category in categoriesObject for item in category] 904 | listOfBookIdsToUpdateMarkdownNotes.sort() 905 | listOfBookIdsToUpdateMarkdownNotes = list(listOfBookIdsToUpdateMarkdownNotes for listOfBookIdsToUpdateMarkdownNotes,_ in groupby(listOfBookIdsToUpdateMarkdownNotes)) 906 | if len(listOfBookIdsToUpdateMarkdownNotes) != 0: 907 | for bookData in range(len(listOfBookIdsToUpdateMarkdownNotes)): # type(listOfBookIdsToUpdateMarkdownNotes[bookData]) = list 908 | key = str(listOfBookIdsToUpdateMarkdownNotes[bookData][0]) 909 | source = str(listOfBookIdsToUpdateMarkdownNotes[bookData][1]) 910 | indexCategory = categoriesObjectNames.index(source) # Identify which position the 'category' corresponds to within the list of category objects 911 | indexBook = list(map(itemgetter('book_id'), categoriesObject[indexCategory])).index(str(key)) # Identify which position the 'book_id' corresponds to within the category object 912 | yamlData = [] 913 | titleBlock = [] 914 | commentData = [] 915 | yamlData.append("---" + "\n") 916 | # Add title to yamlData and titleBlock 917 | title = unidecode(categoriesObject[indexCategory][indexBook]['title']).replace('"', '\'') 918 | yamlData.append("Title: " + "\"" + str(title) + "\"" + "\n") 919 | titleBlock.append("# " + str(title) + "\n") 920 | if(str(categoriesObject[indexCategory][indexBook]['author']) == "None"): 921 | author = " " 922 | yamlData.append("Author: " + str(author) + "\n") 923 | else: 924 | author = unidecode(categoriesObject[indexCategory][indexBook]['author']).replace('"', '\'') 925 | yamlData.append("Author: " + "\"" + str(author) + "\"" + "\n") 926 | source = categoriesObject[indexCategory][indexBook]['source'] 927 | yamlData.append("Tags: " + "[" + "readwise2directory" + ", TVZ, " + str(source) + "]" + "\n") 928 | num_highlights = categoriesObject[indexCategory][indexBook]['num_highlights'] 929 | yamlData.append("Highlights: " + str(num_highlights) + "\n") 930 | lastUpdated = datetime.datetime.strptime(categoriesObject[indexCategory][indexBook]['updated'][0:10], '%Y-%m-%d').strftime("%Y-%m-%d") 931 | yamlData.append("Updated: " + "[[" + str(lastUpdated) + "]]" + "\n") 932 | # Add readwise url to yamlData and titleBlock 933 | url = str(categoriesObject[indexCategory][indexBook]['url']) 934 | yamlData.append("Readwise URL: " + str(url) + "\n") 935 | titleBlock.append("[Readwise URL](" + str(url) + ")") 936 | book_id = str(categoriesObject[indexCategory][indexBook]['book_id']) 937 | yamlData.append("Readwise ID: " + str(book_id) + "\n") 938 | # Add source URL (if exists) to yamlData and titleBlock 939 | try: 940 | source_url = str(categoriesObject[indexCategory][indexBook]['source_url']) 941 | if source_url.lower() == "none" or source_url.lower() == "null" or source_url == "": 942 | print('no source URL found') 943 | #continue # Commented this out because otherwise, books don't get markdown notes generated from them. 944 | else: 945 | yamlData.append("Source URL: " + str(source_url) + "\n") 946 | titleBlock.append(" | " + "[Source URL](" + str(source_url) + ")"+ "\n\n") 947 | except NameError: 948 | continue 949 | yamlData.append("---" + "\n\n") 950 | titleBlock.append("---" + "\n") 951 | 952 | # Add comment with tags 953 | commentData.append("%%\n") 954 | commentData.append("Last Updated: [[" + str(lastUpdated) + "]]\n") 955 | commentData.append("%%" + "\n") 956 | # Add cover image URL if exists 957 | try: 958 | cover_image_url = str(categoriesObject[indexCategory][indexBook]['cover_image_url']) 959 | titleBlock.append("![](" + cover_image_url + ")" + "\n\n") 960 | titleBlock.append("---" + "\n") 961 | except NameError: 962 | continue 963 | #fileName = slugify(title) 964 | fileName = title # Removed slugify to preserve case and spaces 965 | # fileName = get_valid_filename_django(title) 966 | yamlData = "".join(yamlData) 967 | commentData = "".join(commentData) 968 | titleBlock = "".join(titleBlock) 969 | # Ignore books with no highlights 970 | if str(num_highlights) == '0': 971 | booksWithNoHighlights += 1 972 | pass 973 | else: 974 | # Change directory according to source 975 | if str(source) == 'tweets': 976 | sourceOutputDir = 'Tweet' 977 | if str(source) == 'articles': 978 | sourceOutputDir = 'Article' 979 | if str(source) == 'books': 980 | sourceOutputDir = 'Book' 981 | if str(source) == 'podcasts': 982 | sourceOutputDir = 'Podcast' 983 | if str(source) == 'supplementals': 984 | sourceOutputDir = 'Supplemental' 985 | os.chdir(targetDirectory + '/' + sourceOutputDir) 986 | with open(fileName + ".md", 'w') as newFile: # Warning: this will overwrite all content within the readwise note. 987 | print(yamlData, file=newFile) 988 | print(commentData, file=newFile) 989 | print(titleBlock, file=newFile) 990 | # Append highlights to the file beneath the 'book_id' metadata 991 | for n in range(len(categoriesObject[indexCategory][indexBook]['highlights'])): 992 | highlightData = [] 993 | id = str(categoriesObject[indexCategory][indexBook]['highlights'][n]['id']) 994 | note = unidecode(categoriesObject[indexCategory][indexBook]['highlights'][n]['note']) 995 | location = str(categoriesObject[indexCategory][indexBook]['highlights'][n]['location']) 996 | location_type = categoriesObject[indexCategory][indexBook]['highlights'][n]['location_type'] 997 | tagsArray = categoriesObject[indexCategory][indexBook]['highlights'][n]['tags'] 998 | text = unidecode(categoriesObject[indexCategory][indexBook]['highlights'][n]['text']) 999 | if "__" in text: 1000 | text = text.replace("__", "==") 1001 | # Add # for h1-h5 headings 1002 | listOfHeadings = ['#h1', '#h2', '#h3', '#h4', '#h5'] 1003 | if any(item in tagsArray for item in listOfHeadings): 1004 | if any('#h1' in s for s in tagsArray): 1005 | highlightData.append("## " + text + "\n" + " ^" + id + "\n\n") 1006 | booksWithHeadings += 1 1007 | elif any('#h2' in s for s in tagsArray): 1008 | highlightData.append("### " + text + "\n" + " ^" + id + "\n\n") 1009 | booksWithHeadings += 1 1010 | elif any('#h3' in s for s in tagsArray): 1011 | highlightData.append("#### " + text + "\n" + " ^" + id + "\n\n") 1012 | booksWithHeadings += 1 1013 | elif any('#h4' in s for s in tagsArray): 1014 | highlightData.append("##### " + text + "\n" + " ^" + id + "\n\n") 1015 | booksWithHeadings += 1 1016 | elif any('#h5' in s for s in tagsArray): 1017 | highlightData.append("###### " + text + "\n" + " ^" + id + "\n\n") 1018 | booksWithHeadings += 1 1019 | else: 1020 | # Pre-pend a "> " character to any text with line breaks 1021 | # Or pre-pend a "> \" if line is empty 1022 | # This is to fix the issue where the block-reference doesn't pick-up parent items 1023 | if "\n" in text: 1024 | textNew = [] 1025 | textSplit = text.split("\n") # type(highlight['text']) = 'list' 1026 | for s in range(len(textSplit)): 1027 | if textSplit[s] == '': 1028 | x = ("> \\" + textSplit[s]) 1029 | else: 1030 | x = ("> " + textSplit[s]) 1031 | textNew.append(x) 1032 | textNew = "\n".join(textNew) 1033 | highlightData.append(textNew + "\n\n" + "^" + id + "\n\n") 1034 | else: 1035 | highlightData.append(text + " ^" + id + "\n\n") 1036 | if note == [] or note == "": 1037 | pass 1038 | else: 1039 | highlightData.append("**Note:** " + str(note) + "\n") 1040 | if tagsArray == [] or tagsArray == "": 1041 | pass 1042 | else: 1043 | tags = " ".join(str(v) for v in tagsArray) 1044 | highlightData.append("**Tags:** " + str(tags) + "\n") 1045 | if str(categoriesObject[indexCategory][indexBook]['highlights'][n]['url']) == "None": 1046 | pass 1047 | else: 1048 | url = str(categoriesObject[indexCategory][indexBook]['highlights'][n]['url']) 1049 | highlightData.append("**References:** " + str(url) + "\n") 1050 | if source == "podcasts" and str(url) != "None": 1051 | # Append 'embed/' after the 'airr.io/' string and before the '/quote/' string 1052 | airrQuoteMatchingPattern = 'airr.io/' 1053 | airrQuoteEmbedText = 'embed/' 1054 | if any(airrQuoteMatchingPattern in url for string in url): # Check if url is an AirrQuote 1055 | i = url.find(airrQuoteMatchingPattern) # Find index of matching pattern 1056 | podcastUrl = url[:i + len(airrQuoteMatchingPattern)] + airrQuoteEmbedText + url[i + len(airrQuoteMatchingPattern):] 1057 | else: 1058 | podcastUrl = url 1059 | iFrameWithPodcastUrl = '' 1060 | highlightData.append(iFrameWithPodcastUrl + "\n") 1061 | """ 1062 | highlighted_at = datetime.datetime.strptime(categoriesObject[indexCategory][indexBook]['highlights'][n]['highlighted_at'][0:10], '%Y-%m-%d').strftime("%y%m%d %A") # Trim the UTC date field and re-format 1063 | updated = datetime.datetime.strptime(categoriesObject[indexCategory][indexBook]['highlights'][n]['updated'][0:10], '%Y-%m-%d').strftime("%y%m%d %A") # Trim the UTC date field and re-format 1064 | if highlighted_at == updated: 1065 | date = updated 1066 | highlightData.append("**Date:** " + "[[" + str(date) + "]]" + "\n") 1067 | else: 1068 | date = highlighted_at 1069 | highlightData.append("**Date:** " + "[[" + str(date) + "]]" + "\n") 1070 | """ 1071 | highlightData.append("\n" + "---" + "\n") 1072 | highlightData = "".join(highlightData) 1073 | print(highlightData, file=newFile) 1074 | print(' - "' + str(title) + '"') 1075 | os.chdir(sourceDirectory) # Revert to original directory with script 1076 | if str(booksWithHeadings) == '0': 1077 | pass 1078 | else: 1079 | print(str(booksWithHeadings) + ' highlights converted into headings') 1080 | if str(booksWithNoHighlights) == '0': 1081 | pass 1082 | else: 1083 | print(str(booksWithNoHighlights) + ' books ignored as they contained no highlights') 1084 | differenceMarkdownNoteAmount = newMarkdownNoteAmount - originalMarkdownNoteAmount 1085 | message = str(differenceMarkdownNoteAmount) + ' new markdown notes created and ' + str(len(listOfBookIdsToUpdateMarkdownNotes) - differenceMarkdownNoteAmount) + ' markdown notes updated' 1086 | # message = str(len(listOfBookIdsToUpdateMarkdownNotes)) + ' new markdown notes created' 1087 | logDateTimeOutput(message) 1088 | print(message) 1089 | 1090 | ########################################################## 1091 | ### Calculate the number of new markdown notes created ### 1092 | ########################################################## 1093 | 1094 | def numberOfMarkdownNotes(): 1095 | counter = 0 1096 | listCategories = list(categoriesObject) 1097 | for i in range(len(listCategories)): 1098 | counter += len(listCategories[i]) 1099 | return counter 1100 | 1101 | ####################################################### 1102 | ### Import variables from file in another directory ### 1103 | ####################################################### 1104 | 1105 | # Import all variables from readwiseMetadata file 1106 | print('Importing variables from readwiseMetadata...') 1107 | from readwiseMetadata import token, targetDirectory, dateFrom, email, pwd, chromedriverDirectory, highlightLimitToFetchTags 1108 | # from readwiseMetadata import * 1109 | 1110 | # Check dateFrom variable 1111 | print('Checking if a valid dateFrom variable is defined in readwiseMetadata...') 1112 | dateFrom = convertDateFromToUtcFormat(dateFrom) 1113 | 1114 | # Check targetDirectory variable is valid 1115 | print('Checking if a valid targetDirectory variable is defined in readwiseMetadata...') 1116 | insertPath(targetDirectory) 1117 | 1118 | abspath = os.path.realpath(__file__) # Create absolute path for this file 1119 | 1120 | # Create sourceDirectory variable from absolute path for this file 1121 | print('Creating sourceDirectory variable from absolute path for this file...') 1122 | sourceDirectory = os.path.dirname(abspath) # Create variable defining the directory name 1123 | # sourceDirectory = os.getcwd() 1124 | print(str(sourceDirectory) + ' directory variable defined') 1125 | 1126 | # Function to check if any of the optional input variables in readwiseMetadata are blank or missing 1127 | # If blank or missing, set boolean value to False and no tags will be fetched 1128 | fetchTagsBoolean = fetchTagsTrueOrFalse(fetchTagsBoolean, email) 1129 | fetchTagsBoolean = fetchTagsTrueOrFalse(fetchTagsBoolean, pwd) 1130 | fetchTagsBoolean = fetchTagsTrueOrFalse(fetchTagsBoolean, chromedriverDirectory) 1131 | fetchTagsBoolean = fetchTagsTrueOrFalse(fetchTagsBoolean, highlightLimitToFetchTags) 1132 | 1133 | ###################################### 1134 | ### Load book data from JSON files ### 1135 | ###################################### 1136 | 1137 | articles = {} 1138 | books = {} 1139 | podcasts = {} 1140 | supplementals = {} 1141 | tweets = {} 1142 | 1143 | categoriesObject = [articles, books, podcasts, supplementals, tweets] # type(categoriesObject[0]) = 'dictionary' 1144 | 1145 | categoriesObjectNames = ["articles", "books", "podcasts", "supplementals", "tweets"] # type(categoriesObjectNames[0]) = 'string' 1146 | 1147 | # Load existing readwise data from JSON files into categoriesObject 1148 | print('Loading data from JSON files in readwiseCategories to categoriesObject...') 1149 | loadBookDataFromJsonToObject() 1150 | 1151 | originalMarkdownNoteAmount = numberOfMarkdownNotes() # Sum the original number of books in each dictionary 1152 | 1153 | ################## 1154 | ### Books LIST ### 1155 | ################## 1156 | 1157 | # Readwise REST API information = 'https://readwise.io/api_deets' 1158 | # Readwise endpoint = 'https://readwise.io/api/v2/books/' 1159 | 1160 | booksListQueryString = { 1161 | "page_size": 1000, # 1000 items per page - maximum 1162 | "page": 1, # Page 1 >> build for loop to cycle through pages and stop when complete 1163 | "updated__gt": dateFrom, # if no date provided, it will default to dateLastScriptRun 1164 | } 1165 | 1166 | # Trigger GET request with booksListQueryString 1167 | print('Fetching book data from readwise...') 1168 | booksList = requests.get( 1169 | url="https://readwise.io/api/v2/books/", # endpoint provided by https://readwise.io/api_deets 1170 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 1171 | params=booksListQueryString # query string object 1172 | ) 1173 | print("Here's the response: " + str(booksList.content)) 1174 | 1175 | # Convert response into JSON object 1176 | try: 1177 | print('Converting readwise book data returned into JSON...') 1178 | booksListJson = booksList.json() # type(booksListJson) >> 'dict' https://docs.python.org/3/tutorial/datastructures.html#dictionaries 1179 | except ValueError: 1180 | message = 'Response content from booksList request is not valid JSON' 1181 | logDateTimeOutput(message) 1182 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 1183 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 1184 | 1185 | try: 1186 | # Create new object of booksListJson['results'] 1187 | booksListResults = booksListJson['results'] # type(booksListResults) = 'list' 1188 | except NameError: 1189 | message = 'Cannot extract results from empty JSON for booksList request' 1190 | logDateTimeOutput(message) 1191 | print(message) 1192 | 1193 | # Loop through pagination using 'next' property from GET response 1194 | try: 1195 | additionalLoopCounter = 0 1196 | while booksListJson['next']: 1197 | additionalLoopCounter += 1 1198 | print('Fetching additional book data from readwise... (page ' + str(additionalLoopCounter) + ')') 1199 | booksList = requests.get( 1200 | url=booksListJson['next'], # keep same query parameters from booksListQueryString object 1201 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 1202 | ) 1203 | try: 1204 | print('Converting additional readwise book data returned into JSON... (page ' + str(additionalLoopCounter) + ')') 1205 | booksListJson = booksList.json() # type(booksListJson) = 'dictionary' 1206 | except ValueError: 1207 | message = 'Response content from additional booksList request is not valid JSON' 1208 | logDateTimeOutput(message) 1209 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 1210 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 1211 | break 1212 | try: 1213 | # Create dictionary of highlightsListJson['results'] 1214 | booksListResults.extend(booksListJson['results']) # type(booksListJson) = 'list' 1215 | except NameError: 1216 | message = 'Cannot extract results from empty JSON for additional booksList request' 1217 | logDateTimeOutput(message) 1218 | print(message) 1219 | break 1220 | except NameError: 1221 | message = 'Cannot loop through pagination from empty response' 1222 | logDateTimeOutput(message) 1223 | print(message) 1224 | 1225 | # Sort booksListResults data by 'category' key 1226 | print('Sorting readwise book data by category...') 1227 | booksListResultsSort = sorted(booksListResults, key = itemgetter('category')) # e.g. 'category' = 'books' 1228 | 1229 | # Group booksListResults data by 'category' key 1230 | print('Grouping readwise book data by category...') 1231 | booksListResultsGroup = groupby(booksListResultsSort, key = itemgetter('category')) 1232 | 1233 | # Append new books to categoriesObject, or update existing book data 1234 | print('Appending readwise book data returned to categoriesObject...') 1235 | appendBookDataToObject() 1236 | 1237 | ####################### 1238 | ### Highlights LIST ### 1239 | ####################### 1240 | 1241 | # Readwise REST API information = 'https://readwise.io/api_deets' 1242 | # Readwise endpoint = 'https://readwise.io/api/v2/highlights/' 1243 | 1244 | # Create highlightsList query string: 1245 | highlightsListQueryString = { 1246 | "page_size": 1000, # 1000 items per page - maximum 1247 | "page": 1, # Page 1 >> build for loop to cycle through pages and stop when complete 1248 | "highlighted_at__gt": dateFrom, 1249 | } 1250 | 1251 | # Trigger GET request with highlightsListQueryString 1252 | print('Fetching highlight data from readwise...') 1253 | highlightsList = requests.get( 1254 | url="https://readwise.io/api/v2/highlights/", 1255 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 1256 | params=highlightsListQueryString # query string object 1257 | ) 1258 | 1259 | # Convert response into JSON object 1260 | try: 1261 | print('Converting readwise highlight data returned into JSON...') 1262 | highlightsListJson = highlightsList.json() # type(highlightsListJson) = 'dictionary' 1263 | except ValueError: 1264 | message = 'Response content is not valid JSON' 1265 | logDateTimeOutput(message) 1266 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 1267 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 1268 | 1269 | try: 1270 | # Create dictionary of highlightsListJson['results'] 1271 | highlightsListResults = highlightsListJson['results'] # type(highlightsListResults) = 'list' 1272 | except NameError: 1273 | message = 'Cannot extract results from empty JSON' 1274 | logDateTimeOutput(message) 1275 | print(message) 1276 | 1277 | try: 1278 | # Loop through pagination using 'next' property from GET response 1279 | additionalLoopCounter = 0 1280 | while highlightsListJson['next']: 1281 | additionalLoopCounter += 1 1282 | print('Fetching additional highlight data from readwise... (page ' + str(additionalLoopCounter) + ')') 1283 | highlightsList = requests.get( 1284 | url=highlightsListJson['next'], # keep same query parameters from booksListQueryString object 1285 | headers={"Authorization": "Token " + token}, # token imported from readwiseAccessToken file 1286 | ) 1287 | # Convert response into JSON object 1288 | try: 1289 | print('Converting additional readwise highlight data returned into JSON... (page ' + str(additionalLoopCounter) + ')') 1290 | highlightsListJson = highlightsList.json() # type(highlightsListJson) = 'dictionary' 1291 | except ValueError: 1292 | message = 'Response content is not valid JSON' 1293 | logDateTimeOutput(message) 1294 | print(message) # Originally from https://github.com/psf/requests/issues/4908#issuecomment-627486125 1295 | # JSONDecodeError: Expecting value: line 1 column 1 (char 0) specifically happens with an empty string (i.e. empty response content) 1296 | break 1297 | try: 1298 | # Create dictionary of highlightsListJson['results'] 1299 | highlightsListResults.extend(highlightsListJson['results']) # type(highlightsListResults) = 'list' 1300 | except NameError: 1301 | message = 'Cannot extract results from empty JSON' 1302 | logDateTimeOutput(message) 1303 | print(message) 1304 | break 1305 | except NameError: 1306 | message = 'Cannot loop through pagination from empty response' 1307 | logDateTimeOutput(message) 1308 | print(message) 1309 | 1310 | # Replace "location" and "location_type" fields with no values as this will otherwise block highlight data sorting and grouping 1311 | replaceNoneInListOfDict(highlightsListResults) 1312 | 1313 | # Sort highlightsListResults data by 'book_id' key and 'location' 1314 | print('Sorting readwise highlight data by category...') 1315 | highlightsListResultsSort = sorted(highlightsListResults, key = itemgetter('book_id', 'location')) 1316 | 1317 | # Group highlightsListResultsSort data by 'category' key 1318 | print('Grouping readwise highlight data by category...') 1319 | highlightsListResultsGroup = groupby(highlightsListResultsSort, key = itemgetter('book_id')) 1320 | 1321 | listOfBookIdsToUpdateMarkdownNotes = [] # Append 'book ids' to loop through when creating new or updating existing arkdown notes 1322 | 1323 | # Append new highlights to categoriesObject, or update existing highlight data 1324 | print('Appending readwise highlight data returned to categoriesObject...') 1325 | appendHighlightDataToObject() 1326 | 1327 | allHighlightsToFetchTagsFor = [] # Append values from 'highlightsListResultsSort' and 'missingHighlightsListResultsSort' into this list 1328 | missingHighlightsListToFetchTagsFor = [] # Append values from 'missingHighlightsListResultsSort' into this list 1329 | allHighlightsToFetchTagsForSortByDate = [] 1330 | 1331 | def appendHighlightsToListForFetchingTags(originalList, highlightsListToAppend): 1332 | # allHighlightsToFetchTagsFor = [allHighlightsToFetchTagsFor.append(highlightsListToAppend) 1333 | for i in range(len(highlightsListToAppend)): 1334 | originalList.append(highlightsListToAppend[i]) 1335 | 1336 | appendHighlightsToListForFetchingTags(allHighlightsToFetchTagsFor, highlightsListResultsSort) 1337 | 1338 | print('Appending updated highlight data to categoriesObject...') 1339 | appendUpdatedHighlightsToObject() 1340 | 1341 | ######################################################### 1342 | ### Fetch tags individually or in bulk via CSV export ### 1343 | ######################################################### 1344 | 1345 | # appendTagsToHighlightObject(highlightsListResultsSort) 1346 | 1347 | # If num of highlights in 'highlightsListResultsSort' is greater than limit specified in 'highlightLimitToFetchTags', fetch tags via CSV export 1348 | # Otherwise web scrape tags individually via Selenium 1349 | def fetchTagsIndividuallyOrInBulk(): 1350 | if fetchTagsBoolean is True: 1351 | try: 1352 | if len(allHighlightsToFetchTagsFor) > highlightLimitToFetchTags: 1353 | message = 'Fetching tags for ' + str(len(allHighlightsToFetchTagsFor)) + ' highlights in bulk via CSV export...' 1354 | logDateTimeOutput(message) 1355 | print(message) 1356 | list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, list_Location, \ 1357 | list_HighlightedAt, list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId = runFetchCsvData() 1358 | # runFetchCsvData() 1359 | allHighlightsToFetchTagsForSortByDate, list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, list_extractedHighlightLocation, \ 1360 | list_extractedHighlightedAt, list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues = \ 1361 | runExtractDataFromApi(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, \ 1362 | list_Location, list_HighlightedAt, list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId) 1363 | # runFetchTagsFromCsvData() 1364 | runFetchTagsFromCsvData(list_Highlight, list_BookTitle, list_BookAuthor, list_AmazonBookId, list_Note, list_Color, list_Tags, list_LocationType, \ 1365 | list_Location, list_HighlightedAt, list_ReadwiseBookId, list_Source, list_Url, list_NumberOfHighlights, list_UpdatedAt, list_HighlightId, \ 1366 | list_extractedHighlightTags, list_extractedHighlightText, list_extractedHighlightId, list_extractedHighlightLocation, list_extractedHighlightedAt, \ 1367 | list_extractedHighlightBookId, list_noMatchingHighlightIdFromText, list_duplicateHighlightTextValues) 1368 | elif len(allHighlightsToFetchTagsFor) <= highlightLimitToFetchTags: 1369 | message = 'Fetching tags for ' + str(len(allHighlightsToFetchTagsFor)) + ' highlights individually...' 1370 | logDateTimeOutput(message) 1371 | print(message) 1372 | appendTagsToHighlightObject(highlightsListResultsSort) 1373 | appendTagsToHighlightObject(missingHighlightsListToFetchTagsFor) 1374 | else: 1375 | message = 'Error trying to determine whether to fetch tags individually or in bulk' 1376 | logDateTimeOutput(message) 1377 | print(message) 1378 | except (OSError, ValueError): 1379 | return 1380 | else: 1381 | return 1382 | 1383 | if fetchTagsBoolean is True: 1384 | fetchTagsIndividuallyOrInBulk() # Function to determine whether to fetch tags individually or in bulk 1385 | removeHighlightsWithDiscardTag() # Function to remove highlights from categoriesObject which contain 'discard' tag 1386 | appendHashtagToTags() # Function to append a hashtag to the start of every tag (if they are missing) 1387 | else: 1388 | message = 'No tags fetched as one of the input variables required in readwiseMetadata is blank or invalid' 1389 | logDateTimeOutput(message) 1390 | print(message) 1391 | 1392 | # Export books with updated highlights to JSON files 1393 | appendBookAndHighlightObjectToJson() 1394 | 1395 | ############################ 1396 | ### Create markdown note ### 1397 | ############################ 1398 | 1399 | newMarkdownNoteAmount = numberOfMarkdownNotes() # Sum the new number of books in each dictionary 1400 | 1401 | print('Creating or updating markdown notes...') 1402 | 1403 | createMarkdownNote(listOfBookIdsToUpdateMarkdownNotes) 1404 | 1405 | ############################################### 1406 | ### Print script completion time to console ### 1407 | ############################################### 1408 | 1409 | os.chdir(sourceDirectory) 1410 | 1411 | message = 'Script complete' 1412 | logDateTimeOutput(message) 1413 | print(message) 1414 | -------------------------------------------------------------------------------- /readwise-GET_install.py: -------------------------------------------------------------------------------- 1 | ############################ 2 | ### Install dependencies ### 3 | ############################ 4 | 5 | # Instructions for installing Python modules here https://docs.python.org/3/installing/index.html 6 | # If using Mac, use python3.9 -m pip install. If using Windows, use py -3.9 -m pip install 7 | 8 | # Mac 9 | 10 | # !/usr/bin/env python3.9 11 | 12 | python3.9 -m pip install requests 13 | python3.9 -m pip install Django 14 | python3.9 -m pip install Unidecode 15 | python3.9 -m pip install pathvalidate 16 | python3.9 -m pip install pandas 17 | python3.9 -m pip install chromedriver 18 | python3.9 -m pip install selenium 19 | 20 | """ 21 | # Windows 22 | 23 | # !/usr/bin/env py 24 | 25 | py -3.9 -m pip install requests 26 | py -3.9 -m pip install Django 27 | py -3.9 -m pip install Unidecode 28 | py -3.9 -m pip install pathvalidate 29 | py -3.9 -m pip install pandas 30 | py -3.9 -m pip install chromedriver 31 | py -3.9 -m pip install selenium 32 | """ 33 | -------------------------------------------------------------------------------- /readwiseMetadata.py.default: -------------------------------------------------------------------------------- 1 | ############################# 2 | ### Readwise Access Token ### 3 | ############################# 4 | 5 | token = "" # ENTER YOUR TOKEN HERE 6 | # Retrieve from https://readwise.io/access_token 7 | # e.g. "abc123dEf45Gh6" 8 | 9 | ########################################################### 10 | ### Specify target directory for new markdown notes i.e. Obsidian vault ### 11 | ########################################################### 12 | 13 | targetDirectory = "" # ENTER VALID DIRECTORY PATH HERE 14 | # e.g. "/Users/johnsmith/Dropbox/Obsidian/Vault" on Mac or "\\Users\\johnsmith\\Dropbox\\Obsidian\\Vault" on Windows 15 | 16 | ################################################## 17 | ### Specify query string parameters (optional) ### 18 | ################################################## 19 | 20 | dateFrom = "" # "YYYY-MM-DD" format only. Get highlights AFTER this date only. 21 | # If set to "" or None, the script will default to 'last successful script run' date from readwiseGET.log (if exists), or it will fetch all readwise resources 22 | # e.g. "2020-01-01" 23 | 24 | ######################################### 25 | ### Data for fetching tags (optional) ### 26 | ######################################### 27 | 28 | # Readwise API endpoints seem to exclude tags, so I've added functionality to fetch tags from new or updated highlights. 29 | # Note: this uses Selenium to web scrape data from your readwise profile. Please use with caution! 30 | # If any of these variables are set to "" or None, no tags will be fetched. 31 | 32 | email = "" # ENTER YOUR EMAIL HERE 33 | # e.g. "johnsmith@gmail.com" 34 | 35 | pwd = "" # ENTER YOUR PASSWORD HERE 36 | # e.g. "J0HNSM1TH_312" 37 | 38 | chromedriverDirectory = "" # ENTER VALID PATH TO CHROMEDRIVER 39 | # e.g. "/Users/johnsmith/Downloads/chromedriver.exe" on Mac or "\\Users\\johnsmith\\Downloads\\chromedriver.exe" on Windows 40 | # Read more here https://chromedriver.chromium.org/ 41 | 42 | highlightLimitToFetchTags = 10 # ENTER NUMBER HERE 43 | # Specify an integer limit (I recommend 10 for speed) to determine whether to fetch tags individually or in bulk via CSV export 44 | # If <=10 highlights returned, fetch tags individually. If >10 highlights returned, fetch tags in bulk via a CSV export 45 | --------------------------------------------------------------------------------