├── .gitignore ├── LICENSE ├── README.md ├── chromedriver.exe ├── gbd.py ├── img.png └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | google-books-downloader crash.log 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Hayk Aprikyan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Google Books Downloader 2 | An open-source utility to scrape Google Books 3 | 4 | ## How to use 5 | 6 | ### Installing 7 | 8 | **Step 0:** Python is required to run this utility. If you are new to programming, make sure to download and install the latest version of [Python](https://www.python.org/downloads) before proceeding. 9 | 10 | **Step 1:** Download/clone the code to your local machine. Install the required dependencies listed in requirements.txt 11 | 12 | (hint: run `pip3 install -r /[install-directory]/requirements.txt` from your console) 13 | 14 | **Step 2:** Run gbd.py and go get 'em books. 15 | 16 | (hint: run `python [install-directory]/gbd.py` from your console to run Google Books Downloader) 17 | 18 | ### Instructions 19 | 20 | In order to download a material from Google Books, **it needs to have full or snippet view**. If a book does not have any of these, i.e. it can't be viewed on Google Books, then the utility can't (and practically, no one can) download the book. 21 | 22 | ![Book examples](img.png) 23 | 24 | **Step 1:** After running the utility, you'll be asked whether to download from URL or, in case you have *previously* downloaded it, from backup. At this point, type "Yes". 25 | 26 | If you do have a backup file, type "No", then input the address of the backup and proceed to step 4. 27 | 28 | **Step 2:** Enter the URL of the book you want to download. 29 | 30 | In this step, Google Book Downloader will browse the book and fetch its pages, so you can take a short break while it does the job for you. 31 | 32 | **Step 3:** After a couple minutes it'll be done processing the book. It is encouraged that you save the progress made so far and back it up. This will help you skip the previous step and save your time if you ever happen to download the same book again. 33 | 34 | Type Yes to create a backup, otherwise--you know. 35 | 36 | **Step 4:** Type the numbers of pages you'd like to download. Note that if a page you selected is not available in the preview, it will simply be skipped. 37 | 38 | **Step 5:** We're almost done at this step. Just enter where to save the book; leave it blank and press enter to save them in your current directory. 39 | 40 | Congrats! The book pages will be saved as images in the specified location in the highest quality available. You can now read them or combine in a PDF using online tools or desktop apps (I usually use Nitro or this [website](https://smallpdf.com/jpg-to-pdf) as an alternative). 41 | 42 | --------- 43 | 44 | DISCLAIMER OF RESPONSIBILITY: The code is provided as-is without any further warranty. It is designed solely for legal usage; the author(s) of the code are not responsible for any illegal actions done by anyone using this code. Read the License for more information. 45 | -------------------------------------------------------------------------------- /chromedriver.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aprikyan/google-books-downloader/6a0c272e708c895746df0238813a8a7c3c19f093/chromedriver.exe -------------------------------------------------------------------------------- /gbd.py: -------------------------------------------------------------------------------- 1 | import os 2 | import traceback 3 | import regex as re 4 | import requests 5 | from time import sleep 6 | import tempfile 7 | from seleniumwire import webdriver 8 | from progressbar import progressbar as bar 9 | from selenium.webdriver.common.keys import Keys 10 | from selenium.webdriver.chrome.options import Options 11 | from selenium.webdriver.common.by import By 12 | 13 | print(""" 14 | Google Books Downloader by @aprikyan, 2020. 15 | . . . . . . . . . . . 16 | """) 17 | 18 | def get_book_url(): 19 | """ 20 | Asks user for the URL, takes it, 21 | removes irrelevant suffixes, adds others, 22 | returns two URLs to extract data and links. 23 | """ 24 | url = input(""" 25 | Step 1: Paste the URL of the book preview to be downloaded. 26 | (e.g. https://books.google.com/books?id=buc0AAAAMAAJ&printsec=frontcover&sa=X&ved=2ahUKEwj-y8T4r5vrAhWKLewKHaIQBnYQ6AEwAXoECAQQAg#v=onepage&f=false) 27 | 28 | Your input: """) 29 | 30 | if re.findall(r"id=[A-Za-z0-9]+", url): 31 | id_part = re.findall(r"id=[A-Za-z0-9]+", url)[-1] 32 | else: 33 | print("Invalid input. Please try again.") 34 | get_book_url() 35 | 36 | return (f"https://books.google.com/books?{id_part}&pg=1&hl=en#v=onepage&q&f=false", 37 | f"https://books.google.com/books?{id_part}&pg=1&hl=en&f=false&output=embed&source=gbs_embed") 38 | 39 | def get_book_data(url): 40 | """ 41 | Inspects the opened book, returns a 42 | `str` with its title and author. 43 | """ 44 | driver.get(url) 45 | driver.refresh() 46 | sleep(2) 47 | title = driver.find_element(By.CLASS_NAME, "gb-volume-title").text 48 | author = driver.find_element(By.CLASS_NAME, "addmd").text 49 | 50 | return f"{title} (b{author[1:]})" 51 | 52 | def capture_requests(url): 53 | """ 54 | Scrolls through the whole book, 55 | returns the requests driver made. 56 | """ 57 | driver.get(url) 58 | driver.refresh() 59 | sleep(2) 60 | checkpoint = "" 61 | 62 | while checkpoint != driver.find_element(By.CLASS_NAME, "pageImageDisplay"): 63 | checkpoint = driver.find_element(By.CLASS_NAME, "pageImageDisplay") 64 | checkpoint.click() 65 | # scrolling ~25 pages 66 | for i in range(25): 67 | html = driver.find_element(By.TAG_NAME, "body") 68 | html.click() 69 | html.send_keys(Keys.SPACE) 70 | sleep(2) 71 | 72 | return str(driver.requests) 73 | 74 | def extract_urls(requests): 75 | """ 76 | Takes driver's requests as an input, 77 | returns a `dict` of page image URLs. 78 | """ 79 | urls = re.findall(r"url='(https:\/\/[^']+content[^']+pg=[A-Z]+([0-9]+)[^']+)(&w=[0-9]+)'", requests) 80 | 81 | return {int(url[1]): url[0] + "&w=69420" for url in urls} 82 | 83 | def save_backup(): 84 | """ 85 | Asks user whether to backup the available 86 | image URLs for later use. Does so, if yes. 87 | """ 88 | save = input(""" 89 | Would you like to save a backup file (type Yes or No)? 90 | Your input: """).upper() 91 | 92 | if save == "YES" or save == "Y": 93 | with open(f"Backup of {book_data}.txt", "w") as f: 94 | f.write(str(all_pages)) 95 | print(f"Succesfully backed up the book in \"Backup of {book_data}.txt\"!") 96 | 97 | elif save != "NO": 98 | print("Invalid input. Please try again.") 99 | save_backup() 100 | 101 | def select_pages(user_input, all_pages): 102 | """ 103 | Takes the range of pages user specified 104 | and image URLs of all pages available, 105 | returns a `dict` with selected pages only. 106 | """ 107 | ranges = user_input.replace(" ", "").split(",") 108 | page_numbers = [] 109 | 110 | if "all" in ranges: 111 | return all_pages 112 | while "odd" in ranges: 113 | page_numbers.extend([i for i in all_pages.items() if i[0] % 2]) 114 | ranges.remove("odd") 115 | while "even" in ranges: 116 | page_numbers.extend([i for i in all_pages.items() if i[0] % 2 == 0]) 117 | ranges.remove("even") 118 | for segment in ranges: 119 | if "-" in segment: 120 | a, b = segment.split("-") 121 | page_numbers.extend([i for i in all_pages.items() if int(a) <= i[0] <= int(b)]) 122 | elif int(segment) in all_pages.keys(): 123 | page_numbers.append((int(segment), all_pages[int(segment)])) 124 | 125 | return dict(set(page_numbers)) 126 | 127 | def get_cookie(url): 128 | """ 129 | Driver needs to behave like a real 130 | user to GET page images. This function 131 | returns a cookie to bribe Google with. 132 | """ 133 | cookies = [] 134 | driver.get(url) 135 | driver.refresh() 136 | 137 | for request in driver.requests: 138 | if request.headers: 139 | if "Cookie" in request.headers.keys(): 140 | cookies.append(request.headers["Cookie"]) 141 | if len(cookies) == 0: 142 | cookies = driver.get_cookies() 143 | 144 | return cookies[0] 145 | 146 | def download_imgs(pages, cookie, directory): 147 | """ 148 | Takes the `dict` of pages to download, 149 | the cookie to use and the directory 150 | to save to, and then does the magic. 151 | """ 152 | 153 | headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30", 154 | "cookie": f"NID={cookie['value']}", } 155 | 156 | for number, url in bar(pages.items()): 157 | response = requests.get(url, headers=headers, stream=True) 158 | response.raise_for_status() # Check for HTTP request errors 159 | 160 | with open(os.path.join(directory, f"page{number}.png"), 'wb') as file: 161 | for chunk in response.iter_content(chunk_size=8192): 162 | file.write(chunk) 163 | 164 | def step1(): 165 | global book_data, all_pages 166 | 167 | from_url = input(""" 168 | Would you like to download a book from URL? Type No if you have a backup, otherwise type Yes. 169 | 170 | Your input: """).upper() 171 | 172 | if from_url == "YES" or from_url == "Y": 173 | data_url, pages_url = get_book_url() 174 | book_data = get_book_data(data_url) 175 | print(f"\nWe will now process the pages of \"{book_data}\" one by one. Sit back and relax, as this may take some time, depending on the number of its pages.\n") 176 | reqs = capture_requests(pages_url) 177 | all_pages = extract_urls(reqs) 178 | print("""Now that most of the job is done (yahoo!), it is highly recommended to backup the current progress we have made, so as not to lose it if an error happens to be thrown afterward. 179 | Also, if you would like to download another segment of this book later, the backup will be used then to save your precious time.""") 180 | save_backup() 181 | 182 | elif from_url == "NO": 183 | backup = input(""" 184 | Enter the location of the backup file. 185 | (e.g. C:/Users/User/Downloads/Backup_of_booktitle.txt) 186 | 187 | Your input: """) 188 | 189 | try: 190 | book_data = os.path.basename(backup)[10:-4] 191 | all_pages = eval(open(backup).read()) 192 | except: 193 | print("Invalid input. Please try again.") 194 | step1() 195 | 196 | else: 197 | print("Invalid input. Please try again.") 198 | step1() 199 | 200 | def step2(): 201 | global selected_pages, cookie 202 | 203 | selection = input(""" 204 | Step 2: Specify the pages to be downloaded. You may use the combinations of: 205 | - **all**: download all pages available 206 | - exact numbers (e.g. 5, 3, 16) 207 | - ranges (e.g. 11-13, 1-100) 208 | - keywords odd and/or even, to download odd or even pages respectively 209 | - commas to seperate the tokens 210 | Your input may look like "1, 10-50, odd, 603". 211 | Note that only pages available for preview will be downloaded. 212 | 213 | Your input: """) 214 | 215 | try: 216 | selected_pages = select_pages(selection, all_pages) 217 | 218 | except: 219 | print("Invalid input. Please try again.") 220 | step2() 221 | 222 | # it's a surprise tool that will help us later 223 | cookie = get_cookie(list(all_pages.items())[0][1]) 224 | 225 | def step3(): 226 | main_directory = input(""" 227 | Step 3 (optional): Specify the location to download the book pages to (a new folder will be created in that directory). 228 | ENTER to save them right here. 229 | 230 | Your input: """) 231 | 232 | if main_directory == "": 233 | main_directory = tempfile.TemporaryDirectory() 234 | 235 | try: 236 | new_directory = os.path.join(main_directory, book_data) 237 | if not os.path.exists(new_directory): 238 | os.mkdir(new_directory) 239 | except: 240 | try: 241 | new_directory = main_directory 242 | if not os.path.exists(new_directory): 243 | os.mkdir(new_directory) 244 | except: 245 | print(f"Invalid input\"{main_directory}\". Please try again, or leave the input blank to use a folder in temp.") 246 | step3() 247 | 248 | print(f"\nWe will now download all {len(selected_pages)} pages you selected. This will take a minute or two.\n") 249 | print(f"\nDownload folder is: {new_directory}\n") 250 | download_imgs(selected_pages, cookie, new_directory) 251 | 252 | if __name__ == "__main__": 253 | global driver 254 | 255 | chrome_options = Options() 256 | chrome_options.add_argument("--headless") 257 | chrome_options.add_argument("--log-level=-1") 258 | chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) 259 | chrome_options.add_experimental_option("prefs", {"safebrowsing.enabled": True}) 260 | driver = webdriver.Chrome("chromedriver.exe", options=chrome_options) 261 | 262 | try: 263 | step1() 264 | step2() 265 | step3() 266 | 267 | except Exception as e: 268 | with open("google-books-downloader_crash.log", "w") as log: 269 | log.write(traceback.format_exc()) 270 | print(f""" 271 | Something went wrong :/ 272 | 273 | Please make sure that: 274 | - you are connected to the Internet 275 | - the book you are trying to download has preview 276 | - you entered a valid URL of a Google Books book 277 | - your inputs correspond the formatting 278 | - you have permission to save/create files in this and the download directories 279 | 280 | If it still repeats and you think this is an error, please report it on github.com/aprikyan/google-books-downloader. 281 | When reporting, do not forget to attach the following file to the issue: 282 | {os.path.join(os.getcwd(), "google-books-downloader_crash.log")} 283 | """) 284 | 285 | else: 286 | print(f""" 287 | The selected pages were successfully downloaded into the "{book_data}" folder! 288 | 289 | Note that for your convenience the pages are saved as images. If you would like to combine them in a PDF (or another format), it might be done using specialized websites and apps.""") 290 | 291 | # combining in PDF involves asking about its DPI, size, etc, and 292 | # it would take much time and RAM, so it's better to leave it to user 293 | -------------------------------------------------------------------------------- /img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aprikyan/google-books-downloader/6a0c272e708c895746df0238813a8a7c3c19f093/img.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | regex 2 | progressbar2 3 | selenium==4.0.0 4 | selenium-wire>=5.1.0 5 | requests --------------------------------------------------------------------------------