├── reqs.txt ├── media ├── bull.png └── demo.gif ├── LICENSE ├── pdf.py ├── .gitignore └── README.md /reqs.txt: -------------------------------------------------------------------------------- 1 | requests 2 | urllib 3 | -------------------------------------------------------------------------------- /media/bull.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/therealAJ/bulk-pdf/HEAD/media/bull.png -------------------------------------------------------------------------------- /media/demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/therealAJ/bulk-pdf/HEAD/media/demo.gif -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Alex Jordache 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /pdf.py: -------------------------------------------------------------------------------- 1 | import re 2 | import sys 3 | import os 4 | import requests 5 | import argparse 6 | from urllib.request import urlopen 7 | 8 | ''' 9 | argparse to enable finicky command-line args 10 | ''' 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument('website_url') 13 | parser.add_argument('file_path') 14 | parser.add_argument('relative_url', nargs='?', default='none') 15 | args = parser.parse_args() 16 | 17 | # HTML Parse URL 18 | url_link = str(args.website_url) 19 | # wget URL 20 | wget_link = str(args.relative_url) 21 | # system path 22 | sys_path = str(args.file_path) 23 | 24 | website = urlopen(url_link) 25 | 26 | # Decode website into string string 27 | html = website.read().decode('utf-8') 28 | 29 | #Find all local and non-hosted pdfs and download 30 | links = re.findall('"(https?://\S*?.pdf)"', html) 31 | local_links = re.findall('"([^.\"=]*.pdf)', html) 32 | 33 | #Download non-hosted PDFs 34 | for link in links: 35 | command = 'wget' 36 | os.system("%s %s -P %s" % (command, link, sys_path)) 37 | 38 | #Download local/hosted PDFs 39 | for link in local_links: 40 | command = 'wget' 41 | if(wget_link == 'none'): 42 | os.system("%s %s/%s -P %s" % (command, url_link, link, sys_path)) 43 | else: 44 | os.system("%s %s/%s -P %s" % (command, wget_link, link, sys_path)) 45 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | 92 | #OSX 93 | .DS_Store 94 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Bulk PDF downloader

2 |

3 | 4 |
5 |
6 | * will with work for local and non-hosted PDF's * 7 | 8 | 9 | 10 |

11 | 12 | > a cli for downloading external pdf's (for lazy people like me) 13 | 14 |
15 | 16 | ## Demo 17 | 18 | ![Alt text](https://raw.githubusercontent.com/therealAJ/bulk-pdf/master/media/demo.gif) 19 | 20 |
21 | 22 | ## Motivation 23 | One day I was downloading what felt like millions of PDF packages of CS notes. A couple minutes in, I got really tired of right clicking `Save Link As`. So I decided to build this :) 24 | 25 | 26 | ## Requirements 27 | 28 | `argparse` 29 |
30 | `urllib` 31 |
32 | `requests` 33 |
34 | `wget` 35 |
36 | `python >= 3.5` 37 | 38 | ## Install 39 | 40 | ```sh 41 | $ git clone https://github.com/therealAJ/bulk-pdf 42 | $ cd bulk-pdf 43 | $ pip install -r reqs.txt 44 | ``` 45 | 46 | ## Usage 47 | 48 | ```sh 49 | $ python pdf.py [OPTIONAL-BASE-URL-FOR-HOSTED-PDFS] 50 | ``` 51 | 52 | Downloads discovered PDFs to specified `path`. 53 | 54 |
55 | 56 | For example: 57 | 58 | ```sh 59 | $ python pdf.py https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/ ~/Desktop/Lectures 60 | # downloads all PDFs to your `~/Desktop/Lectures` folder 61 | ``` 62 | 63 | ### Notes 64 | 65 | Webpages have different ways of hosting PDFs, some use an absolute url, like `https://hosting-site.com/hello.pdf`, others host locally and have `href` tags looking something like `\courses\cs101\hello.pdf`. The second case is the reason I added the ability to give the optional url for the parent hosting site. 66 |
67 |
68 | You may run into a url that looks like `https://site.com/lectures.html`. More often than not, this is where you want to use the full url to parse the entire webpage and then use `https://site.com` as the optional parameter to do the `wget` requests with. 69 | 70 | So, an example would look like: 71 | ```sh 72 | $ python pdf.py https://site.com/lectures.html ~/Desktop/Test https://site.com 73 | ``` 74 |
75 |
76 | Hopefully I haven't confused you :) File an issue if you run into anything of interest. PRs welcome :) 77 | --------------------------------------------------------------------------------