├── reqs.txt
├── media
    ├── bull.png
    └── demo.gif
├── LICENSE
├── pdf.py
├── .gitignore
└── README.md


/reqs.txt:
--------------------------------------------------------------------------------
1 | requests
2 | urllib
3 | 


--------------------------------------------------------------------------------
/media/bull.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/therealAJ/bulk-pdf/HEAD/media/bull.png


--------------------------------------------------------------------------------
/media/demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/therealAJ/bulk-pdf/HEAD/media/demo.gif


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Alex Jordache
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/pdf.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import sys
 3 | import os
 4 | import requests
 5 | import argparse
 6 | from urllib.request import urlopen
 7 | 
 8 | '''
 9 | argparse to enable finicky command-line args
10 | '''
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument('website_url')
13 | parser.add_argument('file_path')
14 | parser.add_argument('relative_url', nargs='?', default='none')
15 | args = parser.parse_args()
16 | 
17 | # HTML Parse URL 
18 | url_link = str(args.website_url)
19 | # wget URL
20 | wget_link = str(args.relative_url)
21 | # system path
22 | sys_path = str(args.file_path)
23 | 
24 | website = urlopen(url_link)
25 | 
26 | # Decode website into string string
27 | html = website.read().decode('utf-8')
28 | 
29 | #Find all local and non-hosted pdfs and download 
30 | links = re.findall('"(https?://\S*?.pdf)"', html)
31 | local_links = re.findall('"([^.\"=]*.pdf)', html)
32 | 
33 | #Download non-hosted PDFs
34 | for link in links:
35 | 	command = 'wget'
36 | 	os.system("%s %s -P %s" % (command, link, sys_path))
37 | 
38 | #Download local/hosted PDFs
39 | for link in local_links:
40 | 	command = 'wget'
41 | 	if(wget_link == 'none'):
42 | 		os.system("%s %s/%s -P %s" % (command, url_link, link, sys_path))
43 | 	else:
44 | 		os.system("%s %s/%s -P %s" % (command, wget_link, link, sys_path))
45 | 		


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | 
92 | #OSX
93 | .DS_Store
94 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <h3 font-size='20px' align='center'>Bulk PDF downloader</h3>
 2 | <p align="center">
 3 |   <img src="/media/bull.png"/>
 4 |   <br>
 5 |   <br>
 6 |   <i>* will with work for local and non-hosted PDF's *</i>
 7 |   
 8 |   <a><img src="https://img.shields.io/badge/License-MIT-blue.svg"></a>
 9 |   <a><img src="https://img.shields.io/badge/python-3.5%2C%203.6-39CCCC.svg"></a>
10 | </p>
11 | 
12 | > a cli for downloading external pdf's (for lazy people like me)
13 | 
14 | <br>
15 | 
16 | ## Demo
17 | 
18 | ![Alt text](https://raw.githubusercontent.com/therealAJ/bulk-pdf/master/media/demo.gif)
19 | 
20 | <br>
21 | 
22 | ## Motivation
23 | One day I was downloading what felt like millions of PDF packages of CS notes. A couple minutes in, I got really tired of right clicking `Save Link As`. So I decided to build this :)   
24 | 
25 | 
26 | ## Requirements
27 | 
28 | `argparse`
29 | <br>
30 | `urllib`
31 | <br>
32 | `requests`
33 | <br>
34 | `wget`
35 | <br>
36 | `python >= 3.5`
37 | 
38 | ## Install
39 | 
40 | ```sh
41 | $ git clone https://github.com/therealAJ/bulk-pdf
42 | $ cd bulk-pdf
43 | $ pip install -r reqs.txt
44 | ```
45 | 
46 | ## Usage
47 | 
48 | ```sh
49 | $ python pdf.py <URL-CONTAINING-DESIRED-PDFS> <dest/which/you/want/to/download/to> [OPTIONAL-BASE-URL-FOR-HOSTED-PDFS]
50 | ```
51 | 
52 | Downloads discovered PDFs to specified `path`.
53 | 
54 | <br>
55 | 
56 | For example:
57 | 
58 | ```sh
59 | $ python pdf.py https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/ ~/Desktop/Lectures
60 | # downloads all PDFs to your `~/Desktop/Lectures` folder
61 | ```
62 | 
63 | ### Notes
64 | 
65 | Webpages have different ways of hosting PDFs, some use an absolute url, like `https://hosting-site.com/hello.pdf`, others host locally and have `href` tags looking something like `\courses\cs101\hello.pdf`. The second case is the reason I added the ability to give the optional url for the parent hosting site. 
66 | <br>
67 | <br>
68 | You may run into a url that looks like `https://site.com/lectures.html`. More often than not, this is where you want to use the full url to parse the entire webpage and then use `https://site.com` as the optional parameter to do the `wget` requests with. 
69 | 
70 | So, an example would look like: 
71 | ```sh
72 | $ python pdf.py https://site.com/lectures.html ~/Desktop/Test https://site.com
73 | ```
74 | <br>
75 | <br>
76 | Hopefully I haven't confused you :) File an issue if you run into anything of interest. PRs welcome :) 
77 | 


--------------------------------------------------------------------------------