├── README.md
├── logo
    ├── logo.png
    └── pdfLogo.png
└── src
    └── PdfDownloader.py


/README.md:
--------------------------------------------------------------------------------
 1 | # PDF Downloader
 2 | 
 3 | <p align="center">
 4 |   <img src="logo/pdfLogo.png" width="225" alt="logo">
 5 | </p>
 6 | 
 7 | - An innvoative web scrapping solution to save time.
 8 | - Instantly download all necessary pdf files from a webpage.
 9 | 
10 | **Libraries:**
11 | 
12 | Here's a list of additional modules you might have to download.
13 | 
14 | - BeautifulSoup4-4.9.1
15 | - lxml-4.5.1 
16 | - wget 3.2
17 | - requests 2.22.0
18 | 
19 | **How to Use:** 
20 | 
21 | - Download the Python script and run it on your terminal
22 | 
23 |   ```
24 |   python3 PdfDownloader.py
25 |   ```
26 | - Upon getting a prompt, enter your course link.
27 | 
28 | - Your files will be downloaded in the same folder from where you run it.
29 | 
30 | 
31 | **Future Work:**
32 | 
33 | - Deploy with Docker?
34 | - Currently, web pages that end with .html are not correctly parsed. Need to expand the script accordingly. 
35 | 
36 | 
37 | **Contribute:**
38 | 
39 | Feel free to create a pull request if you:
40 | 
41 | - Have any ideas to improve the code.
42 | - Can think of more use-cases for different university specifiic wesbites.
43 | 
44 | **Read More:**
45 | 
46 | - [Medium Article](https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48)
47 |  
48 | 


--------------------------------------------------------------------------------
/logo/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/logo.png


--------------------------------------------------------------------------------
/logo/pdfLogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/pdfLogo.png


--------------------------------------------------------------------------------
/src/PdfDownloader.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import validators
 3 | import sys
 4 | from bs4 import BeautifulSoup as bs
 5 | from urllib.parse import urlparse
 6 | import wget
 7 | from urllib.request import urlopen
 8 | import urllib.request 
 9 | 
10 | def check_validity(my_url):
11 |     try:
12 |         urlopen(my_url)
13 |         print("Valid URL")
14 |     except IOError:
15 |         print ("Invalid URL")
16 |         sys.exit()
17 | 
18 | 
19 | def get_pdfs(my_url):
20 |     links = []
21 |     html = urlopen(my_url).read()
22 |     html_page = bs(html, features="lxml") 
23 |     og_url = html_page.find("meta",  property = "og:url")
24 |     base = urlparse(my_url)
25 |     print("base",base)
26 |     for link in html_page.find_all('a'):
27 |         current_link = link.get('href')
28 |         if current_link.endswith('pdf'):
29 |             if og_url:
30 |                 print("currentLink",current_link)
31 |                 links.append(og_url["content"] + current_link)
32 |             else:
33 |                 links.append(base.scheme + "://" + base.netloc + current_link)
34 | 
35 |     for link in links:
36 |         try: 
37 |             wget.download(link)
38 |         except:
39 |             print(" \n \n Unable to Download A File \n")
40 |     print('\n')
41 | 
42 | 
43 | def main():
44 |     print("Enter Link: ")
45 |     my_url = input()
46 |     check_validity(my_url)
47 |     get_pdfs(my_url)
48 | 
49 | main()
50 | 


--------------------------------------------------------------------------------