├── README.md ├── logo ├── logo.png └── pdfLogo.png └── src └── PdfDownloader.py /README.md: -------------------------------------------------------------------------------- 1 | # PDF Downloader 2 | 3 |

4 | logo 5 |

6 | 7 | - An innvoative web scrapping solution to save time. 8 | - Instantly download all necessary pdf files from a webpage. 9 | 10 | **Libraries:** 11 | 12 | Here's a list of additional modules you might have to download. 13 | 14 | - BeautifulSoup4-4.9.1 15 | - lxml-4.5.1 16 | - wget 3.2 17 | - requests 2.22.0 18 | 19 | **How to Use:** 20 | 21 | - Download the Python script and run it on your terminal 22 | 23 | ``` 24 | python3 PdfDownloader.py 25 | ``` 26 | - Upon getting a prompt, enter your course link. 27 | 28 | - Your files will be downloaded in the same folder from where you run it. 29 | 30 | 31 | **Future Work:** 32 | 33 | - Deploy with Docker? 34 | - Currently, web pages that end with .html are not correctly parsed. Need to expand the script accordingly. 35 | 36 | 37 | **Contribute:** 38 | 39 | Feel free to create a pull request if you: 40 | 41 | - Have any ideas to improve the code. 42 | - Can think of more use-cases for different university specifiic wesbites. 43 | 44 | **Read More:** 45 | 46 | - [Medium Article](https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48) 47 | 48 | -------------------------------------------------------------------------------- /logo/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/logo.png -------------------------------------------------------------------------------- /logo/pdfLogo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/pdfLogo.png -------------------------------------------------------------------------------- /src/PdfDownloader.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import validators 3 | import sys 4 | from bs4 import BeautifulSoup as bs 5 | from urllib.parse import urlparse 6 | import wget 7 | from urllib.request import urlopen 8 | import urllib.request 9 | 10 | def check_validity(my_url): 11 | try: 12 | urlopen(my_url) 13 | print("Valid URL") 14 | except IOError: 15 | print ("Invalid URL") 16 | sys.exit() 17 | 18 | 19 | def get_pdfs(my_url): 20 | links = [] 21 | html = urlopen(my_url).read() 22 | html_page = bs(html, features="lxml") 23 | og_url = html_page.find("meta", property = "og:url") 24 | base = urlparse(my_url) 25 | print("base",base) 26 | for link in html_page.find_all('a'): 27 | current_link = link.get('href') 28 | if current_link.endswith('pdf'): 29 | if og_url: 30 | print("currentLink",current_link) 31 | links.append(og_url["content"] + current_link) 32 | else: 33 | links.append(base.scheme + "://" + base.netloc + current_link) 34 | 35 | for link in links: 36 | try: 37 | wget.download(link) 38 | except: 39 | print(" \n \n Unable to Download A File \n") 40 | print('\n') 41 | 42 | 43 | def main(): 44 | print("Enter Link: ") 45 | my_url = input() 46 | check_validity(my_url) 47 | get_pdfs(my_url) 48 | 49 | main() 50 | --------------------------------------------------------------------------------