├── README.md
├── logo
├── logo.png
└── pdfLogo.png
└── src
└── PdfDownloader.py
/README.md:
--------------------------------------------------------------------------------
1 | # PDF Downloader
2 |
3 |
4 |
5 |
6 |
7 | - An innvoative web scrapping solution to save time.
8 | - Instantly download all necessary pdf files from a webpage.
9 |
10 | **Libraries:**
11 |
12 | Here's a list of additional modules you might have to download.
13 |
14 | - BeautifulSoup4-4.9.1
15 | - lxml-4.5.1
16 | - wget 3.2
17 | - requests 2.22.0
18 |
19 | **How to Use:**
20 |
21 | - Download the Python script and run it on your terminal
22 |
23 | ```
24 | python3 PdfDownloader.py
25 | ```
26 | - Upon getting a prompt, enter your course link.
27 |
28 | - Your files will be downloaded in the same folder from where you run it.
29 |
30 |
31 | **Future Work:**
32 |
33 | - Deploy with Docker?
34 | - Currently, web pages that end with .html are not correctly parsed. Need to expand the script accordingly.
35 |
36 |
37 | **Contribute:**
38 |
39 | Feel free to create a pull request if you:
40 |
41 | - Have any ideas to improve the code.
42 | - Can think of more use-cases for different university specifiic wesbites.
43 |
44 | **Read More:**
45 |
46 | - [Medium Article](https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48)
47 |
48 |
--------------------------------------------------------------------------------
/logo/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/logo.png
--------------------------------------------------------------------------------
/logo/pdfLogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nhammad/PDFDownloader/24e185a50d79baf901bfbfba340afb9d6df7a32e/logo/pdfLogo.png
--------------------------------------------------------------------------------
/src/PdfDownloader.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import validators
3 | import sys
4 | from bs4 import BeautifulSoup as bs
5 | from urllib.parse import urlparse
6 | import wget
7 | from urllib.request import urlopen
8 | import urllib.request
9 |
10 | def check_validity(my_url):
11 | try:
12 | urlopen(my_url)
13 | print("Valid URL")
14 | except IOError:
15 | print ("Invalid URL")
16 | sys.exit()
17 |
18 |
19 | def get_pdfs(my_url):
20 | links = []
21 | html = urlopen(my_url).read()
22 | html_page = bs(html, features="lxml")
23 | og_url = html_page.find("meta", property = "og:url")
24 | base = urlparse(my_url)
25 | print("base",base)
26 | for link in html_page.find_all('a'):
27 | current_link = link.get('href')
28 | if current_link.endswith('pdf'):
29 | if og_url:
30 | print("currentLink",current_link)
31 | links.append(og_url["content"] + current_link)
32 | else:
33 | links.append(base.scheme + "://" + base.netloc + current_link)
34 |
35 | for link in links:
36 | try:
37 | wget.download(link)
38 | except:
39 | print(" \n \n Unable to Download A File \n")
40 | print('\n')
41 |
42 |
43 | def main():
44 | print("Enter Link: ")
45 | my_url = input()
46 | check_validity(my_url)
47 | get_pdfs(my_url)
48 |
49 | main()
50 |
--------------------------------------------------------------------------------