├── .gitignore
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── README.rst
├── ctdl
    ├── __init__.py
    ├── ctdl.py
    ├── downloader.py
    ├── gui.py
    ├── gui_downloader.py
    └── utils.py
├── icon.gif
├── icon.png
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | *.pyc
3 | /dist/
4 | /*.egg
5 | /*.egg-info
6 | .idea/
7 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Nikhil Kumar Singh
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.rst
2 | include icon.png
3 | include icon.gif
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![PyPI](https://img.shields.io/badge/PyPi-v1.5-f39f37.svg)](https://pypi.python.org/pypi/ctdl)
  2 | [![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/nikhilkumarsingh/content-downloader/blob/master/LICENSE.txt)
  3 | 
  4 | # content-downloader
  5 | 
  6 | **content-downloader** a.k.a **ctdl** is a python package with **command line utility** and **desktop GUI** to download files on any topic in bulk!
  7 | 
  8 | ![](https://media.giphy.com/media/3oKIPlt7APHqWuVl3q/giphy.gif)
  9 | 
 10 | ![](https://media.giphy.com/media/xUPGcIvGpH3KvEmlnG/giphy.gif)
 11 | 
 12 | ## Features
 13 | 
 14 | - ctdl can be used as a command line utility as well as a desktop GUI.
 15 | 
 16 | - ctdl fetches file links related to a search query from **Google Search**.
 17 | 
 18 | - Files can be downloaded parallely using multithreading.
 19 | 
 20 | - ctdl is Python 2 as well as Python 3 compatible.
 21 | 
 22 | ## Installation
 23 | 
 24 | - To install content-downloader, simply,
 25 | 
 26 |   ```
 27 |   $ pip install ctdl
 28 |   ```
 29 | 
 30 | - There seem to be some issues with parallel progress bars in tqdm which have
 31 |   been resolved in this [pull](https://github.com/tqdm/tqdm/pull/385). Until this pull is merged, please use my patch by running this command:
 32 | 
 33 |   ```
 34 |   $ pip install -U git+https://github.com/nikhilkumarsingh/tqdm
 35 |   ```
 36 | 
 37 | ## Desktop GUI usage
 38 | 
 39 | To use **ctdl** desktop GUI, open terminal and run this command:
 40 | 
 41 | ```
 42 | $ ctdl-gui
 43 | ```
 44 | 
 45 | ## Command line usage
 46 | 
 47 | ```
 48 | $ ctdl [-h] [-f FILE_TYPE] [-l LIMIT] [-d DIRECTORY] [-p] [-a] [-t]
 49 |        [-minfs MIN_FILE_SIZE] [-maxfs MAX_FILE_SIZE] [-nr]
 50 |        [query]
 51 | ```
 52 | Optional arguments are:
 53 | 
 54 | - -f FILE_TYPE : set the file type. (can take values like ppt, pdf, xml, etc.)
 55 | 
 56 |                  Default value: pdf
 57 | 
 58 | - -l LIMIT : specify the number of files to download.
 59 | 
 60 |              Default value: 10
 61 | 
 62 | - -d DIRECTORY : specify the directory where files will be stored.
 63 | 
 64 |                  Default: A directory with same name as the search query in the current directory.
 65 | 
 66 | - -p : for parallel downloading.
 67 | 
 68 | - -minfs MIN_FILE_SIZE : specify minimum file size to download in Kilobytes (KB).
 69 | 
 70 |                  Default: 0
 71 | 
 72 | - -maxfs MAX_FILE_SIZE : specify maximum file size to download in Kilobytes (KB).
 73 | 
 74 |                  Default: -1 (represents no maximum file size)
 75 | 
 76 | - -nr : prevent download redirects.
 77 | 
 78 |                  Default: False
 79 | 
 80 | ## Examples
 81 | 
 82 | - To get list of available filetypes:
 83 | 
 84 |   ```
 85 |   $ ctdl -a
 86 |   ```
 87 | 
 88 | - To get list of potential high threat filetypes:
 89 | 
 90 |   ```
 91 |   $ ctdl -t
 92 |   ```
 93 | 
 94 | - To download pdf files on topic 'python':
 95 | 
 96 |   ```
 97 |   $ ctdl python
 98 |   ```
 99 |   This is the default behaviour which will download 10 pdf files in a folder named 'python' in current directory.
100 | 
101 | - To download 3 ppt files on 'health':
102 | 
103 |   ```
104 |   $ ctdl -f ppt -l 3 health
105 |   ```
106 | 
107 | - To explicitly specify download folder:
108 | 
109 |   ```
110 |   $ ctdl -d /home/nikhil/Desktop/ml-pdfs machine-learning
111 |   ```
112 | 
113 | - To download files parallely:
114 |   ```
115 |   $ ctdl -f pdf -p python
116 |   ```
117 | 
118 | - To search for and download in parallel 10 files in PDF format containing
119 |   the text "python" and "algorithm", without allowing any url redirects,
120 |   and where the file size is between 10,000 KB (10 MB) and 100,000KB (100 MB),
121 |   where KB means Kilobytes, which has an equivalent value expressed in Megabytes:
122 |   ```
123 |   $ ctdl -f pdf -l 10 -minfs 10000 -maxfs 100000 -nr -p "python algorithm"
124 |   ```
125 | 
126 | ## Usage in Python files
127 | 
128 | ```python
129 | from ctdl import ctdl
130 | 
131 | ctdl.download_content(
132 | file_type = 'ppt',
133 | limit = 5,
134 | directory = '/home/nikhil/Desktop/ml-pdfs',
135 | query = 'machine learning using python')
136 | ```
137 | 
138 | ## TODO
139 | 
140 | - [X] Prompt user before downloading potentially threatful files
141 | 
142 | - [X] Create ctdl GUI
143 | 
144 | - [ ] Implement unit testing
145 | 
146 | - [ ] Use DuckDuckgo API as an option
147 | 
148 | ## Want to contribute?
149 | 
150 | - Clone the repository
151 | 
152 |   ```
153 |   $ git clone http://github.com/nikhilkumarsingh/content-downloader
154 |   ```
155 | 
156 | - Install dependencies
157 |   ```
158 |   $ pip install -r requirements.txt
159 |   ```
160 | 
161 |   **Note:** There seem to be some issues with current version of tqdm. If you do not get
162 |   expected progress bar behaviour, try this patch:
163 | 
164 |   ```
165 |   $ pip uninstall tqdm
166 |   $ pip install git+https://github.com/nikhilkumarsingh/tqdm
167 |   ```
168 | 
169 | - In ctdl/ctdl.py, remove the `.` prefix from `.downloader` and `.utils` for
170 |   the following imports, so it changes from:
171 |   ```python
172 |   from .downloader import download_series, download_parallel
173 |   from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS
174 |   ```
175 |   to:
176 |   ```python
177 |   from downloader import download_series, download_parallel
178 |   from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS
179 |   ```
180 | 
181 | - Run the python file directly `python ctdl/ctdl.py ___` (instead of with `ctdl ___`)
182 | 
183 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | |PyPI| |license|
  2 | 
  3 | content-downloader
  4 | ==================
  5 | 
  6 | **content-downloader** a.k.a **ctdl** is a python package with **command
  7 | line utility** and **desktop GUI** to download files on any topic in
  8 | bulk!
  9 | 
 10 | .. figure:: https://media.giphy.com/media/3oKIPlt7APHqWuVl3q/giphy.gif
 11 |    :alt: 
 12 | 
 13 | .. figure:: https://media.giphy.com/media/xUPGcIvGpH3KvEmlnG/giphy.gif
 14 |    :alt: 
 15 | 
 16 | Features
 17 | --------
 18 | 
 19 | -  ctdl can be used as a command line utility as well as a desktop GUI.
 20 | 
 21 | -  ctdl fetches file links related to a search query from **Google
 22 |    Search**.
 23 | 
 24 | -  Files can be downloaded parallely using multithreading.
 25 | 
 26 | -  ctdl is Python 2 as well as Python 3 compatible.
 27 | 
 28 | Installation
 29 | ------------
 30 | 
 31 | -  To install content-downloader, simply,
 32 | 
 33 | ``$ pip install ctdl``
 34 | 
 35 | -  There seem to be some issues with parallel progress bars in tqdm
 36 |    which have been resolved in this
 37 |    `pull <https://github.com/tqdm/tqdm/pull/385>`__. Until this pull is
 38 |    merged, please use my patch by running this command:
 39 | 
 40 | ``$ pip install -U git+https://github.com/nikhilkumarsingh/tqdm``
 41 | 
 42 | Desktop GUI usage
 43 | -----------------
 44 | 
 45 | To use **ctdl** desktop GUI, open terminal and run this command:
 46 | 
 47 | ::
 48 | 
 49 |     $ ctdl-gui
 50 | 
 51 | Command line usage
 52 | ------------------
 53 | 
 54 | ::
 55 | 
 56 |     $ ctdl [-h] [-f FILE_TYPE] [-l LIMIT] [-d DIRECTORY] [-p] [-a] [-t]
 57 |            [-minfs MIN_FILE_SIZE] [-maxfs MAX_FILE_SIZE] [-nr]
 58 |            [query]
 59 | 
 60 | Optional arguments are:
 61 | 
 62 | -  -f FILE\_TYPE : set the file type. (can take values like ppt, pdf,
 63 |    xml, etc.)
 64 | 
 65 |    ::
 66 | 
 67 |                 Default value: pdf
 68 | 
 69 | -  -l LIMIT : specify the number of files to download.
 70 | 
 71 |    ::
 72 | 
 73 |             Default value: 10
 74 | 
 75 | -  -d DIRECTORY : specify the directory where files will be stored.
 76 | 
 77 |    ::
 78 | 
 79 |                 Default: A directory with same name as the search query in the current directory.
 80 | 
 81 | -  -p : for parallel downloading.
 82 | 
 83 | -  -minfs MIN\_FILE\_SIZE : specify minimum file size to download in
 84 |    Kilobytes (KB).
 85 | 
 86 |    ::
 87 | 
 88 |                 Default: 0
 89 | 
 90 | -  -maxfs MAX\_FILE\_SIZE : specify maximum file size to download in
 91 |    Kilobytes (KB).
 92 | 
 93 |    ::
 94 | 
 95 |                 Default: -1 (represents no maximum file size)
 96 | 
 97 | -  -nr : prevent download redirects.
 98 | 
 99 |    ::
100 | 
101 |                 Default: False
102 | 
103 | Examples
104 | --------
105 | 
106 | -  To get list of available filetypes:
107 | 
108 | ``$ ctdl -a``
109 | 
110 | -  To get list of potential high threat filetypes:
111 | 
112 | ``$ ctdl -t``
113 | 
114 | -  To download pdf files on topic 'python':
115 | 
116 | ``$ ctdl python`` This is the default behaviour which will download 10
117 | pdf files in a folder named 'python' in current directory.
118 | 
119 | -  To download 3 ppt files on 'health':
120 | 
121 | ``$ ctdl -f ppt -l 3 health``
122 | 
123 | -  To explicitly specify download folder:
124 | 
125 | ``$ ctdl -d /home/nikhil/Desktop/ml-pdfs machine-learning``
126 | 
127 | -  To download files parallely: ``$ ctdl -f pdf -p python``
128 | 
129 | -  To search for and download in parallel 10 files in PDF format
130 |    containing the text "python" and "algorithm", without allowing any
131 |    url redirects, and where the file size is between 10,000 KB (10 MB)
132 |    and 100,000KB (100 MB), where KB means Kilobytes, which has an
133 |    equivalent value expressed in Megabytes:
134 |    ``$ ctdl -f pdf -l 10 -minfs 10000 -maxfs 100000 -nr -p "python algorithm"``
135 | 
136 | Usage in Python files
137 | ---------------------
138 | 
139 | .. code:: python
140 | 
141 |     from ctdl import ctdl
142 | 
143 |     ctdl.download_content(
144 |     file_type = 'ppt',
145 |     limit = 5,
146 |     directory = '/home/nikhil/Desktop/ml-pdfs',
147 |     query = 'machine learning using python')
148 | 
149 | TODO
150 | ----
151 | 
152 | -  [X] Prompt user before downloading potentially threatful files
153 | 
154 | -  [X] Create ctdl GUI
155 | 
156 | -  [ ] Implement unit testing
157 | 
158 | -  [ ] Use DuckDuckgo API as an option
159 | 
160 | Want to contribute?
161 | -------------------
162 | 
163 | -  Clone the repository
164 | 
165 | ``$ git clone http://github.com/nikhilkumarsingh/content-downloader``
166 | 
167 | -  Install dependencies ``$ pip install -r requirements.txt``
168 | 
169 | **Note:** There seem to be some issues with current version of tqdm. If
170 | you do not get expected progress bar behaviour, try this patch:
171 | 
172 | ``$ pip uninstall tqdm   $ pip install git+https://github.com/nikhilkumarsingh/tqdm``
173 | 
174 | -  In ctdl/ctdl.py, remove the ``.`` prefix from ``.downloader`` and
175 |    ``.utils`` for the following imports, so it changes from:
176 |    ``python   from .downloader import download_series, download_parallel   from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS``
177 |    to:
178 |    ``python   from downloader import download_series, download_parallel   from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS``
179 | 
180 | -  Run the python file directly ``python ctdl/ctdl.py ___`` (instead of
181 |    with ``ctdl ___``)
182 | 
183 | .. |PyPI| image:: https://img.shields.io/badge/PyPi-v1.5-f39f37.svg
184 |    :target: https://pypi.python.org/pypi/ctdl
185 | .. |license| image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000
186 |    :target: https://github.com/nikhilkumarsingh/content-downloader/blob/master/LICENSE.txt
187 | 


--------------------------------------------------------------------------------
/ctdl/__init__.py:
--------------------------------------------------------------------------------
1 | from .ctdl import *
2 | 


--------------------------------------------------------------------------------
/ctdl/ctdl.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import argparse
  3 | import requests
  4 | import urllib
  5 | try:
  6 | 	from urllib.request import urlopen
  7 | 	from urllib.error import HTTPError
  8 | except ImportError:
  9 | 	from urllib2 import urlopen
 10 | 	from urllib2 import HTTPError
 11 | from requests.packages.urllib3.util.retry import Retry
 12 | from requests.adapters import HTTPAdapter
 13 | from bs4 import BeautifulSoup
 14 | from downloader import download_series, download_parallel
 15 | from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS, DEFAULTS
 16 | 
 17 | 
 18 | s = requests.Session()
 19 | # Max retries and back-off strategy so all requests to http:// sleep before retrying
 20 | retries = Retry(total=5,
 21 | 				backoff_factor=0.1,
 22 | 				status_forcelist=[ 500, 502, 503, 504 ])
 23 | s.mount('http://', HTTPAdapter(max_retries=retries))
 24 | 
 25 | 	
 26 | def get_google_links(limit, params, headers):
 27 | 	"""
 28 | 	function to fetch links equal to limit
 29 | 
 30 | 	every Google search result page has a start index.
 31 | 	every page contains 10 search results.
 32 | 	"""
 33 | 	links = []
 34 | 	for start_index in range(0, limit, 10):
 35 | 		params['start'] = start_index
 36 | 		resp = s.get("https://www.google.com/search", params = params, headers = headers)
 37 | 		page_links = scrape_links(resp.content, engine = 'g')
 38 | 		links.extend(page_links)
 39 | 	return links[:limit]
 40 | 
 41 | 
 42 | 
 43 | def get_duckduckgo_links(limit, params, headers):
 44 | 	"""
 45 | 	function to fetch links equal to limit
 46 | 
 47 | 	duckduckgo pagination is not static, so there is a limit on
 48 | 	maximum number of links that can be scraped
 49 | 	"""
 50 | 	resp = s.get('https://duckduckgo.com/html', params = params, headers = headers)
 51 | 	links = scrape_links(resp.content, engine = 'd')
 52 | 	return links[:limit]
 53 | 
 54 | 
 55 | def scrape_links(html, engine):
 56 | 	"""
 57 | 	function to scrape file links from html response
 58 | 	"""
 59 | 	soup = BeautifulSoup(html, 'lxml')
 60 | 	links = []
 61 | 
 62 | 	if engine == 'd':
 63 | 		results = soup.findAll('a', {'class': 'result__a'})
 64 | 		for result in results:
 65 | 			link = result.get('href')[15:]
 66 | 			link = link.replace('/blob/', '/raw/')
 67 | 			links.append(link)
 68 | 
 69 | 	elif engine == 'g':
 70 | 		results = soup.findAll('h3', {'class': 'r'})   	
 71 | 		for result in results:
 72 | 			link = result.a['href'][7:].split('&')[0]
 73 | 			link = link.replace('/blob/', '/raw/')
 74 | 			links.append(link)
 75 | 
 76 | 	return links
 77 | 
 78 | 
 79 | 
 80 | 
 81 | def get_url_nofollow(url):
 82 | 	""" 
 83 | 	function to get return code of a url
 84 | 
 85 | 	Credits: http://blog.jasonantman.com/2013/06/python-script-to-check-a-list-of-urls-for-return-code-and-final-return-code-if-redirected/
 86 | 	"""
 87 | 	try:
 88 | 		response = urlopen(url)
 89 | 		code = response.getcode()
 90 | 		return code
 91 | 	except HTTPError as e:
 92 | 		return e.code
 93 | 	except:
 94 | 		return 0
 95 | 
 96 | 
 97 | def validate_links(links):
 98 | 	"""
 99 | 	function to validate urls based on http(s) prefix and return code
100 | 	"""
101 | 	valid_links = []
102 | 	for link in links:
103 | 		if link[:7] in "http://" or link[:8] in "https://":
104 | 			valid_links.append(link)
105 | 	
106 | 	if not valid_links:
107 | 		print("No files found.")
108 | 		sys.exit(0)
109 | 
110 | 	# checking valid urls for return code
111 | 	urls = {}
112 | 	for link in valid_links:
113 | 		if 'github.com' and '/blob/' in link:
114 | 			link = link.replace('/blob/', '/raw/')
115 | 		urls[link] = {'code': get_url_nofollow(link)}
116 | 		
117 | 	
118 | 	# printing valid urls with return code 200
119 | 	available_urls = []
120 | 	for url in urls:
121 | 		print("code: %d\turl: %s" % (urls[url]['code'], url))
122 | 		if urls[url]['code'] != 0:
123 | 			available_urls.append(url)
124 | 
125 | 	return available_urls
126 | 
127 | 
128 | def search(query, engine='g', site="", file_type = 'pdf', limit = 10):
129 | 	"""
130 | 	main function to search for links and return valid ones
131 | 	"""
132 | 	if site == "":
133 | 		search_query = "filetype:{0} {1}".format(file_type, query)
134 | 	else:
135 | 		search_query = "site:{0} filetype:{1} {2}".format(site,file_type, query)
136 | 
137 | 	headers = {
138 | 		'User Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) \
139 | 		Gecko/20100101 Firefox/53.0'
140 | 	}
141 | 	if engine == "g":
142 | 		params = {
143 | 			'q': search_query,
144 | 			'start': 0,
145 | 		}
146 | 		links = get_google_links(limit, params, headers)
147 | 
148 | 	elif engine == "d":
149 | 		params = {
150 | 			'q': search_query,
151 | 		}
152 | 		links = get_duckduckgo_links(limit,params,headers)
153 | 	else:
154 | 		print("Wrong search engine selected!")
155 | 		sys.exit()
156 | 	
157 | 	valid_links = validate_links(links)
158 | 	return valid_links
159 | 
160 | 
161 | def check_threats(**args):
162 | 	"""
163 | 	function to check input filetype against threat extensions list 
164 | 	"""
165 | 	is_high_threat = False
166 | 	for val in THREAT_EXTENSIONS.values():
167 | 		if type(val) == list:
168 | 			for el in val:
169 | 				if args['file_type'] == el:
170 | 					is_high_threat = True
171 | 					break
172 | 		else:
173 | 			if args['file_type'] == val:
174 | 				is_high_threat = True
175 | 				break
176 | 	return is_high_threat
177 | 
178 | 
179 | def validate_args(**args):
180 | 	"""
181 | 	function to check if input query is not None 
182 | 	and set missing arguments to default value
183 | 	"""
184 | 	if not args['query']:
185 | 		print("\nMissing required query argument.")
186 | 		sys.exit()
187 | 
188 | 	for key in DEFAULTS:
189 | 		if key not in args:
190 | 			args[key] = DEFAULTS[key]
191 | 
192 | 	return args
193 | 
194 | 
195 | def download_content(**args):
196 | 	"""
197 | 	main function to fetch links and download them
198 | 	"""
199 | 	args = validate_args(**args)
200 | 
201 | 	if not args['directory']:
202 | 		args['directory'] = args['query'].replace(' ', '-')
203 | 
204 | 	print("Downloading {0} {1} files on topic {2} from {3} and saving to directory: {4}"
205 | 		.format(args['limit'], args['file_type'], args['query'], args['website'], args['directory']))
206 | 		
207 | 
208 | 	links = search(args['query'], args['engine'], args['website'], args['file_type'], args['limit'])
209 | 
210 | 	if args['parallel']:
211 | 		download_parallel(links, args['directory'], args['min_file_size'], args['max_file_size'], args['no_redirects'])
212 | 	else:
213 | 		download_series(links, args['directory'], args['min_file_size'], args['max_file_size'], args['no_redirects'])
214 | 
215 | 
216 | def show_filetypes(extensions):
217 | 	"""
218 | 	function to show valid file extensions
219 | 	"""
220 | 	for item in extensions.items():
221 | 		val = item[1]
222 | 		if type(item[1]) == list:
223 | 			val = ", ".join(str(x) for x in item[1])
224 | 		print("{0:4}: {1}".format(val, item[0]))
225 | 
226 | 
227 | def main():
228 | 	parser = argparse.ArgumentParser(description = "Content Downloader",
229 | 									 epilog="Now download files on any topic in bulk!")
230 |  
231 | 	# defining arguments for parser object
232 | 	parser.add_argument("query", type = str, default = None, nargs = '?',
233 | 						help = "Specify the query.")
234 | 
235 | 	parser.add_argument("-f", "--file_type", type = str, default = 'pdf',
236 | 						help = "Specify the extension of files to download.")
237 | 	 
238 | 	parser.add_argument("-l", "--limit", type = int, default = 10,
239 | 						help = "Limit the number of search results (in multiples of 10).")
240 | 	 
241 | 	parser.add_argument("-d", "--directory", type = str, default = None,
242 | 						help = "Specify directory where files will be stored.")
243 | 
244 | 	parser.add_argument("-p", "--parallel", action = 'store_true', default = False,
245 | 						help = "For parallel downloading.")
246 | 
247 | 	parser.add_argument("-e", "--engine", type=str, default = "g",
248 | 						help = "Specify search engine\nduckduckgo: 'd'\ngoogle: 'g'")
249 | 
250 | 	parser.add_argument("-a", "--available", action='store_true',
251 | 						help = "Get list of all available filetypes.")
252 | 
253 | 	parser.add_argument("-w", "--website", type = str,  default = "",
254 | 						help = "specify website.")
255 | 
256 | 	parser.add_argument("-t", "--threats", action='store_true',
257 | 						help = "Get list of all common virus carrier filetypes.")
258 | 
259 | 	parser.add_argument("-minfs", "--min-file-size", type = int, default = 0,
260 | 						help = "Specify minimum file size to download in Kilobytes (KB).")
261 | 
262 | 	parser.add_argument("-maxfs", "--max-file-size", type = int, default = -1,
263 | 						help = "Specify maximum file size to download in Kilobytes (KB).")
264 | 
265 | 	parser.add_argument("-nr", "--no-redirects", action = 'store_true', default = False,
266 | 						help = "Prevent download redirects.")
267 | 
268 | 	parser.add_argument("-w", "--website", default = None, type = str,
269 | 						help = "Specify a particular website to download content from.")
270 | 
271 | 	args = parser.parse_args()
272 | 	args_dict = vars(args)
273 | 
274 | 	if args.available:
275 | 		show_filetypes(FILE_EXTENSIONS)
276 | 		return
277 | 
278 | 	if args.threats:
279 | 		show_filetypes(THREAT_EXTENSIONS)
280 | 		return
281 | 
282 | 	high_threat = check_threats(**args_dict)
283 | 
284 | 	if high_threat:
285 | 		def prompt(message, errormessage, isvalid, isexit):
286 | 			res = None
287 | 			while res is None:
288 | 				res = input(str(message)+': ')
289 | 				if isexit(res):
290 | 					sys.exit()
291 | 				if not isvalid(res):
292 | 					print(str(errormessage))
293 | 					res = None
294 | 			return res
295 | 		prompt(
296 | 			message = "WARNING: Downloading this file type may expose you to a heightened security risk.\nPress 'y' to proceed or 'n' to exit",
297 | 			errormessage= "Error: Invalid option provided.",
298 | 			isvalid = lambda x:True if x is 'y' else None,
299 | 			isexit = lambda x:True if x is 'n' else None
300 | 		)
301 | 
302 | 	download_content(**args_dict)
303 | 
304 | 
305 | if __name__ == "__main__":
306 | 	main()
307 | 


--------------------------------------------------------------------------------
/ctdl/downloader.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import threading
  3 | import requests
  4 | from requests.packages.urllib3.util.retry import Retry
  5 | from requests.adapters import HTTPAdapter
  6 | from tqdm import tqdm, trange
  7 | 
  8 | chunk_size = 1024
  9 | main_iter = None
 10 | yellow_color = "\033[93m"
 11 | blue_color = "\033[94m"
 12 | 
 13 | # modes -> s: series | p: parallel
 14 | 
 15 | s = requests.Session()
 16 | # Max retries and back-off strategy so all requests to http:// sleep before retrying
 17 | retries = Retry(total = 5,
 18 |                 backoff_factor = 0.1,
 19 |                 status_forcelist = [ 500, 502, 503, 504 ])
 20 | s.mount('http://', HTTPAdapter(max_retries = retries))
 21 | 
 22 | def download(url, directory, min_file_size = 0, max_file_size = -1, 
 23 | 	         no_redirects = False, pos = 0, mode = 's'):
 24 |     global main_it
 25 | 
 26 |     file_name = url.split('/')[-1]
 27 |     file_address = directory + '/' + file_name
 28 |     is_redirects = not no_redirects
 29 | 
 30 |     resp = s.get(url, stream = True, allow_redirects = is_redirects)
 31 | 
 32 |     if not resp.status_code == 200:
 33 |     	# ignore this file since server returns invalid response
 34 |         return
 35 | 
 36 |     try:
 37 |         total_size = int(resp.headers['content-length'])
 38 |     except KeyError:
 39 |         total_size = len(resp.content)
 40 | 
 41 |     total_chunks = total_size/chunk_size
 42 | 
 43 |     if total_chunks < min_file_size: 
 44 |     	# ignore this file since file size is lesser than min_file_size
 45 |         return
 46 |     elif max_file_size != -1 and total_chunks > max_file_size:
 47 |     	# ignore this file since file size is greater than max_file_size
 48 |         return
 49 | 
 50 |     file_iterable = resp.iter_content(chunk_size = chunk_size)
 51 | 
 52 |     tqdm_iter = tqdm(iterable = file_iterable, total = total_chunks, 
 53 |             unit = 'KB', position = pos, desc = blue_color + file_name, leave = False)
 54 | 
 55 |     with open(file_address, 'wb') as f:
 56 |         for data in tqdm_iter:
 57 |             f.write(data)
 58 | 
 59 |     if mode == 'p':
 60 |         main_iter.update(1)
 61 | 
 62 | 
 63 | def download_parallel(urls, directory, min_file_size, max_file_size, no_redirects):
 64 |     global main_iter
 65 | 
 66 |     # create directory to save files
 67 |     if not os.path.exists(directory):
 68 |         os.makedirs(directory)
 69 | 
 70 |     # overall progress bar
 71 |     main_iter = trange(len(urls), position = 1, desc = yellow_color + "Overall")
 72 | 
 73 |     # empty list to store threads
 74 |     threads = []
 75 | 
 76 |     # creating threads
 77 |     for idx, url in enumerate(urls):
 78 |         t = threading.Thread(
 79 |             target = download,
 80 |             kwargs = {
 81 |                 'url': url,
 82 |                 'directory': directory,
 83 |                 'pos': 2*idx+3,
 84 |                 'mode': 'p',
 85 |                 'min_file_size': min_file_size,
 86 |                 'max_file_size': max_file_size,
 87 |                 'no_redirects': no_redirects
 88 |             }
 89 |         )
 90 |         threads.append(t)
 91 | 
 92 |     # start all threads
 93 |     for t in threads:
 94 |         t.start()
 95 | 
 96 |     # wait until all threads terminate
 97 |     for t in threads[::-1]:
 98 |         t.join()
 99 | 
100 |     main_iter.close()
101 | 
102 |     print("\n\nDownload complete.")
103 | 
104 | 
105 | def download_series(urls, directory, min_file_size, max_file_size, no_redirects):
106 | 
107 |     # create directory to save files
108 |     if not os.path.exists(directory):
109 |         os.makedirs(directory)
110 | 
111 |     # download files one by one
112 |     for url in urls:
113 |         download(url, directory, min_file_size, max_file_size, no_redirects)
114 | 
115 |     print("Download complete.")
116 | 


--------------------------------------------------------------------------------
/ctdl/gui.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import threading
  3 | import requests
  4 | from requests.packages.urllib3.util.retry import Retry
  5 | from requests.adapters import HTTPAdapter
  6 | from tqdm import tqdm,  trange
  7 | 
  8 | try:
  9 | 	from Tkinter import *
 10 | except :
 11 | 	from tkinter import *
 12 | 
 13 | try:
 14 | 	import ttk
 15 | except :
 16 | 	from tkinter import ttk
 17 | 
 18 | try:
 19 | 	from tkinter.filedialog import *
 20 | 	from tkinter.messagebox import *
 21 | except :
 22 | 	from tkFileDialog import *
 23 | 	from tkMessageBox import *
 24 | 
 25 | from .gui_downloader import *
 26 | from .ctdl import *
 27 | from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS
 28 | 
 29 | cur_dir = os.getcwd()
 30 | 
 31 | # icon and title
 32 | root = Tk()
 33 | root.wm_title("ctdl")
 34 | try:
 35 | 	img = PhotoImage(file = "icon.png")
 36 | 	root.tk.call('wm', 'iconphoto', root._w, img)
 37 | except:
 38 | 	pass
 39 | 
 40 | row = Frame()
 41 | links = []
 42 | 
 43 | default_text = {'file_type' : 'pdf', 'query' : 'python', 
 44 | 	 'min_file_size' : 0, 'max_file_size' : -1, 'limit' : 10}
 45 | 
 46 | fields = 'Search query',  'Min Allowed File Size',  'Max Allowed File Size', 'Download Directory', 'Limit'
 47 | 
 48 | args = { 'parallel' : False, 'file_type' : 'pdf', 'threats' : False, 
 49 | 		 'no_redirects' : False, 'available' : False, 'query' : 'python', 
 50 | 		 'min_file_size' : 0, 'max_file_size' : -1, 'directory' : None, 'limit' : 10}
 51 | 
 52 | 
 53 | def search_function(root1, q, s, f, l, o='g'):
 54 | 	"""
 55 | 	function to get links
 56 | 	"""
 57 | 	global links
 58 | 	links = search(q, o, s, f, l)
 59 | 	root1.destroy()
 60 | 	root1.quit()
 61 | 
 62 | 
 63 | def task(ft):
 64 | 	"""
 65 | 	to create loading progress bar
 66 | 	"""
 67 | 	ft.pack(expand = True,  fill = BOTH,  side = TOP)
 68 | 	pb_hD = ttk.Progressbar(ft, orient = 'horizontal', mode = 'indeterminate')
 69 | 	pb_hD.pack(expand = True, fill = BOTH, side = TOP)
 70 | 	pb_hD.start(50)
 71 | 	ft.mainloop()
 72 | 
 73 | 
 74 | def download_content_gui(**args):
 75 | 	"""
 76 | 	function to fetch links and download them
 77 | 	"""
 78 | 	global row
 79 | 
 80 | 	if not args ['directory']:
 81 | 		args ['directory'] = args ['query'].replace(' ', '-')
 82 | 
 83 | 	root1 = Frame(root)
 84 | 	t1 = threading.Thread(target = search_function,  args = (root1,
 85 | 						  args['query'], args['website'], args['file_type'], args['limit'],args['option']))
 86 | 	t1.start()
 87 | 	task(root1)
 88 | 	t1.join()
 89 | 
 90 | 	#new frame for progress bar 
 91 | 	row = Frame(root)
 92 | 	row.pack()
 93 | 	if args['parallel']:
 94 | 		download_parallel_gui(row, links,  args['directory'], args['min_file_size'], 
 95 | 								 args['max_file_size'], args['no_redirects'])
 96 | 	else:
 97 | 		download_series_gui(row, links, args['directory'], args['min_file_size'],
 98 | 								 args['max_file_size'], args['no_redirects'])
 99 | 
100 | 
101 | class makeform:
102 | 	"""
103 | 	to makre the main form of gui
104 | 	"""
105 | 	global args
106 | 	def __init__(self, root):
107 | 
108 | 
109 | 		# label search query
110 | 		self.row0 = Frame(root)
111 | 		self.lab0 = Label(self.row0, width = 25, text = fields [0], anchor = 'w')
112 | 		self.entry_query = Entry(self.row0)
113 | 		self.entry_query.insert(0, 'python')
114 | 		self.entry_query.bind('<FocusIn>',  self.on_entry_click)
115 | 		self.entry_query.bind('<FocusOut>',  lambda event,  
116 | 							   a = "query" : self.on_focusout(event, a))
117 | 
118 | 		self.entry_query.config(fg = 'grey')
119 | 		self.row0.pack(side = TOP, fill = X, padx = 5, pady = 5)
120 | 		self.lab0.pack(side = LEFT)
121 | 		self.entry_query.pack(side = RIGHT, expand = YES, fill = X)
122 | 
123 | 
124 | 
125 | 		# label min_file_size
126 | 		self.row1 = Frame(root)
127 | 		self.lab1 = Label(self.row1, width = 25, text = fields [1], anchor = 'w')
128 | 		self.entry_min = Entry(self.row1)
129 | 		self.entry_min.insert(0, '0')
130 | 		self.entry_min.bind('<FocusIn>', self.on_entry_click)
131 | 		self.entry_min.bind('<FocusOut>', lambda event,  
132 | 							  a = "min_file_size": self.on_focusout( event, a))
133 | 
134 | 		self.entry_min.config(fg = 'grey')
135 | 		self.row1.pack(side = TOP, fill = X, padx = 5, pady = 5)
136 | 		self.lab1.pack(side = LEFT)
137 | 		self.entry_min.pack(side = RIGHT, expand = YES, fill = X)
138 | 
139 | 
140 | 		# label max_file_size
141 | 		self.row2 = Frame(root)
142 | 		self.lab2 = Label(self.row2, width = 25, text = fields [2], anchor = 'w')
143 | 		self.entry_max  =  Entry(self.row2)
144 | 		self.entry_max.insert(0, '-1')
145 | 		self.entry_max.bind('<FocusIn>', self.on_entry_click)
146 | 		self.entry_max.bind('<FocusOut>', lambda event,  
147 | 							 a = "max_file_size": self.on_focusout(event, a))
148 | 
149 | 		self.entry_max.config(fg = 'grey')
150 | 		self.row2.pack(side = TOP, fill = X, padx = 5, pady = 5)
151 | 		self.lab2.pack(side = LEFT)
152 | 		self.entry_max.pack(side = RIGHT, expand = YES, fill = X)
153 | 
154 | 
155 | 		# label choose directory
156 | 		self.dir_text = StringVar()
157 | 		self.dir_text.set('Choose Directory')
158 | 		self.row3 = Frame(root)
159 | 		self.lab3 = Label(self.row3,  width = 25,  text = fields [3],  anchor = 'w')
160 | 		self.entry_dir = Button(self.row3, textvariable = self.dir_text, command = self.ask_dir)
161 | 		self.row3.pack(side = TOP,  fill = X, padx = 5, pady = 5)
162 | 		self.lab3.pack(side = LEFT )
163 | 		self.entry_dir.pack(side = RIGHT, expand = YES, fill = X)
164 | 		self.dir_opt = options = {}
165 | 		options ['mustexist'] = False
166 | 		options ['parent'] = root
167 | 		options ['title'] = 'Choose Directory'
168 | 
169 | 
170 | 		# label download limit
171 | 		self.row4 = Frame(root)
172 | 		self.lab4 = Label(self.row4, width = 25, text = fields[4], anchor = 'w')
173 | 		self.entry_limit = Entry(self.row4)
174 | 		self.entry_limit.insert(0, '10')
175 | 		self.entry_limit.bind('<FocusIn>', self.on_entry_click)
176 | 		self.entry_limit.bind('<FocusOut>', lambda event,  
177 | 							a = "limit" : self.on_focusout(event, a))
178 | 		self.entry_limit.config(fg = 'grey')
179 | 		self.row4.pack(side = TOP, fill = X, padx = 5, pady = 5)  
180 | 		self.lab4.pack(side = LEFT)
181 | 		self.entry_limit.pack(side = RIGHT, expand = YES, fill = X)
182 | 
183 | 		# specify website
184 | 		self.row8 = Frame(root)
185 | 		self.lab8 = Label(self.row8, width = 25, text = "Specify Website", anchor = 'w')
186 | 		self.entry_website = Entry(self.row8)
187 | 		self.row8.pack(side = TOP, fill = X, padx = 5, pady = 5)
188 | 		self.lab8.pack(side = LEFT)
189 | 		self.entry_website.pack(side = RIGHT, expand = YES, fill = X)
190 | 
191 | 		self.row9 = Frame(root)
192 | 		self.engine = StringVar()
193 | 		self.engine.set("g")
194 | 		Radiobutton(self.row9, text="Google", variable=self.engine, value="g").pack(anchor=W)
195 | 		Radiobutton(self.row9, text="DuckDuckGo", variable=self.engine, value="d").pack(anchor=W)
196 | 		self.row9.pack(side = TOP, fill = X, padx = 5, pady = 5)
197 | 
198 | 		# all entries for dropdown menu
199 | 		self.choiceVar = StringVar()
200 | 		self.choices = []
201 | 		for val in THREAT_EXTENSIONS.values():
202 | 			if type(val) == list:
203 | 				for el in val:
204 | 						self.choices.append(el)
205 | 			else:
206 | 				self.choices.append(val)
207 | 
208 | 		for val in FILE_EXTENSIONS.values():
209 | 			if type(val) == list:
210 | 				for el in val:
211 | 						self.choices.append(el)
212 | 			else:
213 | 				self.choices.append(val)
214 | 		self.choiceVar.set('pdf')
215 | 
216 | 
217 | 		# dropdown box
218 | 		self.row5 = Frame(root)
219 | 		self.lab = Label(self.row5, width = 25,  
220 | 							 text = "File Type", anchor = 'w')
221 | 		self.optionmenu = ttk.Combobox(self.row5,  
222 | 						textvariable = self.choiceVar, values = self.choices)
223 | 
224 | 		self.row5.pack(side = TOP, fill = X, padx = 5, pady = 5)
225 | 		self.lab.pack(side = LEFT )
226 | 		self.optionmenu.pack(side = RIGHT, expand = YES, fill = X)
227 | 
228 | 
229 | 		# toggle box for parallel downloading 
230 | 		# and toggle redirects
231 | 		self.row6 = Frame(root)
232 | 		self.p = BooleanVar()
233 | 		Checkbutton(self.row6, text = "parallel downloading",
234 | 					variable = self.p).pack(side = LEFT)
235 | 
236 | 		self.t = BooleanVar()
237 | 
238 | 		Checkbutton(self.row6,  text = "toggle redirects",  
239 | 						variable = self.t ).pack(side = LEFT)
240 | 
241 | 		self.row6.pack(side = TOP, fill = X, padx = 5, pady = 5)
242 | 
243 | 
244 | 		# download button
245 | 		self.row7 = Frame(root)
246 | 		self.search_button = Button(self.row7, width = 15, text = "Download", anchor = 'w')
247 | 		self.search_button.bind('<Button-1>', self.click_download)
248 | 		self.row7.pack(side = TOP, fill = X, padx = 5, pady = 5)
249 | 		self.search_button.pack(side = LEFT)
250 | 
251 | 
252 | 		# clear button
253 | 		self.clear = Button(self.row7, width = 15, text = "Clear / Cancel", anchor = 'w')
254 | 		self.clear.bind('<Button-1>', self.clear_fun)
255 | 		self.clear.pack(side = RIGHT)
256 | 
257 | 
258 | 	def click_download(self, event):
259 | 		"""
260 | 		event for download button
261 | 		"""
262 | 		args ['parallel'] = self.p.get()
263 | 		args ['file_type'] = self.optionmenu.get()
264 | 		args ['no_redirects'] = self.t.get()
265 | 		args ['query'] = self.entry_query.get()
266 | 		args ['min_file_size'] = int( self.entry_min.get())
267 | 		args ['max_file_size'] = int( self.entry_max.get())
268 | 		args ['limit'] = int( self.entry_limit.get())
269 | 		args ['website']= self.entry_website.get()
270 | 		args ['option']= self.engine.get()
271 | 		print(args)
272 | 		self.check_threat()
273 | 		download_content_gui( **args )
274 | 
275 | 
276 | 	def on_entry_click(self, event):
277 | 		"""
278 | 		function that gets called whenever entry is clicked
279 | 		"""
280 | 		if event.widget.config('fg') [4] == 'grey':
281 | 		   event.widget.delete(0, "end" ) # delete all the text in the entry
282 | 		   event.widget.insert(0, '') #Insert blank for user input
283 | 		   event.widget.config(fg = 'black')
284 | 
285 | 
286 | 	def on_focusout(self, event, a):
287 | 		"""
288 | 		function that gets called whenever anywhere except entry is clicked
289 | 		"""
290 | 		if event.widget.get() == '':
291 | 			event.widget.insert(0, default_text[a])
292 | 			event.widget.config(fg = 'grey')
293 | 
294 | 
295 | 	def check_threat(self):
296 | 		"""
297 | 		function to check input filetype against threat extensions list 
298 | 		"""
299 | 		is_high_threat = False
300 | 		for val in THREAT_EXTENSIONS.values():
301 | 			if type(val) == list:
302 | 				for el in val:
303 | 					if self.optionmenu.get() == el:
304 | 						is_high_threat = True
305 | 						break
306 | 			else:
307 | 				if self.optionmenu.get() == val:
308 | 					is_high_threat = True
309 | 					break
310 | 
311 | 		if is_high_threat == True:
312 | 			is_high_threat = not askokcancel('FILE TYPE', 'WARNING: Downloading this \
313 | 											file type may expose you to a heightened security risk.\nPress\
314 | 											"OK" to proceed or "CANCEL" to exit')
315 | 		return not is_high_threat
316 | 
317 | 	def ask_dir(self):
318 | 		"""
319 | 		dialogue box for choosing directory
320 | 		"""
321 | 		args ['directory'] = askdirectory(**self.dir_opt) 
322 | 		self.dir_text.set(args ['directory'])
323 | 
324 | 
325 | 	def clear_fun(self, event):
326 | 		global row
327 | 		row.destroy()
328 | 
329 | 
330 | def main():
331 | 	"""
332 | 	main function
333 | 	"""
334 | 	s = ttk.Style()
335 | 	s.theme_use('clam')
336 | 	ents = makeform(root)
337 | 	root.mainloop()
338 | 
339 | 
340 | if __name__  ==  "__main__":
341 | 	main()


--------------------------------------------------------------------------------
/ctdl/gui_downloader.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import threading
  3 | import requests
  4 | from requests.packages.urllib3.util.retry import Retry
  5 | from requests.adapters import HTTPAdapter
  6 | 
  7 | try:
  8 | 	from Tkinter import *
  9 | except :
 10 | 	from tkinter import *
 11 | 
 12 | try:
 13 | 	import ttk
 14 | except :
 15 | 	from tkinter import ttk
 16 | 
 17 | chunk_size = 1024
 18 | parallel = False
 19 | exit_flag = 0
 20 | file_name = []
 21 | total_chunks = []
 22 | i_max = []
 23 | 
 24 | s = requests.Session()
 25 | # Max retries and back-off strategy so all requests to http:// sleep before retrying
 26 | retries = Retry(total = 5, 
 27 | 				backoff_factor = 0.1, 
 28 | 				status_forcelist = [500,  502,  503,  504])
 29 | s.mount('http://', HTTPAdapter(max_retries = retries))
 30 | 
 31 | 
 32 | def download(urls, directory, idx, min_file_size = 0, max_file_size = -1,  
 33 | 			 no_redirects = False, pos = 0, mode = 's'):
 34 | 	"""
 35 | 	download function for serial download
 36 | 	"""
 37 | 	global main_it
 38 | 	global exit_flag
 39 | 	global total_chunks
 40 | 	global file_name
 41 | 	global i_max
 42 | 
 43 | 	# loop in single thread to serialize downloads
 44 | 	for url in urls:
 45 | 		file_name[idx] = url.split( '/')[-1] 
 46 | 		file_address = directory + '/' + file_name[idx]
 47 | 		is_redirects = not no_redirects
 48 | 
 49 | 		resp = s.get(url, stream = True, allow_redirects = is_redirects)
 50 | 		if not resp.status_code == 200:
 51 | 			# ignore this file since server returns invalid response
 52 | 			continue
 53 | 		try:
 54 | 			total_size = int(resp.headers['content-length'])
 55 | 		except KeyError:
 56 | 			total_size = len(resp.content)
 57 | 
 58 | 		total_chunks[idx] = total_size / chunk_size
 59 | 		if total_chunks[idx] < min_file_size: 
 60 | 			# ignore this file since file size is lesser than min_file_size
 61 | 			continue
 62 | 		elif max_file_size != -1 and total_chunks[idx] > max_file_size:
 63 | 			# ignore this file since file size is greater than max_file_size
 64 | 			continue
 65 | 
 66 | 		file_iterable = resp.iter_content(chunk_size = chunk_size)
 67 | 		with open(file_address, 'wb') as f:
 68 | 			for sno, data in enumerate(file_iterable):
 69 | 				i_max[idx] = sno + 1
 70 | 				f.write(data)
 71 | 
 72 | 	exit_flag += 1
 73 | 
 74 | 
 75 | def download_parallel(url, directory, idx, min_file_size = 0, max_file_size = -1,  
 76 | 			 no_redirects = False, pos = 0, mode = 's'):
 77 | 	"""
 78 | 	download function to download parallely
 79 | 	"""
 80 | 	global main_it
 81 | 	global exit_flag
 82 | 	global total_chunks
 83 | 	global file_name
 84 | 	global i_max
 85 | 
 86 | 	file_name[idx]= url.split('/')[-1] 
 87 | 	file_address = directory + '/' + file_name[idx]
 88 | 	is_redirects = not no_redirects
 89 | 
 90 | 	resp = s.get(url, stream = True, allow_redirects = is_redirects)
 91 | 	if not resp.status_code == 200:
 92 | 		# ignore this file since server returns invalid response
 93 | 		exit_flag += 1
 94 | 		return
 95 | 	try:
 96 | 		total_size = int(resp.headers['content-length'])
 97 | 	except KeyError:
 98 | 		total_size = len(resp.content)
 99 | 
100 | 	total_chunks[idx] = total_size / chunk_size
101 | 	if total_chunks[idx] < min_file_size: 
102 | 		# ignore this file since file size is lesser than min_file_size
103 | 		exit_flag += 1
104 | 		return
105 | 	elif max_file_size != -1 and total_chunks[idx] > max_file_size:
106 | 		# ignore this file since file size is greater than max_file_size
107 | 		exit_flag += 1
108 | 		return
109 | 
110 | 	file_iterable = resp.iter_content(chunk_size = chunk_size)
111 | 	with open(file_address, 'wb') as f:
112 | 		for sno, data in enumerate(file_iterable):
113 | 			i_max[idx] = sno + 1
114 | 			f.write(data)
115 | 	
116 | 	exit_flag += 1
117 | 
118 | 
119 | 
120 | class myThread (threading.Thread):
121 | 	"""
122 | 	custom thread to run download thread
123 | 	"""
124 | 	def __init__(self, url, directory, idx, min_file_size, max_file_size, no_redirects):
125 | 		threading.Thread.__init__(self)
126 | 		self.idx = idx
127 | 		self.url = url
128 | 		self.directory = directory
129 | 		self.min_file_size = min_file_size
130 | 		self.max_file_size = max_file_size
131 | 		self.no_redirects = no_redirects
132 | 
133 | 
134 | 	def run(self):
135 | 		"""
136 | 		function called when thread is started
137 | 		"""
138 | 		global parallel
139 | 
140 | 		if parallel:
141 | 			download_parallel(self.url, self.directory, self.idx, 
142 | 							  self.min_file_size, self.max_file_size, self.no_redirects)
143 | 		else:
144 | 			download(self.url, self.directory, self.idx,  
145 | 					 self.min_file_size, self.max_file_size, self.no_redirects)
146 | 
147 | 
148 | class progress_class():
149 | 	"""
150 | 	custom class for profress bar
151 | 	"""
152 | 	def __init__(self, frame, url, directory, min_file_size, max_file_size, no_redirects):
153 | 		global i_max
154 | 		global file_name
155 | 		global parallel
156 | 
157 | 		self.url = url
158 | 		self.directory = directory
159 | 		self.min_file_size = min_file_size
160 | 		self.max_file_size = max_file_size
161 | 		self.no_redirects = no_redirects
162 | 		self.frame = frame
163 | 
164 | 		self.progress = []
165 | 		self.str = []
166 | 		self.label = []
167 | 		self.bytes = []
168 | 		self.maxbytes = []
169 | 		self.thread = []
170 | 
171 | 		if parallel:
172 | 			self.length = len(self.url) 
173 | 		else:
174 | 			# to serialize just make a single thread
175 | 			self.length = 1 
176 | 
177 | 		# for parallel downloading
178 | 		for self.i in range(0, self.length):
179 | 			file_name.append("")
180 | 			i_max.append(0)
181 | 			total_chunks.append(0)
182 | 
183 | 			# initialize progressbar
184 | 			self.progress.append(ttk.Progressbar(frame, orient="horizontal", 
185 | 								 length=300, mode="determinate"))
186 | 			self.progress[self.i].pack()
187 | 			self.str.append(StringVar())
188 | 			self.label.append(Label(frame, textvariable=self.str[self.i], width=40))
189 | 			self.label[self.i].pack()
190 | 			self.progress[self.i]["value"] = 0
191 | 			self.bytes.append(0)
192 | 			self.maxbytes.append(0)
193 | 
194 | 		# start thread
195 | 		self.start()
196 | 
197 | 
198 | 	def start(self):
199 | 		"""
200 | 		function to initialize thread for downloading
201 | 		"""
202 | 		global parallel
203 | 		for self.i in range(0, self.length):
204 | 			if parallel:
205 | 				self.thread.append(myThread(self.url[ self.i ], self.directory, self.i, 
206 | 								   self.min_file_size, self.max_file_size, self.no_redirects))
207 | 			else:
208 | 				# if not parallel whole url list is passed
209 | 				self.thread.append(myThread(self.url, self.directory, self.i , self.min_file_size, 
210 | 								   self.max_file_size,  self.no_redirects))
211 | 			self.progress[self.i]["value"] = 0
212 | 			self.bytes[self.i] = 0
213 | 			self.thread[self.i].start()
214 | 
215 | 		self.read_bytes()
216 | 
217 | 
218 | 	def read_bytes(self):
219 | 		"""
220 | 		reading bytes; update progress bar after 1 ms
221 | 		"""
222 | 		global exit_flag
223 | 
224 | 		for self.i in range(0, self.length) :
225 | 			self.bytes[self.i] = i_max[self.i]
226 | 			self.maxbytes[self.i] = total_chunks[self.i]
227 | 			self.progress[self.i]["maximum"] = total_chunks[self.i]
228 | 			self.progress[self.i]["value"] = self.bytes[self.i]
229 | 			self.str[self.i].set(file_name[self.i]+ "       " + str(self.bytes[self.i]) 
230 | 								  + "KB / " + str(int(self.maxbytes[self.i] + 1)) + " KB")
231 | 
232 | 		if exit_flag == self.length:
233 | 			exit_flag = 0
234 | 			self.frame.destroy()
235 | 		else:
236 | 			self.frame.after(10, self.read_bytes)
237 | 
238 | 
239 | def download_parallel_gui(root, urls, directory, min_file_size, max_file_size, no_redirects):
240 | 	"""
241 | 	called when paralled downloading is true
242 | 	"""
243 | 	global parallel
244 | 
245 | 	# create directory to save files
246 | 	if not os.path.exists(directory):
247 | 		os.makedirs(directory)
248 | 	parallel = True
249 | 	app = progress_class(root, urls, directory, min_file_size, max_file_size, no_redirects)
250 | 
251 | 
252 | 
253 | 
254 | def download_series_gui(frame, urls, directory, min_file_size, max_file_size, no_redirects):
255 | 	"""
256 | 	called when user wants serial downloading
257 | 	"""
258 | 
259 | 	# create directory to save files
260 | 	if not os.path.exists(directory):
261 | 		os.makedirs(directory)
262 | 	app = progress_class(frame, urls, directory, min_file_size, max_file_size, no_redirects)
263 | 
264 |  
265 | 


--------------------------------------------------------------------------------
/ctdl/utils.py:
--------------------------------------------------------------------------------
 1 | DEFAULTS = {		'file_type': 'pdf',
 2 | 					'limit': 10,
 3 | 					'directory': None,
 4 | 					'parallel': False,
 5 | 					'min_file_size': 0,
 6 | 					'max_file_size': -1,
 7 | 					'no_redirects': False
 8 | 			}
 9 | 
10 | 
11 | FILE_EXTENSIONS = {  'Adobe Flash': 'swf',
12 | 					 'Adobe Portable Document Format': 'pdf',
13 | 					 'Adobe PostScript': 'ps',
14 | 					 'Autodesk Design Web Format': 'dwf',
15 | 					 'Google Earth': 'kml',
16 | 					 'XML': 'xml',
17 | 					 'Microsoft PowerPoint': 'ppt',
18 | 					 'Microsoft Excel': 'xls',
19 | 					 'Microsoft Word': 'doc',
20 | 					 'GPS eXchange Format': 'gpx',
21 | 					 'Hancom Hanword': 'hwp',
22 | 					 'HTML': 'html',
23 | 					 'OpenOffice presentation': 'odp',
24 | 					 'OpenOffice spreadsheet': 'ods',
25 | 					 'OpenOffice text': 'odt',
26 | 					 'Rich Text Format': 'rtf',
27 | 					 'Scalable Vector Graphics': 'svg',
28 | 					 'TeX/LaTeX': 'tex',
29 | 					 'Text': 'txt',
30 | 					 'Basic source code': 'bas',
31 | 					 'C source code': 'c',
32 | 					 'C++ source code': 'cpp',
33 | 					 'C# source code': 'cs',
34 | 					 'Java source code': 'java',
35 | 					 'Perl source code': 'pl',
36 | 					 'Python source code': 'py',
37 | 					 'Wireless Markup Languag': 'wml'}
38 | 
39 | THREAT_EXTENSIONS = {
40 |     'Executable files': ['exe','com'],
41 |     'Program information file': 'pif',
42 |     'Screensaver file': 'scr',
43 |     'Visual Basic script': 'vbs',
44 |     'Shell scrap file': 'shs',
45 |     'Microsoft Compiled HTML Help': 'chm',
46 |     'Batch file': 'bat'
47 | }


--------------------------------------------------------------------------------
/icon.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coding-blocks/content-downloader/36d37122e22cc4155dd82d629f0f0c9bac638bf7/icon.gif


--------------------------------------------------------------------------------
/icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coding-blocks/content-downloader/36d37122e22cc4155dd82d629f0f0c9bac638bf7/icon.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests>=2.5.0
2 | lxml
3 | bs4
4 | tqdm
5 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from setuptools import setup
 3 | 
 4 | def readme():
 5 | 	try:
 6 | 	    with open('README.rst') as f:
 7 | 	        return f.read()
 8 | 	except:
 9 | 		pass
10 | 
11 | setup(name = 'ctdl',
12 |       version = '1.5.0',
13 |       classifiers = [
14 |         'Development Status :: 4 - Beta',
15 |         'License :: OSI Approved :: MIT License',
16 |         'Programming Language :: Python',
17 | 	    'Programming Language :: Python :: 2',
18 | 	    'Programming Language :: Python :: 2.6',
19 | 	    'Programming Language :: Python :: 2.7',
20 | 	    'Programming Language :: Python :: 3',
21 | 	    'Programming Language :: Python :: 3.3',
22 | 	    'Programming Language :: Python :: 3.4',
23 | 	    'Programming Language :: Python :: 3.5',
24 |         'Topic :: Internet',
25 |       ],
26 |       keywords = 'content downloader bulk files',
27 |       description = 'Bulk file downloader on any topic.',
28 |       long_description = readme(),
29 |       url = 'https://github.com/nikhilkumarsingh/content-downloader',
30 |       author = 'Nikhil Kumar Singh',
31 |       author_email = 'nikhilksingh97@gmail.com',
32 |       license = 'MIT',
33 |       packages = ['ctdl'],
34 |       install_requires = ['requests', 'bs4', 'lxml', 'tqdm'],
35 |       dependency_links = ['git+https://github.com/nikhilkumarsingh/tqdm'],
36 |       include_package_data = True,
37 |       entry_points={
38 |         'console_scripts': [
39 |             'ctdl = ctdl.ctdl:main',
40 |             'ctdl-gui = ctdl.gui:main',
41 |         ],
42 |       },
43 |       package_data={'': ['icon.png']}, 
44 |       zip_safe = False)
45 | 


--------------------------------------------------------------------------------