├── .gitignore ├── LICENSE.txt ├── MANIFEST.in ├── README.md ├── README.rst ├── ctdl ├── __init__.py ├── ctdl.py ├── downloader.py ├── gui.py ├── gui_downloader.py └── utils.py ├── icon.gif ├── icon.png ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.pyc 3 | /dist/ 4 | /*.egg 5 | /*.egg-info 6 | .idea/ 7 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Nikhil Kumar Singh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst 2 | include icon.png 3 | include icon.gif 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![PyPI](https://img.shields.io/badge/PyPi-v1.5-f39f37.svg)](https://pypi.python.org/pypi/ctdl) 2 | [![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/nikhilkumarsingh/content-downloader/blob/master/LICENSE.txt) 3 | 4 | # content-downloader 5 | 6 | **content-downloader** a.k.a **ctdl** is a python package with **command line utility** and **desktop GUI** to download files on any topic in bulk! 7 | 8 | ![](https://media.giphy.com/media/3oKIPlt7APHqWuVl3q/giphy.gif) 9 | 10 | ![](https://media.giphy.com/media/xUPGcIvGpH3KvEmlnG/giphy.gif) 11 | 12 | ## Features 13 | 14 | - ctdl can be used as a command line utility as well as a desktop GUI. 15 | 16 | - ctdl fetches file links related to a search query from **Google Search**. 17 | 18 | - Files can be downloaded parallely using multithreading. 19 | 20 | - ctdl is Python 2 as well as Python 3 compatible. 21 | 22 | ## Installation 23 | 24 | - To install content-downloader, simply, 25 | 26 | ``` 27 | $ pip install ctdl 28 | ``` 29 | 30 | - There seem to be some issues with parallel progress bars in tqdm which have 31 | been resolved in this [pull](https://github.com/tqdm/tqdm/pull/385). Until this pull is merged, please use my patch by running this command: 32 | 33 | ``` 34 | $ pip install -U git+https://github.com/nikhilkumarsingh/tqdm 35 | ``` 36 | 37 | ## Desktop GUI usage 38 | 39 | To use **ctdl** desktop GUI, open terminal and run this command: 40 | 41 | ``` 42 | $ ctdl-gui 43 | ``` 44 | 45 | ## Command line usage 46 | 47 | ``` 48 | $ ctdl [-h] [-f FILE_TYPE] [-l LIMIT] [-d DIRECTORY] [-p] [-a] [-t] 49 | [-minfs MIN_FILE_SIZE] [-maxfs MAX_FILE_SIZE] [-nr] 50 | [query] 51 | ``` 52 | Optional arguments are: 53 | 54 | - -f FILE_TYPE : set the file type. (can take values like ppt, pdf, xml, etc.) 55 | 56 | Default value: pdf 57 | 58 | - -l LIMIT : specify the number of files to download. 59 | 60 | Default value: 10 61 | 62 | - -d DIRECTORY : specify the directory where files will be stored. 63 | 64 | Default: A directory with same name as the search query in the current directory. 65 | 66 | - -p : for parallel downloading. 67 | 68 | - -minfs MIN_FILE_SIZE : specify minimum file size to download in Kilobytes (KB). 69 | 70 | Default: 0 71 | 72 | - -maxfs MAX_FILE_SIZE : specify maximum file size to download in Kilobytes (KB). 73 | 74 | Default: -1 (represents no maximum file size) 75 | 76 | - -nr : prevent download redirects. 77 | 78 | Default: False 79 | 80 | ## Examples 81 | 82 | - To get list of available filetypes: 83 | 84 | ``` 85 | $ ctdl -a 86 | ``` 87 | 88 | - To get list of potential high threat filetypes: 89 | 90 | ``` 91 | $ ctdl -t 92 | ``` 93 | 94 | - To download pdf files on topic 'python': 95 | 96 | ``` 97 | $ ctdl python 98 | ``` 99 | This is the default behaviour which will download 10 pdf files in a folder named 'python' in current directory. 100 | 101 | - To download 3 ppt files on 'health': 102 | 103 | ``` 104 | $ ctdl -f ppt -l 3 health 105 | ``` 106 | 107 | - To explicitly specify download folder: 108 | 109 | ``` 110 | $ ctdl -d /home/nikhil/Desktop/ml-pdfs machine-learning 111 | ``` 112 | 113 | - To download files parallely: 114 | ``` 115 | $ ctdl -f pdf -p python 116 | ``` 117 | 118 | - To search for and download in parallel 10 files in PDF format containing 119 | the text "python" and "algorithm", without allowing any url redirects, 120 | and where the file size is between 10,000 KB (10 MB) and 100,000KB (100 MB), 121 | where KB means Kilobytes, which has an equivalent value expressed in Megabytes: 122 | ``` 123 | $ ctdl -f pdf -l 10 -minfs 10000 -maxfs 100000 -nr -p "python algorithm" 124 | ``` 125 | 126 | ## Usage in Python files 127 | 128 | ```python 129 | from ctdl import ctdl 130 | 131 | ctdl.download_content( 132 | file_type = 'ppt', 133 | limit = 5, 134 | directory = '/home/nikhil/Desktop/ml-pdfs', 135 | query = 'machine learning using python') 136 | ``` 137 | 138 | ## TODO 139 | 140 | - [X] Prompt user before downloading potentially threatful files 141 | 142 | - [X] Create ctdl GUI 143 | 144 | - [ ] Implement unit testing 145 | 146 | - [ ] Use DuckDuckgo API as an option 147 | 148 | ## Want to contribute? 149 | 150 | - Clone the repository 151 | 152 | ``` 153 | $ git clone http://github.com/nikhilkumarsingh/content-downloader 154 | ``` 155 | 156 | - Install dependencies 157 | ``` 158 | $ pip install -r requirements.txt 159 | ``` 160 | 161 | **Note:** There seem to be some issues with current version of tqdm. If you do not get 162 | expected progress bar behaviour, try this patch: 163 | 164 | ``` 165 | $ pip uninstall tqdm 166 | $ pip install git+https://github.com/nikhilkumarsingh/tqdm 167 | ``` 168 | 169 | - In ctdl/ctdl.py, remove the `.` prefix from `.downloader` and `.utils` for 170 | the following imports, so it changes from: 171 | ```python 172 | from .downloader import download_series, download_parallel 173 | from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS 174 | ``` 175 | to: 176 | ```python 177 | from downloader import download_series, download_parallel 178 | from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS 179 | ``` 180 | 181 | - Run the python file directly `python ctdl/ctdl.py ___` (instead of with `ctdl ___`) 182 | 183 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | |PyPI| |license| 2 | 3 | content-downloader 4 | ================== 5 | 6 | **content-downloader** a.k.a **ctdl** is a python package with **command 7 | line utility** and **desktop GUI** to download files on any topic in 8 | bulk! 9 | 10 | .. figure:: https://media.giphy.com/media/3oKIPlt7APHqWuVl3q/giphy.gif 11 | :alt: 12 | 13 | .. figure:: https://media.giphy.com/media/xUPGcIvGpH3KvEmlnG/giphy.gif 14 | :alt: 15 | 16 | Features 17 | -------- 18 | 19 | - ctdl can be used as a command line utility as well as a desktop GUI. 20 | 21 | - ctdl fetches file links related to a search query from **Google 22 | Search**. 23 | 24 | - Files can be downloaded parallely using multithreading. 25 | 26 | - ctdl is Python 2 as well as Python 3 compatible. 27 | 28 | Installation 29 | ------------ 30 | 31 | - To install content-downloader, simply, 32 | 33 | ``$ pip install ctdl`` 34 | 35 | - There seem to be some issues with parallel progress bars in tqdm 36 | which have been resolved in this 37 | `pull `__. Until this pull is 38 | merged, please use my patch by running this command: 39 | 40 | ``$ pip install -U git+https://github.com/nikhilkumarsingh/tqdm`` 41 | 42 | Desktop GUI usage 43 | ----------------- 44 | 45 | To use **ctdl** desktop GUI, open terminal and run this command: 46 | 47 | :: 48 | 49 | $ ctdl-gui 50 | 51 | Command line usage 52 | ------------------ 53 | 54 | :: 55 | 56 | $ ctdl [-h] [-f FILE_TYPE] [-l LIMIT] [-d DIRECTORY] [-p] [-a] [-t] 57 | [-minfs MIN_FILE_SIZE] [-maxfs MAX_FILE_SIZE] [-nr] 58 | [query] 59 | 60 | Optional arguments are: 61 | 62 | - -f FILE\_TYPE : set the file type. (can take values like ppt, pdf, 63 | xml, etc.) 64 | 65 | :: 66 | 67 | Default value: pdf 68 | 69 | - -l LIMIT : specify the number of files to download. 70 | 71 | :: 72 | 73 | Default value: 10 74 | 75 | - -d DIRECTORY : specify the directory where files will be stored. 76 | 77 | :: 78 | 79 | Default: A directory with same name as the search query in the current directory. 80 | 81 | - -p : for parallel downloading. 82 | 83 | - -minfs MIN\_FILE\_SIZE : specify minimum file size to download in 84 | Kilobytes (KB). 85 | 86 | :: 87 | 88 | Default: 0 89 | 90 | - -maxfs MAX\_FILE\_SIZE : specify maximum file size to download in 91 | Kilobytes (KB). 92 | 93 | :: 94 | 95 | Default: -1 (represents no maximum file size) 96 | 97 | - -nr : prevent download redirects. 98 | 99 | :: 100 | 101 | Default: False 102 | 103 | Examples 104 | -------- 105 | 106 | - To get list of available filetypes: 107 | 108 | ``$ ctdl -a`` 109 | 110 | - To get list of potential high threat filetypes: 111 | 112 | ``$ ctdl -t`` 113 | 114 | - To download pdf files on topic 'python': 115 | 116 | ``$ ctdl python`` This is the default behaviour which will download 10 117 | pdf files in a folder named 'python' in current directory. 118 | 119 | - To download 3 ppt files on 'health': 120 | 121 | ``$ ctdl -f ppt -l 3 health`` 122 | 123 | - To explicitly specify download folder: 124 | 125 | ``$ ctdl -d /home/nikhil/Desktop/ml-pdfs machine-learning`` 126 | 127 | - To download files parallely: ``$ ctdl -f pdf -p python`` 128 | 129 | - To search for and download in parallel 10 files in PDF format 130 | containing the text "python" and "algorithm", without allowing any 131 | url redirects, and where the file size is between 10,000 KB (10 MB) 132 | and 100,000KB (100 MB), where KB means Kilobytes, which has an 133 | equivalent value expressed in Megabytes: 134 | ``$ ctdl -f pdf -l 10 -minfs 10000 -maxfs 100000 -nr -p "python algorithm"`` 135 | 136 | Usage in Python files 137 | --------------------- 138 | 139 | .. code:: python 140 | 141 | from ctdl import ctdl 142 | 143 | ctdl.download_content( 144 | file_type = 'ppt', 145 | limit = 5, 146 | directory = '/home/nikhil/Desktop/ml-pdfs', 147 | query = 'machine learning using python') 148 | 149 | TODO 150 | ---- 151 | 152 | - [X] Prompt user before downloading potentially threatful files 153 | 154 | - [X] Create ctdl GUI 155 | 156 | - [ ] Implement unit testing 157 | 158 | - [ ] Use DuckDuckgo API as an option 159 | 160 | Want to contribute? 161 | ------------------- 162 | 163 | - Clone the repository 164 | 165 | ``$ git clone http://github.com/nikhilkumarsingh/content-downloader`` 166 | 167 | - Install dependencies ``$ pip install -r requirements.txt`` 168 | 169 | **Note:** There seem to be some issues with current version of tqdm. If 170 | you do not get expected progress bar behaviour, try this patch: 171 | 172 | ``$ pip uninstall tqdm $ pip install git+https://github.com/nikhilkumarsingh/tqdm`` 173 | 174 | - In ctdl/ctdl.py, remove the ``.`` prefix from ``.downloader`` and 175 | ``.utils`` for the following imports, so it changes from: 176 | ``python from .downloader import download_series, download_parallel from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS`` 177 | to: 178 | ``python from downloader import download_series, download_parallel from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS`` 179 | 180 | - Run the python file directly ``python ctdl/ctdl.py ___`` (instead of 181 | with ``ctdl ___``) 182 | 183 | .. |PyPI| image:: https://img.shields.io/badge/PyPi-v1.5-f39f37.svg 184 | :target: https://pypi.python.org/pypi/ctdl 185 | .. |license| image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000 186 | :target: https://github.com/nikhilkumarsingh/content-downloader/blob/master/LICENSE.txt 187 | -------------------------------------------------------------------------------- /ctdl/__init__.py: -------------------------------------------------------------------------------- 1 | from .ctdl import * 2 | -------------------------------------------------------------------------------- /ctdl/ctdl.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import argparse 3 | import requests 4 | import urllib 5 | try: 6 | from urllib.request import urlopen 7 | from urllib.error import HTTPError 8 | except ImportError: 9 | from urllib2 import urlopen 10 | from urllib2 import HTTPError 11 | from requests.packages.urllib3.util.retry import Retry 12 | from requests.adapters import HTTPAdapter 13 | from bs4 import BeautifulSoup 14 | from downloader import download_series, download_parallel 15 | from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS, DEFAULTS 16 | 17 | 18 | s = requests.Session() 19 | # Max retries and back-off strategy so all requests to http:// sleep before retrying 20 | retries = Retry(total=5, 21 | backoff_factor=0.1, 22 | status_forcelist=[ 500, 502, 503, 504 ]) 23 | s.mount('http://', HTTPAdapter(max_retries=retries)) 24 | 25 | 26 | def get_google_links(limit, params, headers): 27 | """ 28 | function to fetch links equal to limit 29 | 30 | every Google search result page has a start index. 31 | every page contains 10 search results. 32 | """ 33 | links = [] 34 | for start_index in range(0, limit, 10): 35 | params['start'] = start_index 36 | resp = s.get("https://www.google.com/search", params = params, headers = headers) 37 | page_links = scrape_links(resp.content, engine = 'g') 38 | links.extend(page_links) 39 | return links[:limit] 40 | 41 | 42 | 43 | def get_duckduckgo_links(limit, params, headers): 44 | """ 45 | function to fetch links equal to limit 46 | 47 | duckduckgo pagination is not static, so there is a limit on 48 | maximum number of links that can be scraped 49 | """ 50 | resp = s.get('https://duckduckgo.com/html', params = params, headers = headers) 51 | links = scrape_links(resp.content, engine = 'd') 52 | return links[:limit] 53 | 54 | 55 | def scrape_links(html, engine): 56 | """ 57 | function to scrape file links from html response 58 | """ 59 | soup = BeautifulSoup(html, 'lxml') 60 | links = [] 61 | 62 | if engine == 'd': 63 | results = soup.findAll('a', {'class': 'result__a'}) 64 | for result in results: 65 | link = result.get('href')[15:] 66 | link = link.replace('/blob/', '/raw/') 67 | links.append(link) 68 | 69 | elif engine == 'g': 70 | results = soup.findAll('h3', {'class': 'r'}) 71 | for result in results: 72 | link = result.a['href'][7:].split('&')[0] 73 | link = link.replace('/blob/', '/raw/') 74 | links.append(link) 75 | 76 | return links 77 | 78 | 79 | 80 | 81 | def get_url_nofollow(url): 82 | """ 83 | function to get return code of a url 84 | 85 | Credits: http://blog.jasonantman.com/2013/06/python-script-to-check-a-list-of-urls-for-return-code-and-final-return-code-if-redirected/ 86 | """ 87 | try: 88 | response = urlopen(url) 89 | code = response.getcode() 90 | return code 91 | except HTTPError as e: 92 | return e.code 93 | except: 94 | return 0 95 | 96 | 97 | def validate_links(links): 98 | """ 99 | function to validate urls based on http(s) prefix and return code 100 | """ 101 | valid_links = [] 102 | for link in links: 103 | if link[:7] in "http://" or link[:8] in "https://": 104 | valid_links.append(link) 105 | 106 | if not valid_links: 107 | print("No files found.") 108 | sys.exit(0) 109 | 110 | # checking valid urls for return code 111 | urls = {} 112 | for link in valid_links: 113 | if 'github.com' and '/blob/' in link: 114 | link = link.replace('/blob/', '/raw/') 115 | urls[link] = {'code': get_url_nofollow(link)} 116 | 117 | 118 | # printing valid urls with return code 200 119 | available_urls = [] 120 | for url in urls: 121 | print("code: %d\turl: %s" % (urls[url]['code'], url)) 122 | if urls[url]['code'] != 0: 123 | available_urls.append(url) 124 | 125 | return available_urls 126 | 127 | 128 | def search(query, engine='g', site="", file_type = 'pdf', limit = 10): 129 | """ 130 | main function to search for links and return valid ones 131 | """ 132 | if site == "": 133 | search_query = "filetype:{0} {1}".format(file_type, query) 134 | else: 135 | search_query = "site:{0} filetype:{1} {2}".format(site,file_type, query) 136 | 137 | headers = { 138 | 'User Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) \ 139 | Gecko/20100101 Firefox/53.0' 140 | } 141 | if engine == "g": 142 | params = { 143 | 'q': search_query, 144 | 'start': 0, 145 | } 146 | links = get_google_links(limit, params, headers) 147 | 148 | elif engine == "d": 149 | params = { 150 | 'q': search_query, 151 | } 152 | links = get_duckduckgo_links(limit,params,headers) 153 | else: 154 | print("Wrong search engine selected!") 155 | sys.exit() 156 | 157 | valid_links = validate_links(links) 158 | return valid_links 159 | 160 | 161 | def check_threats(**args): 162 | """ 163 | function to check input filetype against threat extensions list 164 | """ 165 | is_high_threat = False 166 | for val in THREAT_EXTENSIONS.values(): 167 | if type(val) == list: 168 | for el in val: 169 | if args['file_type'] == el: 170 | is_high_threat = True 171 | break 172 | else: 173 | if args['file_type'] == val: 174 | is_high_threat = True 175 | break 176 | return is_high_threat 177 | 178 | 179 | def validate_args(**args): 180 | """ 181 | function to check if input query is not None 182 | and set missing arguments to default value 183 | """ 184 | if not args['query']: 185 | print("\nMissing required query argument.") 186 | sys.exit() 187 | 188 | for key in DEFAULTS: 189 | if key not in args: 190 | args[key] = DEFAULTS[key] 191 | 192 | return args 193 | 194 | 195 | def download_content(**args): 196 | """ 197 | main function to fetch links and download them 198 | """ 199 | args = validate_args(**args) 200 | 201 | if not args['directory']: 202 | args['directory'] = args['query'].replace(' ', '-') 203 | 204 | print("Downloading {0} {1} files on topic {2} from {3} and saving to directory: {4}" 205 | .format(args['limit'], args['file_type'], args['query'], args['website'], args['directory'])) 206 | 207 | 208 | links = search(args['query'], args['engine'], args['website'], args['file_type'], args['limit']) 209 | 210 | if args['parallel']: 211 | download_parallel(links, args['directory'], args['min_file_size'], args['max_file_size'], args['no_redirects']) 212 | else: 213 | download_series(links, args['directory'], args['min_file_size'], args['max_file_size'], args['no_redirects']) 214 | 215 | 216 | def show_filetypes(extensions): 217 | """ 218 | function to show valid file extensions 219 | """ 220 | for item in extensions.items(): 221 | val = item[1] 222 | if type(item[1]) == list: 223 | val = ", ".join(str(x) for x in item[1]) 224 | print("{0:4}: {1}".format(val, item[0])) 225 | 226 | 227 | def main(): 228 | parser = argparse.ArgumentParser(description = "Content Downloader", 229 | epilog="Now download files on any topic in bulk!") 230 | 231 | # defining arguments for parser object 232 | parser.add_argument("query", type = str, default = None, nargs = '?', 233 | help = "Specify the query.") 234 | 235 | parser.add_argument("-f", "--file_type", type = str, default = 'pdf', 236 | help = "Specify the extension of files to download.") 237 | 238 | parser.add_argument("-l", "--limit", type = int, default = 10, 239 | help = "Limit the number of search results (in multiples of 10).") 240 | 241 | parser.add_argument("-d", "--directory", type = str, default = None, 242 | help = "Specify directory where files will be stored.") 243 | 244 | parser.add_argument("-p", "--parallel", action = 'store_true', default = False, 245 | help = "For parallel downloading.") 246 | 247 | parser.add_argument("-e", "--engine", type=str, default = "g", 248 | help = "Specify search engine\nduckduckgo: 'd'\ngoogle: 'g'") 249 | 250 | parser.add_argument("-a", "--available", action='store_true', 251 | help = "Get list of all available filetypes.") 252 | 253 | parser.add_argument("-w", "--website", type = str, default = "", 254 | help = "specify website.") 255 | 256 | parser.add_argument("-t", "--threats", action='store_true', 257 | help = "Get list of all common virus carrier filetypes.") 258 | 259 | parser.add_argument("-minfs", "--min-file-size", type = int, default = 0, 260 | help = "Specify minimum file size to download in Kilobytes (KB).") 261 | 262 | parser.add_argument("-maxfs", "--max-file-size", type = int, default = -1, 263 | help = "Specify maximum file size to download in Kilobytes (KB).") 264 | 265 | parser.add_argument("-nr", "--no-redirects", action = 'store_true', default = False, 266 | help = "Prevent download redirects.") 267 | 268 | parser.add_argument("-w", "--website", default = None, type = str, 269 | help = "Specify a particular website to download content from.") 270 | 271 | args = parser.parse_args() 272 | args_dict = vars(args) 273 | 274 | if args.available: 275 | show_filetypes(FILE_EXTENSIONS) 276 | return 277 | 278 | if args.threats: 279 | show_filetypes(THREAT_EXTENSIONS) 280 | return 281 | 282 | high_threat = check_threats(**args_dict) 283 | 284 | if high_threat: 285 | def prompt(message, errormessage, isvalid, isexit): 286 | res = None 287 | while res is None: 288 | res = input(str(message)+': ') 289 | if isexit(res): 290 | sys.exit() 291 | if not isvalid(res): 292 | print(str(errormessage)) 293 | res = None 294 | return res 295 | prompt( 296 | message = "WARNING: Downloading this file type may expose you to a heightened security risk.\nPress 'y' to proceed or 'n' to exit", 297 | errormessage= "Error: Invalid option provided.", 298 | isvalid = lambda x:True if x is 'y' else None, 299 | isexit = lambda x:True if x is 'n' else None 300 | ) 301 | 302 | download_content(**args_dict) 303 | 304 | 305 | if __name__ == "__main__": 306 | main() 307 | -------------------------------------------------------------------------------- /ctdl/downloader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import threading 3 | import requests 4 | from requests.packages.urllib3.util.retry import Retry 5 | from requests.adapters import HTTPAdapter 6 | from tqdm import tqdm, trange 7 | 8 | chunk_size = 1024 9 | main_iter = None 10 | yellow_color = "\033[93m" 11 | blue_color = "\033[94m" 12 | 13 | # modes -> s: series | p: parallel 14 | 15 | s = requests.Session() 16 | # Max retries and back-off strategy so all requests to http:// sleep before retrying 17 | retries = Retry(total = 5, 18 | backoff_factor = 0.1, 19 | status_forcelist = [ 500, 502, 503, 504 ]) 20 | s.mount('http://', HTTPAdapter(max_retries = retries)) 21 | 22 | def download(url, directory, min_file_size = 0, max_file_size = -1, 23 | no_redirects = False, pos = 0, mode = 's'): 24 | global main_it 25 | 26 | file_name = url.split('/')[-1] 27 | file_address = directory + '/' + file_name 28 | is_redirects = not no_redirects 29 | 30 | resp = s.get(url, stream = True, allow_redirects = is_redirects) 31 | 32 | if not resp.status_code == 200: 33 | # ignore this file since server returns invalid response 34 | return 35 | 36 | try: 37 | total_size = int(resp.headers['content-length']) 38 | except KeyError: 39 | total_size = len(resp.content) 40 | 41 | total_chunks = total_size/chunk_size 42 | 43 | if total_chunks < min_file_size: 44 | # ignore this file since file size is lesser than min_file_size 45 | return 46 | elif max_file_size != -1 and total_chunks > max_file_size: 47 | # ignore this file since file size is greater than max_file_size 48 | return 49 | 50 | file_iterable = resp.iter_content(chunk_size = chunk_size) 51 | 52 | tqdm_iter = tqdm(iterable = file_iterable, total = total_chunks, 53 | unit = 'KB', position = pos, desc = blue_color + file_name, leave = False) 54 | 55 | with open(file_address, 'wb') as f: 56 | for data in tqdm_iter: 57 | f.write(data) 58 | 59 | if mode == 'p': 60 | main_iter.update(1) 61 | 62 | 63 | def download_parallel(urls, directory, min_file_size, max_file_size, no_redirects): 64 | global main_iter 65 | 66 | # create directory to save files 67 | if not os.path.exists(directory): 68 | os.makedirs(directory) 69 | 70 | # overall progress bar 71 | main_iter = trange(len(urls), position = 1, desc = yellow_color + "Overall") 72 | 73 | # empty list to store threads 74 | threads = [] 75 | 76 | # creating threads 77 | for idx, url in enumerate(urls): 78 | t = threading.Thread( 79 | target = download, 80 | kwargs = { 81 | 'url': url, 82 | 'directory': directory, 83 | 'pos': 2*idx+3, 84 | 'mode': 'p', 85 | 'min_file_size': min_file_size, 86 | 'max_file_size': max_file_size, 87 | 'no_redirects': no_redirects 88 | } 89 | ) 90 | threads.append(t) 91 | 92 | # start all threads 93 | for t in threads: 94 | t.start() 95 | 96 | # wait until all threads terminate 97 | for t in threads[::-1]: 98 | t.join() 99 | 100 | main_iter.close() 101 | 102 | print("\n\nDownload complete.") 103 | 104 | 105 | def download_series(urls, directory, min_file_size, max_file_size, no_redirects): 106 | 107 | # create directory to save files 108 | if not os.path.exists(directory): 109 | os.makedirs(directory) 110 | 111 | # download files one by one 112 | for url in urls: 113 | download(url, directory, min_file_size, max_file_size, no_redirects) 114 | 115 | print("Download complete.") 116 | -------------------------------------------------------------------------------- /ctdl/gui.py: -------------------------------------------------------------------------------- 1 | import os 2 | import threading 3 | import requests 4 | from requests.packages.urllib3.util.retry import Retry 5 | from requests.adapters import HTTPAdapter 6 | from tqdm import tqdm, trange 7 | 8 | try: 9 | from Tkinter import * 10 | except : 11 | from tkinter import * 12 | 13 | try: 14 | import ttk 15 | except : 16 | from tkinter import ttk 17 | 18 | try: 19 | from tkinter.filedialog import * 20 | from tkinter.messagebox import * 21 | except : 22 | from tkFileDialog import * 23 | from tkMessageBox import * 24 | 25 | from .gui_downloader import * 26 | from .ctdl import * 27 | from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS 28 | 29 | cur_dir = os.getcwd() 30 | 31 | # icon and title 32 | root = Tk() 33 | root.wm_title("ctdl") 34 | try: 35 | img = PhotoImage(file = "icon.png") 36 | root.tk.call('wm', 'iconphoto', root._w, img) 37 | except: 38 | pass 39 | 40 | row = Frame() 41 | links = [] 42 | 43 | default_text = {'file_type' : 'pdf', 'query' : 'python', 44 | 'min_file_size' : 0, 'max_file_size' : -1, 'limit' : 10} 45 | 46 | fields = 'Search query', 'Min Allowed File Size', 'Max Allowed File Size', 'Download Directory', 'Limit' 47 | 48 | args = { 'parallel' : False, 'file_type' : 'pdf', 'threats' : False, 49 | 'no_redirects' : False, 'available' : False, 'query' : 'python', 50 | 'min_file_size' : 0, 'max_file_size' : -1, 'directory' : None, 'limit' : 10} 51 | 52 | 53 | def search_function(root1, q, s, f, l, o='g'): 54 | """ 55 | function to get links 56 | """ 57 | global links 58 | links = search(q, o, s, f, l) 59 | root1.destroy() 60 | root1.quit() 61 | 62 | 63 | def task(ft): 64 | """ 65 | to create loading progress bar 66 | """ 67 | ft.pack(expand = True, fill = BOTH, side = TOP) 68 | pb_hD = ttk.Progressbar(ft, orient = 'horizontal', mode = 'indeterminate') 69 | pb_hD.pack(expand = True, fill = BOTH, side = TOP) 70 | pb_hD.start(50) 71 | ft.mainloop() 72 | 73 | 74 | def download_content_gui(**args): 75 | """ 76 | function to fetch links and download them 77 | """ 78 | global row 79 | 80 | if not args ['directory']: 81 | args ['directory'] = args ['query'].replace(' ', '-') 82 | 83 | root1 = Frame(root) 84 | t1 = threading.Thread(target = search_function, args = (root1, 85 | args['query'], args['website'], args['file_type'], args['limit'],args['option'])) 86 | t1.start() 87 | task(root1) 88 | t1.join() 89 | 90 | #new frame for progress bar 91 | row = Frame(root) 92 | row.pack() 93 | if args['parallel']: 94 | download_parallel_gui(row, links, args['directory'], args['min_file_size'], 95 | args['max_file_size'], args['no_redirects']) 96 | else: 97 | download_series_gui(row, links, args['directory'], args['min_file_size'], 98 | args['max_file_size'], args['no_redirects']) 99 | 100 | 101 | class makeform: 102 | """ 103 | to makre the main form of gui 104 | """ 105 | global args 106 | def __init__(self, root): 107 | 108 | 109 | # label search query 110 | self.row0 = Frame(root) 111 | self.lab0 = Label(self.row0, width = 25, text = fields [0], anchor = 'w') 112 | self.entry_query = Entry(self.row0) 113 | self.entry_query.insert(0, 'python') 114 | self.entry_query.bind('', self.on_entry_click) 115 | self.entry_query.bind('', lambda event, 116 | a = "query" : self.on_focusout(event, a)) 117 | 118 | self.entry_query.config(fg = 'grey') 119 | self.row0.pack(side = TOP, fill = X, padx = 5, pady = 5) 120 | self.lab0.pack(side = LEFT) 121 | self.entry_query.pack(side = RIGHT, expand = YES, fill = X) 122 | 123 | 124 | 125 | # label min_file_size 126 | self.row1 = Frame(root) 127 | self.lab1 = Label(self.row1, width = 25, text = fields [1], anchor = 'w') 128 | self.entry_min = Entry(self.row1) 129 | self.entry_min.insert(0, '0') 130 | self.entry_min.bind('', self.on_entry_click) 131 | self.entry_min.bind('', lambda event, 132 | a = "min_file_size": self.on_focusout( event, a)) 133 | 134 | self.entry_min.config(fg = 'grey') 135 | self.row1.pack(side = TOP, fill = X, padx = 5, pady = 5) 136 | self.lab1.pack(side = LEFT) 137 | self.entry_min.pack(side = RIGHT, expand = YES, fill = X) 138 | 139 | 140 | # label max_file_size 141 | self.row2 = Frame(root) 142 | self.lab2 = Label(self.row2, width = 25, text = fields [2], anchor = 'w') 143 | self.entry_max = Entry(self.row2) 144 | self.entry_max.insert(0, '-1') 145 | self.entry_max.bind('', self.on_entry_click) 146 | self.entry_max.bind('', lambda event, 147 | a = "max_file_size": self.on_focusout(event, a)) 148 | 149 | self.entry_max.config(fg = 'grey') 150 | self.row2.pack(side = TOP, fill = X, padx = 5, pady = 5) 151 | self.lab2.pack(side = LEFT) 152 | self.entry_max.pack(side = RIGHT, expand = YES, fill = X) 153 | 154 | 155 | # label choose directory 156 | self.dir_text = StringVar() 157 | self.dir_text.set('Choose Directory') 158 | self.row3 = Frame(root) 159 | self.lab3 = Label(self.row3, width = 25, text = fields [3], anchor = 'w') 160 | self.entry_dir = Button(self.row3, textvariable = self.dir_text, command = self.ask_dir) 161 | self.row3.pack(side = TOP, fill = X, padx = 5, pady = 5) 162 | self.lab3.pack(side = LEFT ) 163 | self.entry_dir.pack(side = RIGHT, expand = YES, fill = X) 164 | self.dir_opt = options = {} 165 | options ['mustexist'] = False 166 | options ['parent'] = root 167 | options ['title'] = 'Choose Directory' 168 | 169 | 170 | # label download limit 171 | self.row4 = Frame(root) 172 | self.lab4 = Label(self.row4, width = 25, text = fields[4], anchor = 'w') 173 | self.entry_limit = Entry(self.row4) 174 | self.entry_limit.insert(0, '10') 175 | self.entry_limit.bind('', self.on_entry_click) 176 | self.entry_limit.bind('', lambda event, 177 | a = "limit" : self.on_focusout(event, a)) 178 | self.entry_limit.config(fg = 'grey') 179 | self.row4.pack(side = TOP, fill = X, padx = 5, pady = 5) 180 | self.lab4.pack(side = LEFT) 181 | self.entry_limit.pack(side = RIGHT, expand = YES, fill = X) 182 | 183 | # specify website 184 | self.row8 = Frame(root) 185 | self.lab8 = Label(self.row8, width = 25, text = "Specify Website", anchor = 'w') 186 | self.entry_website = Entry(self.row8) 187 | self.row8.pack(side = TOP, fill = X, padx = 5, pady = 5) 188 | self.lab8.pack(side = LEFT) 189 | self.entry_website.pack(side = RIGHT, expand = YES, fill = X) 190 | 191 | self.row9 = Frame(root) 192 | self.engine = StringVar() 193 | self.engine.set("g") 194 | Radiobutton(self.row9, text="Google", variable=self.engine, value="g").pack(anchor=W) 195 | Radiobutton(self.row9, text="DuckDuckGo", variable=self.engine, value="d").pack(anchor=W) 196 | self.row9.pack(side = TOP, fill = X, padx = 5, pady = 5) 197 | 198 | # all entries for dropdown menu 199 | self.choiceVar = StringVar() 200 | self.choices = [] 201 | for val in THREAT_EXTENSIONS.values(): 202 | if type(val) == list: 203 | for el in val: 204 | self.choices.append(el) 205 | else: 206 | self.choices.append(val) 207 | 208 | for val in FILE_EXTENSIONS.values(): 209 | if type(val) == list: 210 | for el in val: 211 | self.choices.append(el) 212 | else: 213 | self.choices.append(val) 214 | self.choiceVar.set('pdf') 215 | 216 | 217 | # dropdown box 218 | self.row5 = Frame(root) 219 | self.lab = Label(self.row5, width = 25, 220 | text = "File Type", anchor = 'w') 221 | self.optionmenu = ttk.Combobox(self.row5, 222 | textvariable = self.choiceVar, values = self.choices) 223 | 224 | self.row5.pack(side = TOP, fill = X, padx = 5, pady = 5) 225 | self.lab.pack(side = LEFT ) 226 | self.optionmenu.pack(side = RIGHT, expand = YES, fill = X) 227 | 228 | 229 | # toggle box for parallel downloading 230 | # and toggle redirects 231 | self.row6 = Frame(root) 232 | self.p = BooleanVar() 233 | Checkbutton(self.row6, text = "parallel downloading", 234 | variable = self.p).pack(side = LEFT) 235 | 236 | self.t = BooleanVar() 237 | 238 | Checkbutton(self.row6, text = "toggle redirects", 239 | variable = self.t ).pack(side = LEFT) 240 | 241 | self.row6.pack(side = TOP, fill = X, padx = 5, pady = 5) 242 | 243 | 244 | # download button 245 | self.row7 = Frame(root) 246 | self.search_button = Button(self.row7, width = 15, text = "Download", anchor = 'w') 247 | self.search_button.bind('', self.click_download) 248 | self.row7.pack(side = TOP, fill = X, padx = 5, pady = 5) 249 | self.search_button.pack(side = LEFT) 250 | 251 | 252 | # clear button 253 | self.clear = Button(self.row7, width = 15, text = "Clear / Cancel", anchor = 'w') 254 | self.clear.bind('', self.clear_fun) 255 | self.clear.pack(side = RIGHT) 256 | 257 | 258 | def click_download(self, event): 259 | """ 260 | event for download button 261 | """ 262 | args ['parallel'] = self.p.get() 263 | args ['file_type'] = self.optionmenu.get() 264 | args ['no_redirects'] = self.t.get() 265 | args ['query'] = self.entry_query.get() 266 | args ['min_file_size'] = int( self.entry_min.get()) 267 | args ['max_file_size'] = int( self.entry_max.get()) 268 | args ['limit'] = int( self.entry_limit.get()) 269 | args ['website']= self.entry_website.get() 270 | args ['option']= self.engine.get() 271 | print(args) 272 | self.check_threat() 273 | download_content_gui( **args ) 274 | 275 | 276 | def on_entry_click(self, event): 277 | """ 278 | function that gets called whenever entry is clicked 279 | """ 280 | if event.widget.config('fg') [4] == 'grey': 281 | event.widget.delete(0, "end" ) # delete all the text in the entry 282 | event.widget.insert(0, '') #Insert blank for user input 283 | event.widget.config(fg = 'black') 284 | 285 | 286 | def on_focusout(self, event, a): 287 | """ 288 | function that gets called whenever anywhere except entry is clicked 289 | """ 290 | if event.widget.get() == '': 291 | event.widget.insert(0, default_text[a]) 292 | event.widget.config(fg = 'grey') 293 | 294 | 295 | def check_threat(self): 296 | """ 297 | function to check input filetype against threat extensions list 298 | """ 299 | is_high_threat = False 300 | for val in THREAT_EXTENSIONS.values(): 301 | if type(val) == list: 302 | for el in val: 303 | if self.optionmenu.get() == el: 304 | is_high_threat = True 305 | break 306 | else: 307 | if self.optionmenu.get() == val: 308 | is_high_threat = True 309 | break 310 | 311 | if is_high_threat == True: 312 | is_high_threat = not askokcancel('FILE TYPE', 'WARNING: Downloading this \ 313 | file type may expose you to a heightened security risk.\nPress\ 314 | "OK" to proceed or "CANCEL" to exit') 315 | return not is_high_threat 316 | 317 | def ask_dir(self): 318 | """ 319 | dialogue box for choosing directory 320 | """ 321 | args ['directory'] = askdirectory(**self.dir_opt) 322 | self.dir_text.set(args ['directory']) 323 | 324 | 325 | def clear_fun(self, event): 326 | global row 327 | row.destroy() 328 | 329 | 330 | def main(): 331 | """ 332 | main function 333 | """ 334 | s = ttk.Style() 335 | s.theme_use('clam') 336 | ents = makeform(root) 337 | root.mainloop() 338 | 339 | 340 | if __name__ == "__main__": 341 | main() -------------------------------------------------------------------------------- /ctdl/gui_downloader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import threading 3 | import requests 4 | from requests.packages.urllib3.util.retry import Retry 5 | from requests.adapters import HTTPAdapter 6 | 7 | try: 8 | from Tkinter import * 9 | except : 10 | from tkinter import * 11 | 12 | try: 13 | import ttk 14 | except : 15 | from tkinter import ttk 16 | 17 | chunk_size = 1024 18 | parallel = False 19 | exit_flag = 0 20 | file_name = [] 21 | total_chunks = [] 22 | i_max = [] 23 | 24 | s = requests.Session() 25 | # Max retries and back-off strategy so all requests to http:// sleep before retrying 26 | retries = Retry(total = 5, 27 | backoff_factor = 0.1, 28 | status_forcelist = [500, 502, 503, 504]) 29 | s.mount('http://', HTTPAdapter(max_retries = retries)) 30 | 31 | 32 | def download(urls, directory, idx, min_file_size = 0, max_file_size = -1, 33 | no_redirects = False, pos = 0, mode = 's'): 34 | """ 35 | download function for serial download 36 | """ 37 | global main_it 38 | global exit_flag 39 | global total_chunks 40 | global file_name 41 | global i_max 42 | 43 | # loop in single thread to serialize downloads 44 | for url in urls: 45 | file_name[idx] = url.split( '/')[-1] 46 | file_address = directory + '/' + file_name[idx] 47 | is_redirects = not no_redirects 48 | 49 | resp = s.get(url, stream = True, allow_redirects = is_redirects) 50 | if not resp.status_code == 200: 51 | # ignore this file since server returns invalid response 52 | continue 53 | try: 54 | total_size = int(resp.headers['content-length']) 55 | except KeyError: 56 | total_size = len(resp.content) 57 | 58 | total_chunks[idx] = total_size / chunk_size 59 | if total_chunks[idx] < min_file_size: 60 | # ignore this file since file size is lesser than min_file_size 61 | continue 62 | elif max_file_size != -1 and total_chunks[idx] > max_file_size: 63 | # ignore this file since file size is greater than max_file_size 64 | continue 65 | 66 | file_iterable = resp.iter_content(chunk_size = chunk_size) 67 | with open(file_address, 'wb') as f: 68 | for sno, data in enumerate(file_iterable): 69 | i_max[idx] = sno + 1 70 | f.write(data) 71 | 72 | exit_flag += 1 73 | 74 | 75 | def download_parallel(url, directory, idx, min_file_size = 0, max_file_size = -1, 76 | no_redirects = False, pos = 0, mode = 's'): 77 | """ 78 | download function to download parallely 79 | """ 80 | global main_it 81 | global exit_flag 82 | global total_chunks 83 | global file_name 84 | global i_max 85 | 86 | file_name[idx]= url.split('/')[-1] 87 | file_address = directory + '/' + file_name[idx] 88 | is_redirects = not no_redirects 89 | 90 | resp = s.get(url, stream = True, allow_redirects = is_redirects) 91 | if not resp.status_code == 200: 92 | # ignore this file since server returns invalid response 93 | exit_flag += 1 94 | return 95 | try: 96 | total_size = int(resp.headers['content-length']) 97 | except KeyError: 98 | total_size = len(resp.content) 99 | 100 | total_chunks[idx] = total_size / chunk_size 101 | if total_chunks[idx] < min_file_size: 102 | # ignore this file since file size is lesser than min_file_size 103 | exit_flag += 1 104 | return 105 | elif max_file_size != -1 and total_chunks[idx] > max_file_size: 106 | # ignore this file since file size is greater than max_file_size 107 | exit_flag += 1 108 | return 109 | 110 | file_iterable = resp.iter_content(chunk_size = chunk_size) 111 | with open(file_address, 'wb') as f: 112 | for sno, data in enumerate(file_iterable): 113 | i_max[idx] = sno + 1 114 | f.write(data) 115 | 116 | exit_flag += 1 117 | 118 | 119 | 120 | class myThread (threading.Thread): 121 | """ 122 | custom thread to run download thread 123 | """ 124 | def __init__(self, url, directory, idx, min_file_size, max_file_size, no_redirects): 125 | threading.Thread.__init__(self) 126 | self.idx = idx 127 | self.url = url 128 | self.directory = directory 129 | self.min_file_size = min_file_size 130 | self.max_file_size = max_file_size 131 | self.no_redirects = no_redirects 132 | 133 | 134 | def run(self): 135 | """ 136 | function called when thread is started 137 | """ 138 | global parallel 139 | 140 | if parallel: 141 | download_parallel(self.url, self.directory, self.idx, 142 | self.min_file_size, self.max_file_size, self.no_redirects) 143 | else: 144 | download(self.url, self.directory, self.idx, 145 | self.min_file_size, self.max_file_size, self.no_redirects) 146 | 147 | 148 | class progress_class(): 149 | """ 150 | custom class for profress bar 151 | """ 152 | def __init__(self, frame, url, directory, min_file_size, max_file_size, no_redirects): 153 | global i_max 154 | global file_name 155 | global parallel 156 | 157 | self.url = url 158 | self.directory = directory 159 | self.min_file_size = min_file_size 160 | self.max_file_size = max_file_size 161 | self.no_redirects = no_redirects 162 | self.frame = frame 163 | 164 | self.progress = [] 165 | self.str = [] 166 | self.label = [] 167 | self.bytes = [] 168 | self.maxbytes = [] 169 | self.thread = [] 170 | 171 | if parallel: 172 | self.length = len(self.url) 173 | else: 174 | # to serialize just make a single thread 175 | self.length = 1 176 | 177 | # for parallel downloading 178 | for self.i in range(0, self.length): 179 | file_name.append("") 180 | i_max.append(0) 181 | total_chunks.append(0) 182 | 183 | # initialize progressbar 184 | self.progress.append(ttk.Progressbar(frame, orient="horizontal", 185 | length=300, mode="determinate")) 186 | self.progress[self.i].pack() 187 | self.str.append(StringVar()) 188 | self.label.append(Label(frame, textvariable=self.str[self.i], width=40)) 189 | self.label[self.i].pack() 190 | self.progress[self.i]["value"] = 0 191 | self.bytes.append(0) 192 | self.maxbytes.append(0) 193 | 194 | # start thread 195 | self.start() 196 | 197 | 198 | def start(self): 199 | """ 200 | function to initialize thread for downloading 201 | """ 202 | global parallel 203 | for self.i in range(0, self.length): 204 | if parallel: 205 | self.thread.append(myThread(self.url[ self.i ], self.directory, self.i, 206 | self.min_file_size, self.max_file_size, self.no_redirects)) 207 | else: 208 | # if not parallel whole url list is passed 209 | self.thread.append(myThread(self.url, self.directory, self.i , self.min_file_size, 210 | self.max_file_size, self.no_redirects)) 211 | self.progress[self.i]["value"] = 0 212 | self.bytes[self.i] = 0 213 | self.thread[self.i].start() 214 | 215 | self.read_bytes() 216 | 217 | 218 | def read_bytes(self): 219 | """ 220 | reading bytes; update progress bar after 1 ms 221 | """ 222 | global exit_flag 223 | 224 | for self.i in range(0, self.length) : 225 | self.bytes[self.i] = i_max[self.i] 226 | self.maxbytes[self.i] = total_chunks[self.i] 227 | self.progress[self.i]["maximum"] = total_chunks[self.i] 228 | self.progress[self.i]["value"] = self.bytes[self.i] 229 | self.str[self.i].set(file_name[self.i]+ " " + str(self.bytes[self.i]) 230 | + "KB / " + str(int(self.maxbytes[self.i] + 1)) + " KB") 231 | 232 | if exit_flag == self.length: 233 | exit_flag = 0 234 | self.frame.destroy() 235 | else: 236 | self.frame.after(10, self.read_bytes) 237 | 238 | 239 | def download_parallel_gui(root, urls, directory, min_file_size, max_file_size, no_redirects): 240 | """ 241 | called when paralled downloading is true 242 | """ 243 | global parallel 244 | 245 | # create directory to save files 246 | if not os.path.exists(directory): 247 | os.makedirs(directory) 248 | parallel = True 249 | app = progress_class(root, urls, directory, min_file_size, max_file_size, no_redirects) 250 | 251 | 252 | 253 | 254 | def download_series_gui(frame, urls, directory, min_file_size, max_file_size, no_redirects): 255 | """ 256 | called when user wants serial downloading 257 | """ 258 | 259 | # create directory to save files 260 | if not os.path.exists(directory): 261 | os.makedirs(directory) 262 | app = progress_class(frame, urls, directory, min_file_size, max_file_size, no_redirects) 263 | 264 | 265 | -------------------------------------------------------------------------------- /ctdl/utils.py: -------------------------------------------------------------------------------- 1 | DEFAULTS = { 'file_type': 'pdf', 2 | 'limit': 10, 3 | 'directory': None, 4 | 'parallel': False, 5 | 'min_file_size': 0, 6 | 'max_file_size': -1, 7 | 'no_redirects': False 8 | } 9 | 10 | 11 | FILE_EXTENSIONS = { 'Adobe Flash': 'swf', 12 | 'Adobe Portable Document Format': 'pdf', 13 | 'Adobe PostScript': 'ps', 14 | 'Autodesk Design Web Format': 'dwf', 15 | 'Google Earth': 'kml', 16 | 'XML': 'xml', 17 | 'Microsoft PowerPoint': 'ppt', 18 | 'Microsoft Excel': 'xls', 19 | 'Microsoft Word': 'doc', 20 | 'GPS eXchange Format': 'gpx', 21 | 'Hancom Hanword': 'hwp', 22 | 'HTML': 'html', 23 | 'OpenOffice presentation': 'odp', 24 | 'OpenOffice spreadsheet': 'ods', 25 | 'OpenOffice text': 'odt', 26 | 'Rich Text Format': 'rtf', 27 | 'Scalable Vector Graphics': 'svg', 28 | 'TeX/LaTeX': 'tex', 29 | 'Text': 'txt', 30 | 'Basic source code': 'bas', 31 | 'C source code': 'c', 32 | 'C++ source code': 'cpp', 33 | 'C# source code': 'cs', 34 | 'Java source code': 'java', 35 | 'Perl source code': 'pl', 36 | 'Python source code': 'py', 37 | 'Wireless Markup Languag': 'wml'} 38 | 39 | THREAT_EXTENSIONS = { 40 | 'Executable files': ['exe','com'], 41 | 'Program information file': 'pif', 42 | 'Screensaver file': 'scr', 43 | 'Visual Basic script': 'vbs', 44 | 'Shell scrap file': 'shs', 45 | 'Microsoft Compiled HTML Help': 'chm', 46 | 'Batch file': 'bat' 47 | } -------------------------------------------------------------------------------- /icon.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coding-blocks/content-downloader/36d37122e22cc4155dd82d629f0f0c9bac638bf7/icon.gif -------------------------------------------------------------------------------- /icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coding-blocks/content-downloader/36d37122e22cc4155dd82d629f0f0c9bac638bf7/icon.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests>=2.5.0 2 | lxml 3 | bs4 4 | tqdm 5 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from setuptools import setup 3 | 4 | def readme(): 5 | try: 6 | with open('README.rst') as f: 7 | return f.read() 8 | except: 9 | pass 10 | 11 | setup(name = 'ctdl', 12 | version = '1.5.0', 13 | classifiers = [ 14 | 'Development Status :: 4 - Beta', 15 | 'License :: OSI Approved :: MIT License', 16 | 'Programming Language :: Python', 17 | 'Programming Language :: Python :: 2', 18 | 'Programming Language :: Python :: 2.6', 19 | 'Programming Language :: Python :: 2.7', 20 | 'Programming Language :: Python :: 3', 21 | 'Programming Language :: Python :: 3.3', 22 | 'Programming Language :: Python :: 3.4', 23 | 'Programming Language :: Python :: 3.5', 24 | 'Topic :: Internet', 25 | ], 26 | keywords = 'content downloader bulk files', 27 | description = 'Bulk file downloader on any topic.', 28 | long_description = readme(), 29 | url = 'https://github.com/nikhilkumarsingh/content-downloader', 30 | author = 'Nikhil Kumar Singh', 31 | author_email = 'nikhilksingh97@gmail.com', 32 | license = 'MIT', 33 | packages = ['ctdl'], 34 | install_requires = ['requests', 'bs4', 'lxml', 'tqdm'], 35 | dependency_links = ['git+https://github.com/nikhilkumarsingh/tqdm'], 36 | include_package_data = True, 37 | entry_points={ 38 | 'console_scripts': [ 39 | 'ctdl = ctdl.ctdl:main', 40 | 'ctdl-gui = ctdl.gui:main', 41 | ], 42 | }, 43 | package_data={'': ['icon.png']}, 44 | zip_safe = False) 45 | --------------------------------------------------------------------------------