├── README.md ├── README_CN.md ├── nhentai-master.exe ├── nhentai-master.py ├── proxies.json ├── sites.txt └── sites_url.txt /README.md: -------------------------------------------------------------------------------- 1 | nhentai-master 2 | =============== 3 | The main code for this project comes from[tumblr-crawler](https://github.com/dixudx/tumblr-crawler). 4 | That project was used to easily get pictures and videos from tumblr, and I modified it into this book downloader. 5 | 6 | 7 | This is a [Python](https://www.python.org) script that you can easily download the photos from nhentai.net 8 | 9 | ## 中文版教程请[移步这里](./README_CN.md) 10 | 11 | 12 | ### Use sites.txt 13 | 14 | Find a text editor and open the file `sites.txt`, add the sites you want to download into the file, separated by comma/space/tab/CR, no `https://nhentai.net/g/`. For example, if you want to download `https://nhentai.net/g/263184/`,`https://nhentai.net/g/263183/`, compose the file like this: 15 | 16 | ``` 17 | 263184 18 | 263183 19 | ``` 20 | 21 | And then save the file, and run `python nhentai-master.py` 22 | or just click `python nhentai-master.exe` 23 | 24 | 25 | ### Use sites_url.txt 26 | 27 | When you search for Akane shinjou and Rikka takarada and need to download all search results, copy your web address and write to a file, for example: 28 | 29 | ``` 30 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada 31 | ``` 32 | 33 | And then save the file, and run `python nhentai-master.py` 34 | or just click `python nhentai-master.exe` -------------------------------------------------------------------------------- /README_CN.md: -------------------------------------------------------------------------------- 1 | nhentai-master 2 | =============== 3 | 4 | 这个项目主要代码来自于[tumblr-crawler](https://github.com/dixudx/tumblr-crawler). 5 | 那个项目用于便捷获取tumblr中的图片和视频,我把它修改成了这个本子下载器 6 | 7 | 8 | 这是一个[Python](https://www.python.org)的脚本,配置运行后下载nhentai中的本子 9 | 或者可以直接使用编译好的.exe文件 10 | 11 | ## 配置和运行 12 | 13 | 有两种方式来指定你要下载的站点,一是编辑`sites.txt`,二是指定命令行参数. 14 | 15 | ### 第一种方法:编辑sites.txt文件 16 | 17 | 打开文件`sites.txt`,把你想要下载的nhentai页面写进去,以逗号/空格/tab/表格鍵/回车符分隔,可以多行,不需要`https://nhentai.net/g/`.例如,如果你要下载`https://nhentai.net/g/263184/`,`https://nhentai.net/g/263183/`,这个文件应该是这样的: 18 | 19 | ``` 20 | 263184 21 | 263183 22 | ``` 23 | 24 | 然后保存文件,在终端(terminal)里面运行`python nhentai-master.py` 25 | 或者直接点击运行nhentai-master.exe 26 | 27 | 28 | ### 第二种方法:编辑sites_url.txt文件 29 | 当你搜索akane shinjou和rikka takarada并且需要下载所有的搜索结果时,复制你的网页地址并写入文件,例如: 30 | 31 | ``` 32 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada 33 | ``` 34 | 35 | 然后保存文件,在终端(terminal)里面运行`python nhentai-master.py` 36 | 或者直接点击运行nhentai-master.exe 37 | 38 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /nhentai-master.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2311596178/nhentai-master/6229899326056d68070a1ab0c5bd15c24f9f3c8e/nhentai-master.exe -------------------------------------------------------------------------------- /nhentai-master.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import os 4 | import sys 5 | import requests 6 | import xmltodict 7 | from six.moves import queue as Queue 8 | from threading import Thread 9 | import re 10 | import json 11 | import ssl 12 | import urllib.request 13 | import urllib.error 14 | import urllib.parse 15 | from bs4 import BeautifulSoup 16 | 17 | 18 | # Setting timeout 19 | TIMEOUT = 10 20 | 21 | # Retry times 22 | RETRY = 5 23 | 24 | # Medium Index Number that Starts from 25 | START = 0 26 | 27 | # Numbers of photos/videos per page 28 | MEDIA_NUM = 50 29 | 30 | # Numbers of downloading threads concurrently 31 | THREADS = 20 32 | 33 | 34 | 35 | def video_hd_match(): 36 | hd_pattern = re.compile(r'.*"hdUrl":("([^\s,]*)"|false),') 37 | 38 | def match(video_player): 39 | hd_match = hd_pattern.match(video_player) 40 | try: 41 | if hd_match is not None and hd_match.group(1) != 'false': 42 | return hd_match.group(2).replace('\\', '') 43 | except: 44 | return None 45 | return match 46 | 47 | 48 | def video_default_match(): 49 | default_pattern = re.compile(r'.*src="(\S*)" ', re.DOTALL) 50 | 51 | def match(video_player): 52 | default_match = default_pattern.match(video_player) 53 | if default_match is not None: 54 | try: 55 | return default_match.group(1) 56 | except: 57 | return None 58 | return match 59 | 60 | 61 | class DownloadWorker(Thread): 62 | def __init__(self, queue, proxies=None): 63 | Thread.__init__(self) 64 | self.queue = queue 65 | self.proxies = proxies 66 | self._register_regex_match_rules() 67 | 68 | def run(self): 69 | print("DownloadWorker.run()") 70 | while True: 71 | medium_type, post, target_folder = self.queue.get() 72 | self.download(medium_type, post, target_folder) 73 | self.queue.task_done() 74 | 75 | def download(self, medium_type, post, target_folder): 76 | try: 77 | medium_url = post 78 | 79 | print("medium_url=" + medium_url) 80 | if medium_url is not None: 81 | self._download(medium_type, medium_url, target_folder) 82 | except TypeError: 83 | print("TypeError") 84 | pass 85 | 86 | # can register differnet regex match rules 87 | def _register_regex_match_rules(self): 88 | # will iterate all the rules 89 | # the first matched result will be returned 90 | self.regex_rules = [video_hd_match(), video_default_match()] 91 | 92 | def _handle_medium_url(self, medium_type, post): 93 | try: 94 | if medium_type == "photo": 95 | return post["photo-url"][0]["#text"] 96 | 97 | if medium_type == "video": 98 | video_player = post["video-player"][1]["#text"] 99 | for regex_rule in self.regex_rules: 100 | matched_url = regex_rule(video_player) 101 | if matched_url is not None: 102 | return matched_url 103 | else: 104 | raise Exception 105 | except: 106 | raise TypeError("Unable to find the right url for downloading. " 107 | "Please open a new issue on " 108 | "https://github.com/dixudx/tumblr-crawler/" 109 | "issues/new attached with below information:\n\n" 110 | "%s" % post) 111 | 112 | def _download(self, medium_type, medium_url, target_folder): 113 | medium_name = medium_url.split("/")[-1].split("?")[0] 114 | if medium_type == "video": 115 | if not medium_name.startswith("tumblr"): 116 | medium_name = "_".join([medium_url.split("/")[-2], 117 | medium_name]) 118 | 119 | medium_name += ".mp4" 120 | medium_url = 'https://vt.tumblr.com/' + medium_name 121 | 122 | file_path = os.path.join(target_folder, medium_name) 123 | if not os.path.isfile(file_path): 124 | print("Downloading %s from %s.\n" % (medium_name, 125 | medium_url)) 126 | retry_times = 0 127 | while retry_times < RETRY: 128 | try: 129 | resp = requests.get(medium_url, 130 | stream=True, 131 | proxies=self.proxies, 132 | timeout=TIMEOUT) 133 | if resp.status_code == 403: 134 | retry_times = RETRY 135 | print("Access Denied when retrieve %s.\n" % medium_url) 136 | raise Exception("Access Denied") 137 | with open(file_path, 'wb') as fh: 138 | for chunk in resp.iter_content(chunk_size=1024): 139 | fh.write(chunk) 140 | break 141 | except: 142 | # try again 143 | pass 144 | retry_times += 1 145 | else: 146 | try: 147 | os.remove(file_path) 148 | except OSError: 149 | pass 150 | print("Failed to retrieve %s from %s.\n" % (medium_type, 151 | medium_url)) 152 | 153 | 154 | def getHtml(site): 155 | # ssl._create_default_https_context = ssl._create_unverified_context 156 | headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} 157 | url = "https://nhentai.net/g/{0}/1/" 158 | urlhtml = url.format(site) 159 | 160 | req = urllib.request.Request(url=urlhtml, headers=headers) 161 | data = urllib.request.urlopen(req).read() 162 | soup = BeautifulSoup(data, "html.parser") 163 | 164 | section = soup.find_all('section', class_='pagination') 165 | tag = section[0] 166 | 167 | tag2 = tag.find_all('a')[3] 168 | page_url = tag2.get('href') 169 | max_page = page_url.split("/")[3] 170 | 171 | img_class = soup.find_all("img", class_="fit-horizontal") 172 | imgurl = img_class[0].get("src") 173 | imgpath = imgurl.split("/")[4] 174 | 175 | return max_page, imgpath 176 | 177 | 178 | def getUrlHtml(site): 179 | # ssl._create_default_https_context = ssl._create_unverified_context 180 | headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} 181 | urlhtml = site 182 | 183 | req = urllib.request.Request(url=urlhtml, headers=headers) 184 | data = urllib.request.urlopen(req).read() 185 | soup = BeautifulSoup(data, "html.parser") 186 | 187 | section = soup.find_all('div', class_='gallery') 188 | sitelist = [] 189 | for tag in section: 190 | ahref = tag.find_all('a')[0] 191 | href = ahref.get('href') 192 | href = href.split("/")[2] 193 | sitelist.append(href) 194 | return sitelist 195 | 196 | 197 | class CrawlerScheduler(object): 198 | 199 | def __init__(self, sites, proxies=None): 200 | self.sites = sites 201 | self.proxies = proxies 202 | self.queue = Queue.Queue() 203 | self.scheduling() 204 | 205 | def scheduling(self): 206 | # create workers 207 | for x in range(THREADS): 208 | worker = DownloadWorker(self.queue, 209 | proxies=self.proxies) 210 | # Setting daemon to True will let the main thread exit 211 | # even though the workers are blocking 212 | worker.daemon = True 213 | worker.start() 214 | 215 | for site in self.sites: 216 | print("site=" + site) 217 | self.download_media(site) 218 | 219 | def download_media(self, site): 220 | self.download_photos(site) 221 | # self.download_videos(site) 222 | 223 | def download_videos(self, site): 224 | self._download_media(site, "video", START) 225 | # wait for the queue to finish processing all the tasks from one 226 | # single site 227 | self.queue.join() 228 | print("Finish Downloading All the videos from %s" % site) 229 | 230 | def download_photos(self, site): 231 | self._download_media(site, "photo", START) 232 | # wait for the queue to finish processing all the tasks from one 233 | # single site 234 | self.queue.join() 235 | print("Finish Downloading All the photos from %s" % site) 236 | 237 | def _download_media(self, site, medium_type, start): 238 | current_folder = os.getcwd() 239 | target_folder = os.path.join(current_folder, site) 240 | if not os.path.isdir(target_folder): 241 | os.mkdir(target_folder) 242 | maxPage, imgpath = getHtml(site) 243 | base_url = "https://i.nhentai.net/galleries/" + imgpath + "/{0}.jpg" 244 | num = int(maxPage) 245 | print(base_url) 246 | for i in range(num): 247 | photo_path = base_url.format(i + 1) 248 | self.queue.put((medium_type, photo_path, target_folder)) 249 | # while True: 250 | # media_url = base_url.format(site, medium_type, MEDIA_NUM, start) 251 | # response = requests.get(media_url, 252 | # proxies=self.proxies) 253 | # if response.status_code == 404: 254 | # print("Site %s does not exist" % site) 255 | # break 256 | # 257 | # print("media_url=" + media_url) 258 | # try: 259 | # xml_cleaned = re.sub(u'[^\x20-\x7f]+', 260 | # u'', response.content.decode('utf-8')) 261 | # data = xmltodict.parse(xml_cleaned) 262 | # posts = data["tumblr"]["posts"]["post"] 263 | # 264 | # for post in posts: 265 | # try: 266 | # # if post has photoset, walk into photoset for each photo 267 | # 268 | # photoset = post["photoset"]["photo"] 269 | # 270 | # for photo in photoset: 271 | # 272 | # self.queue.put((medium_type, photo, target_folder)) 273 | # except: 274 | # # select the largest resolution 275 | # # usually in the first element 276 | # self.queue.put((medium_type, post, target_folder)) 277 | # start += MEDIA_NUM 278 | # except KeyError: 279 | # break 280 | # except UnicodeDecodeError: 281 | # print("Cannot decode response data from URL %s" % media_url) 282 | # continue 283 | # except: 284 | # print("Unknown xml-vulnerabilities from URL %s" % media_url) 285 | # continue 286 | 287 | 288 | def usage(): 289 | print("1. Please create file sites.txt under this same directory.\n" 290 | "2. In sites.txt, you can specify tumblr sites separated by " 291 | "comma/space/tab/CR. Accept multiple lines of text\n" 292 | "3. Save the file and retry.\n\n" 293 | "Sample File Content:\nsite1,site2\n\n" 294 | "Or use command line options:\n\n" 295 | "Sample:\npython tumblr-photo-video-ripper.py site1,site2\n\n\n") 296 | print(u"未找到sites.txt文件,请创建.\n" 297 | u"请在文件中指定Tumblr站点名,并以 逗号/空格/tab/表格鍵/回车符 分割,支持多行.\n" 298 | u"保存文件并重试.\n\n" 299 | u"例子: site1,site2\n\n" 300 | u"或者直接使用命令行参数指定站点\n" 301 | u"例子: python tumblr-photo-video-ripper.py site1,site2") 302 | 303 | 304 | def illegal_json(): 305 | print("Illegal JSON format in file 'proxies.json'.\n" 306 | "Please refer to 'proxies_sample1.json' and 'proxies_sample2.json'.\n" 307 | "And go to http://jsonlint.com/ for validation.\n\n\n") 308 | print(u"文件proxies.json格式非法.\n" 309 | u"请参照示例文件'proxies_sample1.json'和'proxies_sample2.json'.\n" 310 | u"然后去 http://jsonlint.com/ 进行验证.") 311 | 312 | 313 | def parse_sites(filename): 314 | with open(filename, "r") as f: 315 | raw_sites = f.read().rstrip().lstrip() 316 | 317 | raw_sites = raw_sites.replace("\t", ",") \ 318 | .replace("\r", ",") \ 319 | .replace("\n", ",") \ 320 | .replace(" ", ",") 321 | raw_sites = raw_sites.split(",") 322 | 323 | sites = list() 324 | for raw_site in raw_sites: 325 | site = raw_site.lstrip().rstrip() 326 | if site: 327 | sites.append(site) 328 | return sites 329 | 330 | 331 | if __name__ == "__main__": 332 | cur_dir = os.path.dirname(os.path.realpath(__file__)) 333 | sites = None 334 | sitesUrl = None 335 | 336 | proxies = None 337 | proxy_path = os.path.join(cur_dir, "proxies.json") 338 | if os.path.exists(proxy_path): 339 | with open(proxy_path, "r") as fj: 340 | try: 341 | proxies = json.load(fj) 342 | if proxies is not None and len(proxies) > 0: 343 | print("You are using proxies.\n%s" % proxies) 344 | except: 345 | illegal_json() 346 | sys.exit(1) 347 | 348 | if len(sys.argv) < 2: 349 | # check the sites file 350 | filename = os.path.join(cur_dir, "sites.txt") 351 | urlFileName = os.path.join(cur_dir, "sites_url.txt") 352 | if os.path.exists(filename): 353 | sites = parse_sites(filename) 354 | else: 355 | usage() 356 | sys.exit(1) 357 | if os.path.exists(urlFileName): 358 | sitesUrl = parse_sites(urlFileName) 359 | else: 360 | usage() 361 | sys.exit(1) 362 | else: 363 | sites = sys.argv[1].split(",") 364 | sitesUrl = sys.argv[1].split(",") 365 | 366 | for site in sitesUrl: 367 | print("site=" + site) 368 | sites2 = getUrlHtml(site) 369 | sites = sites+sites2 370 | 371 | if len(sites) == 0 or sites[0] == "": 372 | usage() 373 | sys.exit(1) 374 | NUM = 0 375 | print(sites) 376 | CrawlerScheduler(sites, proxies=proxies) 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | -------------------------------------------------------------------------------- /proxies.json: -------------------------------------------------------------------------------- 1 | { 2 | } 3 | -------------------------------------------------------------------------------- /sites.txt: -------------------------------------------------------------------------------- 1 | 263184 2 | 263183 -------------------------------------------------------------------------------- /sites_url.txt: -------------------------------------------------------------------------------- 1 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada --------------------------------------------------------------------------------