├── README.md
├── README_CN.md
├── nhentai-master.exe
├── nhentai-master.py
├── proxies.json
├── sites.txt
└── sites_url.txt


/README.md:
--------------------------------------------------------------------------------
 1 | nhentai-master
 2 | ===============
 3 | The main code for this project comes from[tumblr-crawler](https://github.com/dixudx/tumblr-crawler).  
 4 | That project was used to easily get pictures and videos from tumblr, and I modified it into this book downloader.
 5 | 
 6 | 
 7 | This is a [Python](https://www.python.org) script that you can easily download the photos from nhentai.net
 8 | 
 9 | ## 中文版教程请[移步这里](./README_CN.md)
10 | 
11 | 
12 | ### Use sites.txt
13 | 
14 | Find a text editor and open the file `sites.txt`, add the sites you want to download into the file, separated by comma/space/tab/CR, no `https://nhentai.net/g/`. For example, if you want to download `https://nhentai.net/g/263184/`,`https://nhentai.net/g/263183/`, compose the file like this:
15 | 
16 | ```
17 | 263184
18 | 263183
19 | ```
20 | 
21 | And then save the file, and run `python nhentai-master.py`  
22 | or just click `python nhentai-master.exe`
23 | 
24 | 
25 | ### Use sites_url.txt
26 | 
27 | When you search for Akane shinjou and Rikka takarada and need to download all search results, copy your web address and write to a file, for example:
28 | 
29 | ```
30 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada
31 | ```
32 | 
33 | And then save the file, and run `python nhentai-master.py`  
34 | or just click `python nhentai-master.exe`


--------------------------------------------------------------------------------
/README_CN.md:
--------------------------------------------------------------------------------
 1 | nhentai-master
 2 | ===============
 3 | 
 4 | 这个项目主要代码来自于[tumblr-crawler](https://github.com/dixudx/tumblr-crawler).  
 5 | 那个项目用于便捷获取tumblr中的图片和视频，我把它修改成了这个本子下载器
 6 | 
 7 | 
 8 | 这是一个[Python](https://www.python.org)的脚本,配置运行后下载nhentai中的本子
 9 | 或者可以直接使用编译好的.exe文件
10 | 
11 | ## 配置和运行
12 | 
13 | 有两种方式来指定你要下载的站点,一是编辑`sites.txt`,二是指定命令行参数.
14 | 
15 | ### 第一种方法:编辑sites.txt文件
16 | 
17 | 打开文件`sites.txt`,把你想要下载的nhentai页面写进去,以逗号/空格/tab/表格鍵/回车符分隔,可以多行,不需要`https://nhentai.net/g/`.例如,如果你要下载`https://nhentai.net/g/263184/`,`https://nhentai.net/g/263183/`,这个文件应该是这样的:
18 | 
19 | ```
20 | 263184
21 | 263183
22 | ```
23 | 
24 | 然后保存文件,在终端(terminal)里面运行`python nhentai-master.py`  
25 | 或者直接点击运行nhentai-master.exe
26 | 
27 | 
28 | ### 第二种方法:编辑sites_url.txt文件
29 | 当你搜索akane shinjou和rikka takarada并且需要下载所有的搜索结果时，复制你的网页地址并写入文件，例如：
30 | 
31 | ```
32 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada
33 | ```
34 | 
35 | 然后保存文件,在终端(terminal)里面运行`python nhentai-master.py`  
36 | 或者直接点击运行nhentai-master.exe
37 | 
38 | 
39 | 
40 | 
41 | 


--------------------------------------------------------------------------------
/nhentai-master.exe:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2311596178/nhentai-master/6229899326056d68070a1ab0c5bd15c24f9f3c8e/nhentai-master.exe


--------------------------------------------------------------------------------
/nhentai-master.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import os
  4 | import sys
  5 | import requests
  6 | import xmltodict
  7 | from six.moves import queue as Queue
  8 | from threading import Thread
  9 | import re
 10 | import json
 11 | import ssl
 12 | import urllib.request
 13 | import urllib.error
 14 | import urllib.parse
 15 | from bs4 import BeautifulSoup
 16 | 
 17 | 
 18 | # Setting timeout
 19 | TIMEOUT = 10
 20 | 
 21 | # Retry times
 22 | RETRY = 5
 23 | 
 24 | # Medium Index Number that Starts from
 25 | START = 0
 26 | 
 27 | # Numbers of photos/videos per page
 28 | MEDIA_NUM = 50
 29 | 
 30 | # Numbers of downloading threads concurrently
 31 | THREADS = 20
 32 | 
 33 | 
 34 | 
 35 | def video_hd_match():
 36 |     hd_pattern = re.compile(r'.*"hdUrl":("([^\s,]*)"|false),')
 37 | 
 38 |     def match(video_player):
 39 |         hd_match = hd_pattern.match(video_player)
 40 |         try:
 41 |             if hd_match is not None and hd_match.group(1) != 'false':
 42 |                 return hd_match.group(2).replace('\\', '')
 43 |         except:
 44 |             return None
 45 |     return match
 46 | 
 47 | 
 48 | def video_default_match():
 49 |     default_pattern = re.compile(r'.*src="(\S*)" ', re.DOTALL)
 50 | 
 51 |     def match(video_player):
 52 |         default_match = default_pattern.match(video_player)
 53 |         if default_match is not None:
 54 |             try:
 55 |                 return default_match.group(1)
 56 |             except:
 57 |                 return None
 58 |     return match
 59 | 
 60 | 
 61 | class DownloadWorker(Thread):
 62 |     def __init__(self, queue, proxies=None):
 63 |         Thread.__init__(self)
 64 |         self.queue = queue
 65 |         self.proxies = proxies
 66 |         self._register_regex_match_rules()
 67 | 
 68 |     def run(self):
 69 |         print("DownloadWorker.run()")
 70 |         while True:
 71 |             medium_type, post, target_folder = self.queue.get()
 72 |             self.download(medium_type, post, target_folder)
 73 |             self.queue.task_done()
 74 | 
 75 |     def download(self, medium_type, post, target_folder):
 76 |         try:
 77 |             medium_url = post
 78 | 
 79 |             print("medium_url=" + medium_url)
 80 |             if medium_url is not None:
 81 |                 self._download(medium_type, medium_url, target_folder)
 82 |         except TypeError:
 83 |             print("TypeError")
 84 |             pass
 85 | 
 86 |     # can register differnet regex match rules
 87 |     def _register_regex_match_rules(self):
 88 |         # will iterate all the rules
 89 |         # the first matched result will be returned
 90 |         self.regex_rules = [video_hd_match(), video_default_match()]
 91 | 
 92 |     def _handle_medium_url(self, medium_type, post):
 93 |         try:
 94 |             if medium_type == "photo":
 95 |                 return post["photo-url"][0]["#text"]
 96 | 
 97 |             if medium_type == "video":
 98 |                 video_player = post["video-player"][1]["#text"]
 99 |                 for regex_rule in self.regex_rules:
100 |                     matched_url = regex_rule(video_player)
101 |                     if matched_url is not None:
102 |                         return matched_url
103 |                 else:
104 |                     raise Exception
105 |         except:
106 |             raise TypeError("Unable to find the right url for downloading. "
107 |                             "Please open a new issue on "
108 |                             "https://github.com/dixudx/tumblr-crawler/"
109 |                             "issues/new attached with below information:\n\n"
110 |                             "%s" % post)
111 | 
112 |     def _download(self, medium_type, medium_url, target_folder):
113 |         medium_name = medium_url.split("/")[-1].split("?")[0]
114 |         if medium_type == "video":
115 |             if not medium_name.startswith("tumblr"):
116 |                 medium_name = "_".join([medium_url.split("/")[-2],
117 |                                         medium_name])
118 | 
119 |             medium_name += ".mp4"
120 |             medium_url = 'https://vt.tumblr.com/' + medium_name
121 | 
122 |         file_path = os.path.join(target_folder, medium_name)
123 |         if not os.path.isfile(file_path):
124 |             print("Downloading %s from %s.\n" % (medium_name,
125 |                                                  medium_url))
126 |             retry_times = 0
127 |             while retry_times < RETRY:
128 |                 try:
129 |                     resp = requests.get(medium_url,
130 |                                         stream=True,
131 |                                         proxies=self.proxies,
132 |                                         timeout=TIMEOUT)
133 |                     if resp.status_code == 403:
134 |                         retry_times = RETRY
135 |                         print("Access Denied when retrieve %s.\n" % medium_url)
136 |                         raise Exception("Access Denied")
137 |                     with open(file_path, 'wb') as fh:
138 |                         for chunk in resp.iter_content(chunk_size=1024):
139 |                             fh.write(chunk)
140 |                     break
141 |                 except:
142 |                     # try again
143 |                     pass
144 |                 retry_times += 1
145 |             else:
146 |                 try:
147 |                     os.remove(file_path)
148 |                 except OSError:
149 |                     pass
150 |                 print("Failed to retrieve %s from %s.\n" % (medium_type,
151 |                                                             medium_url))
152 | 
153 | 
154 | def getHtml(site):
155 |     # ssl._create_default_https_context = ssl._create_unverified_context
156 |     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
157 |     url = "https://nhentai.net/g/{0}/1/"
158 |     urlhtml = url.format(site)
159 | 
160 |     req = urllib.request.Request(url=urlhtml, headers=headers)
161 |     data = urllib.request.urlopen(req).read()
162 |     soup = BeautifulSoup(data, "html.parser")
163 | 
164 |     section = soup.find_all('section', class_='pagination')
165 |     tag = section[0]
166 | 
167 |     tag2 = tag.find_all('a')[3]
168 |     page_url = tag2.get('href')
169 |     max_page = page_url.split("/")[3]
170 | 
171 |     img_class = soup.find_all("img", class_="fit-horizontal")
172 |     imgurl = img_class[0].get("src")
173 |     imgpath = imgurl.split("/")[4]
174 | 
175 |     return max_page, imgpath
176 | 
177 | 
178 | def getUrlHtml(site):
179 |     # ssl._create_default_https_context = ssl._create_unverified_context
180 |     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
181 |     urlhtml = site
182 | 
183 |     req = urllib.request.Request(url=urlhtml, headers=headers)
184 |     data = urllib.request.urlopen(req).read()
185 |     soup = BeautifulSoup(data, "html.parser")
186 | 
187 |     section = soup.find_all('div', class_='gallery')
188 |     sitelist = []
189 |     for tag in section:
190 |         ahref = tag.find_all('a')[0]
191 |         href = ahref.get('href')
192 |         href = href.split("/")[2]
193 |         sitelist.append(href)
194 |     return sitelist
195 | 
196 | 
197 | class CrawlerScheduler(object):
198 | 
199 |     def __init__(self, sites, proxies=None):
200 |         self.sites = sites
201 |         self.proxies = proxies
202 |         self.queue = Queue.Queue()
203 |         self.scheduling()
204 | 
205 |     def scheduling(self):
206 |         # create workers
207 |         for x in range(THREADS):
208 |             worker = DownloadWorker(self.queue,
209 |                                     proxies=self.proxies)
210 |             # Setting daemon to True will let the main thread exit
211 |             # even though the workers are blocking
212 |             worker.daemon = True
213 |             worker.start()
214 | 
215 |         for site in self.sites:
216 |             print("site=" + site)
217 |             self.download_media(site)
218 | 
219 |     def download_media(self, site):
220 |         self.download_photos(site)
221 |         # self.download_videos(site)
222 | 
223 |     def download_videos(self, site):
224 |         self._download_media(site, "video", START)
225 |         # wait for the queue to finish processing all the tasks from one
226 |         # single site
227 |         self.queue.join()
228 |         print("Finish Downloading All the videos from %s" % site)
229 | 
230 |     def download_photos(self, site):
231 |         self._download_media(site, "photo", START)
232 |         # wait for the queue to finish processing all the tasks from one
233 |         # single site
234 |         self.queue.join()
235 |         print("Finish Downloading All the photos from %s" % site)
236 | 
237 |     def _download_media(self, site, medium_type, start):
238 |         current_folder = os.getcwd()
239 |         target_folder = os.path.join(current_folder, site)
240 |         if not os.path.isdir(target_folder):
241 |             os.mkdir(target_folder)
242 |         maxPage, imgpath = getHtml(site)
243 |         base_url = "https://i.nhentai.net/galleries/" + imgpath + "/{0}.jpg"
244 |         num = int(maxPage)
245 |         print(base_url)
246 |         for i in range(num):
247 |             photo_path = base_url.format(i + 1)
248 |             self.queue.put((medium_type, photo_path, target_folder))
249 |         # while True:
250 |         #     media_url = base_url.format(site, medium_type, MEDIA_NUM, start)
251 |         #     response = requests.get(media_url,
252 |         #                             proxies=self.proxies)
253 |         #     if response.status_code == 404:
254 |         #         print("Site %s does not exist" % site)
255 |         #         break
256 |         #
257 |         #     print("media_url=" + media_url)
258 |         #     try:
259 |         #         xml_cleaned = re.sub(u'[^\x20-\x7f]+',
260 |         #                              u'', response.content.decode('utf-8'))
261 |         #         data = xmltodict.parse(xml_cleaned)
262 |         #         posts = data["tumblr"]["posts"]["post"]
263 |         #
264 |         #         for post in posts:
265 |         #             try:
266 |         #                 # if post has photoset, walk into photoset for each photo
267 |         #
268 |         #                 photoset = post["photoset"]["photo"]
269 |         #
270 |         #                 for photo in photoset:
271 |         #
272 |         #                     self.queue.put((medium_type, photo, target_folder))
273 |         #             except:
274 |         #                 # select the largest resolution
275 |         #                 # usually in the first element
276 |         #                 self.queue.put((medium_type, post, target_folder))
277 |         #         start += MEDIA_NUM
278 |         #     except KeyError:
279 |         #         break
280 |         #     except UnicodeDecodeError:
281 |         #         print("Cannot decode response data from URL %s" % media_url)
282 |         #         continue
283 |         #     except:
284 |         #         print("Unknown xml-vulnerabilities from URL %s" % media_url)
285 |         #         continue
286 | 
287 | 
288 | def usage():
289 |     print("1. Please create file sites.txt under this same directory.\n"
290 |           "2. In sites.txt, you can specify tumblr sites separated by "
291 |           "comma/space/tab/CR. Accept multiple lines of text\n"
292 |           "3. Save the file and retry.\n\n"
293 |           "Sample File Content:\nsite1,site2\n\n"
294 |           "Or use command line options:\n\n"
295 |           "Sample:\npython tumblr-photo-video-ripper.py site1,site2\n\n\n")
296 |     print(u"未找到sites.txt文件，请创建.\n"
297 |           u"请在文件中指定Tumblr站点名，并以 逗号/空格/tab/表格鍵/回车符 分割，支持多行.\n"
298 |           u"保存文件并重试.\n\n"
299 |           u"例子: site1,site2\n\n"
300 |           u"或者直接使用命令行参数指定站点\n"
301 |           u"例子: python tumblr-photo-video-ripper.py site1,site2")
302 | 
303 | 
304 | def illegal_json():
305 |     print("Illegal JSON format in file 'proxies.json'.\n"
306 |           "Please refer to 'proxies_sample1.json' and 'proxies_sample2.json'.\n"
307 |           "And go to http://jsonlint.com/ for validation.\n\n\n")
308 |     print(u"文件proxies.json格式非法.\n"
309 |           u"请参照示例文件'proxies_sample1.json'和'proxies_sample2.json'.\n"
310 |           u"然后去 http://jsonlint.com/ 进行验证.")
311 | 
312 | 
313 | def parse_sites(filename):
314 |     with open(filename, "r") as f:
315 |         raw_sites = f.read().rstrip().lstrip()
316 | 
317 |     raw_sites = raw_sites.replace("\t", ",") \
318 |                          .replace("\r", ",") \
319 |                          .replace("\n", ",") \
320 |                          .replace(" ", ",")
321 |     raw_sites = raw_sites.split(",")
322 | 
323 |     sites = list()
324 |     for raw_site in raw_sites:
325 |         site = raw_site.lstrip().rstrip()
326 |         if site:
327 |             sites.append(site)
328 |     return sites
329 | 
330 | 
331 | if __name__ == "__main__":
332 |     cur_dir = os.path.dirname(os.path.realpath(__file__))
333 |     sites = None
334 |     sitesUrl = None
335 | 
336 |     proxies = None
337 |     proxy_path = os.path.join(cur_dir, "proxies.json")
338 |     if os.path.exists(proxy_path):
339 |         with open(proxy_path, "r") as fj:
340 |             try:
341 |                 proxies = json.load(fj)
342 |                 if proxies is not None and len(proxies) > 0:
343 |                     print("You are using proxies.\n%s" % proxies)
344 |             except:
345 |                 illegal_json()
346 |                 sys.exit(1)
347 | 
348 |     if len(sys.argv) < 2:
349 |         # check the sites file
350 |         filename = os.path.join(cur_dir, "sites.txt")
351 |         urlFileName = os.path.join(cur_dir, "sites_url.txt")
352 |         if os.path.exists(filename):
353 |             sites = parse_sites(filename)
354 |         else:
355 |             usage()
356 |             sys.exit(1)
357 |         if os.path.exists(urlFileName):
358 |             sitesUrl = parse_sites(urlFileName)
359 |         else:
360 |             usage()
361 |             sys.exit(1)
362 |     else:
363 |         sites = sys.argv[1].split(",")
364 |         sitesUrl = sys.argv[1].split(",")
365 | 
366 |     for site in sitesUrl:
367 |         print("site=" + site)
368 |         sites2 = getUrlHtml(site)
369 |         sites = sites+sites2
370 | 
371 |     if len(sites) == 0 or sites[0] == "":
372 |         usage()
373 |         sys.exit(1)
374 |     NUM = 0
375 |     print(sites)
376 |     CrawlerScheduler(sites, proxies=proxies)
377 | 
378 | 
379 | 
380 | 
381 | 
382 | 
383 | 
384 | 
385 | 
386 | 
387 | 
388 | 
389 | 
390 | 
391 | 
392 | 


--------------------------------------------------------------------------------
/proxies.json:
--------------------------------------------------------------------------------
1 | {
2 | }
3 | 


--------------------------------------------------------------------------------
/sites.txt:
--------------------------------------------------------------------------------
1 | 263184
2 | 263183


--------------------------------------------------------------------------------
/sites_url.txt:
--------------------------------------------------------------------------------
1 | https://nhentai.net/search/?q=akane+shinjou++rikka+takarada


--------------------------------------------------------------------------------