├── LiberationSerif-BoldItalic.ttf
├── README.md
├── httpscreenshot.py
├── install-dependencies.sh
├── masshttp.sh
├── requirements.txt
└── screenshotClustering
    ├── cluster.py
    └── popup.js


/LiberationSerif-BoldItalic.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/foxglovesec/httpscreenshot/c0f38700e5c3c105270de623299f91612a4483e5/LiberationSerif-BoldItalic.ttf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # httpscreenshot
  2 | 
  3 | ### Installation on Ubuntu
  4 | 
  5 | #### Via Script
  6 | 
  7 | Run `install-dependencies.sh` script as root.
  8 | 
  9 | This script has been tested on Ubuntu 14.04.
 10 | 
 11 | ### Manually
 12 | 
 13 |     apt-get install swig swig2.0 libssl-dev python-dev python-pip
 14 |     pip install -r requirements.txt
 15 | 
 16 | If you run into: 'module' object has no attribute 'PhantomJS' then `pip install selenium` (or `pip install --upgrade selenium`).
 17 | 
 18 | 
 19 | If installing on Kali Linux, PhantomJS might not be in the repositories, you can download from https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.8-linux-x86_64.tar.bz2 and symlink to `/usr/bin` like so:
 20 | 
 21 |     sudo ln -s /path/to/phantomjs /usr/bin/phantomjs
 22 | 
 23 | ## README and Use Cases
 24 | 
 25 | HTTPScreenshot is a tool for grabbing screenshots and HTML of large numbers of websites. The goal is for it to be both thorough and fast which can sometimes oppose each other.
 26 | 
 27 | Before getting into documentation - this is what I USUALLY use for options if I want to screenshot a bunch of sites:
 28 | 
 29 |     ./httpscreenshot.py -i \<gnmapFile\> -p -w 40 -a -vH
 30 | 
 31 | Notice there are a ton of worker threads (40). This can be problematic, I make up for failures that could have been a result of too many threads with a second run:
 32 | 
 33 |     ./httpscreenshot.py -i \<gnmapFile\> -p -w 5 -a -vH
 34 | 
 35 | YMMV
 36 | 
 37 | The options are as follows:
 38 | 
 39 |   -h, --help            show this help message and exit
 40 |   -l LIST, --list LIST  List of input URLs
 41 |   -i INPUT, --input INPUT
 42 |                         nmap gnmap output file
 43 |   -p, --headless        Run in headless mode (using phantomjs)
 44 |   -w WORKERS, --workers WORKERS
 45 |                         number of threads
 46 |   -t TIMEOUT, --timeout TIMEOUT
 47 |                         time to wait for pageload before killing the browser
 48 |   -v, --verbose         turn on verbose debugging
 49 |   -a, --autodetect      Automatically detect if listening services are HTTP or
 50 |                         HTTPS. Ignores NMAP service detction and URL schemes.
 51 |   -vH, --vhosts         Attempt to scrape hostnames from SSL certificates and
 52 |                         add these to the URL queue
 53 |   -dB DNS_BRUTE, --dns_brute DNS_BRUTE
 54 |                         Specify a DNS subdomain wordlist for bruteforcing on
 55 |                         wildcard SSL certs
 56 |   -r RETRIES, --retries RETRIES
 57 |                         Number of retries if a URL fails or timesout
 58 |   -tG, --trygui         Try to fetch the page with FireFox when headless fails
 59 |   -sF, --smartfetch     Enables smart fetching to reduce network traffic, also
 60 |                         increases speed if certain conditions are met.
 61 |   -pX PROXY, --proxy PROXY
 62 |                         SOCKS5 Proxy in host:port format
 63 |  
 64 | 
 65 | Some of the above options have non-obvious use-cases, so the following provides some more detail:
 66 | 
 67 | -l, --list -> Takes as input a file with a simple list of input URLs in the format "http(s)://\<URL\>"
 68 | 
 69 | -i, --input -> Takes a gnmap file as input. This includes masscan gnmap output.
 70 | 
 71 | -p, --headless -> I find myself using this option more and more. By default the script "drives" Firefox. As the number of threads increases this becomes really ugly - 20,30 Firefox windows open at once. This options uses "phantomjs" which doesn't have a GUI but will still do a decent job parsing javascript.
 72 | 
 73 | -w, --workers -> The number of threads to use. Increase for more speed. The list of input URL's is automatically shuffled to avoid hammering at IP addresses that are close to each other when possible. If you add too many threads, you might start seeing timeouts in responses - adjust for your network and machine.
 74 | 
 75 | -t TIMEOUT, --timeout -> How long to wait for a response from the server before calling it quits
 76 | 
 77 | -v, --verbose -> Will spit out some extra debugging output.
 78 | 
 79 | -a, --autodetect -> Without this option enabled, HTTPScreenshot will behave as follows:
 80 |     
 81 | > If a LIST of urls is specified as input, sites with scheme "http://" are treated as non-ssl and sites with scheme "https://" are treated as ssl-enabled
 82 | 
 83 | > For GNMAP input the script will scrape input and try to use any SSL detection performed by nmap. Unfortunately this is unreliable, nmap doesn't always like to tell you that something is SSL enabled. Further, masscan doesn't do any version or service detection.
 84 | 
 85 | > The -a or --autodetect option throws away all SSL hints from the input file and tries to detect on its own.
 86 | 
 87 | -vH, --vhosts -> Often when visiting websites by their IP address (e.g: https://192.168.1.30), we will receive a different page than expected or an error. This is because the site is expecting a certain "virtual host" or hostname instead of the IP address, sometimes a single HTTP server will respond with many different pages for different hostnames.
 88 | 
 89 | > For plaintext "http" websites, we can use reverse DNS, BING reverse IP search etc... to try and find the hostnames associated with an IP address. This is not currently a feature in HTTPScreenshot, but may be implemented later.
 90 | 
 91 | > For SSL enabled "https" sites, this can be a little easier. The SSL certificate will provide us with a hint at the domain name in the CN field. In the "subject alt names" field of the certificate, when it exists, we may get a whole list of other domain names potentially associated with this IP. Often these are in the form "\*.google.com" (wildcard certificate) but sometimes will be linked to a single hostname only like "www.google.com"
 92 | 
 93 | > The -vH or --vhosts flag will, for each SSL enabled website extract the hostnames from the CN and subject alt names field, and add them to the list of URL's to be screenshotted. For wildcard certificates, the "\*." part of the name is dropped.
 94 | 
 95 | -dB, --dns_brute -> Must use with -vH for it to make sense. This flag specifies a file containing a list of potential subdomains. For any wildcard certificate e.g: "\*.google.com", HTTPScreenshot will try to bruteforce valid subdomains and add them to the list of URLs to be screenshotted.
 96 | 
 97 | -r, --retries -> Sometimes Firefox or ghostscript timeout when fetching a page. This could be due to a number of factors, sometimes you just have too many threads going, a network hiccup, etc. This specifies the number of times to "retry" a given host when it fails.
 98 | 
 99 | -tG, --trygui -> Upon failure to fetch with the headless browser phantomJS, will pop open FireFox and try again.
100 | 
101 | -sF, --smartfetch -> Enables smart fetching to reduce network traffic, also increases speed if certain conditions are met.
102 | 
103 | 


--------------------------------------------------------------------------------
/httpscreenshot.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | '''
  4 | Installation on Ubuntu:
  5 | apt-get install python-requests python-m2crypto phantomjs
  6 | If you run into: 'module' object has no attribute 'PhantomJS'
  7 | then pip install selenium (or pip install --upgrade selenium)
  8 | '''
  9 | 
 10 | from selenium import webdriver
 11 | from urlparse import urlparse
 12 | from random   import shuffle
 13 | from PIL      import Image
 14 | from PIL      import ImageDraw
 15 | from PIL      import ImageFont
 16 | import multiprocessing
 17 | import Queue
 18 | import argparse
 19 | import sys
 20 | import traceback
 21 | import os.path
 22 | import ssl
 23 | import M2Crypto
 24 | import re
 25 | import time
 26 | import signal
 27 | import shutil
 28 | import hashlib
 29 | 
 30 | try:
 31 |     from urllib.parse import quote
 32 | except:
 33 |     from urllib import quote
 34 | 
 35 | try:
 36 | 	import requesocks as requests
 37 | except:
 38 | 	print "requesocks library not found - proxy support will not be available"
 39 | 	import requests
 40 | 
 41 | reload(sys)
 42 | sys.setdefaultencoding("utf8")
 43 | 
 44 | 
 45 | def timeoutFn(func, args=(), kwargs={}, timeout_duration=1, default=None):
 46 |     import signal
 47 | 
 48 |     class TimeoutError(Exception):
 49 |         pass
 50 | 
 51 |     def handler(signum, frame):
 52 |         raise TimeoutError()
 53 | 
 54 |     # set the timeout handler
 55 |     signal.signal(signal.SIGALRM, handler)
 56 |     signal.alarm(timeout_duration)
 57 |     try:
 58 |         result = func(*args, **kwargs)
 59 |     except TimeoutError as exc:
 60 |         result = default
 61 |     finally:
 62 |         signal.alarm(0)
 63 | 
 64 |     return result
 65 | 
 66 | 
 67 | def addUrlsForService(host, urlList, servicesList, scheme):
 68 | 	if(servicesList == None or servicesList == []):
 69 | 		return
 70 | 	for service in servicesList:
 71 | 		state = service.findPreviousSibling("state")
 72 | 		if(state != None and state != [] and state['state'] == 'open'):
 73 | 			urlList.append(scheme+host+':'+str(service.parent['portid']))
 74 | 
 75 | 
 76 | def detectFileType(inFile):
 77 | 	#Check to see if file is of type gnmap
 78 | 	firstLine = inFile.readline()
 79 | 	secondLine = inFile.readline()
 80 | 	thirdLine = inFile.readline()
 81 | 
 82 | 	#Be polite and reset the file pointer
 83 | 	inFile.seek(0)
 84 | 
 85 | 	if ((firstLine.find('nmap') != -1 or firstLine.find('Masscan') != -1) and thirdLine.find('Host:') != -1):
 86 | 		#Looks like a gnmap file - this wont be true for other nmap output types
 87 | 		#Check to see if -sV flag was used, if not, warn
 88 | 		if(firstLine.find('-sV') != -1 or firstLine.find('-A') != -1):
 89 | 			return 'gnmap'
 90 | 		else:
 91 | 			print("Nmap version detection not used! Discovery module may miss some hosts!")
 92 | 			return 'gnmap'
 93 | 	else:
 94 | 		return None
 95 | 
 96 | 
 97 | def parseGnmap(inFile, autodetect):
 98 | 	'''
 99 | 	Parse a gnmap file into a dictionary. The dictionary key is the ip address or hostname.
100 | 	Each key item is a list of ports and whether or not that port is https/ssl. For example:
101 | 	>>> targets
102 | 	{'127.0.0.1': [[443, True], [8080, False]]}
103 | 	'''
104 | 	targets = {}
105 | 	for hostLine in inFile:
106 | 		currentTarget = []
107 | 		#Pull out the IP address (or hostnames) and HTTP service ports
108 | 		fields = hostLine.split(' ')
109 | 		ip = fields[1] #not going to regex match this with ip address b/c could be a hostname
110 | 		for item in fields:
111 | 			#Make sure we have an open port with an http type service on it
112 | 			if (item.find('http') != -1 or autodetect) and re.findall('\d+/open',item):
113 | 				port  = None
114 | 				https = False
115 | 				'''
116 | 				nmap has a bunch of ways to list HTTP like services, for example:
117 | 				8089/open/tcp//ssl|http
118 | 				8000/closed/tcp//http-alt///
119 | 				8008/closed/tcp//http///
120 | 				8080/closed/tcp//http-proxy//
121 | 				443/open/tcp//ssl|https?///
122 | 				8089/open/tcp//ssl|http
123 | 				Since we want to detect them all, let's just match on the word http
124 | 				and make special cases for things containing https and ssl when we
125 | 				construct the URLs.
126 | 				'''
127 | 				port = item.split('/')[0]
128 | 
129 | 				if item.find('https') != -1 or item.find('ssl') != -1:
130 | 					https = True
131 | 				#Add the current service item to the currentTarget list for this host
132 | 				currentTarget.append([port,https])
133 | 
134 | 		if(len(currentTarget) > 0):
135 | 			targets[ip] = currentTarget
136 | 	return targets
137 | 
138 | 
139 | def setupBrowserProfile(headless,proxy):
140 | 	browser = None
141 | 	if(proxy is not None):
142 | 		service_args=['--ignore-ssl-errors=true','--ssl-protocol=tlsv1','--proxy='+proxy,'--proxy-type=socks5']
143 | 	else:
144 | 		service_args=['--ignore-ssl-errors=true','--ssl-protocol=tlsv1']
145 | 
146 | 	while(browser is None):
147 | 		try:
148 | 			if(not headless):
149 | 				fp = webdriver.FirefoxProfile()
150 | 				fp.set_preference("webdriver.accept.untrusted.certs",True)
151 | 				fp.set_preference("security.enable_java", False)
152 | 				fp.set_preference("webdriver.load.strategy", "fast");
153 | 				if(proxy is not None):
154 | 					proxyItems = proxy.split(":")
155 | 					fp.set_preference("network.proxy.socks",proxyItems[0])
156 | 					fp.set_preference("network.proxy.socks_port",int(proxyItems[1]))
157 | 					fp.set_preference("network.proxy.type",1)
158 | 				browser = webdriver.Firefox(fp)
159 | 			else:
160 | 				browser = webdriver.PhantomJS(service_args=service_args, executable_path="phantomjs")
161 | 		except Exception as e:
162 | 			print e
163 | 			time.sleep(1)
164 | 			continue
165 | 	return browser
166 | 
167 | 
168 | def writeImage(text, filename, fontsize=40, width=1024, height=200):
169 | 	image = Image.new("RGBA", (width,height), (255,255,255))
170 | 	draw = ImageDraw.Draw(image)
171 |         if (os.path.exists("/usr/share/httpscreenshot/LiberationSerif-BoldItalic.ttf")):
172 |             font_path = "/usr/share/httpscreenshot/LiberationSerif-BoldItalic.ttf"
173 |         else:
174 |             font_path = os.path.dirname(os.path.realpath(__file__))+"/LiberationSerif-BoldItalic.ttf"
175 | 	font = ImageFont.truetype(font_path, fontsize)
176 | 	draw.text((10, 0), text, (0,0,0), font=font)
177 | 	image.save(filename)
178 | 
179 | 
180 | def worker(urlQueue, tout, debug, headless, doProfile, vhosts, subs, extraHosts, tryGUIOnFail, smartFetch,proxy):
181 | 	if(debug):
182 | 		print '[*] Starting worker'
183 | 	
184 | 	browser = None
185 | 	try:
186 | 		browser = setupBrowserProfile(headless,proxy)
187 | 
188 | 	except:
189 | 		print "[-] Oh no! Couldn't create the browser, Selenium blew up"
190 | 		exc_type, exc_value, exc_traceback = sys.exc_info()
191 | 		lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
192 | 		print ''.join('!! ' + line for line in lines)
193 | 		return
194 | 
195 | 	while True:
196 | 		#Try to get a URL from the Queue
197 | 		if urlQueue.qsize() > 0:
198 | 			try:			
199 | 				curUrl = urlQueue.get(timeout=tout)
200 | 			except Queue.Empty:
201 | 				continue
202 | 			print '[+] '+str(urlQueue.qsize())+' URLs remaining'
203 | 			screenshotName = quote(curUrl[0], safe='')
204 | 			if(debug):
205 | 				print '[+] Got URL: '+curUrl[0]
206 | 				print '[+] screenshotName: '+screenshotName
207 | 			if(os.path.exists(screenshotName+".png")):
208 | 				if(debug):
209 | 			 		print "[-] Screenshot already exists, skipping"
210 | 				continue
211 | 		else:
212 | 			if(debug):
213 | 				print'[-] URL queue is empty, quitting.'
214 | 			browser.quit()
215 | 			return
216 | 
217 | 		try:
218 | 			if(doProfile):
219 | 				[resp,curUrl] = autodetectRequest(curUrl, timeout=tout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts,proxy=proxy)
220 | 			else:
221 | 				resp = doGet(curUrl, verify=False, timeout=tout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts,proxy=proxy)
222 | 			if(resp is not None and resp.status_code == 401):
223 | 				print curUrl[0]+" Requires HTTP Basic Auth"
224 | 				f = open(screenshotName+".html",'w')
225 | 				f.write(resp.headers.get('www-authenticate','NONE'))
226 | 				f.write('<title>Basic Auth</title>')
227 | 				f.close()
228 | 				writeImage(resp.headers.get('www-authenticate','NO WWW-AUTHENTICATE HEADER'),screenshotName+".png")
229 | 				continue
230 | 
231 | 			elif(resp is not None):
232 | 				if(resp.text is not None):
233 | 					resp_hash = hashlib.md5(resp.text).hexdigest()
234 | 				else:
235 | 					resp_hash = None
236 | 				
237 | 				if smartFetch and resp_hash is not None and resp_hash in hash_basket:
238 | 					#We have this exact same page already, copy it instead of grabbing it again
239 | 					print "[+] Pre-fetch matches previously imaged service, no need to do it again!"
240 | 					shutil.copy2(hash_basket[resp_hash]+".html",screenshotName+".html")
241 | 					shutil.copy2(hash_basket[resp_hash]+".png",screenshotName+".png")
242 | 				else:
243 | 					if smartFetch:
244 | 						hash_basket[resp_hash] = screenshotName
245 | 
246 | 				
247 | 				browser.set_window_size(1024, 768)
248 | 				browser.set_page_load_timeout((tout))
249 | 				old_url = browser.current_url
250 | 				browser.get(curUrl[0].strip())
251 | 				if(browser.current_url == old_url):
252 | 					print "[-] Error fetching in browser but successfully fetched with Requests: "+curUrl[0]
253 | 					if(headless):
254 | 						if(debug):
255 | 							print "[+] Trying with sslv3 instead of TLS - known phantomjs bug: "+curUrl[0]
256 | 						browser2 = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'], executable_path="phantomjs")
257 | 						old_url = browser2.current_url
258 | 						browser2.get(curUrl[0].strip())
259 | 						if(browser2.current_url == old_url):
260 | 							if(debug):
261 | 								print "[-] Didn't work with SSLv3 either..."+curUrl[0]
262 | 							browser2.close()
263 | 						else:
264 | 							print '[+] Saving: '+screenshotName
265 | 							html_source = browser2.page_source
266 | 							f = open(screenshotName+".html",'w')
267 | 							f.write(html_source)
268 | 							f.close()
269 | 							browser2.save_screenshot(screenshotName+".png")
270 | 							browser2.close()
271 | 							continue						
272 | 
273 | 					if(tryGUIOnFail and headless):
274 | 						print "[+] Attempting to fetch with FireFox: "+curUrl[0]
275 | 						browser2 = setupBrowserProfile(False,proxy)
276 | 						old_url = browser2.current_url
277 | 						browser2.get(curUrl[0].strip())
278 | 						if(browser2.current_url == old_url):
279 | 							print "[-] Error fetching in GUI browser as well..."+curUrl[0]
280 | 							browser2.close()
281 | 							continue
282 | 						else:
283 | 							print '[+] Saving: '+screenshotName
284 | 							html_source = browser2.page_source
285 | 							f = open(screenshotName+".html",'w')
286 | 							f.write(html_source)
287 | 							f.close()
288 | 							browser2.save_screenshot(screenshotName+".png")
289 | 							browser2.close()
290 | 							continue
291 | 					else:
292 | 						continue
293 | 
294 | 				print '[+] Saving: '+screenshotName
295 | 				html_source = browser.page_source
296 | 				f = open(screenshotName+".html",'w')
297 | 				f.write(html_source)
298 | 				f.close()
299 | 				browser.save_screenshot(screenshotName+".png")
300 | 
301 | 		except Exception as e:
302 | 			print e
303 | 			print '[-] Something bad happened with URL: '+curUrl[0]
304 | 			if(curUrl[2] > 0):
305 | 				curUrl[2] = curUrl[2] - 1;
306 | 				urlQueue.put(curUrl)
307 | 			if(debug):
308 | 				exc_type, exc_value, exc_traceback = sys.exc_info()
309 | 				lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
310 | 				print ''.join('!! ' + line for line in lines) 
311 | 			browser.quit()
312 | 			browser = setupBrowserProfile(headless,proxy)
313 | 			continue
314 | 
315 | 
316 | def doGet(*args, **kwargs):
317 | 	url        = args[0]
318 | 	doVhosts   = kwargs.pop('vhosts'    ,None)
319 | 	urlQueue   = kwargs.pop('urlQueue'  ,None)
320 | 	subs       = kwargs.pop('subs'      ,None)
321 | 	extraHosts = kwargs.pop('extraHosts',None)
322 | 	proxy = kwargs.pop('proxy',None)
323 | 
324 | 	kwargs['allow_redirects'] = False
325 | 	session = requests.session()
326 | 	if(proxy is not None):
327 | 		session.proxies={'http':'socks5://'+proxy,'https':'socks5://'+proxy}
328 | 	resp = session.get(url[0],**kwargs)
329 | 
330 | 	#If we have an https URL and we are configured to scrape hosts from the cert...
331 | 	if(url[0].find('https') != -1 and url[1] == True):
332 | 		#Pull hostnames from cert, add as additional URLs and flag as not to pull certs
333 | 		host = urlparse(url[0]).hostname
334 | 		port = urlparse(url[0]).port
335 | 		if(port is None):
336 | 			port = 443
337 | 		names = []
338 | 		try:
339 | 			cert     = ssl.get_server_certificate((host,port),ssl_version=ssl.PROTOCOL_SSLv23)
340 | 			x509     = M2Crypto.X509.load_cert_string(cert.decode('string_escape'))
341 | 			subjText = x509.get_subject().as_text()
342 | 			names    = re.findall("CN=([^\s]+)",subjText)
343 | 			altNames = x509.get_ext('subjectAltName').get_value()
344 | 			names.extend(re.findall("DNS:([^,]*)",altNames))
345 | 		except:
346 | 			pass
347 | 
348 | 		for name in names:
349 | 			if(name.find('*.') != -1):
350 | 				for sub in subs:
351 | 					try:
352 | 						sub = sub.strip()
353 | 						hostname = name.replace('*.',sub+'.')
354 | 						if(hostname not in extraHosts):
355 | 							extraHosts[hostname] = 1
356 | 							address = socket.gethostbyname(hostname)
357 | 							urlQueue.put(['https://'+hostname+':'+str(port),False,url[2]])
358 | 							print '[+] Discovered subdomain '+address
359 | 					except:
360 | 						pass
361 | 				name = name.replace('*.','')
362 | 				if(name not in extraHosts):
363 | 					extraHosts[name] = 1
364 | 					urlQueue.put(['https://'+name+':'+str(port),False,url[2]])
365 | 					print '[+] Added host '+name
366 | 			else:
367 | 				if (name not in extraHosts):
368 | 					extraHosts[name] = 1
369 | 					urlQueue.put(['https://'+name+':'+str(port),False,url[2]])
370 | 					print '[+] Added host '+name
371 | 		return resp
372 | 	else:	
373 | 		return resp
374 | 
375 | 
376 | def autodetectRequest(url, timeout, vhosts=False, urlQueue=None, subs=None, extraHosts=None,proxy=None):
377 | 	'''Takes a URL, ignores the scheme. Detect if the host/port is actually an HTTP or HTTPS
378 | 	server'''
379 | 	resp = None
380 | 	host = urlparse(url[0]).hostname
381 | 	port = urlparse(url[0]).port
382 | 
383 | 	if(port is None):
384 | 		if('https' in url[0]):
385 | 			port = 443
386 | 		else:
387 | 			port = 80
388 | 
389 | 	try:
390 | 		#cert = ssl.get_server_certificate((host,port))
391 | 		
392 | 		cert = timeoutFn(ssl.get_server_certificate,kwargs={'addr':(host,port),'ssl_version':ssl.PROTOCOL_SSLv23},timeout_duration=3)
393 | 
394 | 		if(cert is not None):
395 | 			if('https' not in url[0]):
396 | 				url[0] = url[0].replace('http','https')
397 | 				#print 'Got cert, changing to HTTPS '+url[0]
398 | 
399 | 		else:
400 | 			url[0] = url[0].replace('https','http')
401 | 			#print 'Changing to HTTP '+url[0]
402 | 
403 | 
404 | 	except Exception as e:
405 | 		url[0] = url[0].replace('https','http')
406 | 		#print 'Changing to HTTP '+url[0]
407 | 	try:
408 | 		resp = doGet(url,verify=False, timeout=timeout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts, proxy=proxy)
409 | 	except Exception as e:
410 | 		print 'HTTP GET Error: '+str(e)
411 | 		print url[0]
412 | 
413 | 	return [resp,url]
414 | 
415 | 
416 | def sslError(e):
417 | 	if('the handshake operation timed out' in str(e) or 'unknown protocol' in str(e) or 'Connection reset by peer' in str(e) or 'EOF occurred in violation of protocol' in str(e)):
418 | 		return True
419 | 	else:
420 | 		return False
421 | 
422 | def signal_handler(signal, frame):
423 |         print "[-] Ctrl-C received! Killing Thread(s)..."
424 | 	os._exit(0)
425 | signal.signal(signal.SIGINT, signal_handler)
426 | 
427 | if __name__ == '__main__':
428 | 	parser = argparse.ArgumentParser()
429 | 
430 | 	parser.add_argument("-l","--list",help='List of input URLs')
431 | 	parser.add_argument("-i","--input",help='nmap gnmap output file')
432 | 	parser.add_argument("-p","--headless",action='store_true',default=False,help='Run in headless mode (using phantomjs)')
433 | 	parser.add_argument("-w","--workers",default=1,type=int,help='number of threads')
434 | 	parser.add_argument("-t","--timeout",type=int,default=10,help='time to wait for pageload before killing the browser')
435 | 	parser.add_argument("-v","--verbose",action='store_true',default=False,help='turn on verbose debugging')
436 | 	parser.add_argument("-a","--autodetect",action='store_true',default=False,help='Automatically detect if listening services are HTTP or HTTPS. Ignores NMAP service detction and URL schemes.')
437 | 	parser.add_argument("-vH","--vhosts",action='store_true',default=False,help='Attempt to scrape hostnames from SSL certificates and add these to the URL queue')
438 | 	parser.add_argument("-dB","--dns_brute",help='Specify a DNS subdomain wordlist for bruteforcing on wildcard SSL certs')
439 | 	parser.add_argument("-uL","--uri_list",help='Specify a list of URIs to fetch in addition to the root')
440 | 	parser.add_argument("-r","--retries",type=int,default=0,help='Number of retries if a URL fails or timesout')
441 | 	parser.add_argument("-tG","--trygui",action='store_true',default=False,help='Try to fetch the page with FireFox when headless fails')
442 | 	parser.add_argument("-sF","--smartfetch",action='store_true',default=False,help='Enables smart fetching to reduce network traffic, also increases speed if certain conditions are met.')
443 | 	parser.add_argument("-pX","--proxy",default=None,help='SOCKS5 Proxy in host:port format')
444 | 
445 | 
446 | 	args = parser.parse_args()
447 | 
448 | 	if(len(sys.argv) < 2):
449 | 		parser.print_help()
450 | 		sys.exit(0)
451 | 
452 | 	
453 | 	#read in the URI list if specificed
454 | 	uris = ['']
455 | 	if(args.uri_list != None):
456 | 		uris = open(args.uri_list,'r').readlines()
457 | 		uris.append('')
458 | 
459 | 	if(args.input is not None):
460 | 		inFile = open(args.input,'r')
461 | 		if(detectFileType(inFile) == 'gnmap'):
462 | 			hosts = parseGnmap(inFile,args.autodetect)
463 | 			urls = []
464 | 			for host,ports in hosts.items():
465 | 				for port in ports:
466 | 					for uri in uris:
467 | 						url = ''
468 | 						if port[1] == True:
469 | 							url = ['https://'+host+':'+port[0]+uri.strip(),args.vhosts,args.retries]
470 | 						else:
471 | 							url = ['http://'+host+':'+port[0]+uri.strip(),args.vhosts,args.retries]
472 | 						urls.append(url)
473 | 		else:
474 | 			print 'Invalid input file - must be Nmap GNMAP'
475 | 	
476 | 	elif (args.list is not None):
477 | 		f = open(args.list,'r')
478 | 		lst = f.readlines()
479 | 		urls = []
480 | 		for url in lst:
481 | 			urls.append([url.strip(),args.vhosts,args.retries])
482 | 	else:
483 | 		print "No input specified"
484 | 		sys.exit(0)
485 | 	
486 | 
487 | 	#shuffle the url list
488 | 	shuffle(urls)
489 | 
490 | 	#read in the subdomain bruteforce list if specificed
491 | 	subs = []
492 | 	if(args.dns_brute != None):
493 | 		subs = open(args.dns_brute,'r').readlines()
494 | 
495 | 	#Fire up the workers
496 | 	urlQueue      = multiprocessing.Queue()
497 | 	manager       = multiprocessing.Manager()
498 | 	hostsDict     = manager.dict()
499 | 	workers       = []
500 | 	hash_basket   = {}
501 | 
502 | 	for i in range(args.workers):
503 | 		p = multiprocessing.Process(target=worker, args=(urlQueue, args.timeout, args.verbose, args.headless, args.autodetect, args.vhosts, subs, hostsDict, args.trygui, args.smartfetch,args.proxy))
504 | 		workers.append(p)
505 | 		p.start()
506 | 	
507 | 	for url in urls:
508 | 		urlQueue.put(url)
509 | 
510 | 	for p in workers:
511 | 	        p.join()
512 | 			
513 | 


--------------------------------------------------------------------------------
/install-dependencies.sh:
--------------------------------------------------------------------------------
 1 | # Installation Script - tested on an ubuntu/trusty64 vagrant box
 2 | 
 3 | # Show all commands being run
 4 | #set -x
 5 | 
 6 | # Error out if one fails
 7 | set -e
 8 | 
 9 | apt-get install -y swig swig2.0 libssl-dev python-dev
10 | 
11 | # Newer version in PyPI
12 | #apt-get install -y python-requests
13 | 
14 | # Newer version in PyPI
15 | #apt-get install -y python-m2crypto
16 | 
17 | # Installing pillow from PIP for the latest
18 | #apt-get install -y python-pil
19 | 
20 | # Install pip and install pytnon requirements through it
21 | apt-get install -y python-pip
22 | pip install -r requirements.txt
23 | 
24 | # This binary is distributed with the code base, version is
25 | # more recent then the one in the ubuntu repo (1.9.1 vs 1.9.0)
26 | #apt-get install -y phantomjs
27 | 
28 | # Grab the latest of phantomjs it directly from the source
29 | wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.8-linux-x86_64.tar.bz2
30 | 
31 | phantom_md5sum=`md5sum phantomjs-1.9.8-linux-x86_64.tar.bz2 | cut -d' ' -f1`
32 | checksum="4ea7aa79e45fbc487a63ef4788a18ef7"
33 | 
34 | if [ "$phantom_md5sum" != "$checksum" ]
35 | then
36 |     echo "phantomjs checksum mismatch"
37 |     exit 254
38 | fi
39 | 
40 | tar xvf phantomjs-1.9.8-linux-x86_64.tar.bz2
41 | mv phantomjs-1.9.8-linux-x86_64/bin/phantomjs /usr/bin/phantomjs
42 | 
43 | 


--------------------------------------------------------------------------------
/masshttp.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | /root/masscan/bin/masscan -p80,443 -iL networks.txt -oG http.gnmap --rate 100000
 4 | mkdir httpscreenshots
 5 | cd httpscreenshots
 6 | python ~/tools/httpscreenshot.py -i ../http.gnmap -p -t 30 -w 50 -a -vH -r 1
 7 | python ~/tools/httpscreenshot.py -i ../http.gnmap -p -t 10 -w 10 -a -vH
 8 | cd ..
 9 | python screenshotClustering/cluster.py -d httpscreenshots/
10 | 
11 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | m2crypto
2 | requests
3 | selenium
4 | beautifulsoup4
5 | pillow
6 | requesocks
7 | 


--------------------------------------------------------------------------------
/screenshotClustering/cluster.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | import os
  3 | import sys
  4 | import argparse
  5 | from collections import OrderedDict
  6 | from collections import defaultdict
  7 | import re
  8 | import time
  9 | from bs4 import BeautifulSoup
 10 | 
 11 | try:
 12 |     from urllib.parse import quote,unquote
 13 | except:
 14 |     from urllib import quote,unquote
 15 | 
 16 | def addAttrToBag(attrName,url,link,wordBags,soup):
 17 | 	for tag in soup.findAll('',{attrName:True}):
 18 | 		if(isinstance(tag[attrName],str) or isinstance(tag[attrName],unicode)):
 19 | 			tagStr = tag[attrName].encode('utf-8').strip()
 20 | 		elif(isinstance(tag[attrName],list)):
 21 | 			tagStr = tag[attrName][0].encode('utf-8').strip()
 22 | 		else:
 23 | 			print '[-] Strange tag type detected - '+str(type(tag[attrName]))
 24 | 			tagStr = 'XXXXXXXXX'
 25 | 
 26 | 		if(tagStr != ''):
 27 | 			if(link):
 28 | 				tagStr = linkStrip(tagStr)
 29 | 			if(tagStr in wordBags[url]):
 30 | 				wordBags[url][tagStr] += 1
 31 | 			else:
 32 | 				wordBags[url][tagStr] = 1
 33 | 
 34 | def addTagToBag(tagName,url,link,wordBags,soup):
 35 | 	for tag in soup.findAll(tagName):
 36 | 		if(tag is not None):
 37 | 			tagStr = tag.string
 38 | 			if(link):
 39 | 				tagStr = linkStrip(tagStr)
 40 | 			if(tagStr in wordBags[url]):
 41 | 				wordBags[url][tagStr] += 1
 42 | 			else:
 43 | 				wordBags[url][tagStr] = 1
 44 | 
 45 | def linkStrip(linkStr):
 46 | 	if(linkStr.find('/') != -1):
 47 | 		linkStr = linkStr[linkStr.rfind('/'):]
 48 | 	return linkStr
 49 | 
 50 | def createWordBags(htmlList):
 51 | 	wordBags={}
 52 | 
 53 | 	for f in htmlList:
 54 | 		htmlContent = open(f,'r').read()
 55 | 		wordBags[f]={}
 56 | 		soup = BeautifulSoup(htmlContent)
 57 | 		addAttrToBag('name',f,False,wordBags,soup)
 58 | 		addAttrToBag('href',f,True,wordBags,soup)
 59 | 		addAttrToBag('src',f,True,wordBags,soup)
 60 | 		addAttrToBag('id',f,False,wordBags,soup)
 61 | 		addAttrToBag('class',f,False,wordBags,soup)		
 62 | 		addTagToBag('title',f,False,wordBags,soup)
 63 | 		addTagToBag('h1',f,False,wordBags,soup)
 64 | 
 65 | 	return wordBags
 66 | 
 67 | def getNumWords(wordBag):
 68 | 	count = 0
 69 | 	for value in wordBag.values():
 70 | 		count = count+value
 71 | 	return count
 72 | 
 73 | def computeScore(wordBag1,wordBag2,debug=0):
 74 | 	commonWords = 0
 75 | 	wordBag1Length = getNumWords(wordBag1)
 76 | 	wordBag2Length = getNumWords(wordBag2)
 77 | 
 78 | 
 79 | 	if(len(wordBag1) == 0 and len(wordBag2) == 0):
 80 | 		if debug:
 81 | 			print 'Both have no words - return true'
 82 | 		return 1
 83 | 	elif (len(wordBag1) == 0 or len(wordBag2) == 0):
 84 | 		if debug:
 85 | 			print 'One has no words - return false'
 86 | 		return 0
 87 | 
 88 | 	for word in wordBag1.keys():
 89 | 		commonWords = commonWords+min(wordBag1[word],wordBag2.get(word,0))
 90 | 	
 91 | 	score = (float(commonWords)/float(wordBag1Length)*(float(commonWords)/float(wordBag2Length)))
 92 | 
 93 | 	if debug:
 94 | 		print "Common Words: "+str(commonWords)
 95 | 		print "WordBag1 Length: "+str(wordBag1Length)
 96 | 		print "WordBag2 Length: "+str(wordBag2Length)
 97 | 		print score
 98 | 
 99 | 	return score
100 | 
101 | def createClusters(wordBags,threshold):
102 | 	clusterData = {}
103 | 	i = 0
104 | 	siteList = wordBags.keys()
105 | 	for i in range(0,len(siteList)):
106 | 		clusterData[siteList[i]] = [threshold, i]
107 | 
108 | 	for i in range(0,len(siteList)):
109 | 		for j in range(i+1,len(siteList)):
110 | 				score = computeScore(wordBags[siteList[i]],wordBags[siteList[j]])
111 | 				if (clusterData[siteList[i]][0] <= threshold and score > clusterData[siteList[i]][0]):
112 | 					clusterData[siteList[i]][1] = i
113 | 					clusterData[siteList[i]][0] = score
114 | 				if (clusterData[siteList[j]][0] <= threshold and score > clusterData[siteList[j]][0]):
115 | 					clusterData[siteList[j]][1] = i
116 | 					clusterData[siteList[j]][0] = score
117 | 
118 | 	return clusterData
119 | 
120 | def getScopeHtml(scopeFile):
121 | 	if scopeFile is None:
122 | 		return None
123 | 	scope = open(scopeFile,'r')
124 | 	scopeText = '<br/><h3>Scope:</h3>'
125 | 	for line in scope.readlines():
126 | 		scopeText = scopeText + line+'<br/>'
127 | 	return scopeText
128 | 
129 | def renderClusterHtml(clust,width,height,scopeFile=None):
130 |     html = ''
131 |     scopeHtml = getScopeHtml(scopeFile)
132 |     header = '''
133 |     	<HTML>
134 |     		<title>Web Application Catalog</title>
135 |     		<BODY>
136 |     			<h1>Web Application Catalog</h1>
137 |     '''
138 |     if(scopeHtml is not None):
139 |     	header = header+scopeHtml
140 |     header = header + '''
141 |     			<script type="text/javascript" src="popup.js"></script>
142 |     			<LINK href="style.css" rel="stylesheet" type="text/css">
143 |     			<h3>Catalog:</h3>
144 |     			'''
145 |     html = html+'<table border="1">'
146 |     
147 |     for cluster,siteList in clust.iteritems():
148 |         html=html+'<TR>'
149 |         screenshotName = quote(siteList[0][0:-4], safe='./')
150 |         html=html+'<TR><TD><img src="'+screenshotName+'png" width='+str(width)+' height='+str(height)+'/></TD></TR>'
151 |         for site in siteList:
152 |             screenshotName = quote(site[0:-5], safe='./')
153 |             html=html+'<TD onmouseout="clearPopup()" onmouseover="popUp(event,\''+screenshotName+'.png\');"><a href="'+unquote(unquote(screenshotName[2:]).decode("utf-8")).decode("utf-8")+'">'+unquote(unquote(screenshotName[2:]).decode("utf-8")).decode("utf-8")+'</a></TD>'
154 |         html=html+'</TR>'
155 |     html=html+'</table>'
156 |     footer = '</BODY></HTML>'
157 | 
158 |     return [header,html,footer]
159 | def printJS():
160 | 	js = """
161 | 	function popUp(e,src)
162 | 	{
163 | 	    x = e.clientX;
164 | 	    y = e.clientY;
165 | 
166 | 	    var img = document.createElement("img");
167 | 	    img.src = src;
168 | 	    img.setAttribute("class","popUp");
169 | 	    img.setAttribute("style","position:fixed;left:"+(x+15)+";top:"+0+";background-color:white");
170 | 	    //img.setAttribute("onmouseout","clearPopup(event)")
171 | 	    // This next line will just add it to the <body> tag
172 | 	    document.body.appendChild(img);
173 | 	}
174 | 
175 | 	function clearPopup()
176 | 	{
177 | 	    var popUps = document.getElementsByClassName('popUp');
178 | 	    while(popUps[0]) {
179 | 	        popUps[0].parentNode.removeChild(popUps[0]);
180 | 	    }
181 | 	}
182 | 	"""
183 | 
184 | 	f = open('popup.js','w')
185 | 	f.write(js)
186 | 	f.close()
187 | 
188 | def doCluster(htmlList):
189 | 	siteWordBags = createWordBags(htmlList)
190 | 	clusterData = createClusters(siteWordBags,0.6)
191 | 
192 | 	clusterDict = {}
193 | 	for site,data in clusterData.iteritems():
194 | 		if data[1] in clusterDict:
195 | 			clusterDict[data[1]].append(site)
196 | 		else:
197 | 			clusterDict[data[1]]=[site]
198 | 	return clusterDict
199 | 
200 | 
201 | '''For a diff report we want 3 sections:
202 | 1. New sites 
203 | 2. Removed sites
204 | 2. Changed sites
205 | '''
206 | def doDiff(htmlList,diffList):
207 | 	'''Find new sites - this is easy just find any html filenames that are present in diffDir
208 | 		and not htmlList'''
209 | 	newList=[]
210 | 	for newItem in diffList:
211 | 		found = False
212 | 		newItemName = newItem[newItem.rfind('/')+1:]
213 | 		for oldItem in htmlList:
214 | 			oldItemName = oldItem[oldItem.rfind('/')+1:]
215 | 			if(oldItemName == newItemName):
216 | 				found = True
217 | 				break;
218 | 		if(not found):
219 | 			newList.append(newItem)
220 | 
221 | 	'''Now find items that were in the previous scan but not the new'''
222 | 	oldList=[]
223 | 	for oldItem in htmlList:
224 | 		found = False
225 | 		oldItemName = oldItem[oldItem.rfind('/')+1:]
226 | 		for newItem in diffList:
227 | 			newItemName = newItem[newItem.rfind('/')+1:]
228 | 			if(newItemName == oldItemName):
229 | 				found = True
230 | 				break;
231 | 		if(not found):
232 | 			oldList.append(oldItem)
233 | 
234 | 	'''Now find items that changed between the two scans'''
235 | 	changedList=[]
236 | 	oldPath = htmlList[0][:htmlList[0].rfind('/')+1]
237 | 	newPath = diffList[0][:diffList[0].rfind('/')+1]
238 | 
239 | 	for newItem in diffList:
240 | 		newItemName = newItem[newItem.rfind('/')+1:]
241 | 		oldItem = oldPath+newItemName
242 | 		if(os.path.isfile(oldItem)):
243 | 			compare = [newItem,oldItem]
244 | 			wordBags = createWordBags(compare)
245 | 			score = computeScore(wordBags[newItem],wordBags[oldItem])
246 | 			if(score < 0.6):
247 | 				changedList.append(newItem)
248 | 
249 | 	return [newList,oldList,changedList]
250 | 
251 | if __name__ == '__main__':
252 | 	parser = argparse.ArgumentParser()
253 | 	parser.add_argument("-d","--dir",help='Directory containing HTML files')
254 | 	parser.add_argument("-dF","--diff",default=None,help='Directory containing HTML files from a previous run to diff against')
255 | 	parser.add_argument("-t","--thumbsize",default='200x200',help='Thumbnail dimensions (e.g: 200x200).')
256 | 	parser.add_argument("-o","--output",default='clusters.html',help='Specify the HTML output filename')
257 | 	parser.add_argument("-s","--scope",default=None,help='Specify a scope file to include in the HTML output report')
258 | 
259 | 	args = parser.parse_args()
260 | 	#create a list of images
261 | 	path = args.dir
262 | 
263 | 	if(path is None):
264 | 		parser.print_help()
265 | 		sys.exit(0)
266 | 
267 | 	htmlList = []
268 | 	htmlRegex = re.compile('.*html.*')
269 | 	for fileName in os.listdir(path):
270 | 		if(htmlRegex.match(fileName)):
271 | 	    		htmlList.append(path+fileName)
272 | 	
273 | 	n = len(htmlList)
274 | 
275 | 
276 | 	width = int(args.thumbsize[0:args.thumbsize.find('x')])
277 | 	height = int(args.thumbsize[args.thumbsize.rfind('x')+1:])
278 | 
279 | 	html = ''
280 | 	if(args.diff is not None):
281 | 		diffList = []
282 | 		for fileName in os.listdir(args.diff):
283 | 			if(htmlRegex.match(fileName)):
284 | 		    		diffList.append(args.diff+fileName)
285 | 		
286 | 		lists = doDiff(htmlList,diffList)
287 | 
288 | 		newClusterDict = doCluster(lists[0])
289 | 		removedClusterDict = doCluster(lists[1])
290 | 		changedClusterDict = doCluster(lists[2])
291 | 
292 | 		htmlList = renderClusterHtml(newClusterDict,width,height,scopeFile=args.scope)
293 | 		newClusterTable = htmlList[1]
294 | 
295 | 		htmlList = renderClusterHtml(removedClusterDict,width,height,scopeFile=args.scope)
296 | 		oldClusterTable = htmlList[1]
297 | 
298 | 		htmlList = renderClusterHtml(changedClusterDict,width,height,scopeFile=args.scope)
299 | 		changedClusterTable = htmlList[1]
300 | 
301 | 		html = htmlList[0]
302 | 		html = html+"<h2>New Websites</h2>"
303 | 		html = html+newClusterTable
304 | 		html = html+"<h2>Deleted Websites</h2>"
305 | 		html = html+oldClusterTable
306 | 		html = html+"<h2>Changed Websites</h2>"
307 | 		html = html+changedClusterTable
308 | 		html = html+htmlList[2]
309 | 
310 | 	else:
311 | 		clusterDict = doCluster(htmlList)
312 | 		htmlList = renderClusterHtml(clusterDict,width,height,scopeFile=args.scope)
313 | 		html = htmlList[0]+htmlList[1]+htmlList[2]
314 | 
315 | 	f = open(args.output,'w')
316 | 	f.write(html)
317 | 	printJS()
318 | 	
319 | 


--------------------------------------------------------------------------------
/screenshotClustering/popup.js:
--------------------------------------------------------------------------------
 1 | 
 2 | 	function popUp(e,src)
 3 | 	{
 4 | 	    x = e.clientX;
 5 | 	    y = e.clientY;
 6 | 
 7 | 	    var img = document.createElement("img");
 8 | 	    img.src = src;
 9 | 	    img.setAttribute("class","popUp");
10 | 	    img.setAttribute("style","position:fixed;left:"+(x+15)+";top:"+0+";background-color:white");
11 | 	    //img.setAttribute("onmouseout","clearPopup(event)")
12 | 	    // This next line will just add it to the <body> tag
13 | 	    document.body.appendChild(img);
14 | 	}
15 | 
16 | 	function clearPopup()
17 | 	{
18 | 	    var popUps = document.getElementsByClassName('popUp');
19 | 	    while(popUps[0]) {
20 | 	        popUps[0].parentNode.removeChild(popUps[0]);
21 | 	    }
22 | 	}
23 | 	


--------------------------------------------------------------------------------