├── LiberationSerif-BoldItalic.ttf ├── README.md ├── httpscreenshot.py ├── install-dependencies.sh ├── masshttp.sh ├── requirements.txt └── screenshotClustering ├── cluster.py └── popup.js /LiberationSerif-BoldItalic.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/foxglovesec/httpscreenshot/c0f38700e5c3c105270de623299f91612a4483e5/LiberationSerif-BoldItalic.ttf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # httpscreenshot 2 | 3 | ### Installation on Ubuntu 4 | 5 | #### Via Script 6 | 7 | Run `install-dependencies.sh` script as root. 8 | 9 | This script has been tested on Ubuntu 14.04. 10 | 11 | ### Manually 12 | 13 | apt-get install swig swig2.0 libssl-dev python-dev python-pip 14 | pip install -r requirements.txt 15 | 16 | If you run into: 'module' object has no attribute 'PhantomJS' then `pip install selenium` (or `pip install --upgrade selenium`). 17 | 18 | 19 | If installing on Kali Linux, PhantomJS might not be in the repositories, you can download from https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.8-linux-x86_64.tar.bz2 and symlink to `/usr/bin` like so: 20 | 21 | sudo ln -s /path/to/phantomjs /usr/bin/phantomjs 22 | 23 | ## README and Use Cases 24 | 25 | HTTPScreenshot is a tool for grabbing screenshots and HTML of large numbers of websites. The goal is for it to be both thorough and fast which can sometimes oppose each other. 26 | 27 | Before getting into documentation - this is what I USUALLY use for options if I want to screenshot a bunch of sites: 28 | 29 | ./httpscreenshot.py -i \ -p -w 40 -a -vH 30 | 31 | Notice there are a ton of worker threads (40). This can be problematic, I make up for failures that could have been a result of too many threads with a second run: 32 | 33 | ./httpscreenshot.py -i \ -p -w 5 -a -vH 34 | 35 | YMMV 36 | 37 | The options are as follows: 38 | 39 | -h, --help show this help message and exit 40 | -l LIST, --list LIST List of input URLs 41 | -i INPUT, --input INPUT 42 | nmap gnmap output file 43 | -p, --headless Run in headless mode (using phantomjs) 44 | -w WORKERS, --workers WORKERS 45 | number of threads 46 | -t TIMEOUT, --timeout TIMEOUT 47 | time to wait for pageload before killing the browser 48 | -v, --verbose turn on verbose debugging 49 | -a, --autodetect Automatically detect if listening services are HTTP or 50 | HTTPS. Ignores NMAP service detction and URL schemes. 51 | -vH, --vhosts Attempt to scrape hostnames from SSL certificates and 52 | add these to the URL queue 53 | -dB DNS_BRUTE, --dns_brute DNS_BRUTE 54 | Specify a DNS subdomain wordlist for bruteforcing on 55 | wildcard SSL certs 56 | -r RETRIES, --retries RETRIES 57 | Number of retries if a URL fails or timesout 58 | -tG, --trygui Try to fetch the page with FireFox when headless fails 59 | -sF, --smartfetch Enables smart fetching to reduce network traffic, also 60 | increases speed if certain conditions are met. 61 | -pX PROXY, --proxy PROXY 62 | SOCKS5 Proxy in host:port format 63 | 64 | 65 | Some of the above options have non-obvious use-cases, so the following provides some more detail: 66 | 67 | -l, --list -> Takes as input a file with a simple list of input URLs in the format "http(s)://\" 68 | 69 | -i, --input -> Takes a gnmap file as input. This includes masscan gnmap output. 70 | 71 | -p, --headless -> I find myself using this option more and more. By default the script "drives" Firefox. As the number of threads increases this becomes really ugly - 20,30 Firefox windows open at once. This options uses "phantomjs" which doesn't have a GUI but will still do a decent job parsing javascript. 72 | 73 | -w, --workers -> The number of threads to use. Increase for more speed. The list of input URL's is automatically shuffled to avoid hammering at IP addresses that are close to each other when possible. If you add too many threads, you might start seeing timeouts in responses - adjust for your network and machine. 74 | 75 | -t TIMEOUT, --timeout -> How long to wait for a response from the server before calling it quits 76 | 77 | -v, --verbose -> Will spit out some extra debugging output. 78 | 79 | -a, --autodetect -> Without this option enabled, HTTPScreenshot will behave as follows: 80 | 81 | > If a LIST of urls is specified as input, sites with scheme "http://" are treated as non-ssl and sites with scheme "https://" are treated as ssl-enabled 82 | 83 | > For GNMAP input the script will scrape input and try to use any SSL detection performed by nmap. Unfortunately this is unreliable, nmap doesn't always like to tell you that something is SSL enabled. Further, masscan doesn't do any version or service detection. 84 | 85 | > The -a or --autodetect option throws away all SSL hints from the input file and tries to detect on its own. 86 | 87 | -vH, --vhosts -> Often when visiting websites by their IP address (e.g: https://192.168.1.30), we will receive a different page than expected or an error. This is because the site is expecting a certain "virtual host" or hostname instead of the IP address, sometimes a single HTTP server will respond with many different pages for different hostnames. 88 | 89 | > For plaintext "http" websites, we can use reverse DNS, BING reverse IP search etc... to try and find the hostnames associated with an IP address. This is not currently a feature in HTTPScreenshot, but may be implemented later. 90 | 91 | > For SSL enabled "https" sites, this can be a little easier. The SSL certificate will provide us with a hint at the domain name in the CN field. In the "subject alt names" field of the certificate, when it exists, we may get a whole list of other domain names potentially associated with this IP. Often these are in the form "\*.google.com" (wildcard certificate) but sometimes will be linked to a single hostname only like "www.google.com" 92 | 93 | > The -vH or --vhosts flag will, for each SSL enabled website extract the hostnames from the CN and subject alt names field, and add them to the list of URL's to be screenshotted. For wildcard certificates, the "\*." part of the name is dropped. 94 | 95 | -dB, --dns_brute -> Must use with -vH for it to make sense. This flag specifies a file containing a list of potential subdomains. For any wildcard certificate e.g: "\*.google.com", HTTPScreenshot will try to bruteforce valid subdomains and add them to the list of URLs to be screenshotted. 96 | 97 | -r, --retries -> Sometimes Firefox or ghostscript timeout when fetching a page. This could be due to a number of factors, sometimes you just have too many threads going, a network hiccup, etc. This specifies the number of times to "retry" a given host when it fails. 98 | 99 | -tG, --trygui -> Upon failure to fetch with the headless browser phantomJS, will pop open FireFox and try again. 100 | 101 | -sF, --smartfetch -> Enables smart fetching to reduce network traffic, also increases speed if certain conditions are met. 102 | 103 | -------------------------------------------------------------------------------- /httpscreenshot.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | ''' 4 | Installation on Ubuntu: 5 | apt-get install python-requests python-m2crypto phantomjs 6 | If you run into: 'module' object has no attribute 'PhantomJS' 7 | then pip install selenium (or pip install --upgrade selenium) 8 | ''' 9 | 10 | from selenium import webdriver 11 | from urlparse import urlparse 12 | from random import shuffle 13 | from PIL import Image 14 | from PIL import ImageDraw 15 | from PIL import ImageFont 16 | import multiprocessing 17 | import Queue 18 | import argparse 19 | import sys 20 | import traceback 21 | import os.path 22 | import ssl 23 | import M2Crypto 24 | import re 25 | import time 26 | import signal 27 | import shutil 28 | import hashlib 29 | 30 | try: 31 | from urllib.parse import quote 32 | except: 33 | from urllib import quote 34 | 35 | try: 36 | import requesocks as requests 37 | except: 38 | print "requesocks library not found - proxy support will not be available" 39 | import requests 40 | 41 | reload(sys) 42 | sys.setdefaultencoding("utf8") 43 | 44 | 45 | def timeoutFn(func, args=(), kwargs={}, timeout_duration=1, default=None): 46 | import signal 47 | 48 | class TimeoutError(Exception): 49 | pass 50 | 51 | def handler(signum, frame): 52 | raise TimeoutError() 53 | 54 | # set the timeout handler 55 | signal.signal(signal.SIGALRM, handler) 56 | signal.alarm(timeout_duration) 57 | try: 58 | result = func(*args, **kwargs) 59 | except TimeoutError as exc: 60 | result = default 61 | finally: 62 | signal.alarm(0) 63 | 64 | return result 65 | 66 | 67 | def addUrlsForService(host, urlList, servicesList, scheme): 68 | if(servicesList == None or servicesList == []): 69 | return 70 | for service in servicesList: 71 | state = service.findPreviousSibling("state") 72 | if(state != None and state != [] and state['state'] == 'open'): 73 | urlList.append(scheme+host+':'+str(service.parent['portid'])) 74 | 75 | 76 | def detectFileType(inFile): 77 | #Check to see if file is of type gnmap 78 | firstLine = inFile.readline() 79 | secondLine = inFile.readline() 80 | thirdLine = inFile.readline() 81 | 82 | #Be polite and reset the file pointer 83 | inFile.seek(0) 84 | 85 | if ((firstLine.find('nmap') != -1 or firstLine.find('Masscan') != -1) and thirdLine.find('Host:') != -1): 86 | #Looks like a gnmap file - this wont be true for other nmap output types 87 | #Check to see if -sV flag was used, if not, warn 88 | if(firstLine.find('-sV') != -1 or firstLine.find('-A') != -1): 89 | return 'gnmap' 90 | else: 91 | print("Nmap version detection not used! Discovery module may miss some hosts!") 92 | return 'gnmap' 93 | else: 94 | return None 95 | 96 | 97 | def parseGnmap(inFile, autodetect): 98 | ''' 99 | Parse a gnmap file into a dictionary. The dictionary key is the ip address or hostname. 100 | Each key item is a list of ports and whether or not that port is https/ssl. For example: 101 | >>> targets 102 | {'127.0.0.1': [[443, True], [8080, False]]} 103 | ''' 104 | targets = {} 105 | for hostLine in inFile: 106 | currentTarget = [] 107 | #Pull out the IP address (or hostnames) and HTTP service ports 108 | fields = hostLine.split(' ') 109 | ip = fields[1] #not going to regex match this with ip address b/c could be a hostname 110 | for item in fields: 111 | #Make sure we have an open port with an http type service on it 112 | if (item.find('http') != -1 or autodetect) and re.findall('\d+/open',item): 113 | port = None 114 | https = False 115 | ''' 116 | nmap has a bunch of ways to list HTTP like services, for example: 117 | 8089/open/tcp//ssl|http 118 | 8000/closed/tcp//http-alt/// 119 | 8008/closed/tcp//http/// 120 | 8080/closed/tcp//http-proxy// 121 | 443/open/tcp//ssl|https?/// 122 | 8089/open/tcp//ssl|http 123 | Since we want to detect them all, let's just match on the word http 124 | and make special cases for things containing https and ssl when we 125 | construct the URLs. 126 | ''' 127 | port = item.split('/')[0] 128 | 129 | if item.find('https') != -1 or item.find('ssl') != -1: 130 | https = True 131 | #Add the current service item to the currentTarget list for this host 132 | currentTarget.append([port,https]) 133 | 134 | if(len(currentTarget) > 0): 135 | targets[ip] = currentTarget 136 | return targets 137 | 138 | 139 | def setupBrowserProfile(headless,proxy): 140 | browser = None 141 | if(proxy is not None): 142 | service_args=['--ignore-ssl-errors=true','--ssl-protocol=tlsv1','--proxy='+proxy,'--proxy-type=socks5'] 143 | else: 144 | service_args=['--ignore-ssl-errors=true','--ssl-protocol=tlsv1'] 145 | 146 | while(browser is None): 147 | try: 148 | if(not headless): 149 | fp = webdriver.FirefoxProfile() 150 | fp.set_preference("webdriver.accept.untrusted.certs",True) 151 | fp.set_preference("security.enable_java", False) 152 | fp.set_preference("webdriver.load.strategy", "fast"); 153 | if(proxy is not None): 154 | proxyItems = proxy.split(":") 155 | fp.set_preference("network.proxy.socks",proxyItems[0]) 156 | fp.set_preference("network.proxy.socks_port",int(proxyItems[1])) 157 | fp.set_preference("network.proxy.type",1) 158 | browser = webdriver.Firefox(fp) 159 | else: 160 | browser = webdriver.PhantomJS(service_args=service_args, executable_path="phantomjs") 161 | except Exception as e: 162 | print e 163 | time.sleep(1) 164 | continue 165 | return browser 166 | 167 | 168 | def writeImage(text, filename, fontsize=40, width=1024, height=200): 169 | image = Image.new("RGBA", (width,height), (255,255,255)) 170 | draw = ImageDraw.Draw(image) 171 | if (os.path.exists("/usr/share/httpscreenshot/LiberationSerif-BoldItalic.ttf")): 172 | font_path = "/usr/share/httpscreenshot/LiberationSerif-BoldItalic.ttf" 173 | else: 174 | font_path = os.path.dirname(os.path.realpath(__file__))+"/LiberationSerif-BoldItalic.ttf" 175 | font = ImageFont.truetype(font_path, fontsize) 176 | draw.text((10, 0), text, (0,0,0), font=font) 177 | image.save(filename) 178 | 179 | 180 | def worker(urlQueue, tout, debug, headless, doProfile, vhosts, subs, extraHosts, tryGUIOnFail, smartFetch,proxy): 181 | if(debug): 182 | print '[*] Starting worker' 183 | 184 | browser = None 185 | try: 186 | browser = setupBrowserProfile(headless,proxy) 187 | 188 | except: 189 | print "[-] Oh no! Couldn't create the browser, Selenium blew up" 190 | exc_type, exc_value, exc_traceback = sys.exc_info() 191 | lines = traceback.format_exception(exc_type, exc_value, exc_traceback) 192 | print ''.join('!! ' + line for line in lines) 193 | return 194 | 195 | while True: 196 | #Try to get a URL from the Queue 197 | if urlQueue.qsize() > 0: 198 | try: 199 | curUrl = urlQueue.get(timeout=tout) 200 | except Queue.Empty: 201 | continue 202 | print '[+] '+str(urlQueue.qsize())+' URLs remaining' 203 | screenshotName = quote(curUrl[0], safe='') 204 | if(debug): 205 | print '[+] Got URL: '+curUrl[0] 206 | print '[+] screenshotName: '+screenshotName 207 | if(os.path.exists(screenshotName+".png")): 208 | if(debug): 209 | print "[-] Screenshot already exists, skipping" 210 | continue 211 | else: 212 | if(debug): 213 | print'[-] URL queue is empty, quitting.' 214 | browser.quit() 215 | return 216 | 217 | try: 218 | if(doProfile): 219 | [resp,curUrl] = autodetectRequest(curUrl, timeout=tout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts,proxy=proxy) 220 | else: 221 | resp = doGet(curUrl, verify=False, timeout=tout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts,proxy=proxy) 222 | if(resp is not None and resp.status_code == 401): 223 | print curUrl[0]+" Requires HTTP Basic Auth" 224 | f = open(screenshotName+".html",'w') 225 | f.write(resp.headers.get('www-authenticate','NONE')) 226 | f.write('Basic Auth') 227 | f.close() 228 | writeImage(resp.headers.get('www-authenticate','NO WWW-AUTHENTICATE HEADER'),screenshotName+".png") 229 | continue 230 | 231 | elif(resp is not None): 232 | if(resp.text is not None): 233 | resp_hash = hashlib.md5(resp.text).hexdigest() 234 | else: 235 | resp_hash = None 236 | 237 | if smartFetch and resp_hash is not None and resp_hash in hash_basket: 238 | #We have this exact same page already, copy it instead of grabbing it again 239 | print "[+] Pre-fetch matches previously imaged service, no need to do it again!" 240 | shutil.copy2(hash_basket[resp_hash]+".html",screenshotName+".html") 241 | shutil.copy2(hash_basket[resp_hash]+".png",screenshotName+".png") 242 | else: 243 | if smartFetch: 244 | hash_basket[resp_hash] = screenshotName 245 | 246 | 247 | browser.set_window_size(1024, 768) 248 | browser.set_page_load_timeout((tout)) 249 | old_url = browser.current_url 250 | browser.get(curUrl[0].strip()) 251 | if(browser.current_url == old_url): 252 | print "[-] Error fetching in browser but successfully fetched with Requests: "+curUrl[0] 253 | if(headless): 254 | if(debug): 255 | print "[+] Trying with sslv3 instead of TLS - known phantomjs bug: "+curUrl[0] 256 | browser2 = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'], executable_path="phantomjs") 257 | old_url = browser2.current_url 258 | browser2.get(curUrl[0].strip()) 259 | if(browser2.current_url == old_url): 260 | if(debug): 261 | print "[-] Didn't work with SSLv3 either..."+curUrl[0] 262 | browser2.close() 263 | else: 264 | print '[+] Saving: '+screenshotName 265 | html_source = browser2.page_source 266 | f = open(screenshotName+".html",'w') 267 | f.write(html_source) 268 | f.close() 269 | browser2.save_screenshot(screenshotName+".png") 270 | browser2.close() 271 | continue 272 | 273 | if(tryGUIOnFail and headless): 274 | print "[+] Attempting to fetch with FireFox: "+curUrl[0] 275 | browser2 = setupBrowserProfile(False,proxy) 276 | old_url = browser2.current_url 277 | browser2.get(curUrl[0].strip()) 278 | if(browser2.current_url == old_url): 279 | print "[-] Error fetching in GUI browser as well..."+curUrl[0] 280 | browser2.close() 281 | continue 282 | else: 283 | print '[+] Saving: '+screenshotName 284 | html_source = browser2.page_source 285 | f = open(screenshotName+".html",'w') 286 | f.write(html_source) 287 | f.close() 288 | browser2.save_screenshot(screenshotName+".png") 289 | browser2.close() 290 | continue 291 | else: 292 | continue 293 | 294 | print '[+] Saving: '+screenshotName 295 | html_source = browser.page_source 296 | f = open(screenshotName+".html",'w') 297 | f.write(html_source) 298 | f.close() 299 | browser.save_screenshot(screenshotName+".png") 300 | 301 | except Exception as e: 302 | print e 303 | print '[-] Something bad happened with URL: '+curUrl[0] 304 | if(curUrl[2] > 0): 305 | curUrl[2] = curUrl[2] - 1; 306 | urlQueue.put(curUrl) 307 | if(debug): 308 | exc_type, exc_value, exc_traceback = sys.exc_info() 309 | lines = traceback.format_exception(exc_type, exc_value, exc_traceback) 310 | print ''.join('!! ' + line for line in lines) 311 | browser.quit() 312 | browser = setupBrowserProfile(headless,proxy) 313 | continue 314 | 315 | 316 | def doGet(*args, **kwargs): 317 | url = args[0] 318 | doVhosts = kwargs.pop('vhosts' ,None) 319 | urlQueue = kwargs.pop('urlQueue' ,None) 320 | subs = kwargs.pop('subs' ,None) 321 | extraHosts = kwargs.pop('extraHosts',None) 322 | proxy = kwargs.pop('proxy',None) 323 | 324 | kwargs['allow_redirects'] = False 325 | session = requests.session() 326 | if(proxy is not None): 327 | session.proxies={'http':'socks5://'+proxy,'https':'socks5://'+proxy} 328 | resp = session.get(url[0],**kwargs) 329 | 330 | #If we have an https URL and we are configured to scrape hosts from the cert... 331 | if(url[0].find('https') != -1 and url[1] == True): 332 | #Pull hostnames from cert, add as additional URLs and flag as not to pull certs 333 | host = urlparse(url[0]).hostname 334 | port = urlparse(url[0]).port 335 | if(port is None): 336 | port = 443 337 | names = [] 338 | try: 339 | cert = ssl.get_server_certificate((host,port),ssl_version=ssl.PROTOCOL_SSLv23) 340 | x509 = M2Crypto.X509.load_cert_string(cert.decode('string_escape')) 341 | subjText = x509.get_subject().as_text() 342 | names = re.findall("CN=([^\s]+)",subjText) 343 | altNames = x509.get_ext('subjectAltName').get_value() 344 | names.extend(re.findall("DNS:([^,]*)",altNames)) 345 | except: 346 | pass 347 | 348 | for name in names: 349 | if(name.find('*.') != -1): 350 | for sub in subs: 351 | try: 352 | sub = sub.strip() 353 | hostname = name.replace('*.',sub+'.') 354 | if(hostname not in extraHosts): 355 | extraHosts[hostname] = 1 356 | address = socket.gethostbyname(hostname) 357 | urlQueue.put(['https://'+hostname+':'+str(port),False,url[2]]) 358 | print '[+] Discovered subdomain '+address 359 | except: 360 | pass 361 | name = name.replace('*.','') 362 | if(name not in extraHosts): 363 | extraHosts[name] = 1 364 | urlQueue.put(['https://'+name+':'+str(port),False,url[2]]) 365 | print '[+] Added host '+name 366 | else: 367 | if (name not in extraHosts): 368 | extraHosts[name] = 1 369 | urlQueue.put(['https://'+name+':'+str(port),False,url[2]]) 370 | print '[+] Added host '+name 371 | return resp 372 | else: 373 | return resp 374 | 375 | 376 | def autodetectRequest(url, timeout, vhosts=False, urlQueue=None, subs=None, extraHosts=None,proxy=None): 377 | '''Takes a URL, ignores the scheme. Detect if the host/port is actually an HTTP or HTTPS 378 | server''' 379 | resp = None 380 | host = urlparse(url[0]).hostname 381 | port = urlparse(url[0]).port 382 | 383 | if(port is None): 384 | if('https' in url[0]): 385 | port = 443 386 | else: 387 | port = 80 388 | 389 | try: 390 | #cert = ssl.get_server_certificate((host,port)) 391 | 392 | cert = timeoutFn(ssl.get_server_certificate,kwargs={'addr':(host,port),'ssl_version':ssl.PROTOCOL_SSLv23},timeout_duration=3) 393 | 394 | if(cert is not None): 395 | if('https' not in url[0]): 396 | url[0] = url[0].replace('http','https') 397 | #print 'Got cert, changing to HTTPS '+url[0] 398 | 399 | else: 400 | url[0] = url[0].replace('https','http') 401 | #print 'Changing to HTTP '+url[0] 402 | 403 | 404 | except Exception as e: 405 | url[0] = url[0].replace('https','http') 406 | #print 'Changing to HTTP '+url[0] 407 | try: 408 | resp = doGet(url,verify=False, timeout=timeout, vhosts=vhosts, urlQueue=urlQueue, subs=subs, extraHosts=extraHosts, proxy=proxy) 409 | except Exception as e: 410 | print 'HTTP GET Error: '+str(e) 411 | print url[0] 412 | 413 | return [resp,url] 414 | 415 | 416 | def sslError(e): 417 | if('the handshake operation timed out' in str(e) or 'unknown protocol' in str(e) or 'Connection reset by peer' in str(e) or 'EOF occurred in violation of protocol' in str(e)): 418 | return True 419 | else: 420 | return False 421 | 422 | def signal_handler(signal, frame): 423 | print "[-] Ctrl-C received! Killing Thread(s)..." 424 | os._exit(0) 425 | signal.signal(signal.SIGINT, signal_handler) 426 | 427 | if __name__ == '__main__': 428 | parser = argparse.ArgumentParser() 429 | 430 | parser.add_argument("-l","--list",help='List of input URLs') 431 | parser.add_argument("-i","--input",help='nmap gnmap output file') 432 | parser.add_argument("-p","--headless",action='store_true',default=False,help='Run in headless mode (using phantomjs)') 433 | parser.add_argument("-w","--workers",default=1,type=int,help='number of threads') 434 | parser.add_argument("-t","--timeout",type=int,default=10,help='time to wait for pageload before killing the browser') 435 | parser.add_argument("-v","--verbose",action='store_true',default=False,help='turn on verbose debugging') 436 | parser.add_argument("-a","--autodetect",action='store_true',default=False,help='Automatically detect if listening services are HTTP or HTTPS. Ignores NMAP service detction and URL schemes.') 437 | parser.add_argument("-vH","--vhosts",action='store_true',default=False,help='Attempt to scrape hostnames from SSL certificates and add these to the URL queue') 438 | parser.add_argument("-dB","--dns_brute",help='Specify a DNS subdomain wordlist for bruteforcing on wildcard SSL certs') 439 | parser.add_argument("-uL","--uri_list",help='Specify a list of URIs to fetch in addition to the root') 440 | parser.add_argument("-r","--retries",type=int,default=0,help='Number of retries if a URL fails or timesout') 441 | parser.add_argument("-tG","--trygui",action='store_true',default=False,help='Try to fetch the page with FireFox when headless fails') 442 | parser.add_argument("-sF","--smartfetch",action='store_true',default=False,help='Enables smart fetching to reduce network traffic, also increases speed if certain conditions are met.') 443 | parser.add_argument("-pX","--proxy",default=None,help='SOCKS5 Proxy in host:port format') 444 | 445 | 446 | args = parser.parse_args() 447 | 448 | if(len(sys.argv) < 2): 449 | parser.print_help() 450 | sys.exit(0) 451 | 452 | 453 | #read in the URI list if specificed 454 | uris = [''] 455 | if(args.uri_list != None): 456 | uris = open(args.uri_list,'r').readlines() 457 | uris.append('') 458 | 459 | if(args.input is not None): 460 | inFile = open(args.input,'r') 461 | if(detectFileType(inFile) == 'gnmap'): 462 | hosts = parseGnmap(inFile,args.autodetect) 463 | urls = [] 464 | for host,ports in hosts.items(): 465 | for port in ports: 466 | for uri in uris: 467 | url = '' 468 | if port[1] == True: 469 | url = ['https://'+host+':'+port[0]+uri.strip(),args.vhosts,args.retries] 470 | else: 471 | url = ['http://'+host+':'+port[0]+uri.strip(),args.vhosts,args.retries] 472 | urls.append(url) 473 | else: 474 | print 'Invalid input file - must be Nmap GNMAP' 475 | 476 | elif (args.list is not None): 477 | f = open(args.list,'r') 478 | lst = f.readlines() 479 | urls = [] 480 | for url in lst: 481 | urls.append([url.strip(),args.vhosts,args.retries]) 482 | else: 483 | print "No input specified" 484 | sys.exit(0) 485 | 486 | 487 | #shuffle the url list 488 | shuffle(urls) 489 | 490 | #read in the subdomain bruteforce list if specificed 491 | subs = [] 492 | if(args.dns_brute != None): 493 | subs = open(args.dns_brute,'r').readlines() 494 | 495 | #Fire up the workers 496 | urlQueue = multiprocessing.Queue() 497 | manager = multiprocessing.Manager() 498 | hostsDict = manager.dict() 499 | workers = [] 500 | hash_basket = {} 501 | 502 | for i in range(args.workers): 503 | p = multiprocessing.Process(target=worker, args=(urlQueue, args.timeout, args.verbose, args.headless, args.autodetect, args.vhosts, subs, hostsDict, args.trygui, args.smartfetch,args.proxy)) 504 | workers.append(p) 505 | p.start() 506 | 507 | for url in urls: 508 | urlQueue.put(url) 509 | 510 | for p in workers: 511 | p.join() 512 | 513 | -------------------------------------------------------------------------------- /install-dependencies.sh: -------------------------------------------------------------------------------- 1 | # Installation Script - tested on an ubuntu/trusty64 vagrant box 2 | 3 | # Show all commands being run 4 | #set -x 5 | 6 | # Error out if one fails 7 | set -e 8 | 9 | apt-get install -y swig swig2.0 libssl-dev python-dev 10 | 11 | # Newer version in PyPI 12 | #apt-get install -y python-requests 13 | 14 | # Newer version in PyPI 15 | #apt-get install -y python-m2crypto 16 | 17 | # Installing pillow from PIP for the latest 18 | #apt-get install -y python-pil 19 | 20 | # Install pip and install pytnon requirements through it 21 | apt-get install -y python-pip 22 | pip install -r requirements.txt 23 | 24 | # This binary is distributed with the code base, version is 25 | # more recent then the one in the ubuntu repo (1.9.1 vs 1.9.0) 26 | #apt-get install -y phantomjs 27 | 28 | # Grab the latest of phantomjs it directly from the source 29 | wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.8-linux-x86_64.tar.bz2 30 | 31 | phantom_md5sum=`md5sum phantomjs-1.9.8-linux-x86_64.tar.bz2 | cut -d' ' -f1` 32 | checksum="4ea7aa79e45fbc487a63ef4788a18ef7" 33 | 34 | if [ "$phantom_md5sum" != "$checksum" ] 35 | then 36 | echo "phantomjs checksum mismatch" 37 | exit 254 38 | fi 39 | 40 | tar xvf phantomjs-1.9.8-linux-x86_64.tar.bz2 41 | mv phantomjs-1.9.8-linux-x86_64/bin/phantomjs /usr/bin/phantomjs 42 | 43 | -------------------------------------------------------------------------------- /masshttp.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | /root/masscan/bin/masscan -p80,443 -iL networks.txt -oG http.gnmap --rate 100000 4 | mkdir httpscreenshots 5 | cd httpscreenshots 6 | python ~/tools/httpscreenshot.py -i ../http.gnmap -p -t 30 -w 50 -a -vH -r 1 7 | python ~/tools/httpscreenshot.py -i ../http.gnmap -p -t 10 -w 10 -a -vH 8 | cd .. 9 | python screenshotClustering/cluster.py -d httpscreenshots/ 10 | 11 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | m2crypto 2 | requests 3 | selenium 4 | beautifulsoup4 5 | pillow 6 | requesocks 7 | -------------------------------------------------------------------------------- /screenshotClustering/cluster.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import os 3 | import sys 4 | import argparse 5 | from collections import OrderedDict 6 | from collections import defaultdict 7 | import re 8 | import time 9 | from bs4 import BeautifulSoup 10 | 11 | try: 12 | from urllib.parse import quote,unquote 13 | except: 14 | from urllib import quote,unquote 15 | 16 | def addAttrToBag(attrName,url,link,wordBags,soup): 17 | for tag in soup.findAll('',{attrName:True}): 18 | if(isinstance(tag[attrName],str) or isinstance(tag[attrName],unicode)): 19 | tagStr = tag[attrName].encode('utf-8').strip() 20 | elif(isinstance(tag[attrName],list)): 21 | tagStr = tag[attrName][0].encode('utf-8').strip() 22 | else: 23 | print '[-] Strange tag type detected - '+str(type(tag[attrName])) 24 | tagStr = 'XXXXXXXXX' 25 | 26 | if(tagStr != ''): 27 | if(link): 28 | tagStr = linkStrip(tagStr) 29 | if(tagStr in wordBags[url]): 30 | wordBags[url][tagStr] += 1 31 | else: 32 | wordBags[url][tagStr] = 1 33 | 34 | def addTagToBag(tagName,url,link,wordBags,soup): 35 | for tag in soup.findAll(tagName): 36 | if(tag is not None): 37 | tagStr = tag.string 38 | if(link): 39 | tagStr = linkStrip(tagStr) 40 | if(tagStr in wordBags[url]): 41 | wordBags[url][tagStr] += 1 42 | else: 43 | wordBags[url][tagStr] = 1 44 | 45 | def linkStrip(linkStr): 46 | if(linkStr.find('/') != -1): 47 | linkStr = linkStr[linkStr.rfind('/'):] 48 | return linkStr 49 | 50 | def createWordBags(htmlList): 51 | wordBags={} 52 | 53 | for f in htmlList: 54 | htmlContent = open(f,'r').read() 55 | wordBags[f]={} 56 | soup = BeautifulSoup(htmlContent) 57 | addAttrToBag('name',f,False,wordBags,soup) 58 | addAttrToBag('href',f,True,wordBags,soup) 59 | addAttrToBag('src',f,True,wordBags,soup) 60 | addAttrToBag('id',f,False,wordBags,soup) 61 | addAttrToBag('class',f,False,wordBags,soup) 62 | addTagToBag('title',f,False,wordBags,soup) 63 | addTagToBag('h1',f,False,wordBags,soup) 64 | 65 | return wordBags 66 | 67 | def getNumWords(wordBag): 68 | count = 0 69 | for value in wordBag.values(): 70 | count = count+value 71 | return count 72 | 73 | def computeScore(wordBag1,wordBag2,debug=0): 74 | commonWords = 0 75 | wordBag1Length = getNumWords(wordBag1) 76 | wordBag2Length = getNumWords(wordBag2) 77 | 78 | 79 | if(len(wordBag1) == 0 and len(wordBag2) == 0): 80 | if debug: 81 | print 'Both have no words - return true' 82 | return 1 83 | elif (len(wordBag1) == 0 or len(wordBag2) == 0): 84 | if debug: 85 | print 'One has no words - return false' 86 | return 0 87 | 88 | for word in wordBag1.keys(): 89 | commonWords = commonWords+min(wordBag1[word],wordBag2.get(word,0)) 90 | 91 | score = (float(commonWords)/float(wordBag1Length)*(float(commonWords)/float(wordBag2Length))) 92 | 93 | if debug: 94 | print "Common Words: "+str(commonWords) 95 | print "WordBag1 Length: "+str(wordBag1Length) 96 | print "WordBag2 Length: "+str(wordBag2Length) 97 | print score 98 | 99 | return score 100 | 101 | def createClusters(wordBags,threshold): 102 | clusterData = {} 103 | i = 0 104 | siteList = wordBags.keys() 105 | for i in range(0,len(siteList)): 106 | clusterData[siteList[i]] = [threshold, i] 107 | 108 | for i in range(0,len(siteList)): 109 | for j in range(i+1,len(siteList)): 110 | score = computeScore(wordBags[siteList[i]],wordBags[siteList[j]]) 111 | if (clusterData[siteList[i]][0] <= threshold and score > clusterData[siteList[i]][0]): 112 | clusterData[siteList[i]][1] = i 113 | clusterData[siteList[i]][0] = score 114 | if (clusterData[siteList[j]][0] <= threshold and score > clusterData[siteList[j]][0]): 115 | clusterData[siteList[j]][1] = i 116 | clusterData[siteList[j]][0] = score 117 | 118 | return clusterData 119 | 120 | def getScopeHtml(scopeFile): 121 | if scopeFile is None: 122 | return None 123 | scope = open(scopeFile,'r') 124 | scopeText = '

Scope:

' 125 | for line in scope.readlines(): 126 | scopeText = scopeText + line+'
' 127 | return scopeText 128 | 129 | def renderClusterHtml(clust,width,height,scopeFile=None): 130 | html = '' 131 | scopeHtml = getScopeHtml(scopeFile) 132 | header = ''' 133 | 134 | Web Application Catalog 135 | 136 |

Web Application Catalog

137 | ''' 138 | if(scopeHtml is not None): 139 | header = header+scopeHtml 140 | header = header + ''' 141 | 142 | 143 |

Catalog:

144 | ''' 145 | html = html+'' 146 | 147 | for cluster,siteList in clust.iteritems(): 148 | html=html+'' 149 | screenshotName = quote(siteList[0][0:-4], safe='./') 150 | html=html+'' 151 | for site in siteList: 152 | screenshotName = quote(site[0:-5], safe='./') 153 | html=html+'' 154 | html=html+'' 155 | html=html+'
'+unquote(unquote(screenshotName[2:]).decode("utf-8")).decode("utf-8")+'
' 156 | footer = '' 157 | 158 | return [header,html,footer] 159 | def printJS(): 160 | js = """ 161 | function popUp(e,src) 162 | { 163 | x = e.clientX; 164 | y = e.clientY; 165 | 166 | var img = document.createElement("img"); 167 | img.src = src; 168 | img.setAttribute("class","popUp"); 169 | img.setAttribute("style","position:fixed;left:"+(x+15)+";top:"+0+";background-color:white"); 170 | //img.setAttribute("onmouseout","clearPopup(event)") 171 | // This next line will just add it to the tag 172 | document.body.appendChild(img); 173 | } 174 | 175 | function clearPopup() 176 | { 177 | var popUps = document.getElementsByClassName('popUp'); 178 | while(popUps[0]) { 179 | popUps[0].parentNode.removeChild(popUps[0]); 180 | } 181 | } 182 | """ 183 | 184 | f = open('popup.js','w') 185 | f.write(js) 186 | f.close() 187 | 188 | def doCluster(htmlList): 189 | siteWordBags = createWordBags(htmlList) 190 | clusterData = createClusters(siteWordBags,0.6) 191 | 192 | clusterDict = {} 193 | for site,data in clusterData.iteritems(): 194 | if data[1] in clusterDict: 195 | clusterDict[data[1]].append(site) 196 | else: 197 | clusterDict[data[1]]=[site] 198 | return clusterDict 199 | 200 | 201 | '''For a diff report we want 3 sections: 202 | 1. New sites 203 | 2. Removed sites 204 | 2. Changed sites 205 | ''' 206 | def doDiff(htmlList,diffList): 207 | '''Find new sites - this is easy just find any html filenames that are present in diffDir 208 | and not htmlList''' 209 | newList=[] 210 | for newItem in diffList: 211 | found = False 212 | newItemName = newItem[newItem.rfind('/')+1:] 213 | for oldItem in htmlList: 214 | oldItemName = oldItem[oldItem.rfind('/')+1:] 215 | if(oldItemName == newItemName): 216 | found = True 217 | break; 218 | if(not found): 219 | newList.append(newItem) 220 | 221 | '''Now find items that were in the previous scan but not the new''' 222 | oldList=[] 223 | for oldItem in htmlList: 224 | found = False 225 | oldItemName = oldItem[oldItem.rfind('/')+1:] 226 | for newItem in diffList: 227 | newItemName = newItem[newItem.rfind('/')+1:] 228 | if(newItemName == oldItemName): 229 | found = True 230 | break; 231 | if(not found): 232 | oldList.append(oldItem) 233 | 234 | '''Now find items that changed between the two scans''' 235 | changedList=[] 236 | oldPath = htmlList[0][:htmlList[0].rfind('/')+1] 237 | newPath = diffList[0][:diffList[0].rfind('/')+1] 238 | 239 | for newItem in diffList: 240 | newItemName = newItem[newItem.rfind('/')+1:] 241 | oldItem = oldPath+newItemName 242 | if(os.path.isfile(oldItem)): 243 | compare = [newItem,oldItem] 244 | wordBags = createWordBags(compare) 245 | score = computeScore(wordBags[newItem],wordBags[oldItem]) 246 | if(score < 0.6): 247 | changedList.append(newItem) 248 | 249 | return [newList,oldList,changedList] 250 | 251 | if __name__ == '__main__': 252 | parser = argparse.ArgumentParser() 253 | parser.add_argument("-d","--dir",help='Directory containing HTML files') 254 | parser.add_argument("-dF","--diff",default=None,help='Directory containing HTML files from a previous run to diff against') 255 | parser.add_argument("-t","--thumbsize",default='200x200',help='Thumbnail dimensions (e.g: 200x200).') 256 | parser.add_argument("-o","--output",default='clusters.html',help='Specify the HTML output filename') 257 | parser.add_argument("-s","--scope",default=None,help='Specify a scope file to include in the HTML output report') 258 | 259 | args = parser.parse_args() 260 | #create a list of images 261 | path = args.dir 262 | 263 | if(path is None): 264 | parser.print_help() 265 | sys.exit(0) 266 | 267 | htmlList = [] 268 | htmlRegex = re.compile('.*html.*') 269 | for fileName in os.listdir(path): 270 | if(htmlRegex.match(fileName)): 271 | htmlList.append(path+fileName) 272 | 273 | n = len(htmlList) 274 | 275 | 276 | width = int(args.thumbsize[0:args.thumbsize.find('x')]) 277 | height = int(args.thumbsize[args.thumbsize.rfind('x')+1:]) 278 | 279 | html = '' 280 | if(args.diff is not None): 281 | diffList = [] 282 | for fileName in os.listdir(args.diff): 283 | if(htmlRegex.match(fileName)): 284 | diffList.append(args.diff+fileName) 285 | 286 | lists = doDiff(htmlList,diffList) 287 | 288 | newClusterDict = doCluster(lists[0]) 289 | removedClusterDict = doCluster(lists[1]) 290 | changedClusterDict = doCluster(lists[2]) 291 | 292 | htmlList = renderClusterHtml(newClusterDict,width,height,scopeFile=args.scope) 293 | newClusterTable = htmlList[1] 294 | 295 | htmlList = renderClusterHtml(removedClusterDict,width,height,scopeFile=args.scope) 296 | oldClusterTable = htmlList[1] 297 | 298 | htmlList = renderClusterHtml(changedClusterDict,width,height,scopeFile=args.scope) 299 | changedClusterTable = htmlList[1] 300 | 301 | html = htmlList[0] 302 | html = html+"

New Websites

" 303 | html = html+newClusterTable 304 | html = html+"

Deleted Websites

" 305 | html = html+oldClusterTable 306 | html = html+"

Changed Websites

" 307 | html = html+changedClusterTable 308 | html = html+htmlList[2] 309 | 310 | else: 311 | clusterDict = doCluster(htmlList) 312 | htmlList = renderClusterHtml(clusterDict,width,height,scopeFile=args.scope) 313 | html = htmlList[0]+htmlList[1]+htmlList[2] 314 | 315 | f = open(args.output,'w') 316 | f.write(html) 317 | printJS() 318 | 319 | -------------------------------------------------------------------------------- /screenshotClustering/popup.js: -------------------------------------------------------------------------------- 1 | 2 | function popUp(e,src) 3 | { 4 | x = e.clientX; 5 | y = e.clientY; 6 | 7 | var img = document.createElement("img"); 8 | img.src = src; 9 | img.setAttribute("class","popUp"); 10 | img.setAttribute("style","position:fixed;left:"+(x+15)+";top:"+0+";background-color:white"); 11 | //img.setAttribute("onmouseout","clearPopup(event)") 12 | // This next line will just add it to the tag 13 | document.body.appendChild(img); 14 | } 15 | 16 | function clearPopup() 17 | { 18 | var popUps = document.getElementsByClassName('popUp'); 19 | while(popUps[0]) { 20 | popUps[0].parentNode.removeChild(popUps[0]); 21 | } 22 | } 23 | --------------------------------------------------------------------------------