├── .gitignore ├── LICENSE ├── README.md ├── config.py.template └── gen.py /.gitignore: -------------------------------------------------------------------------------- 1 | # web-traffic-generator 2 | # user config file 3 | config.py 4 | 5 | # PyInstaller 6 | # Usually these files are written by a python script from a template 7 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 8 | *.manifest 9 | *.spec 10 | 11 | # Installer logs 12 | pip-log.txt 13 | pip-delete-this-directory.txt 14 | 15 | # pyenv 16 | .python-version 17 | 18 | # Byte-compiled / optimized / DLL files 19 | __pycache__/ 20 | *.py[cod] 21 | *$py.class -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # web-traffic-generator 2 | 3 | A quick and dirty HTTP/S "organic" traffic generator. 4 | 5 | ## About 6 | 7 | Just a simple (poorly written) Python script that aimlessly "browses" the internet by starting at pre-defined `ROOT_URLS` and randomly "clicking" links on pages until the pre-defined `MAX_DEPTH` is met. 8 | 9 | I created this as a noise generator to use for an Incident Response / Network Defense simulation. The only issue is that my simulation environment uses multiple IDS/IPS/NGFW devices that will not pass and log simple TCPreplays of canned traffic. I needed the traffic to be as organic as possible, essentially mimicking real users browsing the web. 10 | 11 | Tested on Ubuntu 14.04 & 16.04 minimal, but should work on any system with Python installed. 12 | 13 | [![asciicast](https://asciinema.org/a/304683.png)](https://asciinema.org/a/304683) 14 | 15 | ## How it works 16 | 17 | About as simple as it gets... 18 | 19 | **First, specify a few settings at the top of the script...** 20 | 21 | - `MAX_DEPTH = 10`, `MIN_DEPTH = 5` Starting from each root URL (ie: www.yahoo.com), our generator will click to a depth 22 | radomly selected between MIN_DEPTH and MAX_DEPTH. 23 | 24 | *The interval between every HTTP GET requests is chosen at random between the following two variables...* 25 | 26 | - `MIN_WAIT = 5` Wait a minimum of `5` seconds between requests... Be careful with making requests to quickly as that tends to piss off web servers. 27 | - `MAX_WAIT = 10` I think you get the point. 28 | 29 | - `DEBUG = False` A poor man's logger. Set to `True` for verbose realtime printing to console for debugging or development. I'll incorporate proper logging later on (maybe). 30 | 31 | - `ROOT_URLS = [url1,url2,url3]` The list of root URLs to start from when browsing. Randomly selected. 32 | 33 | - `blacklist = [".gif", "intent/tweet", "badlink", etc...]` A blacklist of strings that we check every link against. If the link contains any of the strings in this list, it's discarded. Useful to avoid things that are not traffic-generator friendly like "Tweet this!" links or links to image files. 34 | 35 | - `userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3).......'` You guessed it, the user-agent our headless browser hands over to the web server. You can probably leave it set to the default, but feel free to change it. I would strongly suggest using a common/valid one or else you'll likely get rate-limited quick. 36 | 37 | ## Dependencies 38 | 39 | Only thing you need and *might* not have is `requests`. Grab it with 40 | 41 | ```bash 42 | sudo pip install requests 43 | ``` 44 | 45 | ## Usage 46 | 47 | Create your config file first: 48 | 49 | ```bash 50 | cp config.py.template config.py 51 | ``` 52 | 53 | Run the generator: 54 | 55 | ```bash 56 | python gen.py 57 | ``` 58 | 59 | ## Troubleshooting and debugging 60 | 61 | To get more deets on what is happening under the hood, change the Debug variable in `config.py` from `False` to `True`. This provides the following output... 62 | 63 | ```console 64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 65 | Traffic generator started 66 | Diving between 3 and 10 links deep into 489 different root URLs, 67 | Waiting between 5 and 10 seconds between requests. 68 | This script will run indefinitely. Ctrl+C to stop. 69 | Randomly selecting one of 489 URLs 70 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 71 | Recursively browsing [https://arstechnica.com] ~~~ [depth = 7] 72 | Requesting page... 73 | Page size: 77.6KB 74 | Data meter: 77.6KB 75 | Good requests: 1 76 | Bad reqeusts: 0 77 | Scraping page for links 78 | Found 171 valid links 79 | Pausing for 7 seconds... 80 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 81 | Recursively browsing [https://arstechnica.com/author/jon-brodkin/] ~~~ [depth = 6] 82 | Requesting page... 83 | Page size: 75.7KB 84 | Data meter: 153.3KB 85 | Good requests: 2 86 | Bad reqeusts: 0 87 | Scraping page for links 88 | Found 168 valid links 89 | Pausing for 9 seconds... 90 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 91 | Recursively browsing [https://arstechnica.com/information-technology/2020/01/directv-races-to-decommission-broken-boeing-satellite-before-it-explodes/] ~~~ [depth = 5] 92 | Requesting page... 93 | Page size: 43.8KB 94 | Data meter: 197.1KB 95 | Good requests: 3 96 | Bad reqeusts: 0 97 | Scraping page for links 98 | Found 32 valid links 99 | Pausing for 8 seconds... 100 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 101 | Recursively browsing [https://www.facebook.com/sharer.php?u=https%3A%2F%2Farstechnica.com%2F%3Fpost_type%3Dpost%26p%3D1647915] ~~~ [depth = 4] 102 | Requesting page... 103 | Page size: 64.2KB 104 | Data meter: 261.2KB 105 | Good requests: 4 106 | Bad reqeusts: 0 107 | Scraping page for links 108 | Found 0 valid links 109 | Stopping and blacklisting: no links 110 | ``` 111 | 112 | The last URL attempted provides a good example of when a particular URL throws an error. We simply add it to our `config.blacklist` array in memory, and continue browsing. This prevents a known bad URL from returning to the queue. 113 | -------------------------------------------------------------------------------- /config.py.template: -------------------------------------------------------------------------------- 1 | MAX_DEPTH = 10 # maximum click depth 2 | MIN_DEPTH = 3 # minimum click depth 3 | MAX_WAIT = 10 # maximum amount of time to wait between HTTP requests 4 | MIN_WAIT = 5 # minimum amount of time allowed between HTTP requests 5 | DEBUG = False # set to True to enable useful console output 6 | 7 | # use this single item list to test how a site responds to this crawler 8 | # be sure to comment out the list below it. 9 | #ROOT_URLS = ["https:///digg.com/"] 10 | 11 | ROOT_URLS = [ 12 | "https://digg.com/", 13 | "https://www.yahoo.com", 14 | "https://www.reddit.com", 15 | "http://www.cnn.com", 16 | "http://www.ebay.com", 17 | "https://en.wikipedia.org/wiki/Main_Page", 18 | "https://austin.craigslist.org/" 19 | ] 20 | 21 | 22 | # items can be a URL "https://t.co" or simple string to check for "amazon" 23 | blacklist = [ 24 | "https://t.co", 25 | "t.umblr.com", 26 | "messenger.com", 27 | "itunes.apple.com", 28 | "l.facebook.com", 29 | "bit.ly", 30 | "mediawiki", 31 | ".css", 32 | ".ico", 33 | ".xml", 34 | "intent/tweet", 35 | "twitter.com/share", 36 | "signup", 37 | "login", 38 | "dialog/feed?", 39 | ".png", 40 | ".jpg", 41 | ".json", 42 | ".svg", 43 | ".gif", 44 | "zendesk", 45 | "clickserve" 46 | ] 47 | 48 | # must use a valid user agent or sites will hate you 49 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ' \ 50 | 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' 51 | -------------------------------------------------------------------------------- /gen.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # 4 | # written by @eric_capuano 5 | # https://github.com/ecapuano/web-traffic-generator 6 | # 7 | # published under MIT license :) do what you want. 8 | # 9 | 10 | # 20170714 shyft ADDED python 2.7 and 3.x compatibility and generic config 11 | # 20200225 rarawls ADDED recursive, depth-first browsing, color stdout 12 | from __future__ import print_function 13 | import requests 14 | import re 15 | import time 16 | import random 17 | try: 18 | import config 19 | except ImportError: 20 | 21 | class ConfigClass: # minimal config incase you don't have the config.py 22 | MAX_DEPTH = 10 # dive no deeper than this for each root URL 23 | MIN_DEPTH = 3 # dive at least this deep into each root URL 24 | MAX_WAIT = 10 # maximum amount of time to wait between HTTP requests 25 | MIN_WAIT = 5 # minimum amount of time allowed between HTTP requests 26 | DEBUG = False # set to True to enable useful console output 27 | 28 | # use this single item list to test how a site responds to this crawler 29 | # be sure to comment out the list below it. 30 | #ROOT_URLS = ["https://digg.com/"] 31 | ROOT_URLS = [ 32 | "https://www.reddit.com" 33 | ] 34 | 35 | # items can be a URL "https://t.co" or simple string to check for "amazon" 36 | blacklist = [ 37 | 'facebook.com', 38 | 'pinterest.com' 39 | ] 40 | 41 | # must use a valid user agent or sites will hate you 42 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ' \ 43 | 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' 44 | config = ConfigClass 45 | 46 | 47 | class Colors: 48 | RED = '\033[91m' 49 | YELLOW = '\033[93m' 50 | PURPLE = '\033[95m' 51 | NONE = '\033[0m' 52 | 53 | 54 | def debug_print(message, color=Colors.NONE): 55 | """ A method which prints if DEBUG is set """ 56 | if config.DEBUG: 57 | print(color + message + Colors.NONE) 58 | 59 | 60 | def hr_bytes(bytes_, suffix='B', si=False): 61 | """ A method providing a more legible byte format """ 62 | 63 | bits = 1024.0 if si else 1000.0 64 | 65 | for unit in ['', 'K', 'M', 'G', 'T', 'P', 'E', 'Z']: 66 | if abs(bytes_) < bits: 67 | return "{:.1f}{}{}".format(bytes_, unit, suffix) 68 | bytes_ /= bits 69 | return "{:.1f}{}{}".format(bytes_, 'Y', suffix) 70 | 71 | 72 | def do_request(url): 73 | """ A method which loads a page """ 74 | 75 | global data_meter 76 | global good_requests 77 | global bad_requests 78 | 79 | debug_print(" Requesting page...".format(url)) 80 | 81 | headers = {'user-agent': config.USER_AGENT} 82 | 83 | try: 84 | r = requests.get(url, headers=headers, timeout=5) 85 | except: 86 | # Prevent 100% CPU loop in a net down situation 87 | time.sleep(30) 88 | return False 89 | 90 | page_size = len(r.content) 91 | data_meter += page_size 92 | 93 | debug_print(" Page size: {}".format(hr_bytes(page_size))) 94 | debug_print(" Data meter: {}".format(hr_bytes(data_meter))) 95 | 96 | status = r.status_code 97 | 98 | if (status != 200): 99 | bad_requests += 1 100 | debug_print(" Response status: {}".format(r.status_code), Colors.RED) 101 | if (status == 429): 102 | debug_print( 103 | " We're making requests too frequently... sleeping longer...") 104 | config.MIN_WAIT += 10 105 | config.MAX_WAIT += 10 106 | else: 107 | good_requests += 1 108 | 109 | debug_print(" Good requests: {}".format(good_requests)) 110 | debug_print(" Bad reqeusts: {}".format(bad_requests)) 111 | 112 | return r 113 | 114 | 115 | def get_links(page): 116 | """ A method which returns all links from page, less blacklisted links """ 117 | 118 | pattern = r"(?:href\=\")(https?:\/\/[^\"]+)(?:\")" 119 | links = re.findall(pattern, str(page.content)) 120 | valid_links = [link for link in links if not any( 121 | b in link for b in config.blacklist)] 122 | return valid_links 123 | 124 | 125 | def recursive_browse(url, depth): 126 | """ A method which recursively browses URLs, using given depth """ 127 | # Base: load current page and return 128 | # Recursively: load page, pick random link and browse with decremented depth 129 | 130 | debug_print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~") 131 | debug_print( 132 | "Recursively browsing [{}] ~~~ [depth = {}]".format(url, depth)) 133 | 134 | if not depth: # base case: depth of zero, load page 135 | 136 | do_request(url) 137 | return 138 | 139 | else: # recursive case: load page, browse random link, decrement depth 140 | 141 | page = do_request(url) # load current page 142 | 143 | # give up if error loading page 144 | if not page: 145 | debug_print( 146 | " Stopping and blacklisting: page error".format(url), Colors.YELLOW) 147 | config.blacklist.append(url) 148 | return 149 | 150 | # scrape page for links not in blacklist 151 | debug_print(" Scraping page for links".format(url)) 152 | valid_links = get_links(page) 153 | debug_print(" Found {} valid links".format(len(valid_links))) 154 | 155 | # give up if no links to browse 156 | if not valid_links: 157 | debug_print(" Stopping and blacklisting: no links".format( 158 | url), Colors.YELLOW) 159 | config.blacklist.append(url) 160 | return 161 | 162 | # sleep and then recursively browse 163 | sleep_time = random.randrange(config.MIN_WAIT, config.MAX_WAIT) 164 | debug_print(" Pausing for {} seconds...".format(sleep_time)) 165 | time.sleep(sleep_time) 166 | 167 | recursive_browse(random.choice(valid_links), depth - 1) 168 | 169 | 170 | if __name__ == "__main__": 171 | 172 | # Initialize global variables 173 | data_meter = 0 174 | good_requests = 0 175 | bad_requests = 0 176 | 177 | print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~") 178 | print("Traffic generator started") 179 | print("https://github.com/ecapuano/web-traffic-generator") 180 | print("Diving between 3 and {} links deep into {} root URLs,".format( 181 | config.MAX_DEPTH, len(config.ROOT_URLS))) 182 | print("Waiting between {} and {} seconds between requests. ".format( 183 | config.MIN_WAIT, config.MAX_WAIT)) 184 | print("This script will run indefinitely. Ctrl+C to stop.") 185 | 186 | while True: 187 | 188 | debug_print("Randomly selecting one of {} Root URLs".format( 189 | len(config.ROOT_URLS)), Colors.PURPLE) 190 | 191 | random_url = random.choice(config.ROOT_URLS) 192 | depth = random.choice(range(config.MIN_DEPTH, config.MAX_DEPTH)) 193 | 194 | recursive_browse(random_url, depth) 195 | 196 | --------------------------------------------------------------------------------