├── .gitignore
├── LICENSE
├── README.md
├── config.py.template
└── gen.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # web-traffic-generator
 2 | # user config file
 3 | config.py
 4 | 
 5 | # PyInstaller
 6 | #  Usually these files are written by a python script from a template
 7 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 8 | *.manifest
 9 | *.spec
10 | 
11 | # Installer logs
12 | pip-log.txt
13 | pip-delete-this-directory.txt
14 | 
15 | # pyenv
16 | .python-version
17 | 
18 | # Byte-compiled / optimized / DLL files
19 | __pycache__/
20 | *.py[cod]
21 | *$py.class


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # web-traffic-generator
  2 | 
  3 | A quick and dirty HTTP/S "organic" traffic generator.
  4 | 
  5 | ## About
  6 | 
  7 | Just a simple (poorly written) Python script that aimlessly "browses" the internet by starting at pre-defined `ROOT_URLS` and randomly "clicking" links on pages until the pre-defined `MAX_DEPTH` is met.
  8 | 
  9 | I created this as a noise generator to use for an Incident Response / Network Defense simulation. The only issue is that my simulation environment uses multiple IDS/IPS/NGFW devices that will not pass and log simple TCPreplays of canned traffic. I needed the traffic to be as organic as possible, essentially mimicking real users browsing the web.
 10 | 
 11 | Tested on Ubuntu 14.04 & 16.04 minimal, but should work on any system with Python installed.
 12 | 
 13 | [![asciicast](https://asciinema.org/a/304683.png)](https://asciinema.org/a/304683)
 14 | 
 15 | ## How it works
 16 | 
 17 | About as simple as it gets...
 18 | 
 19 | **First, specify a few settings at the top of the script...**
 20 | 
 21 | - `MAX_DEPTH = 10`, `MIN_DEPTH = 5` Starting from each root URL (ie: www.yahoo.com), our generator will click to a depth
 22 | radomly selected between MIN_DEPTH and MAX_DEPTH.
 23 | 
 24 | *The interval between every HTTP GET requests is chosen at random between the following two variables...*
 25 | 
 26 | - `MIN_WAIT = 5` Wait a minimum of `5` seconds between requests... Be careful with making requests to quickly as that tends to piss off web servers.
 27 | - `MAX_WAIT = 10` I think you get the point.
 28 | 
 29 | - `DEBUG = False` A poor man's logger. Set to `True` for verbose realtime printing to console for debugging or development. I'll incorporate proper logging later on (maybe).
 30 | 
 31 | - `ROOT_URLS = [url1,url2,url3]` The list of root URLs to start from when browsing. Randomly selected.
 32 | 
 33 | - `blacklist = [".gif", "intent/tweet", "badlink", etc...]` A blacklist of strings that we check every link against. If the link contains any of the strings in this list, it's discarded. Useful to avoid things that are not traffic-generator friendly like "Tweet this!" links or links to image files.
 34 | 
 35 | - `userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3).......'` You guessed it, the user-agent our headless browser hands over to the web server. You can probably leave it set to the default, but feel free to change it. I would strongly suggest using a common/valid one or else you'll likely get rate-limited quick.
 36 | 
 37 | ## Dependencies
 38 | 
 39 | Only thing you need and *might* not have is `requests`. Grab it with
 40 | 
 41 | ```bash
 42 | sudo pip install requests
 43 | ```
 44 | 
 45 | ## Usage
 46 | 
 47 | Create your config file first:
 48 | 
 49 | ```bash
 50 | cp config.py.template config.py
 51 | ```
 52 | 
 53 | Run the generator:
 54 | 
 55 | ```bash
 56 | python gen.py
 57 | ```
 58 | 
 59 | ## Troubleshooting and debugging
 60 | 
 61 | To get more deets on what is happening under the hood, change the Debug variable in `config.py` from `False` to `True`. This provides the following output...
 62 | 
 63 | ```console
 64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 65 | Traffic generator started
 66 | Diving between 3 and 10 links deep into 489 different root URLs,
 67 | Waiting between 5 and 10 seconds between requests.
 68 | This script will run indefinitely. Ctrl+C to stop.
 69 | Randomly selecting one of 489 URLs
 70 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 71 | Recursively browsing [https://arstechnica.com] ~~~ [depth = 7]
 72 |   Requesting page...
 73 |   Page size: 77.6KB
 74 |   Data meter: 77.6KB
 75 |   Good requests: 1
 76 |   Bad reqeusts: 0
 77 |   Scraping page for links
 78 |   Found 171 valid links
 79 |   Pausing for 7 seconds...
 80 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 81 | Recursively browsing [https://arstechnica.com/author/jon-brodkin/] ~~~ [depth = 6]
 82 |   Requesting page...
 83 |   Page size: 75.7KB
 84 |   Data meter: 153.3KB
 85 |   Good requests: 2
 86 |   Bad reqeusts: 0
 87 |   Scraping page for links
 88 |   Found 168 valid links
 89 |   Pausing for 9 seconds...
 90 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 91 | Recursively browsing [https://arstechnica.com/information-technology/2020/01/directv-races-to-decommission-broken-boeing-satellite-before-it-explodes/] ~~~ [depth = 5]
 92 |   Requesting page...
 93 |   Page size: 43.8KB
 94 |   Data meter: 197.1KB
 95 |   Good requests: 3
 96 |   Bad reqeusts: 0
 97 |   Scraping page for links
 98 |   Found 32 valid links
 99 |   Pausing for 8 seconds...
100 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101 | Recursively browsing [https://www.facebook.com/sharer.php?u=https%3A%2F%2Farstechnica.com%2F%3Fpost_type%3Dpost%26p%3D1647915] ~~~ [depth = 4]
102 |   Requesting page...
103 |   Page size: 64.2KB
104 |   Data meter: 261.2KB
105 |   Good requests: 4
106 |   Bad reqeusts: 0
107 |   Scraping page for links
108 |   Found 0 valid links
109 |   Stopping and blacklisting: no links
110 | ```
111 | 
112 | The last URL attempted provides a good example of when a particular URL throws an error. We simply add it to our `config.blacklist` array in memory, and continue browsing. This prevents a known bad URL from returning to the queue.
113 | 


--------------------------------------------------------------------------------
/config.py.template:
--------------------------------------------------------------------------------
 1 | MAX_DEPTH = 10  # maximum click depth
 2 | MIN_DEPTH = 3 # minimum click depth
 3 | MAX_WAIT = 10  # maximum amount of time to wait between HTTP requests
 4 | MIN_WAIT = 5  # minimum amount of time allowed between HTTP requests
 5 | DEBUG = False  # set to True to enable useful console output
 6 | 
 7 | # use this single item list to test how a site responds to this crawler
 8 | # be sure to comment out the list below it.
 9 | #ROOT_URLS = ["https:///digg.com/"]
10 | 
11 | ROOT_URLS = [
12 | 	"https://digg.com/",
13 | 	"https://www.yahoo.com",
14 | 	"https://www.reddit.com",
15 | 	"http://www.cnn.com",
16 | 	"http://www.ebay.com",
17 | 	"https://en.wikipedia.org/wiki/Main_Page",
18 | 	"https://austin.craigslist.org/"
19 | 	]
20 | 
21 | 
22 | # items can be a URL "https://t.co" or simple string to check for "amazon"
23 | blacklist = [
24 | 	"https://t.co", 
25 | 	"t.umblr.com", 
26 | 	"messenger.com", 
27 | 	"itunes.apple.com", 
28 | 	"l.facebook.com", 
29 | 	"bit.ly", 
30 | 	"mediawiki", 
31 | 	".css", 
32 | 	".ico", 
33 | 	".xml", 
34 | 	"intent/tweet", 
35 | 	"twitter.com/share", 
36 | 	"signup", 
37 | 	"login", 
38 | 	"dialog/feed?", 
39 | 	".png", 
40 | 	".jpg", 
41 | 	".json", 
42 | 	".svg", 
43 | 	".gif", 
44 | 	"zendesk",
45 | 	"clickserve"
46 | 	]  
47 | 
48 | # must use a valid user agent or sites will hate you
49 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ' \
50 | 	'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
51 | 


--------------------------------------------------------------------------------
/gen.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | #
  4 | # written by @eric_capuano
  5 | # https://github.com/ecapuano/web-traffic-generator
  6 | #
  7 | # published under MIT license :) do what you want.
  8 | #
  9 | 
 10 | # 20170714 shyft ADDED python 2.7 and 3.x compatibility and generic config
 11 | # 20200225 rarawls ADDED recursive, depth-first browsing, color stdout
 12 | from __future__ import print_function
 13 | import requests
 14 | import re
 15 | import time
 16 | import random
 17 | try:
 18 |     import config
 19 | except ImportError:
 20 |     
 21 |     class ConfigClass:  # minimal config incase you don't have the config.py
 22 |         MAX_DEPTH = 10  # dive no deeper than this for each root URL
 23 |         MIN_DEPTH = 3   # dive at least this deep into each root URL
 24 |         MAX_WAIT = 10   # maximum amount of time to wait between HTTP requests
 25 |         MIN_WAIT = 5    # minimum amount of time allowed between HTTP requests
 26 |         DEBUG = False    # set to True to enable useful console output
 27 | 
 28 |         # use this single item list to test how a site responds to this crawler
 29 |         # be sure to comment out the list below it.
 30 |         #ROOT_URLS = ["https://digg.com/"]
 31 |         ROOT_URLS = [
 32 |             "https://www.reddit.com"
 33 |         ]
 34 | 
 35 |         # items can be a URL "https://t.co" or simple string to check for "amazon"
 36 |         blacklist = [
 37 |             'facebook.com',
 38 |             'pinterest.com'
 39 |         ]
 40 | 
 41 |         # must use a valid user agent or sites will hate you
 42 |         USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) ' \
 43 |             'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
 44 |     config = ConfigClass
 45 | 
 46 | 
 47 | class Colors:
 48 |     RED = '\033[91m'
 49 |     YELLOW = '\033[93m'
 50 |     PURPLE = '\033[95m'
 51 |     NONE = '\033[0m'
 52 | 
 53 | 
 54 | def debug_print(message, color=Colors.NONE):
 55 |     """ A method which prints if DEBUG is set """
 56 |     if config.DEBUG:
 57 |         print(color + message + Colors.NONE)
 58 | 
 59 | 
 60 | def hr_bytes(bytes_, suffix='B', si=False):
 61 |     """ A method providing a more legible byte format """
 62 | 
 63 |     bits = 1024.0 if si else 1000.0
 64 | 
 65 |     for unit in ['', 'K', 'M', 'G', 'T', 'P', 'E', 'Z']:
 66 |         if abs(bytes_) < bits:
 67 |             return "{:.1f}{}{}".format(bytes_, unit, suffix)
 68 |         bytes_ /= bits
 69 |     return "{:.1f}{}{}".format(bytes_, 'Y', suffix)
 70 | 
 71 | 
 72 | def do_request(url):
 73 |     """ A method which loads a page """
 74 | 
 75 |     global data_meter
 76 |     global good_requests
 77 |     global bad_requests
 78 | 
 79 |     debug_print("  Requesting page...".format(url))
 80 | 
 81 |     headers = {'user-agent': config.USER_AGENT}
 82 | 
 83 |     try:
 84 |         r = requests.get(url, headers=headers, timeout=5)
 85 |     except:
 86 |         # Prevent 100% CPU loop in a net down situation
 87 |         time.sleep(30)
 88 |         return False
 89 | 
 90 |     page_size = len(r.content)
 91 |     data_meter += page_size
 92 | 
 93 |     debug_print("  Page size: {}".format(hr_bytes(page_size)))
 94 |     debug_print("  Data meter: {}".format(hr_bytes(data_meter)))
 95 | 
 96 |     status = r.status_code
 97 | 
 98 |     if (status != 200):
 99 |         bad_requests += 1
100 |         debug_print("  Response status: {}".format(r.status_code), Colors.RED)
101 |         if (status == 429):
102 |             debug_print(
103 |                 "  We're making requests too frequently... sleeping longer...")
104 |             config.MIN_WAIT += 10
105 |             config.MAX_WAIT += 10
106 |     else:
107 |         good_requests += 1
108 | 
109 |     debug_print("  Good requests: {}".format(good_requests))
110 |     debug_print("  Bad reqeusts: {}".format(bad_requests))
111 | 
112 |     return r
113 | 
114 | 
115 | def get_links(page):
116 |     """ A method which returns all links from page, less blacklisted links """
117 | 
118 |     pattern = r"(?:href\=\")(https?:\/\/[^\"]+)(?:\")"
119 |     links = re.findall(pattern, str(page.content))
120 |     valid_links = [link for link in links if not any(
121 |         b in link for b in config.blacklist)]
122 |     return valid_links
123 | 
124 | 
125 | def recursive_browse(url, depth):
126 |     """ A method which recursively browses URLs, using given depth """
127 |     # Base: load current page and return
128 |     # Recursively: load page, pick random link and browse with decremented depth
129 | 
130 |     debug_print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
131 |     debug_print(
132 |         "Recursively browsing [{}] ~~~ [depth = {}]".format(url, depth))
133 | 
134 |     if not depth:  # base case: depth of zero, load page
135 | 
136 |         do_request(url)
137 |         return
138 | 
139 |     else:  # recursive case: load page, browse random link, decrement depth
140 | 
141 |         page = do_request(url)  # load current page
142 | 
143 |         # give up if error loading page
144 |         if not page:
145 |             debug_print(
146 |                 "  Stopping and blacklisting: page error".format(url), Colors.YELLOW)
147 |             config.blacklist.append(url)
148 |             return
149 | 
150 |         # scrape page for links not in blacklist
151 |         debug_print("  Scraping page for links".format(url))
152 |         valid_links = get_links(page)
153 |         debug_print("  Found {} valid links".format(len(valid_links)))
154 | 
155 |         # give up if no links to browse
156 |         if not valid_links:
157 |             debug_print("  Stopping and blacklisting: no links".format(
158 |                 url), Colors.YELLOW)
159 |             config.blacklist.append(url)
160 |             return
161 | 
162 |         # sleep and then recursively browse
163 |         sleep_time = random.randrange(config.MIN_WAIT, config.MAX_WAIT)
164 |         debug_print("  Pausing for {} seconds...".format(sleep_time))
165 |         time.sleep(sleep_time)
166 | 
167 |         recursive_browse(random.choice(valid_links), depth - 1)
168 | 
169 | 
170 | if __name__ == "__main__":
171 | 
172 |     # Initialize global variables
173 |     data_meter = 0
174 |     good_requests = 0
175 |     bad_requests = 0
176 | 
177 |     print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
178 |     print("Traffic generator started")
179 |     print("https://github.com/ecapuano/web-traffic-generator")
180 |     print("Diving between 3 and {} links deep into {} root URLs,".format(
181 |         config.MAX_DEPTH, len(config.ROOT_URLS)))
182 |     print("Waiting between {} and {} seconds between requests. ".format(
183 |         config.MIN_WAIT, config.MAX_WAIT))
184 |     print("This script will run indefinitely. Ctrl+C to stop.")
185 | 
186 |     while True:
187 | 
188 |         debug_print("Randomly selecting one of {} Root URLs".format(
189 |             len(config.ROOT_URLS)), Colors.PURPLE)
190 | 
191 |         random_url = random.choice(config.ROOT_URLS)
192 |         depth = random.choice(range(config.MIN_DEPTH, config.MAX_DEPTH))
193 | 
194 |         recursive_browse(random_url, depth)
195 | 
196 | 


--------------------------------------------------------------------------------