├── LICENSE ├── README.md ├── install.sh ├── requirements.txt └── wwwordlist.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Zarcolio 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![](https://img.shields.io/github/license/Zarcolio/wwwordlist) ![](https://badges.pufler.dev/visits/Zarcolio/wwwordlist) ![](https://img.shields.io/github/stars/Zarcolio/wwwordlist) ![](https://img.shields.io/github/forks/Zarcolio/wwwordlist) ![](https://img.shields.io/github/issues/Zarcolio/wwwordlist) ![](https://img.shields.io/github/issues-closed-raw/Zarcolio/wwwordlist) ![](https://img.shields.io/github/issues-pr/Zarcolio/wwwordlist) ![](https://img.shields.io/github/issues-pr-closed-raw/Zarcolio/wwwordlist) 2 | 3 | # About [WWWordList](https://github.com/Zarcolio/wwwordlist) 4 | WWWordList is a wordlist-generator, it creates a wordlist by taking input from stdin and extracts words based on HTML (extracted with BS4), URLs, JS/HTTP/input variables, quoted texts found in the supplied text and mail files. 5 | It isn't a scraper or spider, so Wwwordlist is used in conjunction with a tool that facilitates the downloading of HTML, for example wget. 6 | 7 | # Why use WWWordList? 8 | Because [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/stokfredrik.svg?style=social&label=Stök)](https://twitter.com/stokfredrik) says you should use good wordlists, based on the content of the target. This is my attempt on creating a wordlist-generator that supports this. 9 | 10 | # Install 11 | WWWordList should be able to run with a default Kali Linux installation with BS4 installed. To install WWWordList including BS4: 12 | ``` 13 | git clone https://github.com/Zarcolio/wwwordlist 14 | cd wwwordlist 15 | sudo bash install.sh 16 | ``` 17 | When using the installer in an automated environment, use the following command for an automated installation: 18 | 19 | ``` 20 | sudo bash install.sh -auto 21 | ``` 22 | 23 | 24 | If you're running into trouble running WWWordList, please drop me an issue and I'll try to fix it :) 25 | 26 | # Usage 27 | ``` 28 | usage: wwwordlist [-h] [-type ] [-case ] [-iwh ] [-iwn ] [-ii] 29 | [-idu] [-min ] [-max ] 30 | 31 | Use WWWordList to generate a wordlist from input. 32 | 33 | optional arguments: 34 | -h, -help show this help message and exit 35 | -type Analyze the text between HTML tags, inside urls found, inside quoted text or in 36 | the full text. Choose between httpvars|inputvars|jsvars|html|urls|quoted|full. 37 | Defaults to 'full'. 38 | -case Apply original, lower or upper case. If no case type is specified, lower case is the 39 | default. If another case is specified, lower has to be specified to be included. 40 | Spearate by comma's. 41 | -excl Leave out the words found in this file. 42 | -iwh Ignore values containing a valid hexadecimal number of this length. Don't use low 43 | values as letters a-f will be filtered. 44 | -iwn Ignore values containing a valid decimal number of this length. 45 | -ii Ignore words that are a valid integer number. 46 | -idu Ignore words containing a dash or underscore, but break them in parts. 47 | -min Defines the minimum length of a word to add to the wordlist, defaults to 3. 48 | -max Defines the maximum length of a word to add to the wordlist, defaults to 10 49 | -mailfile Quoted-printable decode input first. Use this option when inputting an email body. 50 | ``` 51 | 52 | # Examples 53 | If you want to build a wordlist based on the text between the HTML tags, simply run the following command and let the wordlist generation begin: 54 | ``` 55 | cat index.html|wwwordlist -type html 56 | ``` 57 | If you want to build a wordlist based on links inside a file, simply run: 58 | ``` 59 | cat index.html|wwwordlist -type urls 60 | ``` 61 | If you want to build a wordlist based on the text between the HTML tags, but you want it to be quite small, simply run: 62 | ``` 63 | cat index.html|wwwordlist -type html -ih 4 -dui -max 8 64 | ``` 65 | If you want to build a wordlist based on the text between the HTML tags, but you want it to be really big, simply run: 66 | ``` 67 | cat index.html|wwwordlist -type html -ih 4 -case o,l,u 68 | ``` 69 | If you want to build a wordlist based on the text from a webpage, simply run: 70 | ``` 71 | wget -qO - example.com|wwwordlist -type html 72 | ``` 73 | If you want to build a big wordlist based on whole website and run it through ffuf, try: 74 | ``` 75 | wget -nd -r example.com -q -E -R woff,jpg,gif,eot,ttf,svg,png,otf,pdf,exe,zip,rar,tgz,docx,ico,jpeg 76 | cat *.*|wwwordlist -ih 4 -case o,l,u -max 10 -full|ffuf -recursion -w - -u https://example.com/FUZZ -r 77 | ``` 78 | Want to throw [waybackurls](https://github.com/tomnomnom/waybackurls) in the mix? Use it together with xargs together and [urlcoding](https://github.com/Zarcolio/urlcoding) (warning: this will take a lot of time): 79 | ``` 80 | cat domains.txt | waybackurls | urlcoding -e | parallel -pipe xargs -n1 wget -T 2 -qO - | wwwordlist -ih 4 81 | ``` 82 | Got a Git repo cloned locally? Try the following command inside the clone folder: 83 | ``` 84 | find . -type f -exec strings {} +|wwwordlist 85 | ``` 86 | 87 | # Contribute? 88 | Do you have some usefull additions to WWWordList: 89 | 90 | * [![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](https://github.com/Zarcolio/wwwordlist/pulls) 91 | * [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/zarcolio.svg?style=social&label=Contact%20me)](https://twitter.com/zarcolio) 92 | 93 | -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | scriptname="wwwordlist.py" 4 | 5 | # Attempt to install the packages using pip3 6 | sudo pip3 install -r requirements.txt 2>&1 | grep "error: externally-managed-environment" >/dev/null 7 | 8 | if [ $? -eq 0 ]; then 9 | # If the pip3 installation failed, install the packages using apt 10 | echo "Installing required packages using apt" 11 | sudo apt-get update 12 | sudo apt-get install -y $(cat requirements.txt) 13 | fi 14 | 15 | dir=$(pwd) 16 | 17 | 2ulb 2&>/dev/null 18 | if [ $? -eq 127 ] 19 | then 20 | echo "2ulb not found, install 2ulb? [y/n]: " 21 | while true; do 22 | read yn -p 23 | case $yn in 24 | [Yy]*) cd .. || exit; git clone https://github.com/Zarcolio/2ulb ; sudo python3 2ulb/2ulb.py 2ulb/2ulb.py ; cd "$dir" || exit; sudo 2ulb $scriptname; exit 0 ;; 25 | [Nn]*) echo "Aborted" ; exit 1 ;; 26 | esac 27 | done 28 | else 29 | sudo 2ulb $scriptname 30 | fi 31 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.9.0 2 | 3 | -------------------------------------------------------------------------------- /wwwordlist.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | from bs4 import BeautifulSoup 5 | import argparse 6 | import signal 7 | import sys 8 | import requests 9 | import codecs 10 | import re 11 | import urllib 12 | import unicodedata 13 | import quopri 14 | import email 15 | 16 | def GetArguments(): 17 | # Get some commandline arguments: 18 | argParser=argparse.ArgumentParser(description='Use wwwordlist to generate a wordlist from input.') 19 | argParser.add_argument('-type', metavar="", help='Analyze the text between HTML tags, inside urls found, inside quoted text or in the full text. Choose between httpvars|inputvars|jsvars|html|urls|quoted|full. Defaults to \'full\'.') 20 | argParser.add_argument('-case', metavar="", help='Apply original, lower or upper case. If no case type is specified, lower case is the default. If another case is specified, lower has to be specified to be included. Spearate by comma\'s') 21 | argParser.add_argument('-excl', metavar="", help='Leave out the words found in this file.') 22 | argParser.add_argument('-iwh', metavar="", help='Ignore values containing a valid hexadecimal number of this length. Don\'t low values as letters a-f will be filtered.', default=False) 23 | argParser.add_argument('-iwn', metavar="", help='Ignore values containing a valid decimal number of this length.', default=False) 24 | argParser.add_argument('-ii', help='Ignore words that are a valid integer number.', action="store_true", default=False) 25 | argParser.add_argument('-idu', help='Ignore words containing a dash or underscore, but break them in parts.', action="store_true", default=False) 26 | argParser.add_argument('-min', metavar="", help='Defines the minimum length of a word to add to the wordlist, defaults to 3.', default=3) 27 | argParser.add_argument('-max', metavar="", help='Defines the maximum length of a word to add to the wordlist, defaults to 10', default=10) 28 | argParser.add_argument('-mailfile', help='Quoted-printable decode input first. Use this option when inputting a email body.', action="store_true") 29 | 30 | return argParser.parse_args() 31 | 32 | def GetCompFileContent(sCompFileName): 33 | try: 34 | f = open(sCompFileName, 'r') 35 | lWords = f.read().splitlines() 36 | lWords = list(filter(None, lWords)) 37 | f.close() 38 | except FileNotFoundError: 39 | print("File named " + sCompFileName + " was not found.") 40 | 41 | return lWords 42 | 43 | def SignalHandler(sig, frame): 44 | # Create a break routine: 45 | sys.stderr.write("\nCtrl-C detected, exiting...\n") 46 | sys.exit(1) 47 | 48 | signal.signal(signal.SIGINT, SignalHandler) 49 | 50 | def StripAccents(text): 51 | # Remove diacritics from text 52 | try: 53 | text = unicode(text, 'utf-8') 54 | except (TypeError, NameError): # unicode is a default on python 3 55 | pass 56 | text = unicodedata.normalize('NFD', text) 57 | text = text.encode('ascii', 'ignore') 58 | text = text.decode("utf-8") 59 | return str(text) 60 | 61 | ESCAPE_SEQUENCE_RE = re.compile(r''' 62 | ( \\U........ # 8-digit hex escapes 63 | | \\u.... # 4-digit hex escapes 64 | | \\x.. # 2-digit hex escapes 65 | | \\[0-7]{1,3} # Octal escapes 66 | | \\N\{[^}]+\} # Unicode characters by name 67 | | \\[\\'"abfnrtv] # Single-character escapes 68 | )''', re.UNICODE | re.VERBOSE) 69 | 70 | def Unescape(s): 71 | # replace = hack because / and . cannot be unescaped: 72 | s = s.replace("\/", "/") 73 | s = s.replace("\.", ".") 74 | s = s.replace("\:", ":") 75 | s = s.replace("\;", ";") 76 | 77 | def unescape_match(match): 78 | try: 79 | return codecs.decode(match.group(0), 'unicode-escape') 80 | except: 81 | pass 82 | 83 | return ESCAPE_SEQUENCE_RE.sub(unescape_match, s) 84 | 85 | def GetHtmlWords(sHtml): 86 | sHtml = sHtml.replace("><","> <") # needed becasue BS4 sometime concatenates words when it shouldn't 87 | soup = BeautifulSoup(sHtml, 'html.parser') 88 | 89 | # kill all script and style elements 90 | for script in soup(["script", "style"]): 91 | script.extract() # rip it out 92 | 93 | # get text from HTML 94 | sText = soup.get_text() 95 | 96 | # break into lines and remove leading and trailing space on each 97 | lines = (line.strip() for line in sText.splitlines()) 98 | # break multi-headlines into a line each 99 | chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 100 | # drop blank lines 101 | sText = '\n'.join(chunk for chunk in chunks if chunk) 102 | return sText 103 | 104 | def GetVarsJs (strInput): 105 | regex = r"(var\s+)([a-z0-9]+)(\s*=)" 106 | matches = re.finditer(regex, strInput, re.IGNORECASE) 107 | lMatches = [] 108 | for matchNum, match in enumerate(matches, start=1): 109 | lMatches.append(match.group(2)) 110 | return lMatches 111 | 112 | def GetVarsInput (strInput): 113 | regex = r"(\]+(?:[<\s]action|background|cite|classid|codebase|data|formaction|href|icon|longdesc|manifest|poster|profile|src|usemap)\s*=\s*)(?!['\"]?(?:data|([a-zA-Z][a-zA-Z0-9+-.]*\:\/\/)))['\"]?([^'\"\)\s>]+)" 138 | matches = re.finditer(regex, strInput, re.IGNORECASE) 139 | lMatches = [] 140 | for matchNum, match in enumerate(matches, start=1): 141 | lMatches.append(match.group(2)) 142 | return lMatches 143 | 144 | def RelUrlsQuoted(strInput): 145 | regex = r"([\"'])(\/[{a-z}{0-9}\.-_~!$&()\*\+,;=:@\[\]]+)([\"'])" 146 | matches = re.finditer(regex, strInput, re.IGNORECASE) 147 | lMatches = [] 148 | for matchNum, match in enumerate(matches, start=1): 149 | lMatches.append(match.group(2)) 150 | return lMatches 151 | 152 | def GetQuotedStrings(strInput): 153 | regex = r"([\"'])(?:(?=(\\?))\2.)*?\1" 154 | matches = re.finditer(regex, strInput, re.MULTILINE) 155 | lMatches = [] 156 | for matchNum, match in enumerate(matches, start=1): 157 | lMatches.append( "{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) 158 | return " ".join(lMatches) 159 | 160 | def GetLinks(strTotalInput): 161 | lUrls = Urls(strTotalInput) 162 | lRelUrls = RelUrls(strTotalInput) 163 | lRelUrlsQuoted = RelUrlsQuoted(strTotalInput) 164 | lTotal = lUrls + lRelUrls + lRelUrlsQuoted 165 | return " ".join(lTotal) 166 | 167 | def FilterIh(lWords): 168 | lTemp = [] 169 | if lArgs.iwh: 170 | for word in lWords: 171 | regex = r"[a-f0-9]{" + lArgs.iwh + ",}" 172 | if not re.match(regex, word): 173 | lTemp.append(word) 174 | return lTemp 175 | else: 176 | return lWords 177 | 178 | def FilterIn(lWords): 179 | lTemp = [] 180 | if lArgs.iwn: 181 | for word in lWords: 182 | regex = r"[a-f0-9]{" + lArgs.iwn + ",}" 183 | matches = re.finditer(regex, word) 184 | print(matches) 185 | if not matches: 186 | print(regex+">"+word) 187 | lTemp.append(word) 188 | return lTemp 189 | else: 190 | return lWords 191 | 192 | def FilterMin(lWords): 193 | lTemp = [] 194 | for word in lWords: 195 | if len(word)>= int(lArgs.min): 196 | lTemp.append(word) 197 | return lTemp 198 | 199 | def FilterMax(lWords): 200 | lTemp = [] 201 | for word in lWords: 202 | if len(word)<= int(lArgs.max): 203 | lTemp.append(word) 204 | return lTemp 205 | 206 | def FilterIi(lWords): 207 | lTemp = [] 208 | for word in lWords: 209 | if not word.isdigit(): 210 | lTemp.append(word) 211 | return lTemp 212 | 213 | def RegStringsWithDashAndUnderscore(strInput): 214 | regex = r"([a-z0-9\-\_]+)" 215 | matches = re.finditer(regex, strInput, re.IGNORECASE) 216 | lMatches = [] 217 | for matchNum, match in enumerate(matches, start=1): 218 | lMatches.append( "{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) 219 | return lMatches 220 | 221 | def RegStringsWithoutDashAndUnderscore(strInput): 222 | regex = r"([a-z0-9]+)" 223 | matches = re.finditer(regex, strInput, re.IGNORECASE) 224 | lMatches = [] 225 | for matchNum, match in enumerate(matches, start=1): 226 | lMatches.append( "{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) 227 | return lMatches 228 | 229 | def Strings(strInput): 230 | lMatches = RegStringsWithDashAndUnderscore(StripAccents(strInput)) + RegStringsWithoutDashAndUnderscore(StripAccents(strInput)) 231 | lMatches = list(dict.fromkeys(lMatches)) 232 | return lMatches 233 | 234 | def ToPlainText(strInput): 235 | strInput = urllib.parse.unquote(strInput) 236 | strInput = urllib.parse.unquote(strInput) 237 | strInput = Unescape(strInput) 238 | strInput = StripAccents(strInput) 239 | return strInput 240 | 241 | def ReplaceInsideWords(lWords): 242 | lTemp = [] 243 | for word in lWords: 244 | if len(word)>0: 245 | word2 = word.rstrip('0123456789').lstrip('0123456789') 246 | if word2 != word: 247 | lTemp.append(word2) 248 | 249 | lTemp.append(word) 250 | word2 = word.replace("-", "_") 251 | if word2 != word: 252 | lTemp.append(word2) 253 | word2 = word.replace("_", "-") 254 | if word2 != word: 255 | lTemp.append(word2) 256 | if not lArgs.idu: 257 | word2 = word.replace("-", "") 258 | if word2 != word: 259 | lTemp.append(word2) 260 | word2 = word.replace("_", "") 261 | if word2 != word: 262 | lTemp.append(word2) 263 | return lTemp 264 | 265 | def StripStripes(lWords): 266 | lTemp = [] 267 | for word in lWords: 268 | if len(word)>0: 269 | if lArgs.idu: 270 | if "_" not in word and "-" not in word: 271 | lTemp.append(word) 272 | else: 273 | if word[0] != "_" and word[0] != "-" and word[len(word)-1] != "_" and word[len(word)-1] != "-": 274 | lTemp.append(word) 275 | return lTemp 276 | 277 | lArgs = GetArguments() 278 | requests.packages.urllib3.disable_warnings() 279 | 280 | def PluralToSingle(lWords): 281 | lTemp = [] 282 | for word in lWords: 283 | lTemp.append(word) 284 | if word.endswith("ies", len(word)-4): 285 | word = re.sub('ies$', 'y', word) 286 | lTemp.append(word) 287 | elif word.endswith("ss", len(word)-3): 288 | continue 289 | else: 290 | word = re.sub('s$', '', word) 291 | lTemp.append(word) 292 | 293 | return lTemp 294 | 295 | def main(): 296 | 297 | if lArgs.excl: 298 | lCompWords = GetCompFileContent(lArgs.excl) 299 | 300 | if lArgs.case: 301 | lCaseArgs = lArgs.case.split(",") 302 | else: 303 | lCaseArgs = ["l"] 304 | 305 | if lArgs.type: 306 | lTypeArgs = lArgs.type 307 | else: 308 | lTypeArgs = "full" 309 | 310 | strTotalInput = "" 311 | 312 | try: # skip if binary values are given 313 | for strInput in sys.stdin: 314 | strTotalInput += strInput 315 | except UnicodeError: 316 | pass 317 | 318 | if lArgs.mailfile: 319 | strTotalInput = quopri.decodestring(strTotalInput).decode('utf-8', errors='ignore') 320 | b = email.message_from_string(strTotalInput) 321 | if b.is_multipart(): 322 | for payload in b.get_payload(): 323 | strTotalInput = (payload.get_payload()) 324 | else: 325 | strTotalInput = (b.get_payload()) 326 | 327 | strTotalInput = ToPlainText(strTotalInput) 328 | 329 | if lTypeArgs == "full": 330 | lMatches = Strings(strTotalInput) 331 | elif lTypeArgs == "jsvars": 332 | lMatches = GetVarsJs(strTotalInput) 333 | elif lTypeArgs == "httpvars": 334 | lMatches = GetVarsHttp(strTotalInput) 335 | elif lTypeArgs == "inputvars": 336 | lMatches = GetVarsInput(strTotalInput) 337 | elif lTypeArgs == "html": 338 | strTotalInput = GetHtmlWords(strTotalInput) 339 | lMatches = Strings(strTotalInput) 340 | elif lTypeArgs == "urls": 341 | strTotalInput = GetLinks(strTotalInput) 342 | lMatches = Strings(strTotalInput) 343 | elif lTypeArgs == "quoted": 344 | strTotalInput = GetQuotedStrings(strTotalInput) 345 | lMatches = Strings(strTotalInput) 346 | else: 347 | sys.stderr.write("Invalid type.\n\n") 348 | sys.exit(2) 349 | 350 | z = ReplaceInsideWords(lMatches) 351 | z = StripStripes(z) 352 | z = PluralToSingle(z) 353 | if lArgs.ii: 354 | z = FilterIi(z) 355 | 356 | if lArgs.min: 357 | z = FilterMin(z) 358 | 359 | if lArgs.max: 360 | z = FilterMax(z) 361 | 362 | if lArgs.iwh: 363 | z = FilterIh(z) 364 | 365 | if lArgs.iwn: 366 | z = FilterIn(z) 367 | 368 | lResult = [] 369 | 370 | if "l" in lCaseArgs: 371 | lResult += [x.lower() for x in z] 372 | 373 | if "u" in lCaseArgs: 374 | lResult += [x.upper() for x in z] 375 | 376 | if "o" in lCaseArgs: 377 | lResult += z 378 | 379 | lResult = list(dict.fromkeys(lResult)) 380 | 381 | if lArgs.excl: 382 | lResult = set(lResult) - set(lCompWords) 383 | 384 | for x in sorted(lResult): 385 | print(x) 386 | 387 | if __name__ == '__main__': 388 | main() --------------------------------------------------------------------------------