├── input └── .keep ├── results └── .keep ├── trained └── .keep ├── log ├── __init__.py └── multiprocessinglog.py ├── ngram ├── __init__.py └── ngram_creator.py ├── configs ├── __init__.py ├── dev.json ├── main.json └── configure.py ├── requirements.txt ├── .gitignore ├── utils ├── sortresult.py └── info.py ├── docs ├── LICENSE └── CHANGELOG.md ├── train.py ├── meter.py └── README.md /input/.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /results/.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /trained/.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /log/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ngram/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /configs/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tqdm 2 | u-msgpack-python 3 | rainbow_logging_handler -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | trained/*.pack 6 | input/*.txt 7 | input/*.7z 8 | results/eval_result.txt -------------------------------------------------------------------------------- /configs/dev.json: -------------------------------------------------------------------------------- 1 | { 2 | "name" : "Development", 3 | "eval_file" : "eval.txt", 4 | "training_file" : "training.txt", 5 | "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%", 6 | "lengths" : [4,6,8], 7 | "ngram_size" : 4, 8 | "no_cpus" : 8, 9 | "progress_bar": true 10 | } -------------------------------------------------------------------------------- /configs/main.json: -------------------------------------------------------------------------------- 1 | { 2 | "name" : "Yet Another Configuration", 3 | "eval_file" : "eval.txt", 4 | "training_file" : "training.txt", 5 | "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%", 6 | "lengths" : [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23], 7 | "ngram_size" : 4, 8 | "no_cpus" : 8, 9 | "progress_bar": false 10 | } -------------------------------------------------------------------------------- /utils/sortresult.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | :description: Sorts passwords in the strength meter outfile 'eval_result.txt' by likelihood 9 | :usage: pypy utils/sortresult.py results/eval_result.txt > results/eval_result_sorted.txt 10 | ''' 11 | 12 | import sys 13 | 14 | # Read file 15 | with open(sys.argv[1], 'r') as inputfile: 16 | out = [] 17 | for line in inputfile: 18 | line = line.rstrip('\r\n') 19 | splitted = line.split('\t') 20 | if splitted[0].startswith("Info: No Markov model for this length:"): 21 | out.append((-1.0,splitted[1])) 22 | # Instead of adding them, you could also discard them 23 | # pass 24 | else: 25 | prob = float(splitted[0]) 26 | pw = splitted[1] 27 | out.append((prob,pw)) 28 | 29 | # Sort by prob 30 | out = sorted(out, key=lambda tup: tup[0], reverse=True) 31 | 32 | # Output 33 | for entry in out: 34 | prob = entry[0] 35 | pw = entry[1] 36 | print("{}\t{}".format(prob,pw)) -------------------------------------------------------------------------------- /docs/LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2019 Horst Goertz Institute for IT Security (Ruhr University Bochum) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /docs/CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | All notable changes to this project will be documented in this file. 3 | This project adheres to [Semantic Versioning](http://semver.org/). 4 | 5 | ## [Unreleased] 6 | ### Added 7 | - Natural Language Encoder (NLE) 8 | 9 | ### Planned 10 | - Support for backoff model 11 | 12 | ## [0.7.1] - 2019-07-11 13 | ### Fixed 14 | - Changed "Error" to "Info", if a Markov model of a specific size does not exist 15 | 16 | ## [0.7.0] - 2019-07-11 17 | ### Added 18 | - Added support for configuration files 19 | 20 | ## [0.6.0] - 2019-02-04 21 | ### Added 22 | - Adaptation to process Android unlock patterns 23 | 24 | ## [0.5.0] - 2017-10-26 25 | ### Added 26 | - Rewrite to support a Markov model per password length 27 | 28 | ## [0.4.0] - 2017-08-29 29 | ### Added 30 | - Complete rewrite using Python lists instead of Python OrderedDicts 31 | 32 | ## [0.3.0] - 2016-12-11 33 | ### Added 34 | - Rewrite to process Emoji and PINs 35 | - Added support for efficient enumeration and 5-fold cross-validation 36 | 37 | ## [0.2.0] - 2016-05-18 38 | ### Added 39 | - Added a Natural Language Encoder (`encode.py` and `decode.py`) [Unreleased] 40 | 41 | ## [0.1.0] - 2016-02-03 42 | ### Added 43 | - Initial version for NEMO 44 | - Main modules: training (`ngram_creator.py`) 45 | -------------------------------------------------------------------------------- /configs/configure.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' This script configures the Markov model 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | ''' 9 | 10 | # Load external modules 11 | import sys, logging, json, datetime 12 | from threading import Thread 13 | 14 | # Load own modules 15 | import multiprocessing 16 | from log.multiprocessinglog import * 17 | from ngram.ngram_creator import * 18 | 19 | # Global variables 20 | mtlog = MultiProcessingLog('foo.log', 'a', 0, 0) 21 | logger = logging.getLogger() 22 | logger.addHandler(mtlog) 23 | logger.setLevel(logging.DEBUG) # DEBUG, INFO, CRITICAL 24 | logger = multiprocessing.log_to_stderr(logging.INFO) # DEBUG, INFO, CRITICAL 25 | 26 | class Configure: 27 | 28 | def __init__(self, dict): 29 | self.name = dict['name'] 30 | logging.debug("Constructor started for '{}'".format(self.name)) 31 | self._read_config() 32 | self.EVAL_FILE 33 | 34 | def _read_config(self): 35 | try: 36 | with open('./configs/dev.json', 'r') as configfile: 37 | config = json.load(configfile) 38 | # Those DEFAULTS are used, if the config file is malformed 39 | self.NAME = config.get("name", "Demo") 40 | self.EVAL_FILE = config.get("eval_file", "eval.txt") 41 | self.TRAINING_FILE = config.get("training_file", "training.txt") 42 | self.ALPHABET = config.get("alphabet", "abcdefghijklmnopqrstuvwxyz") 43 | self.LENGTHS = config.get("lengths", [6,8]) 44 | self.NGRAM_SIZE = config.get("ngram_size", 3) 45 | self.NO_CPUS = config.get("no_cpus", 8) 46 | self.PROGRESS_BAR = config.get("progress_bar", False) 47 | except Exception as e: 48 | sys.stderr.write("\x1b[1;%dm" % (31) + "Malformed config file: {}\n".format(e) + "\x1b[0m") 49 | sys.exit(1) 50 | -------------------------------------------------------------------------------- /utils/info.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | :description: Reports some statistics about password length, alphabet, ASCII encoding etc. 9 | :usage: pypy utils/info.py input/eval.txt 10 | ''' 11 | 12 | import sys 13 | import operator 14 | 15 | def is_ascii(s): 16 | return all((ord(c) >= 32 and ord(c) <= 126) for c in s) 17 | 18 | def main(): 19 | min_len = sys.maxsize 20 | max_len = -sys.maxsize - 1 21 | alphabet = dict() 22 | lengths = set() 23 | everything_ascii = "Yes" 24 | 25 | with open(sys.argv[1], 'r') as passwordfile: 26 | for line in passwordfile: 27 | line = line.rstrip('\r\n') 28 | length = len(line) 29 | for char in line: 30 | if char in alphabet: 31 | alphabet[char] += 1 32 | else: 33 | alphabet[char] = 1 34 | if length < min_len: 35 | min_len = length 36 | if length > max_len: 37 | max_len = length 38 | lengths.add(length) 39 | if is_ascii(line) == False: 40 | everything_ascii = "No" 41 | 42 | # Alphabet 43 | alphabet = sorted(alphabet.items(), key=operator.itemgetter(1), reverse=True) 44 | lengths = sorted(lengths) 45 | alpha = [] 46 | for e in alphabet: 47 | if e[0] == '"': 48 | alpha.append('\\"') # escape quotes 49 | elif e[0] == '\\': 50 | alpha.append('\\\\') # escape backslash 51 | else: 52 | alpha.append(e[0]) 53 | print("File: {}".format(sys.argv[1].split('/')[-1])) 54 | print("Min length: {}".format(min_len)) 55 | print("Max length: {}".format(max_len)) 56 | print("Observed password lengths: [{}]".format(','.join([str(x) for x in list(lengths)]))) 57 | print('Alphabet (escaped for Python, but watch out for the space char): "{}"'.format(''.join(alpha))) 58 | print("Alphabet length: {}".format(len(alphabet))) 59 | print("ASCII only: {}".format(everything_ascii)) 60 | 61 | if __name__ == '__main__': 62 | main() -------------------------------------------------------------------------------- /log/multiprocessinglog.py: -------------------------------------------------------------------------------- 1 | from logging.handlers import RotatingFileHandler 2 | import multiprocessing, threading, logging, sys, traceback 3 | from rainbow_logging_handler import RainbowLoggingHandler # pip install rainbow_logging_handler 4 | 5 | class MultiProcessingLog(logging.Handler): 6 | def __init__(self, name, mode, maxsize, rotate): 7 | logging.Handler.__init__(self) 8 | 9 | #self._handler = RotatingFileHandler(name, mode, maxsize, rotate) 10 | formatter = logging.Formatter("[%(asctime)s.%(msecs)03d] %(filename)16s Line %(lineno)3d %(funcName)s():\t %(message)s") 11 | self._handler = RainbowLoggingHandler(sys.stderr, color_funcName=('green', 'none', True)) 12 | self._handler.setFormatter(formatter) 13 | 14 | self.queue = multiprocessing.Queue(-1) 15 | 16 | t = threading.Thread(target=self.receive) 17 | t.daemon = True 18 | t.start() 19 | 20 | def setFormatter(self, fmt): 21 | logging.Handler.setFormatter(self, fmt) 22 | self._handler.setFormatter(fmt) 23 | 24 | def receive(self): 25 | while True: 26 | try: 27 | record = self.queue.get() 28 | self._handler.emit(record) 29 | except (KeyboardInterrupt, SystemExit): 30 | raise 31 | except EOFError: 32 | break 33 | except: 34 | traceback.print_exc(file=sys.stderr) 35 | 36 | def send(self, s): 37 | self.queue.put_nowait(s) 38 | 39 | def _format_record(self, record): 40 | # ensure that exc_info and args 41 | # have been stringified. Removes any chance of 42 | # unpickleable things inside and possibly reduces 43 | # message size sent over the pipe 44 | if record.args: 45 | record.msg = record.msg % record.args 46 | record.args = None 47 | if record.exc_info: 48 | dummy = self.format(record) 49 | record.exc_info = None 50 | 51 | return record 52 | 53 | def emit(self, record): 54 | try: 55 | s = self._format_record(record) 56 | self.send(s) 57 | except (KeyboardInterrupt, SystemExit): 58 | raise 59 | except: 60 | self.handleError(record) 61 | 62 | def close(self): 63 | self._handler.close() 64 | logging.Handler.close(self) -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' This script manages the training 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | ''' 9 | 10 | # Load external modules 11 | from configs.configure import * 12 | 13 | ''' Generates a new ngram-object via init, count, prob, (save) ''' 14 | def worker(data): 15 | "This data was received by the process:" 16 | length = data[0] 17 | progress_bar = data[1] 18 | 19 | ngram_creator = NGramCreator({ 20 | "name": ("NGramCreator, Session: {}, Length: {}, Progress bar: {}".format(CONFIG.NAME, length, progress_bar)), 21 | "alphabet": CONFIG.ALPHABET, 22 | "ngram_size": CONFIG.NGRAM_SIZE, 23 | "training_file": "input/"+CONFIG.TRAINING_FILE, 24 | "length": length, 25 | "progress_bar": progress_bar 26 | }) 27 | 28 | # Initial probability (IP) 29 | logging.debug("ip_list init() ...") 30 | ngram_creator._init_lists("ip_list") 31 | 32 | logging.debug("ip_list count() ...") 33 | ngram_creator._count("ip_list") 34 | 35 | logging.debug("ip_list prob() ...") 36 | ngram_creator._prob("ip_list") 37 | 38 | logging.debug("ip_list save() ...") 39 | ngram_creator.save("ip_list") 40 | 41 | logging.debug("Training IP done ...") 42 | 43 | # Conditional probability (CP) 44 | logging.debug("cp_list init() ...") 45 | ngram_creator._init_lists("cp_list") 46 | 47 | logging.debug("cp_list count() ...") 48 | ngram_creator._count("cp_list") 49 | 50 | logging.debug("cp_list prob() ...") 51 | ngram_creator._prob("cp_list") 52 | 53 | logging.debug("cp_list save() ...") 54 | ngram_creator.save("cp_list") 55 | 56 | logging.debug("Training CP done ...") 57 | 58 | # End probability (EP) 59 | logging.debug("ep_list init() ...") 60 | ngram_creator._init_lists("ep_list") 61 | 62 | logging.debug("ep_list count() ...") 63 | ngram_creator._count("ep_list") 64 | 65 | logging.debug("ep_list prob() ...") 66 | ngram_creator._prob("ep_list") 67 | 68 | logging.debug("ep_list save() ...") 69 | ngram_creator.save("ep_list") 70 | 71 | logging.debug("Training EP done ...") 72 | 73 | ''' Manages the training ''' 74 | def train(): 75 | try: 76 | logging.debug("Training started ...") 77 | 78 | ''' Singleprocessing 79 | for length in CONFIG.LENGTHS: 80 | data = [length, CONFIG.PROGRESS_BAR] 81 | worker(data) 82 | ''' 83 | 84 | #''' Multiprocessing 85 | data = [] 86 | for length in CONFIG.LENGTHS: 87 | data.append([length, CONFIG.PROGRESS_BAR]) 88 | pool = multiprocessing.Pool(processes=CONFIG.NO_CPUS) 89 | pool.map(worker, data) 90 | pool.close() # no more tasks can be submitted to the pool 91 | pool.join() # wait for the worker processes to exit 92 | #''' 93 | 94 | except Exception as e: 95 | sys.stderr.write("\x1b[1;%dm" % (31) + "Training failed: {}\n".format(e) + "\x1b[0m") 96 | sys.exit(1) 97 | 98 | def main(): 99 | try: 100 | global CONFIG 101 | CONFIG = Configure({"name":"My Config"}) 102 | train() 103 | except KeyboardInterrupt: 104 | print('User canceled') 105 | sys.exit(1) 106 | except Exception as e: 107 | sys.stderr.write("\x1b[1;%dm" % (31) + "Error: {}\n".format(e) + "\x1b[0m") 108 | sys.exit(1) 109 | 110 | if __name__ == '__main__': 111 | print("{0}: {1:%Y-%m-%d %H:%M:%S}\n".format("Start", datetime.datetime.now())) 112 | print("Press Ctrl+C to shutdown") 113 | main() 114 | print("{0}: {1:%Y-%m-%d %H:%M:%S}".format("Done", datetime.datetime.now())) 115 | -------------------------------------------------------------------------------- /meter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' This script loads the training and estimates the probability (strength) of some passwords 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | ''' 9 | 10 | # Load external modules 11 | from configs.configure import * 12 | 13 | ''' Loads the training data from disk ''' 14 | def worker(length): 15 | ngram_creator = NGramCreator({ 16 | "name": CONFIG.NAME, 17 | "alphabet": CONFIG.ALPHABET, 18 | "ngram_size": CONFIG.NGRAM_SIZE, 19 | "training_file": "input/"+CONFIG.TRAINING_FILE, 20 | "length": length, 21 | "progress_bar": CONFIG.PROGRESS_BAR 22 | }) 23 | logging.debug("Thread: {} - ip_list load() ...".format(length)) 24 | ngram_creator.load("ip_list") 25 | logging.debug("Thread: {} - cp_list load() ...".format(length)) 26 | ngram_creator.load("cp_list") 27 | logging.debug("Thread: {} - ep_list load() ...".format(length)) 28 | ngram_creator.load("ep_list") 29 | logging.debug("Thread: {} - Loading done ...".format(length)) 30 | MARKOV_MODELS.append(ngram_creator) 31 | 32 | ''' Every length has its own model, we select the correct model for every password ''' 33 | def _select_correct_markov_model(pw_length, markov_models): 34 | result = markov_models[0] # Fallback solution, if there is no model for the selected length 35 | for model in markov_models: 36 | if model.length == pw_length: 37 | result = model 38 | return result 39 | 40 | ''' This function manages the password strength evaluation ''' 41 | def eval(): 42 | # ngram creator 43 | global MARKOV_MODELS 44 | MARKOV_MODELS = [] 45 | threads = [] 46 | for length in CONFIG.LENGTHS: 47 | # Using threads is not beneficial, because it's a disk intensive task 48 | thread = Thread(target = worker, args = (length,)) 49 | thread.start() 50 | threads.append(thread) 51 | # Wait for all threads to finish 52 | for thread in threads: 53 | thread.join() 54 | 55 | logging.debug("Training loaded from disk ...") 56 | logging.debug("Number of Markov models: "+str(len(MARKOV_MODELS))) 57 | fo = open("results/"+CONFIG.EVAL_FILE.rstrip('.txt')+"_result.txt", "w") 58 | with open("input/"+CONFIG.EVAL_FILE, 'r') as inputfile: 59 | for line in inputfile: 60 | line = line.rstrip('\r\n') 61 | # Determine correct model 62 | ngram_creator = _select_correct_markov_model(len(line), MARKOV_MODELS) 63 | if len(line) != ngram_creator.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir" 64 | sys.stderr.write("\x1b[1;%dm" % (31) + "Info: No Markov model for this length: {} {}\n".format(len(line),line) + "\x1b[0m") 65 | fo.write("{} {}\t{}\n".format("Info: No Markov model for this length:", len(line), line)) 66 | continue 67 | if ngram_creator._is_in_alphabet(line): # Filter non-printable 68 | ip = line[:ngram_creator.ngram_size-1] 69 | ip_prob = ngram_creator.ip_list[ngram_creator._n2iIP(ip)] 70 | ep = line[len(line)-(ngram_creator.ngram_size-1):] 71 | ep_prob = ngram_creator.ep_list[ngram_creator._n2iIP(ep)] 72 | old_pos = 0 73 | cp_probs = [] 74 | for new_pos in range(ngram_creator.ngram_size, len(line)+1, 1): 75 | cp = line[old_pos:new_pos] 76 | cp_probs.append(ngram_creator.cp_list[ngram_creator._n2iCP(cp)]) 77 | old_pos += 1 78 | pw_prob = ip_prob * ep_prob 79 | for cp_prob in cp_probs: 80 | pw_prob = pw_prob * cp_prob 81 | fo.write("{}\t{}\n".format(pw_prob,line)) 82 | fo.flush() 83 | else: 84 | sys.stderr.write("\x1b[1;%dm" % (31) + "Info: Password contains invalid characters: {}\n".format(line) + "\x1b[0m") 85 | fo.write("{}\t{}\n".format("Info: Password contains invalid characters:", line)) 86 | continue 87 | fo.close() 88 | 89 | def main(): 90 | try: 91 | global CONFIG 92 | CONFIG = Configure({"name":"My Config"}) 93 | eval() 94 | except KeyboardInterrupt: 95 | print('User canceled') 96 | sys.exit(1) 97 | except Exception as e: 98 | sys.stderr.write("\x1b[1;%dm" % (31) + "Error: {}\n".format(e) + "\x1b[0m") 99 | sys.exit(1) 100 | 101 | if __name__ == '__main__': 102 | print("{0}: {1:%Y-%m-%d %H:%M:%S}\n".format("Start", datetime.datetime.now())) 103 | print("Press Ctrl+C to shutdown") 104 | main() 105 | print("{0}: {1:%Y-%m-%d %H:%M:%S}".format("Done", datetime.datetime.now())) 106 | -------------------------------------------------------------------------------- /ngram/ngram_creator.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env pypy 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' The Markov model 5 | :author: Maximilian Golla 6 | :contact: maximilian.golla@rub.de 7 | :version: 0.7.1, 2019-07-11 8 | ''' 9 | 10 | # External modules 11 | from collections import OrderedDict # storing the alphabet 12 | import os # load and save / file handling 13 | import umsgpack # load and save # pip install u-msgpack-python 14 | import math # only pow 15 | import logging # logging debug infos 16 | from rainbow_logging_handler import RainbowLoggingHandler # pip install rainbow_logging_handler 17 | from tqdm import tqdm # progress bar while reading the file # pip install tqdm 18 | import datetime 19 | 20 | class NGramCreator: 21 | 22 | def __init__(self, dict): 23 | self.name = dict['name'] 24 | logging.debug("Constructor started for '{}'".format(self.name)) 25 | self.alphabet = dict['alphabet'] 26 | self.alphabet_len = len(self.alphabet) 27 | self.alphabet_dict = OrderedDict.fromkeys(self.alphabet) #a 0, b 1, c 2 28 | i = 0 29 | for char in self.alphabet_dict: 30 | self.alphabet_dict[char] = i 31 | i += 1 32 | self.alphabet_list = list(self.alphabet) 33 | logging.debug("Used alphabet: {}".format(self.alphabet)) 34 | self.length = dict['length'] 35 | logging.debug("Model string length: {}".format(self.length)) 36 | self.ngram_size = dict['ngram_size'] 37 | assert self.ngram_size >= 2, "n-gram size < 2 does not make any sense! Your configured n-gram size is {}".format(self.ngram_size) 38 | logging.debug("NGram size: {}".format(self.ngram_size)) 39 | self.training_file = dict['training_file'] 40 | self.training_file_lines = sum(1 for line in open(self.training_file)) 41 | self.disable_progress = False if dict['progress_bar'] else True 42 | self.ip_list = [] 43 | self.cp_list = [] 44 | self.ep_list = [] 45 | self.no_ip_ngrams = int(math.pow(self.alphabet_len, (self.ngram_size-1))) 46 | self.no_cp_ngrams = int(math.pow(self.alphabet_len, (self.ngram_size))) 47 | self.no_ep_ngrams = self.no_ip_ngrams # save one exponentiation :-P 48 | logging.debug("len(IP) theo: {}".format(self.no_ip_ngrams)) 49 | logging.debug("len(CP) theo: {} => {} * {}".format(self.no_cp_ngrams, int(math.pow(self.alphabet_len, (self.ngram_size-1))), self.alphabet_len)) 50 | logging.debug("len(EP) theo: {}".format(self.no_ep_ngrams)) 51 | 52 | def __del__(self): 53 | logging.debug("Destructor started for '{}'".format(self.name)) 54 | 55 | def __str__(self): 56 | return "Hello {}!".format(self.name) 57 | 58 | ######################################################################################################################## 59 | 60 | def _is_in_alphabet(self, string): 61 | for char in string: 62 | if not char in self.alphabet: 63 | return False 64 | return True 65 | 66 | # checks whether two floats are equal like 1.0 == 1.0? 67 | def _is_almost_equal(self, a, b, rel_tol=1e-09, abs_tol=0.0): 68 | #print '{0:.16f}'.format(a), '{0:.16f}'.format(b) 69 | return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol) 70 | 71 | ######################################################################################################################## 72 | 73 | # ngram-to-intial-prob-index 74 | def _n2iIP(self, ngram): 75 | ngram = list(ngram) 76 | if self.ngram_size == 5: 77 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[0]] ) 78 | if self.ngram_size == 4: 79 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[0]] ) 80 | if self.ngram_size == 3: 81 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[0]] ) 82 | if self.ngram_size == 2: 83 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[0]] ) 84 | 85 | # intial-prob-index-to-ngram 86 | def _i2nIP(self, index): 87 | if self.ngram_size == 5: 88 | third, fourth = divmod(index, self.alphabet_len) 89 | second, third = divmod(third, self.alphabet_len) 90 | first, second = divmod(second, self.alphabet_len) 91 | return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth] 92 | if self.ngram_size == 4: 93 | second, third = divmod(index, self.alphabet_len) 94 | first, second = divmod(second, self.alphabet_len) 95 | return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] 96 | if self.ngram_size == 3: 97 | first, second = divmod(index, self.alphabet_len) 98 | return self.alphabet_list[first] + self.alphabet_list[second] 99 | if self.ngram_size == 2: 100 | return self.alphabet_list[index] 101 | 102 | # ngram-to-conditial-prob-index 103 | def _n2iCP(self, ngram): 104 | ngram = list(ngram) 105 | if self.ngram_size == 5: 106 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[4]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**4 * self.alphabet_dict[ngram[0]] ) 107 | if self.ngram_size == 4: 108 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[0]] ) 109 | if self.ngram_size == 3: 110 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[0]] ) 111 | if self.ngram_size == 2: 112 | return ( self.alphabet_len**0 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[0]] ) 113 | 114 | # conditial-prob-index-to-ngram 115 | def _i2nCP(self, index): 116 | if self.ngram_size == 5: 117 | fourth, fifth = divmod(index, self.alphabet_len) 118 | third, fourth = divmod(fourth, self.alphabet_len) 119 | second, third = divmod(third, self.alphabet_len) 120 | first, second = divmod(second, self.alphabet_len) 121 | return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth] + self.alphabet_list[fifth] 122 | if self.ngram_size == 4: 123 | third, fourth = divmod(index, self.alphabet_len) 124 | second, third = divmod(third, self.alphabet_len) 125 | first, second = divmod(second, self.alphabet_len) 126 | return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth] 127 | if self.ngram_size == 3: 128 | second, third = divmod(index, self.alphabet_len) 129 | first, second = divmod(second, self.alphabet_len) 130 | return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] 131 | if self.ngram_size == 2: 132 | first, second = divmod(index, self.alphabet_len) 133 | return self.alphabet_list[first] + self.alphabet_list[second] 134 | 135 | ######################################################################################################################## 136 | 137 | # Adds all possible combinations of ngrams to the list with initial count = 1 138 | def _init_lists(self, kind): 139 | if kind == "ip_list": 140 | for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size-1))): 141 | self.ip_list.append(1) # Smoothing, we initialize every possible ngram with count = 1 142 | elif kind == "cp_list": 143 | for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size))): 144 | self.cp_list.append(1) # Smoothing, we initialize every possible ngram with count = 1 145 | elif kind == "ep_list": 146 | for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size-1))): 147 | self.ep_list.append(1) # Smoothing, we initialize every possible ngram with count = 1 148 | else: 149 | raise Exception('Unknown list given (required: ip_list, cp_list, or ep_list)') 150 | 151 | ######################################################################################################################## 152 | 153 | # Count the occurrences of ngrams in the training corpus 154 | ''' 155 | password PW 156 | pa IP 157 | pas CP1 158 | ass CP2 159 | ssw CP3 160 | swo CP4 161 | wor CP5 162 | ord CP6 163 | rd EP 164 | ''' 165 | def _count(self, kind): 166 | if kind == "ip_list": 167 | with open(self.training_file) as input_file: 168 | for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"): 169 | line = line.rstrip('\r\n') 170 | if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir" 171 | continue 172 | if self._is_in_alphabet(line): # Filter non-printable 173 | ngram = line[0:self.ngram_size-1] # Get IP ngram 174 | self.ip_list[self._n2iIP(ngram)] = self.ip_list[self._n2iIP(ngram)] + 1 # Increase IP ngram count by 1 175 | elif kind == "cp_list": 176 | with open(self.training_file) as input_file: # Open trainingfile 177 | for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"): 178 | line = line.rstrip('\r\n') 179 | if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir" 180 | continue 181 | if self._is_in_alphabet(line): # Filter non-printable 182 | old_pos = 0 183 | for new_pos in range(self.ngram_size, len(line)+1, 1): # Sliding window: pas|ass|ssw|swo|wor|ord 184 | ngram = line[old_pos:new_pos] 185 | old_pos += 1 186 | self.cp_list[self._n2iCP(ngram)] = self.cp_list[self._n2iCP(ngram)] + 1 # Increase CP ngram count by 1 187 | elif kind == "ep_list": 188 | with open(self.training_file) as input_file: # Open trainingfile 189 | for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"): 190 | line = line.rstrip('\r\n') 191 | if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir" 192 | continue 193 | if self._is_in_alphabet(line): # Filter non-printable 194 | ngram = line[-self.ngram_size+1:] # Get EP ngram 195 | self.ep_list[self._n2iIP(ngram)] = self.ep_list[self._n2iIP(ngram)] + 1 # Increase EP ngram count by 1 196 | else: 197 | raise Exception("Unknown dictionary given (required: ip_list, cp_list, or ep_list)") 198 | 199 | ######################################################################################################################## 200 | 201 | # Determine the probability (based on the counts) of a ngram 202 | def _prob(self, kind): 203 | if kind == "ip_list": 204 | no_ip_training_ngrams = 0.0 # must be a float 205 | for ngram_count in self.ip_list: 206 | no_ip_training_ngrams += ngram_count 207 | for index in range(0, len(self.ip_list)): 208 | self.ip_list[index] = self.ip_list[index] / no_ip_training_ngrams # count / all 209 | # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0) 210 | sum = 0.0 211 | for ngram_prob in self.ip_list: 212 | sum += ngram_prob 213 | logging.debug("IP probability sum: {0:.16f}".format(sum)) 214 | if not self._is_almost_equal(sum, 1.0): 215 | raise Exception("ip_list probabilities do not sum up to 1.0! It is only: {}".format(sum)) 216 | elif kind == "cp_list": 217 | for index in range(0, len(self.cp_list), self.alphabet_len): 218 | no_cp_training_ngrams = 0.0 # must be a float 219 | for x in range(index, index+self.alphabet_len): 220 | no_cp_training_ngrams += self.cp_list[x] # Count all ngram occurrences within one ngram-1 category 221 | for x in range(index, index+self.alphabet_len): 222 | self.cp_list[x] = self.cp_list[x] / no_cp_training_ngrams # count / all (of current [x]) 223 | # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0) 224 | ''' 225 | sum = 0.0 226 | for x in range(index, index+self.alphabet_len): 227 | sum += self.cp_list[x] 228 | #logging.debug("CP probability sum: {0:.16f}".format(sum)) 229 | if not self._is_almost_equal(sum, 1.0): 230 | raise Exception("cp_list probabilities do not sum up to 1.0! It is only: {}".format(sum)) 231 | ''' 232 | elif kind == "ep_list": 233 | no_ep_training_ngrams = 0.0 # must be a float 234 | for ngram_count in self.ep_list: 235 | no_ep_training_ngrams += ngram_count 236 | for index in range(0, len(self.ep_list)): 237 | self.ep_list[index] = self.ep_list[index] / no_ep_training_ngrams # count / all 238 | # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0) 239 | sum = 0.0 240 | for ngram_prob in self.ep_list: 241 | sum += ngram_prob 242 | logging.debug("EP probability sum: {0:.16f}".format(sum)) 243 | if not self._is_almost_equal(sum, 1.0): 244 | raise Exception("ep_list probabilities do not sum up to 1.0! It is only: {}".format(sum)) 245 | else: 246 | raise Exception("Unknown dictionary given (required: ip_dict, cp_dict, or ep_dict)") 247 | 248 | ######################################################################################################################## 249 | 250 | ''' 251 | CP cPickle Storing the data on disk took: 0:01:18.987257 # Native? 252 | CP simplejson Storing the data on disk took: 0:01:14.158285 # pip install simplejson 253 | CP ujson Storing the data on disk took: 0:01:05.501812 # pip install ujson 254 | CP cbor Storing the data on disk took: 0:00:17.168384 # pip install cbor 255 | CP cbor2 Storing the data on disk took: 0:00:12.584272 # pip install cbor2 256 | CP marshal Storing the data on disk took: 0:00:14.355625 # Native? 257 | CP umsgpack Storing the data on disk took: 0:00:11.805770 # pip install u-msgpack-python 258 | Loading the data from disk took: 0:00:17.505519 259 | CP msgpack Storing the data on disk took: 0:00:07.918690 # pip install msgpack 260 | ValueError: ('%s exceeds max_array_len(%s)', 804357, 131072) 261 | ''' 262 | 263 | def save(self, kind): 264 | start = datetime.datetime.now() 265 | logging.debug("Start: Writing result to disk, this gonna take a while ...") 266 | path, file = os.path.split(self.training_file) 267 | with open('trained/'+file[:-4]+'_'+kind+'_'+str(self.ngram_size)+'_'+str(self.length)+'.pack', 'wb') as fp: 268 | if kind == "ip_list": 269 | umsgpack.dump(self.ip_list, fp) 270 | elif kind == "cp_list": 271 | umsgpack.dump(self.cp_list, fp) 272 | elif kind == "ep_list": 273 | umsgpack.dump(self.ep_list, fp) 274 | else: 275 | raise Exception("Unknown list given (required: ip_list, cp_list, or ep_list)") 276 | logging.debug("Done! Everything stored on disk.") 277 | logging.debug("Storing the data on disk took: {}".format(datetime.datetime.now()-start)) 278 | 279 | def load(self, kind): 280 | start = datetime.datetime.now() 281 | path, file = os.path.split(self.training_file) 282 | with open('trained/'+file[:-4]+'_'+kind+'_'+str(self.ngram_size)+'_'+str(self.length)+'.pack', 'rb') as fp: 283 | if kind == "ip_list": 284 | self.ip_list = umsgpack.load(fp) 285 | elif kind == "cp_list": 286 | self.cp_list = umsgpack.load(fp) 287 | elif kind == "ep_list": 288 | self.ep_list = umsgpack.load(fp) 289 | else: 290 | raise Exception("Unknown list given (required: ip_list, cp_list, or ep_list)") 291 | logging.debug("Done! Everything loaded from disk.") 292 | logging.debug("Loading the data from disk took: {}".format(datetime.datetime.now()-start)) 293 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NEMO: Modeling Password Guessability Using Markov Models 2 | 3 | ### tl;dr 4 | This is our ongoing effort of using Markov models to build probabilistic password models. 5 | Common use cases include: 6 | * Strength estimation 7 | * Guessing 8 | * (Adaptive) Natural Language Encoders 9 | * ... 10 | 11 | ### WARNING 12 | - This is research-quality code that should only be used for a proof of concept (PoC). 13 | - We share this code in the hope that the research community can benefit from it. Please share your code, too! :heart_eyes: 14 | - We recommended running this software using [PyPy](https://pypy.org/download.html) (see performance stats below). 15 | 16 | ### About NEMO 17 | The scope of this project is not limited to passwords, this software has also been used in the context of other human-chosen secrets like Emoji, PINs, and Android unlock patterns. 18 | 19 | The architecture of the software is inspired by [OMEN](https://github.com/RUB-SysSec/OMEN). More background information about OMEN can be found [here](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/omen/) and [here](https://www.mobsec.ruhr-uni-bochum.de/media/mobsec/arbeiten/2014/12/12/2013-ma-angelstorf-omen.pdf). An excellent Python implementation of OMEN, called `py_omen`, by [Matthew Weir](https://dblp.uni-trier.de/pers/hd/w/Weir:Matt) ([@lakiw](https://twitter.com/lakiw)) can be found [here](https://github.com/lakiw/py_omen). 20 | 21 | #### Difference to OMEN 22 | OMEN makes use of so-called levels (a form of binning). This implementation does not. Thus, efficient enumeration of password candidates (guessing passwords as OMEN does), is not (out of the box) possible, if the key space becomes too big. However, because of the non-binned output, this software has other advantages, for example it can produce [more accurate strength estimates](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-password-strength-meters/). 23 | 24 | #### Overview: Markov Model-Based Password Guessing 25 | * In 2005, [Arvind Narayanan](https://dblp.uni-trier.de/pers/hd/n/Narayanan:Arvind) and [Vitaly Shmatikov](https://dblp.uni-trier.de/pers/hd/s/Shmatikov:Vitaly) proposed the use of Markov models to overcome some problems of dictionary-based password guessing attacks in their work [Fast Dictionary Attacks on Passwords Using Time-Space Tradeoff](https://www.cs.cornell.edu/~shmat/shmat_ccs05pwd.pdf). The idea behind Markov models is based on the observation that subsequent tokens, such as letters in a text, are rarely independently chosen and can often be accurately modeled based on a short history of tokens. 26 | 27 | * In 2008, the popular password cracker [John the Ripper](https://www.openwall.com/john/) introduced a `-markov` mode. More details can be found [here](https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/doc/MARKOV), [here](https://openwall.info/wiki/john/markov), and [here](https://github.com/RUB-SysSec/Password-Guessing-Framework/blob/master/src/scripts/JTR_MARKOV.sh). [Simon Marechal](https://dblp.uni-trier.de/pers/hd/m/Marechal:Simon) ([@bartavelle](https://twitter.com/bartavelle)) compared this Markov model-based approach with various other guessing techniques in his work [Advances in Password Cracking](https://link.springer.com/article/10.1007/s11416-007-0064-y). 28 | 29 | * In 2010, [Dell’Amico et al.](https://dblp.uni-trier.de/pers/hd/d/Dell=Amico:Matteo) used a Markov model-based approach to guess passwords in their work [Measuring Password Strength: An Empirial Analysis](https://arxiv.org/pdf/0907.3402.pdf). 30 | 31 | * In 2012, [Castelluccia et al.](https://dblp.uni-trier.de/pers/hd/c/Castelluccia:Claude) (2012) and [Dürmuth et al.](https://dblp.uni-trier.de/pers/hd/d/D=uuml=rmuth:Markus) (2015) improved the concept by generating password candidates according to their occurrence probabilities, i.e., by guessing the most likely passwords first. Please refer to their works, [Adaptive Password-Strength Meters from Markov Models](https://www.ei.ruhr-uni-bochum.de/media/ei/veroeffentlichungen/2016/01/15/2012-ndss-pwd-strength.pdf), [OMEN: Faster Password Guessing Using an Ordered Markov Enumerator](https://hal.archives-ouvertes.fr/hal-01112124/document), and [When Privacy Meets Security: Leveraging Personal Information for Password Cracking](https://arxiv.org/pdf/1304.6584.pdf) for more details. 32 | 33 | * In 2014, [Ma et al.](https://dblp.uni-trier.de/pers/hd/m/Ma:Jerry) discussed other sources for improvements such as smoothing, backoff models, and issues related to data sparsity in their excellent work [A Study of Probabilistic Password Models](https://www.ieee-security.org/TC/SP2014/papers/AStudyofProbabilisticPasswordModels.pdf). 34 | 35 | * In 2015, [Matteo Dell’Amico](https://dblp.uni-trier.de/pers/hd/d/Dell=Amico:Matteo) and [Maurizio Filippone](https://dblp.uni-trier.de/pers/hd/f/Filippone:Maurizio) published their work on [Monte Carlo Strength Evaluation: Fast and Reliable Password Checking](http://www.eurecom.fr/~filippon/Publications/ccs15.pdf). Their [*backoff*](https://github.com/matteodellamico/montecarlopwd) Markov model can be found on GitHub, too. :heart_eyes: 36 | 37 | * In 2015, [Ur et al.](https://dblp.uni-trier.de/pers/hd/u/Ur:Blase) compared various password cracking methods in their work [Measuring Real-World Accuracies and Biases in Modeling Password Guessability](https://www.blaseur.com/papers/sec15-guessability.pdf). For Markov model-based attacks they used a copy of Ma et al.'s code, which is now available via Carnegie Mellon University's [Password Guessability Service (PGS)](https://pgs.ece.cmu.edu/) where it is called "Markov Model: wordlist-order5-smoothed." 38 | 39 | * In 2016, [Melicher et al.](https://dblp.uni-trier.de/pers/hd/m/Melicher:William) compared their RNN-based approach to a Markov model in their work [Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks](https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/melicher). While some details are missing, their [model can be found on GitHub](https://github.com/cupslab/neural_network_cracking/blob/master/markov_model.py), too. :heart_eyes: 40 | 41 | #### Publications 42 | In the past, we used different versions of this code in the following publications: :bowtie: 43 | * IEEE SP 2019: [Reasoning Analytically About Password-Cracking Software](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/reasoning-analytically-about-password-cracking/) (`Markov: Multi`) 44 | * ACM CCS 2018: [On the Accuracy of Password Strength Meters](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-password-strength-meters/) (`ID: 4B/4C Markov (Single/Multi)`) 45 | * ACM CCS 2016: [On the Security of Cracking-Resistant Password Vaults](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/cracking-resistant-password-vaults/) (`Markov Model`) 46 | 47 | A simpler version of this code has been used for other user-chosen secrets such as [Emoji](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/quantifying-security-emoji-based-authentication/) and [Android unlock patterns](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-android-pattern-strength-meters/). 48 | 49 | ### Design Decisions 50 | Warning: Markov models are memory-eating monsters! :smiling_imp: 51 | 52 | We use three copies of a data structure (in the past: Python OrderedDicts(), today: plain Python lists) to store the frequencies of the *n-grams* in the training corpus. We use: 53 | 54 | - IP: Initial probabilities (ngram_size - 1) 55 | - CP: Conditional probabilities 56 | - EP: End probabilities (ngram_size - 1) 57 | 58 | Here is an example for 3-grams: 59 | ``` 60 | password PW 61 | 62 | pa IP (some literature uses this annotation: ^pa) 63 | 64 | pas CP1 65 | ass CP2 66 | ssw CP3 67 | swo CP4 68 | wor CP5 69 | ord CP6 70 | 71 | rd EP (some literature uses this annotation: rd$) 72 | ``` 73 | 74 | #### How Big Are They? 75 | 76 | ``` 77 | IP: alphabet_length ^ (ngram_size - 1) 78 | CP: alphabet_length ^ ngram_size 79 | EP: alphabet_length ^ (ngram_size - 1) 80 | ``` 81 | 82 | #### Some Details For the Ones Interested: 83 | 84 | :nerd_face: 85 | 86 | __*n-gram* size__: Currently, we support 2,3,4,5-grams. The higher the order of the Markov chains, the more accurate the model becomes. Unfortunately, this also introduces the risk of overfitting and sparsity. If one does not have enough training data, e.g., when using the model with Android unlock patterns, computing the transition probabilities from too small count numbers will become too noisy. While we only support fixed-order Markov chains, we recommend using Dell’Amico and Filippone [*backoff*](https://github.com/matteodellamico/montecarlopwd) model for variable-order Markov chains. 87 | 88 | __Smoothing__: Currently, we only support Additive smoothing (add '1' to the counts), also known as Laplace smoothing. 89 | 90 | __Alphabet__: We tested this software with ASCII passwords only. Using non-ASCII passwords, likely requires to drop the support for Python 2 first. Hint: You can use the `info.py` script in the `utils` folder to determine the alphabet. 91 | 92 | ### Development 93 | In early versions of this code, we made heavy use of Python's (Ordered)-Dictionary class. Fun fact: As of Python 3.7 [dictionaries are always ordered](https://mail.python.org/pipermail/python-dev/2017-December/151283.html) :) 94 | 95 | ``` 96 | cp_dict_full: 97 | key: "aaaa", value: 0.0071192... 98 | key: "aaab", value: 0.0034128... 99 | ... 100 | ``` 101 | 102 | A few months later, we optimized the memory consumption by only storing *n-grams* that really occur in the training corpus. If a rare *n-gram* like the 4-gram `o9py` does not occur in the training file, we used to return a very small default probability instead. This helped quite a lot to reduce the required memory, still, like Google Chrome, our solution easy occupied more than __20GB of RAM__. :poop: 103 | 104 | ``` 105 | cp_dict_sparse: 106 | key: "assw", value: 0.0838103... 107 | key: "sswo", value: 0.0954193... 108 | ... 109 | ``` 110 | 111 | Thus, we decided to refactor the code again to further limit the amount of required memory to nowadays approx. __16GB of RAM__. 112 | Today, we use simple Python lists to store the *n-gram* probabilities in memory. 113 | However, this forced us to come up with a `ngram-to-listindex` function(), which is different for CP in comparison with IP/EP. 114 | 115 | ``` 116 | 117 | _n2i(): ngram, e.g., "assw" to index in list, e.g., ngram_cp_list[87453] 118 | _i2n(): index in list, e.g., ngram_cp_list[87453] to ngram, e.g., "assw" 119 | 120 | cp_list_full: 121 | index: 0, value: 0.0071192... | ("aaaa") 122 | index: 1, value: 0.0034128... | ("aaab") 123 | ... 124 | index: 87453, value: 0.0838103... | ("assw") 125 | ... 126 | index: 8133135, value: 0.0954193... | ("sswo") 127 | ... 128 | ``` 129 | 130 | The current version of the code supports this operation for 2,3,4, and 5-grams. 131 | Fortunately, while this approach achieves the desired memory savings, the additional function call does in comparison to the O(1) HashMap access (offered by Python dictionaries) not increase runtime significantly. 132 | 133 | ### Performance Testing 134 | 135 | We highly recommended to replace Python with [PyPy](https://pypy.org/download.html) before using this software. :100: :thumbsup: 136 | ``` 137 | MEMORY TIME 138 | # PYTHON 2 139 | CPython 2.7.10 15.36GB 53m 27s 140 | PyPy2 7.1.1 5.88GB 3m 8s (based on Python 2.7.13) <- Highly recommended 141 | 142 | # PYTHON 3 143 | CPython 3.7.3 14.47GB 12m 34s 144 | CPython 3.6.5 14.49GB 13m 13s 145 | PyPy3 7.1.1 7.33GB 2m 13s (based on Python 3.6.1) <- Highly recommended 146 | ``` 147 | 148 | ### Getting Started 149 | #### Folder Structure 150 | 151 | ``` 152 | . 153 | ├── README.md 154 | ├── configs 155 | │ ├── configure.py 156 | │ ├── dev.json 157 | │ └── main.json 158 | ├── input 159 | ├── log 160 | │ └── multiprocessinglog.py 161 | ├── meter.py 162 | ├── ngram 163 | │ └── ngram_creator.py 164 | ├── requirements.txt 165 | ├── results 166 | ├── train.py 167 | ├── trained 168 | │ ├── training_cp_list__.pack 169 | │ ├── training_ep_list__.pack 170 | │ └── training_ip_list__.pack 171 | └── utils 172 | ├── info.py 173 | └── sortresult.py 174 | ``` 175 | 176 | #### Installation 177 | 178 | Install PyPy (for Python 2 or better Python 3), and create a virtual environment just to keep your system light and clean: 179 | 180 | `$ virtualenv -p $(which pypy) nemo-venv` 181 | ``` 182 | Running virtualenv with interpreter /usr/local/bin/pypy 183 | New pypy executable in /home//nemo-venv/bin/pypy 184 | Installing setuptools, pip, wheel... 185 | done. 186 | ``` 187 | 188 | Activate the new virtual environment: 189 | 190 | `$ source nemo-venv/bin/activate` 191 | 192 | Now clone the repo: 193 | 194 | `(nemo-venv) $ git clone https://github.com/RUB-SysSec/NEMO.git` 195 | 196 | Change into the newly cloned folder: 197 | 198 | `(nemo-venv) $ cd NEMO` 199 | 200 | Now install the requirements: 201 | 202 | `(nemo-venv) $ pip install -r requirements.txt` 203 | 204 | This includes: 205 | - `tqdm` # for a fancy progress bar 206 | - `u-msgpack-python` # required to store/load the trained model to/from disk 207 | - `rainbow_logging_handler` # for colorful log messages 208 | 209 | #### Dataset 210 | While the Markov model can be used for a variety of things, in the following we focus on a simple **strength meter use case**. 211 | 212 | For this, you will need two files: 213 | 214 | - `input/training.txt`: Contains the passwords that you like to use to train your Markov model. 215 | - `input/eval.txt`: Contains the passwords, which guessability you like to estimate. 216 | 217 | I will not share any of those password files, but using "RockYou" or the "LinkedIn" password leak sounds like a great idea. Make sure to clean and (ASCII) filter the files to optimize the performance. 218 | 219 | For optimal accuracy, consider to train with a password distribution that is similar to the one you like to evaluate (e.g., 90%/10% split). Please do not train a dictionary / word list, this won't work. :stuck_out_tongue_winking_eye: You need a real password distribution, i.e., including duplicates. 220 | 221 | - The file must be placed in the `input` folder. 222 | - One password per line. 223 | - File must be a real password distribution (not a dictionary / word list), i.e., must contain multiplicities. 224 | - All passwords that are shorter or longer than the specified `lengths` will be ignored. 225 | - All passwords that contain characters which are not in the specified `alphabet` will be ignored. 226 | During development, we tested our code with a file that contained ~10m passwords. 227 | 228 | #### Configuration 229 | Before training, you need to provide a configuration file. 230 | You can specify which configuration file to use by editing the following line in `configure.py` in the `configs` folder: 231 | 232 | ``` 233 | with open('./configs/dev.json', 'r') as configfile: 234 | ``` 235 | 236 | Here is the default content of `dev.json`, feel free to edit the file as you like. 237 | ``` 238 | { 239 | "name" : "Development", 240 | "eval_file" : "eval.txt", 241 | "training_file" : "training.txt", 242 | "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%", 243 | "lengths" : [4,6,8], 244 | "ngram_size" : 4, 245 | "no_cpus" : 8, 246 | "progress_bar": true 247 | } 248 | ``` 249 | 250 | Please note: You can use the `info.py` script in the `utils` folder to learn the alphabet of your training / evaluation file. 251 | 252 | For example run: 253 | 254 | `(nemo-venv) $ pypy utils/info.py input/eval.txt` 255 | 256 | ``` 257 | File: eval.txt 258 | Min length: 3 259 | Max length: 23 260 | Observed password lengths: [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23] 261 | Alphabet (escaped for Python, but watch out for the space char): "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%" 262 | Alphabet length: 85 263 | ASCII only: Yes 264 | ``` 265 | 266 | If you encounter any issues, go to `train.py` and change the `train()` function from multi to single processing. This way, it is easier to debug the actual problem. 267 | 268 | #### Training 269 | 270 | ##### Training Requirements 271 | * ~2-5 minutes 272 | * ~8 threads (4 cores + hyper-threading) 273 | * ~16GB of RAM 274 | * ~6GB of disk space 275 | 276 | ##### How to Train the Model 277 | To train the model run: 278 | 279 | `(nemo-venv) $ pypy train.py` 280 | 281 | Once the training is done, you should have multiple `*.pack` files in the `trained` folder. We use a lightweight [MessagePack](https://github.com/vsergeev/u-msgpack-python) implementation to serialize the model. 282 | 283 | A successful training looks like this: 284 | 285 | ``` 286 | Start: 2019-07-06 15:54:13 287 | 288 | Press Ctrl+C to shutdown 289 | [15:54:13.239] configure.py Line 30 __init__(): Constructor started for 'My Config' 290 | [15:54:13.241] train.py Line 76 train(): Training started ... 291 | [15:54:13.242] ngram_creator.py Line 24 __init__(): Constructor started for 'NGramCreator, Session: Development, Length: 4, Progress bar: True' 292 | [15:54:13.242] ngram_creator.py Line 33 __init__(): Used alphabet: ae10i2onrls9384t5m67cdyhubkgpjvfzwAxEONIRSLM.TC_DqBHYUKPJG!-*F @VWXZ/,#+&?$Q)<'=;^[(%\~]`:|"> 293 | [15:54:13.242] ngram_creator.py Line 35 __init__(): Model string length: 4 294 | [15:54:13.242] ngram_creator.py Line 38 __init__(): NGram size: 4 295 | [15:54:15.315] ngram_creator.py Line 48 __init__(): len(IP) theo: 804357 296 | [15:54:15.315] ngram_creator.py Line 49 __init__(): len(CP) theo: 74805201 => 804357 * 93 297 | [15:54:15.315] ngram_creator.py Line 50 __init__(): len(EP) theo: 804357 298 | 299 | [15:54:15.315] train.py Line 29 worker(): ip_list init() ... 300 | [15:54:15.343] train.py Line 32 worker(): ip_list count() ... 301 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 2916292.68pw/s] 302 | [15:54:18.776] train.py Line 35 worker(): ip_list prob() ... 303 | [15:54:18.794] ngram_creator.py Line 213 _prob(): IP probability sum: 1.0000000000141687 304 | [15:54:18.794] train.py Line 38 worker(): ip_list save() ... 305 | [15:54:18.794] ngram_creator.py Line 265 save(): Start: Writing result to disk, this gonna take a while ... 306 | [15:54:19.022] ngram_creator.py Line 276 save(): Done! Everything stored on disk. 307 | [15:54:19.023] ngram_creator.py Line 277 save(): Storing the data on disk took: 0:00:00.228256 308 | [15:54:19.023] train.py Line 41 worker(): Training IP done ... 309 | 310 | [15:54:19.023] train.py Line 44 worker(): cp_list init() ... 311 | [15:54:21.722] train.py Line 47 worker(): cp_list count() ... 312 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 2995344.77pw/s] 313 | [15:54:25.063] train.py Line 50 worker(): cp_list prob() ... 314 | [15:54:25.893] train.py Line 53 worker(): cp_list save() ... 315 | [15:54:25.893] ngram_creator.py Line 265 save(): Start: Writing result to disk, this gonna take a while ... 316 | [15:54:45.189] ngram_creator.py Line 276 save(): Done! Everything stored on disk. 317 | [15:54:45.189] ngram_creator.py Line 277 save(): Storing the data on disk took: 0:00:19.295808 318 | [15:54:45.190] train.py Line 56 worker(): Training CP done ... 319 | 320 | [15:54:45.190] train.py Line 59 worker(): ep_list init() ... 321 | [15:54:45.211] train.py Line 62 worker(): ep_list count() ... 322 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 3005917.73pw/s] 323 | [15:54:48.542] train.py Line 65 worker(): ep_list prob() ... 324 | [15:54:48.553] ngram_creator.py Line 242 _prob(): EP probability sum: 1.0000000000141684 325 | [15:54:48.553] train.py Line 68 worker(): ep_list save() ... 326 | [15:54:48.554] ngram_creator.py Line 265 save(): Start: Writing result to disk, this gonna take a while ... 327 | [15:54:48.781] ngram_creator.py Line 276 save(): Done! Everything stored on disk. 328 | [15:54:48.782] ngram_creator.py Line 277 save(): Storing the data on disk took: 0:00:00.227519 329 | [15:54:48.782] train.py Line 71 worker(): Training EP done ... 330 | 331 | [15:54:55.686] ngram_creator.py Line 53 __del__(): Destructor started for 'NGramCreator, Session: Development, Length: 4, Progress bar: True' 332 | ... 333 | Done: 2019-07-06 15:56:11 334 | ``` 335 | 336 | 337 | #### Strength Estimation 338 | After training, we can use the model for example to estimate the strength of a list of passwords that originate from a similar password distribution. 339 | To do so, please double check that your `eval_file` is specified correctly in your configuration `.json`. 340 | 341 | For the strength estimation, we will read the trained *n-gram* frequencies from disk and then evaluate all passwords from the specified `eval_file`. 342 | 343 | `(nemo-venv) $ pypy meter.py` 344 | 345 | The result of this strength estimation can be found in the `results` folder in a file called `eval_result.txt`. 346 | 347 | ``` 348 | ... 349 | 1.7228127641947414e-13 funnygirl2 350 | 4.03572701534676e-13 single42 351 | 3.669804567773374e-16 silkk 352 | 3.345752850966769e-11 car345 353 | 6.9565427286338e-11 password1991 354 | 4.494395283171681e-12 abby28 355 | 3.1035094651948957e-13 1595159 356 | 7.936477209731241e-13 bhagwati 357 | 1.3319042593247044e-22 natt4evasexy 358 | 1.5909371909986554e-15 curbside 359 | ... 360 | ``` 361 | 362 | The values are `` separated. 363 | You can use `sortresult.py` from the `utils` folder to sort the passwords. 364 | 365 | For example run: 366 | 367 | `(nemo-venv) $ pypy utils/sortresult.py results/eval_result.txt > results/eval_result_sorted.txt` 368 | 369 | A successful strength estimation looks like this: 370 | 371 | ``` 372 | Start: 2019-07-06 16:07:58 373 | 374 | Press Ctrl+C to shutdown 375 | [16:07:58.346] configure.py Line 30 __init__(): Constructor started for 'My Config' 376 | [16:07:58.349] ngram_creator.py Line 24 __init__(): Constructor started for 'Development' 377 | [16:07:58.349] ngram_creator.py Line 33 __init__(): Used alphabet: ae10i2onrls9384t5m67cdyhubkgpjvfzwAxEONIRSLM.TC_DqBHYUKPJG!-*F @VWXZ/,#+&?$Q)<'=;^[(%\~]`:|"> 378 | [16:07:58.349] ngram_creator.py Line 35 __init__(): Model string length: 8 379 | [16:07:58.349] ngram_creator.py Line 38 __init__(): NGram size: 4 380 | [16:08:00.253] ngram_creator.py Line 48 __init__(): len(IP) theo: 804357 381 | [16:08:00.253] ngram_creator.py Line 49 __init__(): len(CP) theo: 74805201 => 804357 * 93 382 | [16:08:00.253] ngram_creator.py Line 50 __init__(): len(EP) theo: 804357 383 | 384 | [16:08:00.253] meter.py Line 23 worker(): Thread: 8 - ip_list load() ... 385 | [16:08:00.438] ngram_creator.py Line 291 load(): Done! Everything loaded from disk. 386 | [16:08:00.439] ngram_creator.py Line 292 load(): Loading the data from disk took: 0:00:00.184483 387 | 388 | [16:08:00.439] meter.py Line 25 worker(): Thread: 8 - cp_list load() ... 389 | [16:08:14.075] ngram_creator.py Line 291 load(): Done! Everything loaded from disk. 390 | [16:08:14.076] ngram_creator.py Line 292 load(): Loading the data from disk took: 0:00:13.635805 391 | 392 | [16:08:14.076] meter.py Line 27 worker(): Thread: 8 - ep_list load() ... 393 | [16:08:14.224] ngram_creator.py Line 291 load(): Done! Everything loaded from disk. 394 | [16:08:14.224] ngram_creator.py Line 292 load(): Loading the data from disk took: 0:00:00.148400 395 | 396 | [16:08:14.224] meter.py Line 29 worker(): Thread: 8 - Loading done ... 397 | [16:08:14.225] meter.py Line 55 eval(): Training loaded from disk ... 398 | ... 399 | 400 | Info: No Markov model for this length: 13 jake1password 401 | Info: No Markov model for this length: 16 marasalvatrucha3 402 | ... 403 | Done: 2019-07-06 16:08:14 404 | 405 | ``` 406 | 407 | ### FAQ 408 | 409 | - Usage: ASCII pre-filter your input / eval files. 410 | 411 | - Usage: Limit the alphabet `alphabet` (lower+upper+digits), *n-gram* size `ngram_size` (3 or 4-grams), password lengths `lengths` (6 or 8 character long passwords) 412 | 413 | - Usage: Make sure you train a real password distribution, not a word list / dictionary, one normally uses with tools like Hashcat / John the Ripper. 414 | 415 | - Debugging: If you encounter any issues, go to `train.py` and change the `train()` function from multi to single processing. This way, it is easier to debug the actual problem. 416 | 417 | - Debugging: In `configure.py` you can change the verbosity of the `rainbow_logging_handler` from `DEBUG` to `INFO` or `CRITICAL`. 418 | 419 | 420 | ### License 421 | 422 | **NEMO** is licensed under the MIT license. Refer to [docs/LICENSE](docs/LICENSE) for more information. 423 | 424 | ### Third-Party Libraries 425 | * **tqdm** is a library that can be used to display a progress meter. It is a "*product of collaborative work*" from multiple authors and is using the MIT license. The license and the source code can be found 426 | [here](https://tqdm.github.io/licence/). 427 | * **u-msgpack-python** is a lightweight MessagePack serializer developed by Ivan (Vanya) A. Sergeev and is using the MIT license. The 428 | source code and the license can be downloaded [here](https://github.com/vsergeev/u-msgpack-python#license). 429 | * **rainbow_logging_handler** is a colorized logger developed by Mikko Ohtamaa and Sho Nakatani. The authors released it as "*free and unencumbered public domain software*". The source code and "license" can be found [here](https://github.com/laysakura/rainbow_logging_handler). 430 | 431 | ### Contact 432 | Visit our [website](https://www.mobsec.rub.de) and follow us on [Twitter](https://twitter.com/hgi_bochum). If you are interested in passwords, consider to contribute and to attend the [International Conference on Passwords (PASSWORDS)](https://passwordscon.org). 433 | --------------------------------------------------------------------------------