├── input
    └── .keep
├── results
    └── .keep
├── trained
    └── .keep
├── log
    ├── __init__.py
    └── multiprocessinglog.py
├── ngram
    ├── __init__.py
    └── ngram_creator.py
├── configs
    ├── __init__.py
    ├── dev.json
    ├── main.json
    └── configure.py
├── requirements.txt
├── .gitignore
├── utils
    ├── sortresult.py
    └── info.py
├── docs
    ├── LICENSE
    └── CHANGELOG.md
├── train.py
├── meter.py
└── README.md


/input/.keep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/results/.keep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/trained/.keep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/log/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ngram/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/configs/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tqdm
2 | u-msgpack-python
3 | rainbow_logging_handler


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 | trained/*.pack
6 | input/*.txt
7 | input/*.7z
8 | results/eval_result.txt


--------------------------------------------------------------------------------
/configs/dev.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "name" : "Development",
 3 |     "eval_file" : "eval.txt",
 4 |     "training_file" : "training.txt",
 5 |     "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%",
 6 |     "lengths" : [4,6,8],
 7 |     "ngram_size" : 4,
 8 |     "no_cpus" : 8,
 9 |     "progress_bar": true
10 | }


--------------------------------------------------------------------------------
/configs/main.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "name" : "Yet Another Configuration",
 3 |     "eval_file" : "eval.txt",
 4 |     "training_file" : "training.txt",
 5 |     "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%",
 6 |     "lengths" : [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23],
 7 |     "ngram_size" : 4,
 8 |     "no_cpus" : 8,
 9 |     "progress_bar": false
10 | }


--------------------------------------------------------------------------------
/utils/sortresult.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env pypy
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | '''
 5 | :author: Maximilian Golla
 6 | :contact: maximilian.golla@rub.de
 7 | :version: 0.7.1, 2019-07-11
 8 | :description: Sorts passwords in the strength meter outfile 'eval_result.txt' by likelihood
 9 | :usage: pypy utils/sortresult.py results/eval_result.txt > results/eval_result_sorted.txt
10 | '''
11 | 
12 | import sys
13 | 
14 | # Read file
15 | with open(sys.argv[1], 'r') as inputfile:
16 |     out = []
17 |     for line in inputfile:
18 |         line = line.rstrip('\r\n')
19 |         splitted = line.split('\t')
20 |         if splitted[0].startswith("Info: No Markov model for this length:"):
21 |             out.append((-1.0,splitted[1]))
22 |             # Instead of adding them, you could also discard them
23 |             # pass
24 |         else:
25 |             prob = float(splitted[0])
26 |             pw = splitted[1]
27 |             out.append((prob,pw))
28 | 
29 | # Sort by prob
30 | out = sorted(out, key=lambda tup: tup[0], reverse=True)
31 | 
32 | # Output
33 | for entry in out:
34 |     prob = entry[0]
35 |     pw = entry[1]
36 |     print("{}\t{}".format(prob,pw))


--------------------------------------------------------------------------------
/docs/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2019 Horst Goertz Institute for IT Security (Ruhr University Bochum)
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/docs/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | # Changelog
 2 | All notable changes to this project will be documented in this file.
 3 | This project adheres to [Semantic Versioning](http://semver.org/).
 4 | 
 5 | ## [Unreleased]
 6 | ### Added
 7 | - Natural Language Encoder (NLE)
 8 | 
 9 | ### Planned
10 | - Support for backoff model
11 | 
12 | ## [0.7.1] - 2019-07-11
13 | ### Fixed
14 | - Changed "Error" to "Info", if a Markov model of a specific size does not exist
15 | 
16 | ## [0.7.0] - 2019-07-11
17 | ### Added
18 | - Added support for configuration files
19 | 
20 | ## [0.6.0] - 2019-02-04
21 | ### Added
22 | - Adaptation to process Android unlock patterns
23 | 
24 | ## [0.5.0] - 2017-10-26
25 | ### Added
26 | - Rewrite to support a Markov model per password length
27 | 
28 | ## [0.4.0] - 2017-08-29
29 | ### Added
30 | - Complete rewrite using Python lists instead of Python OrderedDicts
31 | 
32 | ## [0.3.0] - 2016-12-11
33 | ### Added
34 | - Rewrite to process Emoji and PINs
35 | - Added support for efficient enumeration and 5-fold cross-validation
36 | 
37 | ## [0.2.0] - 2016-05-18
38 | ### Added
39 | - Added a Natural Language Encoder (`encode.py` and `decode.py`) [Unreleased]
40 | 
41 | ## [0.1.0] - 2016-02-03
42 | ### Added
43 | - Initial version for NEMO
44 | - Main modules: training (`ngram_creator.py`)
45 | 


--------------------------------------------------------------------------------
/configs/configure.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env pypy
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | ''' This script configures the Markov model
 5 | :author: Maximilian Golla
 6 | :contact: maximilian.golla@rub.de
 7 | :version: 0.7.1, 2019-07-11
 8 | '''
 9 | 
10 | # Load external modules
11 | import sys, logging, json, datetime
12 | from threading import Thread
13 | 
14 | # Load own modules
15 | import multiprocessing
16 | from log.multiprocessinglog import *
17 | from ngram.ngram_creator import *
18 | 
19 | # Global variables
20 | mtlog = MultiProcessingLog('foo.log', 'a', 0, 0)
21 | logger = logging.getLogger()
22 | logger.addHandler(mtlog)
23 | logger.setLevel(logging.DEBUG) # DEBUG, INFO, CRITICAL
24 | logger = multiprocessing.log_to_stderr(logging.INFO) # DEBUG, INFO, CRITICAL
25 | 
26 | class Configure:
27 | 
28 |     def __init__(self, dict):
29 |         self.name = dict['name']
30 |         logging.debug("Constructor started for '{}'".format(self.name))
31 |         self._read_config()
32 |         self.EVAL_FILE
33 | 
34 |     def _read_config(self):
35 |         try:
36 |             with open('./configs/dev.json', 'r') as configfile:
37 |                 config = json.load(configfile)
38 |                 # Those DEFAULTS are used, if the config file is malformed
39 |                 self.NAME = config.get("name", "Demo")
40 |                 self.EVAL_FILE = config.get("eval_file", "eval.txt")
41 |                 self.TRAINING_FILE = config.get("training_file", "training.txt")
42 |                 self.ALPHABET = config.get("alphabet", "abcdefghijklmnopqrstuvwxyz")
43 |                 self.LENGTHS = config.get("lengths", [6,8])
44 |                 self.NGRAM_SIZE = config.get("ngram_size", 3)
45 |                 self.NO_CPUS = config.get("no_cpus", 8)
46 |                 self.PROGRESS_BAR = config.get("progress_bar", False)
47 |         except Exception as e:
48 |             sys.stderr.write("\x1b[1;%dm" % (31) + "Malformed config file: {}\n".format(e) + "\x1b[0m")
49 |             sys.exit(1)
50 | 


--------------------------------------------------------------------------------
/utils/info.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env pypy
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | '''
 5 | :author: Maximilian Golla
 6 | :contact: maximilian.golla@rub.de
 7 | :version: 0.7.1, 2019-07-11
 8 | :description: Reports some statistics about password length, alphabet, ASCII encoding etc.
 9 | :usage: pypy utils/info.py input/eval.txt
10 | '''
11 | 
12 | import sys
13 | import operator
14 | 
15 | def is_ascii(s):
16 |     return all((ord(c) >= 32 and ord(c) <= 126) for c in s)
17 | 
18 | def main():
19 |     min_len = sys.maxsize
20 |     max_len = -sys.maxsize - 1
21 |     alphabet = dict()
22 |     lengths = set()
23 |     everything_ascii = "Yes"
24 | 
25 |     with open(sys.argv[1], 'r') as passwordfile:
26 |         for line in passwordfile:
27 |             line = line.rstrip('\r\n')
28 |             length = len(line)
29 |             for char in line:
30 |                 if char in alphabet:
31 |                     alphabet[char] += 1
32 |                 else:
33 |                     alphabet[char] = 1
34 |             if length < min_len:
35 |                 min_len = length
36 |             if length > max_len:
37 |                 max_len = length
38 |             lengths.add(length)
39 |             if is_ascii(line) == False:
40 |                 everything_ascii = "No"
41 | 
42 |     # Alphabet
43 |     alphabet = sorted(alphabet.items(), key=operator.itemgetter(1), reverse=True)
44 |     lengths = sorted(lengths)
45 |     alpha = []
46 |     for e in alphabet:
47 |         if e[0] == '"':
48 |             alpha.append('\\"') # escape quotes
49 |         elif e[0] == '\\':
50 |             alpha.append('\\\\') # escape backslash
51 |         else:
52 |             alpha.append(e[0])
53 |     print("File: {}".format(sys.argv[1].split('/')[-1]))
54 |     print("Min length: {}".format(min_len))
55 |     print("Max length: {}".format(max_len))
56 |     print("Observed password lengths: [{}]".format(','.join([str(x) for x in list(lengths)])))
57 |     print('Alphabet (escaped for Python, but watch out for the space char): "{}"'.format(''.join(alpha)))
58 |     print("Alphabet length: {}".format(len(alphabet)))
59 |     print("ASCII only: {}".format(everything_ascii))
60 | 
61 | if __name__ == '__main__':
62 |     main()


--------------------------------------------------------------------------------
/log/multiprocessinglog.py:
--------------------------------------------------------------------------------
 1 | from logging.handlers import RotatingFileHandler
 2 | import multiprocessing, threading, logging, sys, traceback
 3 | from rainbow_logging_handler import RainbowLoggingHandler # pip install rainbow_logging_handler
 4 | 
 5 | class MultiProcessingLog(logging.Handler):
 6 |     def __init__(self, name, mode, maxsize, rotate):
 7 |         logging.Handler.__init__(self)
 8 | 
 9 |         #self._handler = RotatingFileHandler(name, mode, maxsize, rotate)
10 |         formatter = logging.Formatter("[%(asctime)s.%(msecs)03d] %(filename)16s Line %(lineno)3d %(funcName)s():\t %(message)s")
11 |         self._handler = RainbowLoggingHandler(sys.stderr, color_funcName=('green', 'none', True))
12 |         self._handler.setFormatter(formatter)
13 | 
14 |         self.queue = multiprocessing.Queue(-1)
15 | 
16 |         t = threading.Thread(target=self.receive)
17 |         t.daemon = True
18 |         t.start()
19 | 
20 |     def setFormatter(self, fmt):
21 |         logging.Handler.setFormatter(self, fmt)
22 |         self._handler.setFormatter(fmt)
23 | 
24 |     def receive(self):
25 |         while True:
26 |             try:
27 |                 record = self.queue.get()
28 |                 self._handler.emit(record)
29 |             except (KeyboardInterrupt, SystemExit):
30 |                 raise
31 |             except EOFError:
32 |                 break
33 |             except:
34 |                 traceback.print_exc(file=sys.stderr)
35 | 
36 |     def send(self, s):
37 |         self.queue.put_nowait(s)
38 | 
39 |     def _format_record(self, record):
40 |         # ensure that exc_info and args
41 |         # have been stringified.  Removes any chance of
42 |         # unpickleable things inside and possibly reduces
43 |         # message size sent over the pipe
44 |         if record.args:
45 |             record.msg = record.msg % record.args
46 |             record.args = None
47 |         if record.exc_info:
48 |             dummy = self.format(record)
49 |             record.exc_info = None
50 | 
51 |         return record
52 | 
53 |     def emit(self, record):
54 |         try:
55 |             s = self._format_record(record)
56 |             self.send(s)
57 |         except (KeyboardInterrupt, SystemExit):
58 |             raise
59 |         except:
60 |             self.handleError(record)
61 | 
62 |     def close(self):
63 |         self._handler.close()
64 |         logging.Handler.close(self)


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env pypy
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ''' This script manages the training
  5 | :author: Maximilian Golla
  6 | :contact: maximilian.golla@rub.de
  7 | :version: 0.7.1, 2019-07-11
  8 | '''
  9 | 
 10 | # Load external modules
 11 | from configs.configure import *
 12 | 
 13 | ''' Generates a new ngram-object via init, count, prob, (save) '''
 14 | def worker(data):
 15 |     "This data was received by the process:"
 16 |     length = data[0]
 17 |     progress_bar = data[1]
 18 | 
 19 |     ngram_creator = NGramCreator({
 20 |         "name": ("NGramCreator, Session: {}, Length: {}, Progress bar: {}".format(CONFIG.NAME, length, progress_bar)),
 21 |         "alphabet": CONFIG.ALPHABET,
 22 |         "ngram_size": CONFIG.NGRAM_SIZE,
 23 |         "training_file": "input/"+CONFIG.TRAINING_FILE,
 24 |         "length": length,
 25 |         "progress_bar": progress_bar
 26 |     })
 27 | 
 28 |     # Initial probability (IP)
 29 |     logging.debug("ip_list init() ...")
 30 |     ngram_creator._init_lists("ip_list")
 31 | 
 32 |     logging.debug("ip_list count() ...")
 33 |     ngram_creator._count("ip_list")
 34 | 
 35 |     logging.debug("ip_list prob() ...")
 36 |     ngram_creator._prob("ip_list")
 37 | 
 38 |     logging.debug("ip_list save() ...")
 39 |     ngram_creator.save("ip_list")
 40 | 
 41 |     logging.debug("Training IP done ...")
 42 | 
 43 |     # Conditional probability (CP)
 44 |     logging.debug("cp_list init() ...")
 45 |     ngram_creator._init_lists("cp_list")
 46 | 
 47 |     logging.debug("cp_list count() ...")
 48 |     ngram_creator._count("cp_list")
 49 | 
 50 |     logging.debug("cp_list prob() ...")
 51 |     ngram_creator._prob("cp_list")
 52 | 
 53 |     logging.debug("cp_list save() ...")
 54 |     ngram_creator.save("cp_list")
 55 | 
 56 |     logging.debug("Training CP done ...")
 57 | 
 58 |     # End probability (EP)
 59 |     logging.debug("ep_list init() ...")
 60 |     ngram_creator._init_lists("ep_list")
 61 | 
 62 |     logging.debug("ep_list count() ...")
 63 |     ngram_creator._count("ep_list")
 64 | 
 65 |     logging.debug("ep_list prob() ...")
 66 |     ngram_creator._prob("ep_list")
 67 | 
 68 |     logging.debug("ep_list save() ...")
 69 |     ngram_creator.save("ep_list")
 70 | 
 71 |     logging.debug("Training EP done ...")
 72 | 
 73 | ''' Manages the training '''
 74 | def train():
 75 |     try:
 76 |         logging.debug("Training started ...")
 77 | 
 78 |         ''' Singleprocessing
 79 |         for length in CONFIG.LENGTHS:
 80 |             data = [length, CONFIG.PROGRESS_BAR]
 81 |             worker(data)
 82 |         '''
 83 | 
 84 |         #''' Multiprocessing
 85 |         data = []
 86 |         for length in CONFIG.LENGTHS:
 87 |             data.append([length, CONFIG.PROGRESS_BAR])
 88 |         pool = multiprocessing.Pool(processes=CONFIG.NO_CPUS)
 89 |         pool.map(worker, data)
 90 |         pool.close() # no more tasks can be submitted to the pool
 91 |         pool.join() # wait for the worker processes to exit
 92 |         #'''
 93 | 
 94 |     except Exception as e:
 95 |         sys.stderr.write("\x1b[1;%dm" % (31) + "Training failed: {}\n".format(e) + "\x1b[0m")
 96 |         sys.exit(1)
 97 | 
 98 | def main():
 99 |     try:
100 |         global CONFIG
101 |         CONFIG = Configure({"name":"My Config"})
102 |         train()
103 |     except KeyboardInterrupt:
104 |         print('User canceled')
105 |         sys.exit(1)
106 |     except Exception as e:
107 |         sys.stderr.write("\x1b[1;%dm" % (31) + "Error: {}\n".format(e) + "\x1b[0m")
108 |         sys.exit(1)
109 | 
110 | if __name__ == '__main__':
111 |     print("{0}: {1:%Y-%m-%d %H:%M:%S}\n".format("Start", datetime.datetime.now()))
112 |     print("Press Ctrl+C to shutdown")
113 |     main()
114 |     print("{0}: {1:%Y-%m-%d %H:%M:%S}".format("Done", datetime.datetime.now()))
115 | 


--------------------------------------------------------------------------------
/meter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env pypy
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ''' This script loads the training and estimates the probability (strength) of some passwords
  5 | :author: Maximilian Golla
  6 | :contact: maximilian.golla@rub.de
  7 | :version: 0.7.1, 2019-07-11
  8 | '''
  9 | 
 10 | # Load external modules
 11 | from configs.configure import *
 12 | 
 13 | ''' Loads the training data from disk '''
 14 | def worker(length):
 15 |     ngram_creator = NGramCreator({
 16 |         "name": CONFIG.NAME,
 17 |         "alphabet": CONFIG.ALPHABET,
 18 |         "ngram_size": CONFIG.NGRAM_SIZE,
 19 |         "training_file": "input/"+CONFIG.TRAINING_FILE,
 20 |         "length": length,
 21 |         "progress_bar": CONFIG.PROGRESS_BAR
 22 |     })
 23 |     logging.debug("Thread: {} - ip_list load() ...".format(length))
 24 |     ngram_creator.load("ip_list")
 25 |     logging.debug("Thread: {} - cp_list load() ...".format(length))
 26 |     ngram_creator.load("cp_list")
 27 |     logging.debug("Thread: {} - ep_list load() ...".format(length))
 28 |     ngram_creator.load("ep_list")
 29 |     logging.debug("Thread: {} - Loading done ...".format(length))
 30 |     MARKOV_MODELS.append(ngram_creator)
 31 | 
 32 | ''' Every length has its own model, we select the correct model for every password '''
 33 | def _select_correct_markov_model(pw_length, markov_models):
 34 |     result = markov_models[0] # Fallback solution, if there is no model for the selected length
 35 |     for model in markov_models:
 36 |         if model.length == pw_length:
 37 |             result = model
 38 |     return result
 39 | 
 40 | ''' This function manages the password strength evaluation '''
 41 | def eval():
 42 |     # ngram creator
 43 |     global MARKOV_MODELS
 44 |     MARKOV_MODELS = []
 45 |     threads = []
 46 |     for length in CONFIG.LENGTHS:
 47 |         # Using threads is not beneficial, because it's a disk intensive task
 48 |         thread = Thread(target = worker, args = (length,))
 49 |         thread.start()
 50 |         threads.append(thread)
 51 |     # Wait for all threads to finish
 52 |     for thread in threads:
 53 |         thread.join()
 54 | 
 55 |     logging.debug("Training loaded from disk ...")
 56 |     logging.debug("Number of Markov models: "+str(len(MARKOV_MODELS)))
 57 |     fo = open("results/"+CONFIG.EVAL_FILE.rstrip('.txt')+"_result.txt", "w")
 58 |     with open("input/"+CONFIG.EVAL_FILE, 'r') as inputfile:
 59 |         for line in inputfile:
 60 |             line = line.rstrip('\r\n')
 61 |              # Determine correct model
 62 |             ngram_creator = _select_correct_markov_model(len(line), MARKOV_MODELS)
 63 |             if len(line) != ngram_creator.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir"
 64 |                 sys.stderr.write("\x1b[1;%dm" % (31) + "Info: No Markov model for this length: {} {}\n".format(len(line),line) + "\x1b[0m")
 65 |                 fo.write("{} {}\t{}\n".format("Info: No Markov model for this length:", len(line), line))
 66 |                 continue
 67 |             if ngram_creator._is_in_alphabet(line): # Filter non-printable
 68 |                 ip = line[:ngram_creator.ngram_size-1]
 69 |                 ip_prob = ngram_creator.ip_list[ngram_creator._n2iIP(ip)]
 70 |                 ep = line[len(line)-(ngram_creator.ngram_size-1):]
 71 |                 ep_prob = ngram_creator.ep_list[ngram_creator._n2iIP(ep)]
 72 |                 old_pos = 0
 73 |                 cp_probs = []
 74 |                 for new_pos in range(ngram_creator.ngram_size, len(line)+1, 1):
 75 |                     cp = line[old_pos:new_pos]
 76 |                     cp_probs.append(ngram_creator.cp_list[ngram_creator._n2iCP(cp)])
 77 |                     old_pos += 1
 78 |                 pw_prob = ip_prob * ep_prob
 79 |                 for cp_prob in cp_probs:
 80 |                     pw_prob = pw_prob * cp_prob
 81 |                 fo.write("{}\t{}\n".format(pw_prob,line))
 82 |                 fo.flush()
 83 |             else:
 84 |                 sys.stderr.write("\x1b[1;%dm" % (31) + "Info: Password contains invalid characters: {}\n".format(line) + "\x1b[0m")
 85 |                 fo.write("{}\t{}\n".format("Info: Password contains invalid characters:", line))
 86 |                 continue
 87 |     fo.close()
 88 | 
 89 | def main():
 90 |     try:
 91 |         global CONFIG
 92 |         CONFIG = Configure({"name":"My Config"})
 93 |         eval()
 94 |     except KeyboardInterrupt:
 95 |         print('User canceled')
 96 |         sys.exit(1)
 97 |     except Exception as e:
 98 |         sys.stderr.write("\x1b[1;%dm" % (31) + "Error: {}\n".format(e) + "\x1b[0m")
 99 |         sys.exit(1)
100 | 
101 | if __name__ == '__main__':
102 |     print("{0}: {1:%Y-%m-%d %H:%M:%S}\n".format("Start", datetime.datetime.now()))
103 |     print("Press Ctrl+C to shutdown")
104 |     main()
105 |     print("{0}: {1:%Y-%m-%d %H:%M:%S}".format("Done", datetime.datetime.now()))
106 | 


--------------------------------------------------------------------------------
/ngram/ngram_creator.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env pypy
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ''' The Markov model
  5 | :author: Maximilian Golla
  6 | :contact: maximilian.golla@rub.de
  7 | :version: 0.7.1, 2019-07-11
  8 | '''
  9 | 
 10 | # External modules
 11 | from collections import OrderedDict # storing the alphabet
 12 | import os # load and save / file handling
 13 | import umsgpack # load and save # pip install u-msgpack-python
 14 | import math # only pow
 15 | import logging # logging debug infos
 16 | from rainbow_logging_handler import RainbowLoggingHandler # pip install rainbow_logging_handler
 17 | from tqdm import tqdm # progress bar while reading the file # pip install tqdm
 18 | import datetime
 19 | 
 20 | class NGramCreator:
 21 | 
 22 |     def __init__(self, dict):
 23 |         self.name = dict['name']
 24 |         logging.debug("Constructor started for '{}'".format(self.name))
 25 |         self.alphabet = dict['alphabet']
 26 |         self.alphabet_len = len(self.alphabet)
 27 |         self.alphabet_dict = OrderedDict.fromkeys(self.alphabet) #a 0, b 1, c 2
 28 |         i = 0
 29 |         for char in self.alphabet_dict:
 30 |             self.alphabet_dict[char] = i
 31 |             i += 1
 32 |         self.alphabet_list = list(self.alphabet)
 33 |         logging.debug("Used alphabet: {}".format(self.alphabet))
 34 |         self.length = dict['length']
 35 |         logging.debug("Model string length: {}".format(self.length))
 36 |         self.ngram_size = dict['ngram_size']
 37 |         assert self.ngram_size >= 2, "n-gram size < 2 does not make any sense! Your configured n-gram size is {}".format(self.ngram_size)
 38 |         logging.debug("NGram size: {}".format(self.ngram_size))
 39 |         self.training_file = dict['training_file']
 40 |         self.training_file_lines = sum(1 for line in open(self.training_file))
 41 |         self.disable_progress = False if dict['progress_bar'] else True
 42 |         self.ip_list = []
 43 |         self.cp_list = []
 44 |         self.ep_list = []
 45 |         self.no_ip_ngrams = int(math.pow(self.alphabet_len, (self.ngram_size-1)))
 46 |         self.no_cp_ngrams = int(math.pow(self.alphabet_len, (self.ngram_size)))
 47 |         self.no_ep_ngrams = self.no_ip_ngrams # save one exponentiation :-P
 48 |         logging.debug("len(IP) theo: {}".format(self.no_ip_ngrams))
 49 |         logging.debug("len(CP) theo: {} => {} * {}".format(self.no_cp_ngrams, int(math.pow(self.alphabet_len, (self.ngram_size-1))), self.alphabet_len))
 50 |         logging.debug("len(EP) theo: {}".format(self.no_ep_ngrams))
 51 | 
 52 |     def __del__(self):
 53 |         logging.debug("Destructor started for '{}'".format(self.name))
 54 | 
 55 |     def __str__(self):
 56 |         return "Hello {}!".format(self.name)
 57 | 
 58 | ########################################################################################################################
 59 | 
 60 |     def _is_in_alphabet(self, string):
 61 |         for char in string:
 62 |             if not char in self.alphabet:
 63 |                 return False
 64 |         return True
 65 | 
 66 |     # checks whether two floats are equal like 1.0 == 1.0?
 67 |     def _is_almost_equal(self, a, b, rel_tol=1e-09, abs_tol=0.0):
 68 |         #print '{0:.16f}'.format(a), '{0:.16f}'.format(b)
 69 |         return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
 70 | 
 71 | ########################################################################################################################
 72 | 
 73 |     # ngram-to-intial-prob-index
 74 |     def _n2iIP(self, ngram):
 75 |         ngram = list(ngram)
 76 |         if self.ngram_size == 5:
 77 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[0]] )
 78 |         if self.ngram_size == 4:
 79 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[0]] )
 80 |         if self.ngram_size == 3:
 81 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[0]] )
 82 |         if self.ngram_size == 2:
 83 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[0]] )
 84 | 
 85 |     # intial-prob-index-to-ngram
 86 |     def _i2nIP(self, index):
 87 |         if self.ngram_size == 5:
 88 |             third, fourth = divmod(index,  self.alphabet_len)
 89 |             second, third = divmod(third,  self.alphabet_len)
 90 |             first, second = divmod(second, self.alphabet_len)
 91 |             return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth]
 92 |         if self.ngram_size == 4:
 93 |             second, third = divmod(index,  self.alphabet_len)
 94 |             first, second = divmod(second, self.alphabet_len)
 95 |             return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third]
 96 |         if self.ngram_size == 3:
 97 |             first, second = divmod(index,  self.alphabet_len)
 98 |             return self.alphabet_list[first] + self.alphabet_list[second]
 99 |         if self.ngram_size == 2:
100 |             return self.alphabet_list[index]
101 | 
102 |     # ngram-to-conditial-prob-index
103 |     def _n2iCP(self, ngram):
104 |         ngram = list(ngram)
105 |         if self.ngram_size == 5:
106 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[4]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**4 * self.alphabet_dict[ngram[0]] )
107 |         if self.ngram_size == 4:
108 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[3]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**3 * self.alphabet_dict[ngram[0]] )
109 |         if self.ngram_size == 3:
110 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[2]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**2 * self.alphabet_dict[ngram[0]] )
111 |         if self.ngram_size == 2:
112 |             return ( self.alphabet_len**0 * self.alphabet_dict[ngram[1]] ) + ( self.alphabet_len**1 * self.alphabet_dict[ngram[0]] )
113 | 
114 |     # conditial-prob-index-to-ngram
115 |     def _i2nCP(self, index):
116 |         if self.ngram_size == 5:
117 |             fourth, fifth = divmod(index,  self.alphabet_len)
118 |             third, fourth = divmod(fourth, self.alphabet_len)
119 |             second, third = divmod(third,  self.alphabet_len)
120 |             first, second = divmod(second, self.alphabet_len)
121 |             return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth] + self.alphabet_list[fifth]
122 |         if self.ngram_size == 4:
123 |             third, fourth = divmod(index,  self.alphabet_len)
124 |             second, third = divmod(third,  self.alphabet_len)
125 |             first, second = divmod(second, self.alphabet_len)
126 |             return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third] + self.alphabet_list[fourth]
127 |         if self.ngram_size == 3:
128 |             second, third = divmod(index,  self.alphabet_len)
129 |             first, second = divmod(second, self.alphabet_len)
130 |             return self.alphabet_list[first] + self.alphabet_list[second] + self.alphabet_list[third]
131 |         if self.ngram_size == 2:
132 |             first, second = divmod(index,  self.alphabet_len)
133 |             return self.alphabet_list[first] + self.alphabet_list[second]
134 | 
135 | ########################################################################################################################
136 | 
137 |     # Adds all possible combinations of ngrams to the list with initial count = 1
138 |     def _init_lists(self, kind):
139 |         if kind == "ip_list":
140 |             for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size-1))):
141 |                 self.ip_list.append(1) # Smoothing, we initialize every possible ngram with count = 1
142 |         elif kind == "cp_list":
143 |             for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size))):
144 |                 self.cp_list.append(1) # Smoothing, we initialize every possible ngram with count = 1
145 |         elif kind == "ep_list":
146 |             for i in range(0, int(math.pow(self.alphabet_len, self.ngram_size-1))):
147 |                 self.ep_list.append(1) # Smoothing, we initialize every possible ngram with count = 1
148 |         else:
149 |             raise Exception('Unknown list given (required: ip_list, cp_list, or ep_list)')
150 | 
151 | ########################################################################################################################
152 | 
153 |     # Count the occurrences of ngrams in the training corpus
154 |     '''
155 |     password PW
156 |     pa       IP
157 |     pas      CP1
158 |      ass     CP2
159 |       ssw    CP3
160 |        swo   CP4
161 |         wor  CP5
162 |          ord CP6
163 |           rd EP
164 |     '''
165 |     def _count(self, kind):
166 |         if kind == "ip_list":
167 |             with open(self.training_file) as input_file:
168 |                 for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"):
169 |                     line = line.rstrip('\r\n')
170 |                     if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir"
171 |                         continue
172 |                     if self._is_in_alphabet(line): # Filter non-printable
173 |                         ngram = line[0:self.ngram_size-1] # Get IP ngram
174 |                         self.ip_list[self._n2iIP(ngram)] = self.ip_list[self._n2iIP(ngram)] + 1 # Increase IP ngram count by 1
175 |         elif kind == "cp_list":
176 |             with open(self.training_file) as input_file: # Open trainingfile
177 |                 for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"):
178 |                     line = line.rstrip('\r\n')
179 |                     if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir"
180 |                         continue
181 |                     if self._is_in_alphabet(line): # Filter non-printable
182 |                         old_pos = 0
183 |                         for new_pos in range(self.ngram_size, len(line)+1, 1): # Sliding window: pas|ass|ssw|swo|wor|ord
184 |                             ngram = line[old_pos:new_pos]
185 |                             old_pos += 1
186 |                             self.cp_list[self._n2iCP(ngram)] = self.cp_list[self._n2iCP(ngram)] + 1 # Increase CP ngram count by 1
187 |         elif kind == "ep_list":
188 |             with open(self.training_file) as input_file: # Open trainingfile
189 |                 for line in tqdm(input_file, desc=self.training_file, total=self.training_file_lines, disable=self.disable_progress, miniters=1000, unit="pw"):
190 |                     line = line.rstrip('\r\n')
191 |                     if len(line) != self.length: # Important to prevent generating "passwor", or "iloveyo", or "babygir"
192 |                         continue
193 |                     if self._is_in_alphabet(line): # Filter non-printable
194 |                         ngram = line[-self.ngram_size+1:] # Get EP ngram
195 |                         self.ep_list[self._n2iIP(ngram)] = self.ep_list[self._n2iIP(ngram)] + 1 # Increase EP ngram count by 1
196 |         else:
197 |             raise Exception("Unknown dictionary given (required: ip_list, cp_list, or ep_list)")
198 | 
199 | ########################################################################################################################
200 | 
201 |     # Determine the probability (based on the counts) of a ngram
202 |     def _prob(self, kind):
203 |         if kind == "ip_list":
204 |             no_ip_training_ngrams = 0.0 # must be a float
205 |             for ngram_count in self.ip_list:
206 |                 no_ip_training_ngrams += ngram_count
207 |             for index in range(0, len(self.ip_list)):
208 |                 self.ip_list[index] = self.ip_list[index] / no_ip_training_ngrams # count / all
209 |             # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0)
210 |             sum = 0.0
211 |             for ngram_prob in self.ip_list:
212 |                 sum += ngram_prob
213 |             logging.debug("IP probability sum: {0:.16f}".format(sum))
214 |             if not self._is_almost_equal(sum, 1.0):
215 |                 raise Exception("ip_list probabilities do not sum up to 1.0! It is only: {}".format(sum))
216 |         elif kind == "cp_list":
217 |             for index in range(0, len(self.cp_list), self.alphabet_len):
218 |                 no_cp_training_ngrams = 0.0 # must be a float
219 |                 for x in range(index, index+self.alphabet_len):
220 |                     no_cp_training_ngrams += self.cp_list[x] # Count all ngram occurrences within one ngram-1 category
221 |                 for x in range(index, index+self.alphabet_len):
222 |                     self.cp_list[x] = self.cp_list[x] / no_cp_training_ngrams # count / all (of current [x])
223 |                 # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0)
224 |                 '''
225 |                 sum = 0.0
226 |                 for x in range(index, index+self.alphabet_len):
227 |                     sum += self.cp_list[x]
228 |                 #logging.debug("CP probability sum: {0:.16f}".format(sum))
229 |                 if not self._is_almost_equal(sum, 1.0):
230 |                     raise Exception("cp_list probabilities do not sum up to 1.0! It is only: {}".format(sum))
231 |                 '''
232 |         elif kind == "ep_list":
233 |             no_ep_training_ngrams = 0.0 # must be a float
234 |             for ngram_count in self.ep_list:
235 |                 no_ep_training_ngrams += ngram_count
236 |             for index in range(0, len(self.ep_list)):
237 |                 self.ep_list[index] = self.ep_list[index] / no_ep_training_ngrams # count / all
238 |             # Validate that prob sums to 1.0, otherwise coding error. Check for rounding errors using Decimal(1.0) instead of float(1.0)
239 |             sum = 0.0
240 |             for ngram_prob in self.ep_list:
241 |                 sum += ngram_prob
242 |             logging.debug("EP probability sum: {0:.16f}".format(sum))
243 |             if not self._is_almost_equal(sum, 1.0):
244 |                 raise Exception("ep_list probabilities do not sum up to 1.0! It is only: {}".format(sum))
245 |         else:
246 |             raise Exception("Unknown dictionary given (required: ip_dict, cp_dict, or ep_dict)")
247 | 
248 | ########################################################################################################################
249 | 
250 |     '''
251 |     CP cPickle      Storing the data on disk took:   0:01:18.987257 # Native?
252 |     CP simplejson   Storing the data on disk took:   0:01:14.158285 # pip install simplejson
253 |     CP ujson        Storing the data on disk took:   0:01:05.501812 # pip install ujson
254 |     CP cbor         Storing the data on disk took:   0:00:17.168384 # pip install cbor
255 |     CP cbor2        Storing the data on disk took:   0:00:12.584272 # pip install cbor2
256 |     CP marshal      Storing the data on disk took:   0:00:14.355625 # Native?
257 |     CP umsgpack     Storing the data on disk took:   0:00:11.805770 # pip install u-msgpack-python
258 |                     Loading the data from disk took: 0:00:17.505519
259 |     CP msgpack      Storing the data on disk took:   0:00:07.918690 # pip install msgpack
260 |                     ValueError: ('%s exceeds max_array_len(%s)', 804357, 131072)
261 |     '''
262 | 
263 |     def save(self, kind):
264 |         start = datetime.datetime.now()
265 |         logging.debug("Start: Writing result to disk, this gonna take a while ...")
266 |         path, file = os.path.split(self.training_file)
267 |         with open('trained/'+file[:-4]+'_'+kind+'_'+str(self.ngram_size)+'_'+str(self.length)+'.pack', 'wb') as fp:
268 |             if kind == "ip_list":
269 |                 umsgpack.dump(self.ip_list, fp)
270 |             elif kind == "cp_list":
271 |                 umsgpack.dump(self.cp_list, fp)
272 |             elif kind == "ep_list":
273 |                 umsgpack.dump(self.ep_list, fp)
274 |             else:
275 |                 raise Exception("Unknown list given (required: ip_list, cp_list, or ep_list)")
276 |         logging.debug("Done! Everything stored on disk.")
277 |         logging.debug("Storing the data on disk took: {}".format(datetime.datetime.now()-start))
278 | 
279 |     def load(self, kind):
280 |         start = datetime.datetime.now()
281 |         path, file = os.path.split(self.training_file)
282 |         with open('trained/'+file[:-4]+'_'+kind+'_'+str(self.ngram_size)+'_'+str(self.length)+'.pack', 'rb') as fp:
283 |             if kind == "ip_list":
284 |                 self.ip_list = umsgpack.load(fp)
285 |             elif kind == "cp_list":
286 |                 self.cp_list = umsgpack.load(fp)
287 |             elif kind == "ep_list":
288 |                 self.ep_list = umsgpack.load(fp)
289 |             else:
290 |                 raise Exception("Unknown list given (required: ip_list, cp_list, or ep_list)")
291 |         logging.debug("Done! Everything loaded from disk.")
292 |         logging.debug("Loading the data from disk took: {}".format(datetime.datetime.now()-start))
293 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # NEMO: Modeling Password Guessability Using Markov Models
  2 | 
  3 | ### tl;dr
  4 | This is our ongoing effort of using Markov models to build probabilistic password models.
  5 | Common use cases include:
  6 | * Strength estimation
  7 | * Guessing
  8 | * (Adaptive) Natural Language Encoders
  9 | * ...
 10 | 
 11 | ### WARNING
 12 | - This is research-quality code that should only be used for a proof of concept (PoC).
 13 | - We share this code in the hope that the research community can benefit from it. Please share your code, too! :heart_eyes:
 14 | - We recommended running this software using [PyPy](https://pypy.org/download.html) (see performance stats below).
 15 | 
 16 | ### About NEMO
 17 | The scope of this project is not limited to passwords, this software has also been used in the context of other human-chosen secrets like Emoji, PINs, and Android unlock patterns.
 18 | 
 19 | The architecture of the software is inspired by [OMEN](https://github.com/RUB-SysSec/OMEN). More background information about OMEN can be found [here](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/omen/) and [here](https://www.mobsec.ruhr-uni-bochum.de/media/mobsec/arbeiten/2014/12/12/2013-ma-angelstorf-omen.pdf). An excellent Python implementation of OMEN, called `py_omen`, by [Matthew Weir](https://dblp.uni-trier.de/pers/hd/w/Weir:Matt) ([@lakiw](https://twitter.com/lakiw)) can be found [here](https://github.com/lakiw/py_omen).
 20 | 
 21 | #### Difference to OMEN
 22 | OMEN makes use of so-called levels (a form of binning). This implementation does not. Thus, efficient enumeration of password candidates (guessing passwords as OMEN does), is not (out of the box) possible, if the key space becomes too big. However, because of the non-binned output, this software has other advantages, for example it can produce [more accurate strength estimates](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-password-strength-meters/).
 23 | 
 24 | #### Overview: Markov Model-Based Password Guessing
 25 | * In 2005, [Arvind Narayanan](https://dblp.uni-trier.de/pers/hd/n/Narayanan:Arvind) and [Vitaly Shmatikov](https://dblp.uni-trier.de/pers/hd/s/Shmatikov:Vitaly) proposed the use of Markov models to overcome some problems of dictionary-based password guessing attacks in their work [Fast Dictionary Attacks on Passwords Using Time-Space Tradeoff](https://www.cs.cornell.edu/~shmat/shmat_ccs05pwd.pdf). The idea behind Markov models is based on the observation that subsequent tokens, such as letters in a text, are rarely independently chosen and can often be accurately modeled based on a short history of tokens.
 26 | 
 27 | * In 2008, the popular password cracker [John the Ripper](https://www.openwall.com/john/) introduced a `-markov` mode. More details can be found [here](https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/doc/MARKOV), [here](https://openwall.info/wiki/john/markov), and [here](https://github.com/RUB-SysSec/Password-Guessing-Framework/blob/master/src/scripts/JTR_MARKOV.sh). [Simon Marechal](https://dblp.uni-trier.de/pers/hd/m/Marechal:Simon) ([@bartavelle](https://twitter.com/bartavelle)) compared this Markov model-based approach with various other guessing techniques in his work [Advances in Password Cracking](https://link.springer.com/article/10.1007/s11416-007-0064-y).
 28 | 
 29 | * In 2010, [Dell’Amico et al.](https://dblp.uni-trier.de/pers/hd/d/Dell=Amico:Matteo) used a Markov model-based approach to guess passwords in their work [Measuring Password Strength: An Empirial Analysis](https://arxiv.org/pdf/0907.3402.pdf).
 30 | 
 31 | * In 2012, [Castelluccia et al.](https://dblp.uni-trier.de/pers/hd/c/Castelluccia:Claude) (2012) and [Dürmuth et al.](https://dblp.uni-trier.de/pers/hd/d/D=uuml=rmuth:Markus) (2015) improved the concept by generating password candidates according to their occurrence probabilities, i.e., by guessing the most likely passwords first. Please refer to their works, [Adaptive Password-Strength Meters from Markov Models](https://www.ei.ruhr-uni-bochum.de/media/ei/veroeffentlichungen/2016/01/15/2012-ndss-pwd-strength.pdf), [OMEN: Faster Password Guessing Using an Ordered Markov Enumerator](https://hal.archives-ouvertes.fr/hal-01112124/document), and [When Privacy Meets Security: Leveraging Personal Information for Password Cracking](https://arxiv.org/pdf/1304.6584.pdf) for more details.
 32 | 
 33 | * In 2014, [Ma et al.](https://dblp.uni-trier.de/pers/hd/m/Ma:Jerry) discussed other sources for improvements such as smoothing, backoff models, and issues related to data sparsity in their excellent work [A Study of Probabilistic Password Models](https://www.ieee-security.org/TC/SP2014/papers/AStudyofProbabilisticPasswordModels.pdf).
 34 | 
 35 | * In 2015, [Matteo Dell’Amico](https://dblp.uni-trier.de/pers/hd/d/Dell=Amico:Matteo) and [Maurizio Filippone](https://dblp.uni-trier.de/pers/hd/f/Filippone:Maurizio) published their work on [Monte Carlo Strength Evaluation: Fast and Reliable Password Checking](http://www.eurecom.fr/~filippon/Publications/ccs15.pdf). Their [*backoff*](https://github.com/matteodellamico/montecarlopwd) Markov model can be found on GitHub, too. :heart_eyes:
 36 | 
 37 | * In 2015, [Ur et al.](https://dblp.uni-trier.de/pers/hd/u/Ur:Blase) compared various password cracking methods in their work [Measuring Real-World Accuracies and Biases in Modeling Password Guessability](https://www.blaseur.com/papers/sec15-guessability.pdf). For Markov model-based attacks they used a copy of Ma et al.'s code, which is now available via Carnegie Mellon University's [Password Guessability Service (PGS)](https://pgs.ece.cmu.edu/) where it is called "Markov Model: wordlist-order5-smoothed."
 38 | 
 39 | * In 2016, [Melicher et al.](https://dblp.uni-trier.de/pers/hd/m/Melicher:William) compared their RNN-based approach to a Markov model in their work [Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks](https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/melicher). While some details are missing, their [model can be found on GitHub](https://github.com/cupslab/neural_network_cracking/blob/master/markov_model.py), too. :heart_eyes:
 40 | 
 41 | #### Publications
 42 | In the past, we used different versions of this code in the following publications: :bowtie:
 43 | * IEEE SP 2019: [Reasoning Analytically About Password-Cracking Software](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/reasoning-analytically-about-password-cracking/) (`Markov: Multi`)
 44 | * ACM CCS 2018: [On the Accuracy of Password Strength Meters](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-password-strength-meters/) (`ID: 4B/4C Markov (Single/Multi)`)
 45 | * ACM CCS 2016: [On the Security of Cracking-Resistant Password Vaults](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/cracking-resistant-password-vaults/) (`Markov Model`)
 46 | 
 47 | A simpler version of this code has been used for other user-chosen secrets such as [Emoji](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/quantifying-security-emoji-based-authentication/) and [Android unlock patterns](https://www.mobsec.ruhr-uni-bochum.de/forschung/veroeffentlichungen/accuracy-android-pattern-strength-meters/).
 48 | 
 49 | ### Design Decisions
 50 | Warning: Markov models are memory-eating monsters! :smiling_imp:
 51 | 
 52 | We use three copies of a data structure (in the past: Python OrderedDicts(), today: plain Python lists) to store the frequencies of the *n-grams* in the training corpus. We use:
 53 | 
 54 | - IP: Initial probabilities (ngram_size - 1)
 55 | - CP: Conditional probabilities
 56 | - EP: End probabilities (ngram_size - 1)
 57 | 
 58 | Here is an example for 3-grams:
 59 | ```
 60 | password PW
 61 | 
 62 | pa       IP  (some literature uses this annotation: ^pa)
 63 | 
 64 | pas      CP1
 65 |  ass     CP2
 66 |   ssw    CP3
 67 |    swo   CP4
 68 |     wor  CP5
 69 |      ord CP6
 70 | 
 71 |       rd EP  (some literature uses this annotation: rd$)
 72 | ```
 73 | 
 74 | #### How Big Are They?
 75 | 
 76 | ```
 77 | IP: alphabet_length ^ (ngram_size - 1)
 78 | CP: alphabet_length ^  ngram_size
 79 | EP: alphabet_length ^ (ngram_size - 1)
 80 | ```
 81 | 
 82 | #### Some Details For the Ones Interested:
 83 | 
 84 | :nerd_face:
 85 | 
 86 | __*n-gram* size__: Currently, we support 2,3,4,5-grams. The higher the order of the Markov chains, the more accurate the model becomes. Unfortunately, this also introduces the risk of overfitting and sparsity. If one does not have enough training data, e.g., when using the model with Android unlock patterns, computing the transition probabilities from too small count numbers will become too noisy. While we only support fixed-order Markov chains, we recommend using Dell’Amico and Filippone [*backoff*](https://github.com/matteodellamico/montecarlopwd) model for variable-order Markov chains.
 87 | 
 88 | __Smoothing__: Currently, we only support Additive smoothing (add '1' to the counts), also known as Laplace smoothing.
 89 | 
 90 | __Alphabet__: We tested this software with ASCII passwords only. Using non-ASCII passwords, likely requires to drop the support for Python 2 first. Hint: You can use the `info.py` script in the `utils` folder to determine the alphabet.
 91 | 
 92 | ### Development
 93 | In early versions of this code, we made heavy use of Python's (Ordered)-Dictionary class. Fun fact: As of Python 3.7 [dictionaries are always ordered](https://mail.python.org/pipermail/python-dev/2017-December/151283.html) :)
 94 | 
 95 | ```
 96 | cp_dict_full:
 97 |     key: "aaaa", value: 0.0071192...
 98 |     key: "aaab", value: 0.0034128...
 99 |     ...
100 | ```
101 | 
102 | A few months later, we optimized the memory consumption by only storing *n-grams* that really occur in the training corpus. If a rare *n-gram* like the 4-gram `o9py` does not occur in the training file, we used to return a very small default probability instead. This helped quite a lot to reduce the required memory, still, like Google Chrome, our solution easy occupied more than __20GB of RAM__. :poop:
103 | 
104 | ```
105 | cp_dict_sparse:
106 |     key: "assw", value: 0.0838103...
107 |     key: "sswo", value: 0.0954193...
108 |     ...
109 | ```
110 | 
111 | Thus, we decided to refactor the code again to further limit the amount of required memory to nowadays approx. __16GB of RAM__.
112 | Today, we use simple Python lists to store the *n-gram* probabilities in memory.
113 | However, this forced us to come up with a `ngram-to-listindex` function(), which is different for CP in comparison with IP/EP.
114 | 
115 | ```
116 | 
117 | _n2i(): ngram, e.g., "assw" to index in list, e.g., ngram_cp_list[87453]
118 | _i2n(): index in list, e.g., ngram_cp_list[87453] to ngram, e.g., "assw"
119 | 
120 | cp_list_full:
121 |     index:       0, value: 0.0071192... | ("aaaa")
122 |     index:       1, value: 0.0034128... | ("aaab")
123 |     ...
124 |     index:   87453, value: 0.0838103... | ("assw")
125 |     ...
126 |     index: 8133135, value: 0.0954193... | ("sswo")
127 |     ...
128 | ```
129 | 
130 | The current version of the code supports this operation for 2,3,4, and 5-grams.
131 | Fortunately, while this approach achieves the desired memory savings, the additional function call does in comparison to the O(1) HashMap access (offered by Python dictionaries) not increase runtime significantly.
132 | 
133 | ### Performance Testing
134 | 
135 | We highly recommended to replace Python with [PyPy](https://pypy.org/download.html) before using this software. :100: :thumbsup:
136 | ```
137 |                  MEMORY  TIME
138 | # PYTHON 2
139 | CPython 2.7.10  15.36GB 53m 27s
140 | PyPy2   7.1.1    5.88GB  3m  8s (based on Python 2.7.13) <- Highly recommended
141 | 
142 | # PYTHON 3
143 | CPython 3.7.3   14.47GB 12m 34s
144 | CPython 3.6.5   14.49GB 13m 13s
145 | PyPy3   7.1.1    7.33GB  2m 13s (based on Python 3.6.1) <- Highly recommended
146 | ```
147 | 
148 | ### Getting Started
149 | #### Folder Structure
150 | 
151 | ```
152 | .
153 | ├── README.md
154 | ├── configs
155 | │   ├── configure.py
156 | │   ├── dev.json
157 | │   └── main.json
158 | ├── input
159 | ├── log
160 | │   └── multiprocessinglog.py
161 | ├── meter.py
162 | ├── ngram
163 | │   └── ngram_creator.py
164 | ├── requirements.txt
165 | ├── results
166 | ├── train.py
167 | ├── trained
168 | │   ├── training_cp_list_<ngram-size>_<pw-length>.pack
169 | │   ├── training_ep_list_<ngram-size>_<pw-length>.pack
170 | │   └── training_ip_list_<ngram-size>_<pw-length>.pack
171 | └── utils
172 |     ├── info.py
173 |     └── sortresult.py
174 | ```
175 | 
176 | #### Installation
177 | 
178 | Install PyPy (for Python 2 or better Python 3), and create a virtual environment just to keep your system light and clean:
179 | 
180 | `$ virtualenv -p $(which pypy) nemo-venv`
181 | ```
182 | Running virtualenv with interpreter /usr/local/bin/pypy
183 | New pypy executable in /home/<username>/nemo-venv/bin/pypy
184 | Installing setuptools, pip, wheel...
185 | done.
186 | ```
187 | 
188 | Activate the new virtual environment:
189 | 
190 | `$ source nemo-venv/bin/activate`
191 | 
192 | Now clone the repo:
193 | 
194 | `(nemo-venv) $ git clone https://github.com/RUB-SysSec/NEMO.git`
195 | 
196 | Change into the newly cloned folder:
197 | 
198 | `(nemo-venv) $ cd NEMO`
199 | 
200 | Now install the requirements:
201 | 
202 | `(nemo-venv) $ pip install -r requirements.txt`
203 | 
204 | This includes:
205 | - `tqdm` # for a fancy progress bar
206 | - `u-msgpack-python` # required to store/load the trained model to/from disk
207 | - `rainbow_logging_handler` # for colorful log messages
208 | 
209 | #### Dataset
210 | While the Markov model can be used for a variety of things, in the following we focus on a simple **strength meter use case**.
211 | 
212 | For this, you will need two files:
213 | 
214 | - `input/training.txt`: Contains the passwords that you like to use to train your Markov model.
215 | - `input/eval.txt`: Contains the passwords, which guessability you like to estimate.
216 | 
217 | I will not share any of those password files, but using "RockYou" or the "LinkedIn" password leak sounds like a great idea. Make sure to clean and (ASCII) filter the files to optimize the performance.
218 | 
219 | For optimal accuracy, consider to train with a password distribution that is similar to the one you like to evaluate (e.g., 90%/10% split). Please do not train a dictionary / word list, this won't work.  :stuck_out_tongue_winking_eye: You need a real password distribution, i.e., including duplicates.
220 | 
221 | - The file must be placed in the `input` folder.
222 | - One password per line.
223 | - File must be a real password distribution (not a dictionary / word list), i.e., must contain multiplicities.
224 | - All passwords that are shorter or longer than the specified `lengths` will be ignored.
225 | - All passwords that contain characters which are not in the specified `alphabet` will be ignored.
226 | During development, we tested our code with a file that contained ~10m passwords.
227 | 
228 | #### Configuration
229 | Before training, you need to provide a configuration file.
230 | You can specify which configuration file to use by editing the following line in `configure.py` in the `configs` folder:
231 | 
232 | ```
233 | with open('./configs/dev.json', 'r') as configfile:
234 | ```
235 | 
236 | Here is the default content of `dev.json`, feel free to edit the file as you like.
237 | ```
238 | {
239 |     "name" : "Development",
240 |     "eval_file" : "eval.txt",
241 |     "training_file" : "training.txt",
242 |     "alphabet" : "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%",
243 |     "lengths" : [4,6,8],
244 |     "ngram_size" : 4,
245 |     "no_cpus" : 8,
246 |     "progress_bar": true
247 | }
248 | ```
249 | 
250 | Please note: You can use the `info.py` script in the `utils` folder to learn the alphabet of your training / evaluation file.
251 | 
252 | For example run:
253 | 
254 | `(nemo-venv) $ pypy utils/info.py input/eval.txt`
255 | 
256 | ```
257 | File: eval.txt
258 | Min length: 3
259 | Max length: 23
260 | Observed password lengths: [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23]
261 | Alphabet (escaped for Python, but watch out for the space char): "aeio1nrlst20mcuydh93bkgp84576vjfwzxAEqORLNISMTBYCP!.UGHDJ F-K*#V_\\XZW';Q],@&?~+$={^/%"
262 | Alphabet length: 85
263 | ASCII only: Yes
264 | ```
265 | 
266 | If you encounter any issues, go to `train.py` and change the `train()` function from multi to single processing. This way, it is easier to debug the actual problem.
267 | 
268 | #### Training
269 | 
270 | ##### Training Requirements
271 | * ~2-5 minutes
272 | * ~8 threads (4 cores + hyper-threading)
273 | * ~16GB of RAM
274 | * ~6GB of disk space
275 | 
276 | ##### How to Train the Model
277 | To train the model run:
278 | 
279 | `(nemo-venv) $ pypy train.py`
280 | 
281 | Once the training is done, you should have multiple `*.pack` files in the `trained` folder. We use a lightweight [MessagePack](https://github.com/vsergeev/u-msgpack-python) implementation to serialize the model.
282 | 
283 | A successful training looks like this:
284 | 
285 | ```
286 | Start: 2019-07-06 15:54:13
287 | 
288 | Press Ctrl+C to shutdown
289 | [15:54:13.239]     configure.py Line  30 __init__(): Constructor started for 'My Config'
290 | [15:54:13.241]         train.py Line  76 train():    Training started ...
291 | [15:54:13.242] ngram_creator.py Line  24 __init__(): Constructor started for 'NGramCreator, Session: Development, Length: 4, Progress bar: True'
292 | [15:54:13.242] ngram_creator.py Line  33 __init__(): Used alphabet: ae10i2onrls9384t5m67cdyhubkgpjvfzwAxEONIRSLM.TC_DqBHYUKPJG!-*F @VWXZ/,#+&?$Q)<'=;^[(%\~]`:|">
293 | [15:54:13.242] ngram_creator.py Line  35 __init__(): Model string length: 4
294 | [15:54:13.242] ngram_creator.py Line  38 __init__(): NGram size: 4
295 | [15:54:15.315] ngram_creator.py Line  48 __init__(): len(IP) theo: 804357
296 | [15:54:15.315] ngram_creator.py Line  49 __init__(): len(CP) theo: 74805201 => 804357 * 93
297 | [15:54:15.315] ngram_creator.py Line  50 __init__(): len(EP) theo: 804357
298 | 
299 | [15:54:15.315]         train.py Line  29 worker():   ip_list init() ...
300 | [15:54:15.343]         train.py Line  32 worker():   ip_list count() ...
301 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 2916292.68pw/s]
302 | [15:54:18.776]         train.py Line  35 worker():   ip_list prob() ...
303 | [15:54:18.794] ngram_creator.py Line 213 _prob():    IP probability sum: 1.0000000000141687
304 | [15:54:18.794]         train.py Line  38 worker():   ip_list save() ...
305 | [15:54:18.794] ngram_creator.py Line 265 save():     Start: Writing result to disk, this gonna take a while ...
306 | [15:54:19.022] ngram_creator.py Line 276 save():     Done! Everything stored on disk.
307 | [15:54:19.023] ngram_creator.py Line 277 save():     Storing the data on disk took: 0:00:00.228256
308 | [15:54:19.023]         train.py Line  41 worker():   Training IP done ...
309 | 
310 | [15:54:19.023]         train.py Line  44 worker():   cp_list init() ...
311 | [15:54:21.722]         train.py Line  47 worker():   cp_list count() ...
312 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 2995344.77pw/s]
313 | [15:54:25.063]         train.py Line  50 worker():   cp_list prob() ...
314 | [15:54:25.893]         train.py Line  53 worker():   cp_list save() ...
315 | [15:54:25.893] ngram_creator.py Line 265 save():     Start: Writing result to disk, this gonna take a while ...
316 | [15:54:45.189] ngram_creator.py Line 276 save():     Done! Everything stored on disk.
317 | [15:54:45.189] ngram_creator.py Line 277 save():     Storing the data on disk took: 0:00:19.295808
318 | [15:54:45.190]         train.py Line  56 worker():   Training CP done ...
319 | 
320 | [15:54:45.190]         train.py Line  59 worker():   ep_list init() ...
321 | [15:54:45.211]         train.py Line  62 worker():   ep_list count() ...
322 | input/training.txt: 100%|███████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 3005917.73pw/s]
323 | [15:54:48.542]         train.py Line  65 worker():   ep_list prob() ...
324 | [15:54:48.553] ngram_creator.py Line 242 _prob():    EP probability sum: 1.0000000000141684
325 | [15:54:48.553]         train.py Line  68 worker():   ep_list save() ...
326 | [15:54:48.554] ngram_creator.py Line 265 save():     Start: Writing result to disk, this gonna take a while ...
327 | [15:54:48.781] ngram_creator.py Line 276 save():     Done! Everything stored on disk.
328 | [15:54:48.782] ngram_creator.py Line 277 save():     Storing the data on disk took: 0:00:00.227519
329 | [15:54:48.782]         train.py Line  71 worker():   Training EP done ...
330 | 
331 | [15:54:55.686] ngram_creator.py Line  53 __del__():  Destructor started for 'NGramCreator, Session: Development, Length: 4, Progress bar: True'
332 | ...
333 | Done: 2019-07-06 15:56:11
334 | ```
335 | 
336 | 
337 | #### Strength Estimation
338 | After training, we can use the model for example to estimate the strength of a list of passwords that originate from a similar password distribution.
339 | To do so, please double check that your `eval_file` is specified correctly in your configuration `.json`.
340 | 
341 | For the strength estimation, we will read the trained *n-gram* frequencies from disk and then evaluate all passwords from the specified `eval_file`.
342 | 
343 | `(nemo-venv) $ pypy meter.py`
344 | 
345 | The result of this strength estimation can be found in the `results` folder in a file called `eval_result.txt`.
346 | 
347 | ```
348 | ...
349 | 1.7228127641947414e-13 funnygirl2
350 | 4.03572701534676e-13   single42
351 | 3.669804567773374e-16  silkk
352 | 3.345752850966769e-11  car345
353 | 6.9565427286338e-11    password1991
354 | 4.494395283171681e-12  abby28
355 | 3.1035094651948957e-13 1595159
356 | 7.936477209731241e-13  bhagwati
357 | 1.3319042593247044e-22 natt4evasexy
358 | 1.5909371909986554e-15 curbside
359 | ...
360 | ```
361 | 
362 | The values are `<TAB>` separated.
363 | You can use `sortresult.py` from the `utils` folder to sort the passwords.
364 | 
365 | For example run:
366 | 
367 | `(nemo-venv) $ pypy utils/sortresult.py results/eval_result.txt > results/eval_result_sorted.txt`
368 | 
369 | A successful strength estimation looks like this:
370 | 
371 | ```
372 | Start: 2019-07-06 16:07:58
373 | 
374 | Press Ctrl+C to shutdown
375 | [16:07:58.346]     configure.py Line  30 __init__(): Constructor started for 'My Config'
376 | [16:07:58.349] ngram_creator.py Line  24 __init__(): Constructor started for 'Development'
377 | [16:07:58.349] ngram_creator.py Line  33 __init__(): Used alphabet: ae10i2onrls9384t5m67cdyhubkgpjvfzwAxEONIRSLM.TC_DqBHYUKPJG!-*F @VWXZ/,#+&?$Q)<'=;^[(%\~]`:|">
378 | [16:07:58.349] ngram_creator.py Line  35 __init__(): Model string length: 8
379 | [16:07:58.349] ngram_creator.py Line  38 __init__(): NGram size: 4
380 | [16:08:00.253] ngram_creator.py Line  48 __init__(): len(IP) theo: 804357
381 | [16:08:00.253] ngram_creator.py Line  49 __init__(): len(CP) theo: 74805201 => 804357 * 93
382 | [16:08:00.253] ngram_creator.py Line  50 __init__(): len(EP) theo: 804357
383 | 
384 | [16:08:00.253]         meter.py Line  23 worker():   Thread: 8 - ip_list load() ...
385 | [16:08:00.438] ngram_creator.py Line 291 load():     Done! Everything loaded from disk.
386 | [16:08:00.439] ngram_creator.py Line 292 load():     Loading the data from disk took: 0:00:00.184483
387 | 
388 | [16:08:00.439]         meter.py Line  25 worker():   Thread: 8 - cp_list load() ...
389 | [16:08:14.075] ngram_creator.py Line 291 load():     Done! Everything loaded from disk.
390 | [16:08:14.076] ngram_creator.py Line 292 load():     Loading the data from disk took: 0:00:13.635805
391 | 
392 | [16:08:14.076]         meter.py Line  27 worker():   Thread: 8 - ep_list load() ...
393 | [16:08:14.224] ngram_creator.py Line 291 load():     Done! Everything loaded from disk.
394 | [16:08:14.224] ngram_creator.py Line 292 load():     Loading the data from disk took: 0:00:00.148400
395 | 
396 | [16:08:14.224]         meter.py Line  29 worker():   Thread: 8 - Loading done ...
397 | [16:08:14.225]         meter.py Line  55 eval():     Training loaded from disk ...
398 | ...
399 | 
400 | Info: No Markov model for this length: 13 jake1password
401 | Info: No Markov model for this length: 16 marasalvatrucha3
402 | ...
403 | Done: 2019-07-06 16:08:14
404 | 
405 | ```
406 | 
407 | ### FAQ
408 | 
409 | - Usage: ASCII pre-filter your input / eval files.
410 | 
411 | - Usage: Limit the alphabet `alphabet` (lower+upper+digits), *n-gram* size `ngram_size` (3 or 4-grams), password lengths `lengths` (6 or 8 character long passwords)
412 | 
413 | - Usage: Make sure you train a real password distribution, not a word list / dictionary, one normally uses with tools like Hashcat / John the Ripper.
414 | 
415 | - Debugging: If you encounter any issues, go to `train.py` and change the `train()` function from multi to single processing. This way, it is easier to debug the actual problem.
416 | 
417 | - Debugging: In `configure.py` you can change the verbosity of the `rainbow_logging_handler` from `DEBUG` to `INFO` or `CRITICAL`.
418 | 
419 | 
420 | ### License
421 | 
422 | **NEMO** is licensed under the MIT license. Refer to [docs/LICENSE](docs/LICENSE) for more information.
423 | 
424 | ### Third-Party Libraries
425 | * **tqdm** is a library that can be used to display a progress meter. It is a "*product of collaborative work*" from multiple authors and is using the MIT license. The license and the source code can be found
426 | [here](https://tqdm.github.io/licence/).
427 | * **u-msgpack-python** is a lightweight MessagePack serializer developed by Ivan (Vanya) A. Sergeev and is using the MIT license. The
428 | source code and the license can be downloaded [here](https://github.com/vsergeev/u-msgpack-python#license).
429 | * **rainbow_logging_handler** is a colorized logger developed by Mikko Ohtamaa and Sho Nakatani. The authors released it as "*free and unencumbered public domain software*". The source code and "license" can be found [here](https://github.com/laysakura/rainbow_logging_handler).
430 | 
431 | ### Contact
432 | Visit our [website](https://www.mobsec.rub.de) and follow us on [Twitter](https://twitter.com/hgi_bochum). If you are interested in passwords, consider to contribute and to attend the [International Conference on Passwords (PASSWORDS)](https://passwordscon.org).
433 | 


--------------------------------------------------------------------------------