├── .gitignore ├── LICENSE ├── ReadMe.md ├── calibration.R ├── common.py ├── config ├── baseline │ ├── config.json │ └── prune-nvd.js └── logging.json ├── curation ├── fetch-nvd.sh ├── merge-json.py └── parse-cve.py ├── docs ├── Baseline-model.md └── Data-acquisition.md ├── encode.py ├── encoders.py ├── linear.py ├── neural-net.py ├── preprocess.py ├── preprocessors.py ├── requirements.txt ├── settings.py └── tests ├── test_encoders.py └── test_preprocessors.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2017 Jerry Gagelman 2 | 3 | Redistribution and use in source and binary forms, with or without 4 | modification, are permitted provided that the following conditions are met: 5 | 6 | 1. Redistributions of source code must retain the above copyright notice, this 7 | list of conditions and the following disclaimer. 8 | 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 17 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 19 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 20 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 21 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 22 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | -------------------------------------------------------------------------------- /ReadMe.md: -------------------------------------------------------------------------------- 1 | # Scoring software vulnerabilities 2 | 3 | This project evolved as a collection of tools for analyzing software 4 | vulnerability data. It is largely a set of command line utilities. Each 5 | script focuses on a single unit of work, the aim being that more complex 6 | processing pipelines are built via composition. This design allows for 7 | leveraging other CL utilities, keeping the API to a minimal surface area. 8 | 9 | One of the main intended uses is training ML models for the 10 | [exploit prediction](https://arxiv.org/abs/1707.08015) problem. 11 | Please see that paper references for more background. 12 | 13 | ## System requirements 14 | 15 | The utilities target Python 3 (tested against 3.5-7). See requirements.txt 16 | for the Python dependencies. 17 | 18 | [jq 1.5+](https://stedolan.github.io/jq/) is required for essentially all 19 | data processing tasks. (See **data workflow** below.) One can download the 20 | latest stable version for your target platform, and Linux systems allow for 21 | installation via the system package manager. 22 | 23 | ## Data workflow 24 | 25 | Exploit prediction is a supervised learning problem. Most machine learning 26 | workflows start by marshaling the data into a tabular format--an N-by-D 27 | feature matrix, together with an additional column for the labels--and perform 28 | all cleaning and feature engineering steps from there. The DataFrame structure 29 | in R or Pandas are designed around this. 30 | 31 | The tools here emphasize a different, "opinionated" workflow whose point of 32 | departure is the fact that raw vulnerability data is most readily available 33 | in a hierarchically structured format like XML or JSON instead of flat tables. 34 | The target format for the data is a line-delimited file of JSON records -- the 35 | so-called _JSONL format._ Each data cleaning or feature engineering step 36 | consumes a JSONL file and emits a new one, thereby building a pipeline of 37 | processing steps with checkpoints along the way. 38 | 39 | One of the design decisions that is best made explicit early on involves the 40 | preferred way of defining and encoding features. Suppose that the input 41 | records have a top-level property called "foo," each being an object of 42 | categorical attributes: 43 | 44 | {..., "foo": {"type": "debian", "version": "1.2", "affected": true}, ...} 45 | {..., "foo": {"type": "legacy", ""affected": false}, ...} 46 | ... 47 | 48 | One possible approach is to create a feature for each of the paths 49 | `foo.type`, `foo.version`, `foo.affected`, etc., each of which will be 50 | a categorical variable and have its own one-hot encoding. _Instead, the 51 | preferred approach is to use a bag-of-words encoding for the top-level 52 | property._ Its vocabulary is the space of all inter-object paths, eg, 53 | `type.debian`, `version.1.2`, etc., so that the the preprocessed 54 | records become: 55 | 56 | {..., "foo": ["type.debian", "version.1.2", "affected.true"], ...} 57 | {..., "foo": ["type.legacy", "affected.false"], ...} 58 | ... 59 | 60 | The two approaches are mathematically equivalent. However the latter helps 61 | to keep the data wrangling steps simpler. For each data set, one only needs 62 | to specify the transforms and encodings for a bounded set of top-level 63 | properties. 64 | 65 | The data cleaning and feature engineering steps of the workflow operate on 66 | the data one record at a time (or "row-wise"), and then the final encoding 67 | step transforms the data into a columnar format consumable for ML training 68 | and evaluation. That target format is a Python dictionary associating feature 69 | names (the keys) to 2D [numpy](http://www.numpy.org/) arrays. A given array 70 | will have shape `(N, K)`, where N is the number of records, and K is the 71 | dimensionality of the vector encoding for that feature. Note that the term 72 | "feature" here is applied loosely, as it may included the class labels for a 73 | supervised learning problem, in which case K=1. 74 | 75 | ### Workflow outline 76 | 77 | 1. Create a file of JSON records, where all records have the same set of 78 | keys corresponding to the "features" of interest. A basic walk through 79 | on [data acquisition](docs/Data-acquisition.md) illustrates this. 80 | 81 | 2. Apply the [preprocssing](preprocess.py) script to the raw data, creating 82 | another file of JSON records with the same top level keys, but the 83 | corresponding values are either arrays of strings (literally bag-of-tokens) 84 | or numeric values. 85 | 86 | 3. Apply the [encoding](encode.py) script to transform the preprocessed 87 | records into the target dictionary of numpy arrays. 88 | 89 | 90 | ## Command line API 91 | 92 | This sections documents the preprocessing and encoding scripts in more detail. 93 | Each of these scripts consumes and emits files as part of a data pipeline 94 | that can be summarized as follows: 95 | 96 | ### `preprocess.py` 97 | 98 | | Argument | State | Description | 99 | | ---------- | --------------- | --------------------------------- | 100 | | config | required input | JSON configuration file. | 101 | | rawdata | required input | JSONL file of raw features. | 102 | | processed | output | JSONL file of processed features. | 103 | | vocabulary | optional output | JSON file of vocabularies. | 104 | 105 | ### `encode.py` 106 | 107 | | Argument | State | Description | 108 | | ---------- | --------------- | --------------------------------- | 109 | | config | required input | JSON configuration file. | 110 | | vocabulary | required input | JSON file of vocabularies. | 111 | | processed | output | JSONL file of processed features. | 112 | | encoded | output | Dictionary of numpy arrays. | 113 | 114 | ### `config` schema 115 | 116 | Both of the scripts take a **config** argument that defines all of 117 | preprocessing and encoding methods applied to each feature. It is a JSON 118 | array of objects, one for each feature, with the following schema: 119 | 120 | [ 121 | { 122 | "feature": // key name in JSON input records. 123 | "preprocessor": // reference in preprocess.PREPROCESSORS 124 | "encoder": // reference in encode.ENCODERS 125 | // optional key-word arguments for preprocessor and encoder methods. 126 | }, 127 | ... 128 | ] 129 | 130 | ### `vocabulary` schema 131 | 132 | When working with a feature (text or structured data) to which a bag-of-words 133 | encoding will be applied, it is important to extract the _vocabulary_ for that 134 | feature, which fixes an assignment of each token to its dimension in the vector 135 | representation. As it is critical that the same vector representation for a 136 | feature used to train an estimator is also applied to new examples during 137 | inference, the vocabulary needs to be treated as an artifact of preprocessing 138 | that becomes an input of any encoding step. 139 | 140 | The **vocabulary** artifact emitted by the preprocessing script is a JSON 141 | file with a simple nested format: 142 | 143 | { 144 | : { 145 | : , 146 | : , 147 | ... 148 | } 149 | ... 150 | } 151 | 152 | The top level keys are features from the input data, but only those targeting 153 | bag-of-words encoding; numeric features are absent. The nested maps associate 154 | each token in that "feature space" to the fraction of records in which that 155 | token appears. 156 | 157 | When this object is consumed by the encoding script, the only thing that 158 | matters for the vector representation of a feature is its "key space" of 159 | tokens, as the token-to-dimension mapping is established by sorting. This 160 | allows for different dimension reduction strategies by pruning or otherwise 161 | transforming these nested objects in the input vocabulary; the numeric ranks 162 | are only provided as an aid toward these steps. 163 | -------------------------------------------------------------------------------- /calibration.R: -------------------------------------------------------------------------------- 1 | library(ggplot2) 2 | 3 | BINS <- 25 4 | COLORS <- c("#a6cee3", "#e31a1c") 5 | 6 | # Loads and cleans data. 7 | # 8 | # The `filename` argument must be a CSV file with column names 9 | # "p_hat", "y_true" for the predicted probabilties and ground truth labels. 10 | # 11 | get.data <- function(filename) { 12 | data <- read.csv(filename) 13 | data$y_true <- as.factor(data$y_true) 14 | data 15 | } 16 | 17 | # Returns a ggplot object. Input `data` frame must include column names 18 | # "p_hat", "y_true" for the predicted probabilties and ground truth labels. 19 | # 20 | get.plot <- function(data) { 21 | ggplot(data, aes(p_hat, fill=y_true)) + 22 | geom_histogram(bins=BINS, position="stack", colour="grey") + 23 | scale_fill_manual(values=COLORS, name="Ground truth") + 24 | labs(x="Predicted probability", title="", y="") 25 | } 26 | -------------------------------------------------------------------------------- /common.py: -------------------------------------------------------------------------------- 1 | import joblib 2 | import json 3 | import logging 4 | 5 | LOGGER = logging.getLogger('cve-score') 6 | 7 | 8 | def load_json(filename): 9 | '''Reads a JSON object from a file.''' 10 | with open(filename) as fh: 11 | LOGGER.info('reading %s', filename) 12 | return json.load(fh) 13 | 14 | 15 | def dump_json(filename, data): 16 | '''Writes a JSON-serializable object to a file.''' 17 | with open(filename, 'w') as fh: 18 | LOGGER.info('writing %s', filename) 19 | json.dump(data, fh) 20 | 21 | 22 | def deserialize(filename): 23 | '''Loads a pickled object.''' 24 | LOGGER.info('loading %s', filename) 25 | return joblib.load(filename) 26 | 27 | 28 | def serialize(filename, object_): 29 | '''Writes an object to a pickle file.''' 30 | LOGGER.info('writing %s', filename) 31 | joblib.dump(object_, filename) 32 | -------------------------------------------------------------------------------- /config/baseline/config.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "key": "cvssV2", 4 | "preprocessor": "flatten", 5 | "encoder": "dense" 6 | }, 7 | { 8 | "key": "description", 9 | "preprocessor": "tokenize", 10 | "encoder": "sparse", 11 | "vocabulary_max_size": 10000 12 | }, 13 | { 14 | "key": "exploitdb", 15 | "preprocessor": "binarize", 16 | "encoder": "numeric" 17 | } 18 | ] 19 | -------------------------------------------------------------------------------- /config/baseline/prune-nvd.js: -------------------------------------------------------------------------------- 1 | select (.impact.baseMetricV2) | { 2 | cveid: .cve.CVE_data_meta.ID, 3 | description: .cve.description.description_data[0].value, 4 | cvssV2: .impact.baseMetricV2.cvssV2 5 | } 6 | -------------------------------------------------------------------------------- /config/logging.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": 1, 3 | "formatters": { 4 | "default": { 5 | "format": "%(asctime)s %(name)s %(levelname)s %(message)s", 6 | "dateformat": "%Y-%m-%d %H:%M:%S" 7 | } 8 | }, 9 | "handlers": { 10 | "console": { 11 | "class": "logging.StreamHandler", 12 | "formatter": "default" 13 | } 14 | }, 15 | "loggers": { 16 | "tensorflow": { 17 | "handlers": ["console"], 18 | "level": "WARN" 19 | }, 20 | "cve-score": { 21 | "handlers": ["console"], 22 | "level": "INFO" 23 | } 24 | } 25 | } 26 | -------------------------------------------------------------------------------- /curation/fetch-nvd.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | SOURCES="https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2002.json.gz 4 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2003.json.gz 5 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2004.json.gz 6 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2005.json.gz 7 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2006.json.gz 8 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2007.json.gz 9 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2008.json.gz 10 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2009.json.gz 11 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2010.json.gz 12 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2011.json.gz 13 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2012.json.gz 14 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2013.json.gz 15 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2014.json.gz 16 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2015.json.gz 17 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2016.json.gz 18 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2017.json.gz 19 | https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2018.json.gz 20 | " 21 | 22 | PREFIX=$1 23 | [[ -n "$PREFIX" ]] || PREFIX=$HOME/nvd-data 24 | 25 | mkdir -p $PREFIX/raw 26 | for URL in $SOURCES 27 | do 28 | wget --no-clobber $URL -P $PREFIX/raw 29 | done 30 | 31 | gunzip -c $PREFIX/raw/* \ 32 | |jq -c ".CVE_Items[]" \ 33 | |grep -vE "\*\* REJECT \*\*" \ 34 | > $PREFIX/nvd-records.jsonl 35 | -------------------------------------------------------------------------------- /curation/merge-json.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import json 4 | import logging 5 | import argparse 6 | 7 | DEFAULT_LOGGING = { 8 | 'format': '%(asctime)s %(levelname)s %(message)s', 9 | 'level': logging.INFO, 10 | } 11 | 12 | 13 | def index(keyname, filename): 14 | '''Indexes a file of JSON records the "keyname" attributes.''' 15 | with open(filename) as fh: 16 | data = {record[keyname]: record for record in map(json.loads, fh)} 17 | 18 | logging.info('read %d unique records from %s', len(data), filename) 19 | return data 20 | 21 | def merge(left_index, right_index): 22 | '''Merges the right records into the left records, returning the set 23 | of merged records.''' 24 | for key, record in left_index.items(): 25 | record.update(right_index.get(key, {})) 26 | 27 | return left_index.values() 28 | 29 | 30 | if __name__ == '__main__': 31 | parser = argparse.ArgumentParser() 32 | parser.add_argument('leftfile') 33 | parser.add_argument('rightfile') 34 | parser.add_argument('outfile') 35 | parser.add_argument('--keyname', default='cveid', 36 | help='join key name [cveid]') 37 | args = parser.parse_args() 38 | logging.basicConfig(**DEFAULT_LOGGING) 39 | 40 | left_index = index(args.keyname, args.leftfile) 41 | right_index = index(args.keyname, args.rightfile) 42 | with open(args.outfile, 'w') as fh: 43 | logging.info('writing %s', args.outfile) 44 | for record in merge(left_index, right_index): 45 | fh.write('%s\n' % json.dumps(record)) 46 | -------------------------------------------------------------------------------- /curation/parse-cve.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | import re 5 | import glob 6 | import json 7 | import logging 8 | import argparse 9 | from collections import defaultdict 10 | 11 | DEFAULT_LOGGING = { 12 | 'format': '%(asctime)s %(levelname)s %(message)s', 13 | 'level': logging.INFO, 14 | } 15 | CVE_RE = re.compile('CVE-\d{4}-\d+') 16 | 17 | def extract_cves(filename): 18 | '''Applies regex matching to file contents.''' 19 | with open(filename) as fh: 20 | logging.info('scanning %s', filename) 21 | return CVE_RE.findall(fh.read()) 22 | 23 | def _filenames(prefix, subdir): 24 | '''Returns a list of valid filenames.''' 25 | search_str = os.path.join(prefix, subdir, '**') 26 | logging.debug('searching %s', search_str) 27 | return filter(os.path.isfile, glob.glob(search_str, recursive=True)) 28 | 29 | def index_exploit_db(prefix): 30 | '''Indexes all exploit-db exploits. Returns a list of records 31 | { 32 | cveid: "CVE-YYYY-nnnn", 33 | exploitdb: "" 34 | } 35 | ''' 36 | index = defaultdict(list) 37 | for subdir in ['exploits', 'shellcodes']: 38 | for filename in _filenames(prefix, subdir): 39 | token = filename.split(prefix)[1] 40 | for cveid in extract_cves(filename): 41 | index[cveid].append(token) 42 | 43 | return [ 44 | dict(cveid=cveid, exploitdb=data) for (cveid, data) in index.items() 45 | ] 46 | 47 | def write_records(outfile, data): 48 | with open(outfile, 'w') as fh: 49 | logging.info('writing %d records to %s', len(data), outfile) 50 | for record in data: 51 | fh.write('%s\n' % json.dumps(record)) 52 | 53 | 54 | if __name__ == '__main__': 55 | parser = argparse.ArgumentParser() 56 | parser.add_argument('outfile') 57 | parser.add_argument('--exploit-db', help='path prefix for exploit-db data') 58 | args = parser.parse_args() 59 | logging.basicConfig(**DEFAULT_LOGGING) 60 | 61 | if args.exploit_db: 62 | records = index_exploit_db(args.exploit_db) 63 | write_records(args.outfile, records) 64 | -------------------------------------------------------------------------------- /docs/Baseline-model.md: -------------------------------------------------------------------------------- 1 | # Exploit prediction from CVSS 2 | 3 | This walk-through illustrates the main steps of the data workflow, training 4 | and evaluating one of the simplest baseline models for exploit prediction: 5 | a linear classifier using CVSS features. 6 | 7 | The input is a file of JSON records with raw fields like the one produced 8 | in [[data acquisition|Data-acquisition]]. In keeping with the steps there, 9 | it is assumed that the file `$HOME/cve-data/nvd-edb-merged.jsonl` exists, 10 | and each line is a JSON record which contains the keys `"cveid"`, `"cvssV2"`, 11 | and `"exploitdb"`. 12 | 13 | ## Step 1. Training/test split 14 | 15 | Evaluating how well an estimator generalizes involves measuring its 16 | performance on set of test data that is disjoint from the training data. 17 | What constitutes a faithful training/test split is somewhat dependent on 18 | the problem domain and is not always immediately ascertained. Vulnerability 19 | data is intrinsically temporal. 20 | A [principled approach](https://arxiv.org/abs/1707.08015) is to split the 21 | training and test sets along a time boundary, the idea being that a random 22 | training/test split may introduce subtle leakage of future information 23 | into the past. 24 | 25 | This approach is followed here, but using a method that is intentionally 26 | simplistic for the sake of illustration; namely, the data is sorted into 27 | coarse bins based on the YEAR component of the CVE identifier. Records 28 | with a CVE issued from 2011-2015 form the training data, and those with a 29 | CVE issued between 2016-2017 are the test data. 30 | 31 | $ cat ~/cve-data/nvd-edb-merged.jsonl \ 32 | | jq -c 'select(.cveid |test("CVE-201[1-5]"))' \ 33 | > ~/cve-data/nvd-edb-train-raw.jsonl 34 | 35 | $ cat ~/cve-data/nvd-edb-merged.jsonl \ 36 | | jq -c 'select(.cveid |test("CVE-201[67]"))' \ 37 | > ~/cve-data/nvd-edb-test-raw.jsonl 38 | 39 | $ wc -l ~/cve-data/nvd-edb*.jsonl 40 | ... 41 | 22352 /Users/.../nvd-edb-test-raw.jsonl 42 | 30663 /Users/.../nvd-edb-train-raw.jsonl 43 | 44 | So about 30k training examples, 22.3k test examples. 45 | 46 | ## Step 2. Preprocessing 47 | 48 | The repository includes a configuration object, `config.json`, for baseline 49 | models. This defines the preprocessing steps for converting raw fields into 50 | a uniform format: 51 | 52 | $ ./preprocess.py config/baseline/config.json \ 53 | ~/cve-data/nvd-edb-train-raw.jsonl \ 54 | ~/cve-data/nvd-edb-train-prep.jsonl \ 55 | --vocabulary ~/cve-data/nvd-edb-vocabulary.json 56 | 57 | $ ./preprocess.py config/baseline/config.json \ 58 | ~/cve-data/nvd-edb-test-raw.jsonl \ 59 | ~/cve-data/nvd-edb-test-prep.jsonl 60 | 61 | Note that the `--vocabulary` artifact need only be created from the training 62 | set, as that will be taken as the source of truth for the token space. 63 | 64 | ## Step 3. Encoding 65 | 66 | The same configuration object used for preprocessing also defines the steps 67 | for transforming the JSON records into columnar numpy data: 68 | 69 | $ ./encode.py config/baseline/config.json \ 70 | ~/cve-data/nvd-edb-vocabulary.json \ 71 | ~/cve-data/nvd-edb-train-prep.jsonl \ 72 | ~/cve-data/nvd-edb-train-numpy.pkl 73 | 74 | $ ./encode.py config/baseline/config.json \ 75 | ~/cve-data/nvd-edb-vocabulary.json \ 76 | ~/cve-data/nvd-edb-test-prep.jsonl \ 77 | ~/cve-data/nvd-edb-test-numpy.pkl 78 | 79 | ## Training a model 80 | 81 | The repository includes a script, `linear.py`, for training and evaluating a 82 | [linear classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html). 83 | Its performance is generally far from state-of-the-art for virtually any 84 | task around vulnerability scoring, however it is very quick to train and 85 | is useful for smoke testing the data pipeline. 86 | 87 | The script has two commands, **train** and **eval**, and each takes two 88 | positional arguments specifying a dataset and an estimator, respectively. 89 | The dataset argument must be an artifact emitted by `encode.py`. 90 | 91 | The command line interface is somewhat verbose. The following command 92 | is used to produce an estimator from the training set 93 | [created above](#step-3-encoding): 94 | 95 | $ ./linear.py train ~/cve-data/nvd-edb-train-numpy.pkl \ 96 | ~/cve-data/nvd-edb-linear.pkl \ 97 | --feature-key cvssV2 --label-key exploitdb 98 | 99 | This emits an estimator to the file `nvd-edb-linear.pkl`, the `cvssV2` data 100 | as the features and `exploitdb` as the labels. 101 | 102 | Evaluating the estimator on the test set uses the same input argument with 103 | only the command and the input dataset changed: 104 | 105 | $ ./linear.py eval ~/cve-data/nvd-edb-test-numpy.pkl \ 106 | ~/cve-data/nvd-edb-linear.pkl \ 107 | --feature-key cvssV2 --label-key exploitdb 108 | 109 | ... INFO metrics => {"AUC": 0.615513597301432} 110 | -------------------------------------------------------------------------------- /docs/Data-acquisition.md: -------------------------------------------------------------------------------- 1 | # Label engineering: NVD and Exploit-DB 2 | 3 | Scripts in the `curation/` folder are used to gather data and apply some basic 4 | preprocessing so that it launch-ready for downstream feature engineering and 5 | encoding steps. The walk-through here builds a dataset from two separate 6 | sources: [NVD](https://nvd.nist.gov/) records, which constitute the features, 7 | and [Exploit DB](https://www.exploit-db.com/) exploits, which constitute 8 | positive labels. 9 | 10 | All steps assume that [jq](https://stedolan.github.io/jq/) is installed, 11 | and that [the repo](https://github.com/drjerry/cve-score) is cloned locally 12 | and is the working directory for all shell commands. 13 | 14 | ## Step 1. Collect and clean raw features. 15 | 16 | The script `curation/fetch-nvd.sh` pulls compressed JSON files from the NVD repository and unpacks these as a single file of JSON records. It takes a destination path prefix as an optional argument. 17 | 18 | $ mkdir ~/cve-data 19 | $ ./curation/fetch-nvd.sh ~/cve-data 20 | $ wc -l ~/cve-data/nvd-records.jsonl 21 | 112294 /Users/.../nvd-records.jsonl 22 | 23 | Inspecting the first few records of the file reveals how deeply nested the 24 | JSON schema is: 25 | 26 | $ head ~/cve-data/nvd-records.jsonl | jq 27 | 28 | The first transform we apply selects attributes of interest and makes them 29 | top level keys. It relies on a predefined JQ script in the `config/baseline/` 30 | folder: 31 | 32 | $ cat ~/cve-data/nvd-records.jsonl \ 33 | | jq -c -f config/baseline/prune-nvd.js \ 34 | > ~/cve-data/nvd-pruned.jsonl 35 | $ head ~/cve-data/nvd-pruned.jsonl | jq 36 | 37 | 38 | ### Step 2. Collect and clean raw labels 39 | 40 | Exploit DB maintains a version of its database as a simple file tree that is synced as a GitHub repository. To extract labels from entries, a script is provided that (naively) associates exploits to CVEs by string matching on the data: 41 | 42 | $ git clone https://github.com/offensive-security/exploit-database.git ~/cve-data/exploit-db 43 | $ ./curation/parse-cve.py ~/cve-data/exploit-db.jsonl --exploit-db ~/cve-data/exploit-db 44 | 45 | The `--exploit-db` argument should point to the local path prefix of the 46 | Exploit DB repository. The positional argument is target file. Each line is 47 | a JSON object that represents a one-to-many relationship between a single 48 | CVE ID and associated exploits. 49 | 50 | $ head ~/cve-data/exploit-db.jsonl |jq 51 | 52 | ### Step 3. Merge the features and labels 53 | 54 | The `curation/merge.py` script is similar to the Linux join utility, except that it operates on JSON record and joins on a key name instead of a fixed 55 | column: 56 | 57 | $ ./curation/merge-json.py \ 58 | ~/cve-data/nvd-pruned.jsonl \ 59 | ~/cve-data/exploit-db.jsonl \ 60 | ~/cve-data/nvd-edb-merged.jsonl --keyname cveid 61 | 62 | The first and second arguments are files where all records are expected to 63 | share a common key specified by the `--keyname` argument. The third argument 64 | is the output, effectively a "left outer join," where each record from the 65 | second file is merged into the corresponding record from the first where 66 | their specified keys coincide. 67 | 68 | The `--keyname` field _must serve as a primary key_ across both input files. 69 | If the same value is repeated across multiple records, later records in the 70 | file will overwrite the earlier ones. 71 | -------------------------------------------------------------------------------- /encode.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import joblib 5 | import json 6 | import logging 7 | from collections import defaultdict 8 | 9 | from common import load_json, serialize 10 | import encoders 11 | import settings 12 | 13 | LOGGER = logging.getLogger('cve-score') 14 | 15 | META_KEY = 'META' 16 | 17 | ENCODERS = { 18 | 'dense': encoders.DenseEncoder, 19 | 'sparse': encoders.SparseEncoder, 20 | 'numeric': encoders.NumericEncoder, 21 | 'embedding': encoders.EmbeddingEncoder, 22 | } 23 | 24 | 25 | def get_encoder(config, vocabulary): 26 | '''Factory method to instantiate a class in the `encoders` module, 27 | tied to a particular key. 28 | 29 | Arguments: 30 | config: [dict] specifying the "key," "encoder" type, and additional 31 | constructor arguments. 32 | vocabulary: [dict] mapping keys to token-frequency mappings. 33 | 34 | Returns a pair (key, encoder) on success, consisting of a string and a 35 | lambda function. Returns `None` on failure. 36 | ''' 37 | encoder_name = config.get('encoder') 38 | if encoder_name is None: 39 | return None 40 | if encoder_name not in ENCODERS: 41 | LOGGER.warning('unrecognized encoder: %s', encoder_name) 42 | return None 43 | key = config['key'] 44 | constructor = ENCODERS[encoder_name] 45 | kwargs = dict(vocabulary=vocabulary.get(key), **config) 46 | try: 47 | return (key, constructor(**kwargs)) 48 | except (ValueError, TypeError) as ex: 49 | LOGGER.warning('failure to instantiate "%s": %s', encoder_name, ex) 50 | return None 51 | 52 | 53 | def encode(encoders, records): 54 | '''Converts a stream of JSON records into a dictionary of numpy arrays. 55 | 56 | Arguments: 57 | encoders: list of (key, encoder) pairs, where the right hand value 58 | is an instance from the `encoders` module. 59 | records: file of line-separated JSON records. 60 | 61 | Returns a dictionary mapping "keys" from encoders list to numpy objects. 62 | ''' 63 | indices = defaultdict(list) 64 | for record in map(json.loads, records): 65 | for (key, encoder) in encoders: 66 | indices[key].append(encoder(record[key])) 67 | 68 | return { 69 | key: encoder.transform(indices[key]) 70 | for (key, encoder) in encoders 71 | } 72 | 73 | 74 | if __name__ == '__main__': 75 | parser = argparse.ArgumentParser() 76 | parser.add_argument('config', help='JSON processing config') 77 | parser.add_argument('vocabulary', 78 | help='JSON file of token-frequency mappings') 79 | parser.add_argument('infile', help='raw JSON records, one per line') 80 | parser.add_argument('outfile', help='pickled dict of numpy arrays') 81 | args = parser.parse_args() 82 | settings.configure_logging() 83 | 84 | config = load_json(args.config) 85 | vocabulary = load_json(args.vocabulary) 86 | encoders = list(filter(None, 87 | [get_encoder(item, vocabulary) for item in config])) 88 | 89 | with open(args.infile) as fh: 90 | LOGGER.info('reading %s', args.infile) 91 | data = encode(encoders, fh) 92 | 93 | serialize(args.outfile, data) 94 | -------------------------------------------------------------------------------- /encoders.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy import sparse 3 | 4 | 5 | def trim_vocabulary(vocabulary, vocabulary_max_size=None): 6 | '''Extracts a vocabulary constrained to a target size. 7 | 8 | Arguments: 9 | vocabulary: [dict] mapping tokens to frequencies. 10 | vocabulary_max_size: [int] target size of return value. If `None` 11 | the full vocabulary is used, and likewise if it exceeds the 12 | size of the input. 13 | 14 | Returns a _list_ of target size. 15 | ''' 16 | size = vocabulary_max_size or len(vocabulary) 17 | if isinstance(vocabulary, dict): 18 | items = [(rank, token) for (token, rank) in vocabulary.items()] 19 | sorted_ = sorted(items, reverse=True) 20 | return [token for (_, token) in sorted_[:size]] 21 | raise ValueError('"vocabulary" must be a dict.') 22 | 23 | 24 | class _BoWEncoder(object): 25 | '''Base class for encoding tokens as integers. Initializes its state with 26 | an `_index` object, that maps the space of tokens to integer indices. 27 | ''' 28 | def __init__(self, vocabulary, vocabulary_max_size=None, **kwargs): 29 | trimmed = trim_vocabulary(vocabulary, vocabulary_max_size) 30 | self._index = {token: i for (i, token) in enumerate(trimmed)} 31 | self._input_dim = len(self._index) 32 | 33 | def __call__(self, tokens): 34 | return [ 35 | self._index[token] for token in tokens 36 | if token in self._index 37 | ] 38 | 39 | def transform(self, records): 40 | raise NotImplementedError() 41 | 42 | 43 | class DenseEncoder(_BoWEncoder): 44 | '''Token encoder whose `transform` method returns a dense NumPy array.''' 45 | 46 | def __init__(self, vocabulary, vocabulary_max_size=None, **kwargs): 47 | super().__init__(vocabulary, vocabulary_max_size) 48 | 49 | def transform(self, records): 50 | '''Transforms a list of encoded records into a 2D NumPy tensor.''' 51 | shape = len(records), self._input_dim 52 | matrix = np.empty(shape, dtype=np.float32) 53 | for row, indices in enumerate(records): 54 | matrix[row, indices] = 1 55 | return matrix 56 | 57 | 58 | class SparseEncoder(_BoWEncoder): 59 | '''Token encoder whose `transform` method returns a SciPy sparse array.''' 60 | 61 | def __init__(self, vocabulary, vocabulary_max_size=None, **kwargs): 62 | super().__init__(vocabulary, vocabulary_max_size) 63 | 64 | def transform(self, records): 65 | '''Transforms a list of encoded records into a 2D sparse tensor.''' 66 | shape = len(records), self._input_dim 67 | row_ind, col_ind = [], [] 68 | for row, indices in enumerate(records): 69 | col_ind.extend(indices) 70 | row_ind.extend([row] * len(indices)) 71 | 72 | return sparse.csr_matrix( 73 | ([1] * len(row_ind), (row_ind, col_ind)), 74 | shape=shape, 75 | dtype=np.float32) 76 | 77 | 78 | class EmbeddingEncoder(object): 79 | '''Token encoder whose `transform` method returns NumPy array of 80 | integer indices compatible with TensorFlow Embedding layers. 81 | ''' 82 | 83 | def __init__(self, 84 | vocabulary, 85 | input_length, 86 | vocabulary_max_size=None, **kwargs): 87 | trimmed = trim_vocabulary(vocabulary, vocabulary_max_size) 88 | # uses index 0 for padding 89 | self._index = {token: i+1 for (i, token) in enumerate(trimmed)} 90 | self._input_dim = len(self._index)+1 91 | self._input_length = int(input_length) 92 | 93 | def __call__(self, tokens): 94 | return [ 95 | self._index[token] for token in tokens 96 | if token in self._index 97 | ] 98 | 99 | def transform(self, records): 100 | '''Transforms a list of encoded records into a 2D NumPy tensor of 101 | integer indices with padding.''' 102 | shape = len(records), self._input_length 103 | # this is where 0-padding is enforceds 104 | matrix = np.zeros(shape, dtype=np.int32) 105 | for row, indices in enumerate(records): 106 | row_size = min(self._input_length, len(indices)) 107 | matrix[row,0:row_size] = indices[:row_size] 108 | return matrix 109 | 110 | 111 | class NumericEncoder(object): 112 | 113 | def __init__(self, dtype='int32', **kwargs): 114 | self._dtype = np.dtype(dtype) 115 | 116 | def __call__(self, value): 117 | return value 118 | 119 | def transform(self, values): 120 | return np.array(values, dtype=self._dtype) 121 | -------------------------------------------------------------------------------- /linear.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import json 5 | import logging 6 | 7 | import numpy as np 8 | from scipy import sparse 9 | from sklearn.linear_model import SGDClassifier 10 | from sklearn.metrics import roc_auc_score 11 | 12 | from common import serialize, deserialize 13 | import settings 14 | 15 | 16 | LOGGER = logging.getLogger('cve-score') 17 | 18 | CLF_DEFAULTS = { 19 | 'loss': 'log', 20 | 'penalty': 'elasticnet', 21 | 'class_weight': 'balanced', 22 | 'alpha': 1e-4, 23 | 'l1_ratio': 0.0, 24 | 'tol': 1e-4, 25 | } 26 | 27 | _PARSER = argparse.ArgumentParser() 28 | _PARSER.add_argument('command', help='train|eval') 29 | _PARSER.add_argument('dataset', help='serialized dict of numpy arrays') 30 | _PARSER.add_argument('estimator', help='serialized sklearn estimator') 31 | _PARSER.add_argument('--feature-keys', nargs='*', 32 | help='feature key in dataset') 33 | _PARSER.add_argument('--label-key', help='label key in dataset') 34 | _PARSER.add_argument('--alpha', type=float, default=1e-4, 35 | help='classifier L2 penalty term') 36 | _PARSER.add_argument('--l1-ratio', type=float, default=0, 37 | help='classifier Elastic Net mixing parameter') 38 | 39 | 40 | def reformat(data, feature_keys, label_key=None): 41 | '''Massages a dictionary of numpy arrays into a pair consumable by 42 | estimators in the sklearn API. 43 | 44 | Arguments: 45 | data: dictionary associating names to NumPy/SciPy objects. 46 | feature_keys: list selecting values in `data` to combine into the 47 | design matrix. 48 | label_key: [str|None] selects the array in `data` for the labels. 49 | 50 | Returns a pair (X, y), where y is `None` if label_key is not supplied. 51 | ''' 52 | feature_keys = [key for key in feature_keys if key in data] 53 | LOGGER.info('using features: %s', json.dumps(feature_keys)) 54 | 55 | labels = data.get(label_key) 56 | features = [data[key] for key in feature_keys] 57 | if isinstance(features[0], np.ndarray): 58 | return np.hstack(features), labels 59 | elif isinstance(features[0], sparse.spmatrix): 60 | return sparse.hstack(features), labels 61 | 62 | 63 | def train(dataset, estimator, feature_keys, label_key, **kwargs): 64 | '''Implements the "train" command.''' 65 | data = deserialize(dataset) 66 | features, labels = reformat(data, feature_keys, label_key) 67 | params = { 68 | key: kwargs.get(key, default) 69 | for (key, default) in CLF_DEFAULTS.items() 70 | } 71 | LOGGER.info('estimator params => %s', json.dumps(params)) 72 | clf = SGDClassifier(**params) 73 | clf.fit(features, labels) 74 | serialize(estimator, clf) 75 | 76 | 77 | def eval(dataset, estimator, feature_keys, label_key, **kwargs): 78 | '''Implements the "eval" command.''' 79 | data = deserialize(dataset) 80 | features, labels = reformat(data, feature_keys, label_key) 81 | clf = deserialize(estimator) 82 | class_probs = clf.predict_proba(features) 83 | if label_key is not None: 84 | metrics = dict(AUC=roc_auc_score(labels, class_probs[:, 1])) 85 | LOGGER.info('metrics => %s', json.dumps(metrics)) 86 | 87 | 88 | if __name__ == '__main__': 89 | args = _PARSER.parse_args() 90 | settings.configure_logging() 91 | 92 | if args.command == 'train': 93 | train(**vars(args)) 94 | elif args.command == 'eval': 95 | eval(**vars(args)) 96 | else: 97 | print('invalid command: %s' % args.command) 98 | -------------------------------------------------------------------------------- /neural-net.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import joblib 5 | import json 6 | import logging 7 | 8 | import numpy as np 9 | import tensorflow as tf 10 | from tensorflow.keras import layers, models 11 | 12 | from common import deserialize, serialize 13 | from encode import META_KEY 14 | import settings 15 | 16 | 17 | _PARSER = argparse.ArgumentParser() 18 | _PARSER.add_argument('command', help='train|eval') 19 | _PARSER.add_argument('dataset', help='serialized dict of numpy arrays') 20 | _PARSER.add_argument('estimator', help='serialized sklearn estimator') 21 | _PARSER.add_argument('--feature-keys', nargs='*', 22 | help='feature key in dataset') 23 | _PARSER.add_argument('--label-key', help='label key in dataset') 24 | _PARSER.add_argument('--batch-size', type=int, default=32, 25 | help='keras.models.Model#fit argument') 26 | _PARSER.add_argument('--epochs', type=int, default=10, 27 | help='keras.models.Model#fit argument') 28 | _PARSER.add_argument('--validation-split', type=float, default=0.25, 29 | help='keras.models.Model#fit argument') 30 | 31 | LOGGER = logging.getLogger('cve-score') 32 | 33 | OUTPUT_DIM = 128 34 | KERAS_METRICS = [ 35 | tf.keras.metrics.Precision(), 36 | tf.keras.metrics.Recall(), 37 | ] 38 | 39 | 40 | def _input_layer(key, tensor): 41 | input_dim = int(tensor.shape[-1]) 42 | return layers.Input(shape=(input_dim,), name=key) 43 | 44 | 45 | def _embedding_layer(input_layer, key, tensor): 46 | if tensor.dtype != 'int32': 47 | return input_layer 48 | input_dim = np.max(tensor) + 1 49 | input_length = tensor.shape[-1] 50 | embedding = layers.Embedding( 51 | input_dim, OUTPUT_DIM, input_length=input_length)(input_layer) 52 | return layers.Flatten()(embedding) 53 | 54 | 55 | def _class_weights(labels): 56 | '''Implements `sklearn.utils.class_weight` method.''' 57 | n_samples = labels.shape[0] 58 | weights = n_samples / 2 / np.bincount(labels) 59 | return dict(zip(range(2), weights)) 60 | 61 | 62 | def get_model(features): 63 | '''Returns a keras.Model.''' 64 | features = list(features.items()) # fix the ordering 65 | inputs = [ 66 | _input_layer(key, tensor) for (key, tensor) in features 67 | ] 68 | embeddings = [ 69 | _embedding_layer(input, key, tensor) 70 | for (input, (key, tensor)) in zip(inputs, features) 71 | ] 72 | concatenated = (embeddings[0] if len(embeddings) == 1 73 | else layers.concatenate(embeddings)) 74 | 75 | outputs = layers.Dense(1, activation=tf.nn.softmax)(concatenated) 76 | 77 | model = models.Model(inputs=inputs, outputs=outputs) 78 | model.compile('adam', loss='binary_crossentropy', metrics=KERAS_METRICS) 79 | LOGGER.info('model => %s', model.get_config()) 80 | return model 81 | 82 | 83 | def train( 84 | dataset, estimator, feature_keys, label_key, **kwargs): 85 | data = deserialize(dataset) 86 | features = { 87 | key: tensor for (key, tensor) in data.items() 88 | if key in feature_keys 89 | } 90 | labels = data[label_key] 91 | model = get_model(features) 92 | params = { 93 | key: value for (key, value) in kwargs.items() 94 | if key in {'batch_size', 'epochs', 'validation_split'} 95 | } 96 | params['class_weight'] = _class_weights(labels) 97 | model.fit(features, labels, **params) 98 | 99 | 100 | def eval(dataset, estimator, feature_keys, label_key, **kwargs): 101 | pass 102 | 103 | if __name__ == '__main__': 104 | args = _PARSER.parse_args() 105 | settings.configure_logging() 106 | 107 | if args.command == 'train': 108 | train(**vars(args)) 109 | elif args.command == 'eval': 110 | eval(**vars(args)) 111 | else: 112 | print('invalid command: %s' % args.command) 113 | -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import json 5 | import logging 6 | from collections import defaultdict, Counter 7 | 8 | from common import load_json, dump_json 9 | import preprocessors 10 | import settings 11 | 12 | LOGGER = logging.getLogger('cve-score') 13 | 14 | PREPROCESSORS = { 15 | 'flatten': preprocessors.FlatMap, 16 | 'tokenize': preprocessors.Tokenizer, 17 | 'binarize': preprocessors.Binarizer, 18 | 'identity': preprocessors.PassThrough, 19 | } 20 | 21 | _TOKEN_COUNTS = defaultdict(Counter) 22 | 23 | 24 | def get_preprocessor(config): 25 | '''Factory method to instantiate a class in the `preprocessors` module, 26 | tied to a particular key. 27 | 28 | Arguments: 29 | config: [dict] specifying the "key," "preprocessor" type, and 30 | additional constructor arguments. 31 | 32 | Returns a pair (key, preprocessor) on success, consisting of a string and 33 | a lambda function. Returns `None` on failure. 34 | ''' 35 | preproc_name = config.get('preprocessor') 36 | if preproc_name not in PREPROCESSORS: 37 | LOGGER.warning('unrecognized preprocessor: %s', preproc_name) 38 | return None 39 | key = config['key'] 40 | constructor = PREPROCESSORS[preproc_name] 41 | try: 42 | return (key, constructor(**config)) 43 | except Exception as ex: 44 | LOGGER.warning('failure to instantiate "%s": %s', preproc_name, ex) 45 | return None 46 | 47 | 48 | def preprocess(preprocessors, records): 49 | '''Iterator transforming a stream of raw JSON records into a stram 50 | of preprocessed records. 51 | 52 | Arguments: 53 | preprocessors: list of (key, lambda) pairs. 54 | records: `file` object of line-separated JSON records. 55 | 56 | Returns an iterator of dictionaries. 57 | ''' 58 | for record in map(json.loads, records): 59 | result = {} 60 | for (key, function) in preprocessors: 61 | value = function(record.get(key)) 62 | result[key] = value 63 | if isinstance(value, list): 64 | _TOKEN_COUNTS[key].update(set(value)) 65 | 66 | yield result 67 | 68 | 69 | if __name__ == '__main__': 70 | parser = argparse.ArgumentParser() 71 | parser.add_argument('config', help='JSON processing config') 72 | parser.add_argument('infile', help='raw JSON records, one per line') 73 | parser.add_argument('outfile') 74 | parser.add_argument('--vocabulary', 75 | help='target JSON file of token-count mappings') 76 | parser.add_argument('--logging', help='JSON logging config') 77 | args = parser.parse_args() 78 | settings.configure_logging(args.logging) 79 | 80 | config = load_json(args.config) 81 | preprocessors = list(filter(None, 82 | [get_preprocessor(item) for item in config])) 83 | 84 | with open(args.infile) as f_in: 85 | with open(args.outfile, 'w') as f_out: 86 | LOGGER.info('writing %s', args.outfile) 87 | for record in preprocess(preprocessors, f_in): 88 | f_out.write('%s\n' % json.dumps(record)) 89 | 90 | if args.vocabulary: 91 | dump_json(args.vocabulary, _TOKEN_COUNTS) 92 | -------------------------------------------------------------------------------- /preprocessors.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | class Binarizer(object): 5 | '''Callable for transforming "truthy/falsey" values to 0/1 respectively. 6 | ''' 7 | def __init__(self, **kwargs): 8 | pass 9 | 10 | def __call__(self, value): 11 | return int(bool(value)) 12 | 13 | 14 | class FlatMap(object): 15 | '''Callable for transforming a nested structure composted of dictionaries 16 | and lists into a single list of unique paths. The return value is always 17 | a list; an input that is a primitive (string or numeric) value is 18 | transformed into a list containing that value. 19 | ''' 20 | def __init__(self, **kwargs): 21 | pass 22 | 23 | def __call__(self, document): 24 | state = [] 25 | 26 | def recur(key, value): 27 | # Unpacks the `value` object to recursively add keys to the state 28 | if isinstance(value, dict): 29 | for child, new_value in value.items(): 30 | new_key = '.'.join(filter(None, [key, child])) 31 | recur(new_key, new_value) 32 | elif isinstance(value, list): 33 | for child in value: 34 | recur(key, child) 35 | else: 36 | state.append(':'.join(filter(None, [key, str(value)]))) 37 | 38 | recur('', document) 39 | return state 40 | 41 | 42 | class Tokenizer(object): 43 | '''Transforms a string into a list of "canonical" tokens via two 44 | operations: (1) normalizing to lower case, and (2) extracting only 45 | matches of a single regex. 46 | ''' 47 | def __init__(self, regex='\\S+', **kwargs): 48 | self._regex = re.compile(regex) 49 | 50 | def __call__(self, document): 51 | state = [] 52 | 53 | def recur(datum): 54 | if isinstance(datum, str): 55 | state.extend(self._regex.findall(datum.lower())) 56 | elif isinstance(datum, list): 57 | for child in datum: 58 | recur(child) 59 | 60 | recur(document) 61 | return list(filter(None, state)) 62 | 63 | 64 | class PassThrough(object): 65 | '''Implements the identity transform.''' 66 | def __init__(self, **kwargs): 67 | pass 68 | 69 | def __call__(self, document): 70 | return document 71 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.13.0 2 | tensorflow>=1.13.1 3 | scipy>=1.2.1 4 | nose>=1.3.7 5 | -------------------------------------------------------------------------------- /settings.py: -------------------------------------------------------------------------------- 1 | import json 2 | from logging.config import dictConfig 3 | 4 | 5 | LOGGER_CFG = { 6 | 'version': 1, 7 | 'formatters': { 8 | 'default': { 9 | 'format': '%(asctime)s %(name)s %(levelname)s %(message)s', 10 | 'dateformat': '%Y-%m-%d %H:%M:%S', 11 | }, 12 | }, 13 | 'handlers': { 14 | 'console': { 15 | 'class': 'logging.StreamHandler', 16 | 'formatter': 'default', 17 | }, 18 | }, 19 | 'loggers': { 20 | 'tensorflow': { 21 | 'handlers': ['console'], 22 | 'level': 'INFO', 23 | }, 24 | 'cve-score': { 25 | 'handlers': ['console'], 26 | 'level': 'INFO', 27 | } 28 | } 29 | } 30 | 31 | 32 | def configure_logging(filename=None): 33 | '''Configures logging using an optional JSON file, or the defaults.''' 34 | if filename is not None: 35 | with open(filename) as fh: 36 | cfg = json.load(fh) 37 | else: 38 | cfg = LOGGER_CFG 39 | 40 | dictConfig(cfg) 41 | -------------------------------------------------------------------------------- /tests/test_encoders.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import numpy as np 3 | 4 | from encoders import _BoWEncoder 5 | from encoders import DenseEncoder 6 | from encoders import EmbeddingEncoder 7 | from encoders import NumericEncoder 8 | from encoders import SparseEncoder 9 | from encoders import trim_vocabulary 10 | 11 | 12 | class MethodsTest(unittest.TestCase): 13 | 14 | def test_trim_vocabulary(self): 15 | 16 | vocabulary = { 17 | 'foo': 2, 'bar': 1, 'baz': 1 18 | } 19 | expected = ['foo', 'baz', 'bar'] 20 | 21 | self.assertEqual(expected, trim_vocabulary(vocabulary)) 22 | 23 | 24 | class BoWEncoderTest(unittest.TestCase): 25 | 26 | def test_dictionary_initializer(self): 27 | 28 | vocabulary = { 29 | 'foo': 0.3, 'bar': 0.1, 'baz': 0.2 30 | } 31 | # [bar, baz, foo] => [2, 1, 0] 32 | data = ['baz', 'qux', 'foo'] 33 | expected = [1, 0] 34 | 35 | encoder = _BoWEncoder(vocabulary) 36 | self.assertEqual(expected, encoder(data)) 37 | 38 | def test_encoder_correctness(self): 39 | 40 | vocabulary = { 41 | 'foo': 0.3, 'bar': 0.1, 'baz': 0.2 42 | } 43 | # [bar, baz, foo] => [2, 1, 0] 44 | vocabulary_max_size = 2 45 | data = ['baz', 'bar'] 46 | expected = [1] 47 | 48 | encoder = _BoWEncoder( 49 | vocabulary, vocabulary_max_size=vocabulary_max_size) 50 | self.assertEqual(expected, encoder(data)) 51 | 52 | 53 | class DenseEncoderTest(unittest.TestCase): 54 | 55 | def test_delegated_initializer(self): 56 | # BoWEncoderTest.test_dictionary_initializer 57 | 58 | vocabulary = { 59 | 'foo': 3, 'bar': 1, 'baz': 2 60 | } 61 | data = ['baz', 'qux', 'foo'] 62 | expected = [1, 0] 63 | 64 | encoder = DenseEncoder(vocabulary) 65 | self.assertEqual(expected, encoder(data)) 66 | 67 | def test_transform(self): 68 | 69 | vocabulary = { 70 | 'foo': 3, 'bar': 1, 'baz': 2 71 | } 72 | encoder = DenseEncoder(vocabulary) 73 | 74 | data = [ 75 | [2, 1], [0, 2], 76 | ] 77 | onehot = [ 78 | [0, 1, 1], [1, 0, 1], 79 | ] 80 | 81 | actual = encoder.transform(data) 82 | expected = np.array(onehot, dtype=np.float32) 83 | self.assertTrue(np.allclose(expected, actual)) 84 | 85 | 86 | class SparseEncoderTest(unittest.TestCase): 87 | 88 | def test_transform(self): 89 | 90 | vocabulary = { 91 | 'foo': 3, 'bar': 1, 'baz': 2 92 | } 93 | encoder = SparseEncoder(vocabulary) 94 | 95 | data = [ 96 | [2, 1], [0, 2], 97 | ] 98 | onehot = [ 99 | [0, 1, 1], [1, 0, 1], 100 | ] 101 | 102 | actual = encoder.transform(data).toarray() 103 | expected = np.array(onehot, dtype=np.float32) 104 | self.assertTrue(np.allclose(expected, actual)) 105 | 106 | 107 | class NumericeEncoderTest(unittest.TestCase): 108 | 109 | def test_transform(self): 110 | 111 | data = [[1], [3], [2]] 112 | expected_shape = (3, 1) 113 | 114 | encoder = NumericEncoder() 115 | actual = encoder.transform(data) 116 | self.assertEqual(expected_shape, actual.shape) 117 | 118 | 119 | class EmbeddingEncoderTest(unittest.TestCase): 120 | 121 | def test_encoder_correctness(self): 122 | 123 | vocabulary = { 124 | 'foo': 3, 'bar': 1, 'baz': 2 125 | } 126 | # [bar, baz, foo] => [3, 2, 1] 127 | input_length = 2 128 | encoder = EmbeddingEncoder(vocabulary, input_length) 129 | 130 | data = ['baz', 'bar'] 131 | expected = [2, 3] 132 | 133 | self.assertEqual(expected, encoder(data)) 134 | 135 | def test_transform(self): 136 | 137 | vocabulary = { 138 | 'foo': 3, 'bar': 1, 'baz': 2 139 | } 140 | # [bar, baz, foo] => [3, 2, 1] 141 | input_length = 2 142 | encoder = EmbeddingEncoder(vocabulary, input_length) 143 | 144 | data = [ 145 | [3, 2], 146 | [2], 147 | [3, 2, 1], 148 | ] 149 | embedded = [ 150 | [3, 2], 151 | [2, 0], 152 | [3, 2], 153 | ] 154 | 155 | actual = encoder.transform(data) 156 | expected = np.array(embedded, dtype=np.int32) 157 | self.assertTrue(np.array_equal(expected, actual)) 158 | -------------------------------------------------------------------------------- /tests/test_preprocessors.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from preprocessors import Binarizer 3 | from preprocessors import FlatMap 4 | from preprocessors import Tokenizer 5 | 6 | 7 | class TestBinarizer(unittest.TestCase): 8 | 9 | def test_correctness(self): 10 | # pairs of (, ) 11 | fixtures = [ 12 | ('token', 1), 13 | ([2, 'vals'], 1), 14 | ({}, 0), 15 | (None, 0), 16 | ] 17 | 18 | processor = Binarizer() 19 | for datum, expected in fixtures: 20 | self.assertEqual(expected, processor(datum)) 21 | 22 | 23 | class TestFlatMap(unittest.TestCase): 24 | 25 | def test_nested_objects(self): 26 | # pairs of (, ) 27 | fixtures = [ 28 | ({'A': {'a': '1', 'b': 1}}, ['A.a:1', 'A.b:1']), 29 | ({'A': ['a', 'b']}, ['A:a', 'A:b']), 30 | (['A', {'B': 'b'}], ['A', 'B:b']), 31 | ] 32 | 33 | processor = FlatMap() 34 | for datum, expected in fixtures: 35 | actual = processor(datum) 36 | self.assertEqual(sorted(expected), sorted(actual)) 37 | 38 | def test_non_nested_objects(self): 39 | # pairs of (, ) 40 | fixtures = [ 41 | (['A', 'B'], ['A', 'B']), 42 | ('A', ['A']), 43 | (1, ['1']), 44 | ] 45 | 46 | processor = FlatMap() 47 | for datum, expected in fixtures: 48 | actual = processor(datum) 49 | self.assertEqual(sorted(expected), sorted(actual)) 50 | 51 | 52 | class TestTokenizer(unittest.TestCase): 53 | 54 | def test_regex_matcher(self): 55 | 56 | regex = '[\\w\\-]+' # matches words with hyphens 57 | fixtures = [ 58 | ( 59 | 'Hypenated-strings are preserved!', 60 | ['hypenated-strings', 'are', 'preserved'], 61 | ), 62 | ( 63 | 'Tokens with inter=punc/tua/tion 4R3 split', 64 | ['tokens', 'with', 'inter', 'punc', 'tua', 65 | 'tion', '4r3', 'split'], 66 | ) 67 | ] 68 | 69 | processor = Tokenizer(regex) 70 | for datum, expected in fixtures: 71 | self.assertEqual(expected, processor(datum)) 72 | 73 | def test_list_handling(self): 74 | 75 | regex = '\\w+' 76 | fixtures = [ 77 | ( 78 | [ 79 | 'Some first string,', 80 | '-- and a second -- are handled!' 81 | ], 82 | [ 83 | 'some', 'first', 'string', 'and', 'a', 'second', 84 | 'are', 'handled' 85 | ], 86 | ), 87 | ] 88 | 89 | processor = Tokenizer(regex) 90 | for datum, expected in fixtures: 91 | self.assertEqual(expected, processor(datum)) 92 | --------------------------------------------------------------------------------