├── .gitattributes ├── vulner_embedding.bin └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | vulner_embedding.bin filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /vulner_embedding.bin: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:d2c57010c9bad7c25de5fd7559f21a34d1e0339f838b0fbbbc5dbea22896be6b 3 | size 137771920 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Word Representation for Cyber Security Vulnerability Domain 2 | This repo provides a word representation (SecVuln_WE) and a dataset for benchmarking word similarity and relatedness for cyber security vulnerability domain. The following paper describes the step-by-step procedure for training the word embedding and construction of similarity dataset. 3 | 4 | ## SecVuln_WE 5 | A word2vec model trained on multiple heterogeneous sources including Vulners, English Wikipedia (Security category), Information Security Stack Exchange Q&As, Common Weakness Enumeration (CWE) and Stack Overflow. 6 | ### Model file 7 | The pre-trained WE (SecVuln) is stored in a .bin file (of approximate size 160 MB). 8 | 9 | ### Instructions on how to use the model 10 | #### Prerequisites 11 | To load the model you will need Python 3.5 and the [gensim](https://radimrehurek.com/gensim/) library. 12 | 13 | #### Loading the model 14 | ``` 15 | from gensim.models.keyedvectors import KeyedVectors 16 | word_vect = KeyedVectors.load_word2vec_format("vulner_embedding.bin", binary=True) 17 | ``` 18 | #### Querying the model 19 | 20 | Examples of semantic similarity queries 21 | ``` 22 | words=['vulnerability','patch'] 23 | for w in words: 24 | try: 25 | print(word_vect.most_similar(w)[:5]) 26 | except KeyError as e: 27 | print(e) 28 | >> [(u'vulnerabilities', 0.889), (u'bug', 0.786), (u'flaw', 0.742), (u'exploit', 0.740), (u'issues', 0.739)] 29 | >> [(u'patches', 0.816), (u'updates', 0.707), (u'fixes', 0.702), (u'fix', 0.688), (u'upgrade', 0.667)] 30 | ``` 31 | ``` 32 | print(word_vect.similarity('bug', 'flaw')) 33 | >> 0.72691536 34 | ``` 35 | ``` 36 | print(word_vect.doesnt_match("exploit attack weakness python".split())) 37 | >> python 38 | ``` 39 | Examples of analogy queries 40 | 41 | ``` 42 | print(word_vect.most_similar(positive=['exploit', 'title'], negative=['ubuntu'])) 43 | >> [(u'vulnerability', 0.571), (u'xss', 0.556), (u'injection', 0.501)] 44 | ``` 45 | 46 | ## Word Similarity Dataset 47 | Word Similarity dataset is a collection of words for measuring the similarity and relatedness of cyber security words. 48 | The dataset file is available here for download. The file is in csv format and consists of two columns with word1 and word2. The dataset is available [here](https://drive.google.com/drive/u/0/folders/1NyyNKD0UogYBg4iQl-zos40HygDIf_lE) for download. 49 | --------------------------------------------------------------------------------