├── .gitattributes
├── vulner_embedding.bin
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | vulner_embedding.bin filter=lfs diff=lfs merge=lfs -text
2 | 


--------------------------------------------------------------------------------
/vulner_embedding.bin:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:d2c57010c9bad7c25de5fd7559f21a34d1e0339f838b0fbbbc5dbea22896be6b
3 | size 137771920
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Word Representation for Cyber Security Vulnerability Domain
 2 | This repo provides a word representation (SecVuln_WE) and a dataset for benchmarking word similarity and relatedness for cyber security vulnerability domain. The following paper describes the step-by-step procedure for training the word embedding and construction of similarity dataset. 
 3 | 
 4 | ## SecVuln_WE
 5 | A word2vec model trained on multiple heterogeneous sources including Vulners, English Wikipedia (Security category), Information Security Stack Exchange Q&As, Common Weakness Enumeration (CWE) and Stack Overﬂow.
 6 | ### Model file
 7 | The pre-trained WE (SecVuln) is stored in a .bin file (of approximate size 160 MB).
 8 | 
 9 | ### Instructions on how to use the model
10 | #### Prerequisites
11 | To load the model you will need Python 3.5 and the [gensim](https://radimrehurek.com/gensim/) library.
12 | 
13 | #### Loading the model
14 | ```
15 | from gensim.models.keyedvectors import KeyedVectors
16 | word_vect = KeyedVectors.load_word2vec_format("vulner_embedding.bin", binary=True)
17 | ```
18 | #### Querying the model
19 | 
20 | Examples of semantic similarity queries
21 | ```
22 | words=['vulnerability','patch']
23 | for w in words:
24 |     try:
25 |         print(word_vect.most_similar(w)[:5])
26 |     except KeyError as e:
27 |             print(e)
28 | >> [(u'vulnerabilities', 0.889), (u'bug', 0.786), (u'flaw', 0.742), (u'exploit', 0.740), (u'issues', 0.739)]
29 | >> [(u'patches', 0.816), (u'updates', 0.707), (u'fixes', 0.702), (u'fix', 0.688), (u'upgrade', 0.667)]
30 | ```
31 | ```
32 | print(word_vect.similarity('bug', 'flaw'))
33 | >> 0.72691536
34 | ```
35 | ```
36 | print(word_vect.doesnt_match("exploit attack weakness python".split()))
37 | >> python
38 | ```
39 | Examples of analogy queries
40 | 
41 | ```
42 | print(word_vect.most_similar(positive=['exploit', 'title'], negative=['ubuntu']))
43 | >> [(u'vulnerability', 0.571), (u'xss', 0.556), (u'injection', 0.501)]
44 | ```
45 | 
46 | ## Word Similarity Dataset
47 | Word Similarity dataset is a collection of words for measuring the similarity and relatedness of cyber security words.
48 | The dataset file is available here for download. The file is in csv format and consists of two columns with word1 and word2. The dataset is available [here](https://drive.google.com/drive/u/0/folders/1NyyNKD0UogYBg4iQl-zos40HygDIf_lE) for download.
49 | 


--------------------------------------------------------------------------------