├── .gitignore ├── README.md ├── Data Features NVD.ipynb ├── CERT Impact Analysis .ipynb ├── Bi Grams.ipynb ├── Nessus Plugins Prediction .ipynb ├── Word2Vectors.ipynb ├── Yearly NVD Analysis .ipynb └── Tensor Flow NN.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | cve-2002.json 2 | cve-2003.json 3 | cve-2004.json 4 | cve-2005.json 5 | cve-2006.json 6 | cve-2007.json 7 | cve-2008.json 8 | cve-2009.json 9 | cve-2010.json 10 | cve-2011.json 11 | cve-2012.json 12 | cve-2013.json 13 | cve-2014.json 14 | cve-2015.json 15 | cve-2016.json 16 | cve-2017.json 17 | cve-2018.json 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Vuln-Analysis-ML-Python 2 | This project deals with vulnerability analysis and classification using machine learning techniques i.e. Natural Language Processing. 3 | 4 | ## Abstract 5 | With the advent of the Internet, systems have become more and more vulnerable over the years. Therefore, it is crucial to assess how vulnerable a system is and what level of protection is required to secure the sensitive data. Hence it gave birth to the process of vulnerability assessment or vulnerability analysis. The whole process has several steps. Mainly, it is required to first identify how vulnerable a system is. Then we need to classify the identified vulnerability in order to assess the impact that the vulnerability can cause to the system, if breached. When it comes to the classification of already identified vulnerabilities, there is a US based database created in the year 2000 which is widely considered as a common reference standard of vulnerability classification. It is known as the National Vulnerability Database (NVD). It uses the Security Content Automation Protocol (SCAP). The security analysts use the common vulnerability scoring system to classify an identified vulnerability. This system is based upon an extensive set of CVSS metrics. 6 | 7 | 8 | The main idea behind this thesis is to automate the process of assessing these vulnerabilities. In order to avoid the cumbersome and time-consuming process of classifying the gravity of an identified vulnerability, a much swifter and state of the art machine learning based approach to actually predict and foresee the impact of an identified vulnerability is taken. Our research mainly fo- cuses on machine learning based natural language processing (NLP). 9 | 10 | 11 | The research focuses on two aspects of classifying the impact of an identified vulnerability. Firstly, it focuses on classifying the vulnerability impact in various classes i.e. low, medium, high, critical. Furthermore, based on a regressive model, the system also tries to predict the CVSS score allocated to an identified vulnerability. The results are achieved with considerable precision further discussed in the results section. Deep learning based neural networks are also explored and observed i.e. how using deep learning model impacts the results. 12 | 13 | 14 | Conclusively, the developed model employs a very robust machine learning based approach to classify and assess identified vulnerabilities. Future works may include further refining the model or use deep learning based models as the National Vulnerability Database grows. 15 | -------------------------------------------------------------------------------- /Data Features NVD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "warnings.filterwarnings('ignore')" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset2018 = pd.read_json('cve-2018.json')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "(13949, 6)" 33 | ] 34 | }, 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "output_type": "execute_result" 38 | } 39 | ], 40 | "source": [ 41 | "dataset2018.shape" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 4, 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "data": { 51 | "text/plain": [ 52 | "array(['CVE_Items', 'CVE_data_format', 'CVE_data_numberOfCVEs',\n", 53 | " 'CVE_data_timestamp', 'CVE_data_type', 'CVE_data_version'],\n", 54 | " dtype=object)" 55 | ] 56 | }, 57 | "execution_count": 4, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "dataset2018.columns.values" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "{'configurations': {'CVE_data_version': '4.0',\n", 75 | " 'nodes': [{'cpe': [{'cpe22Uri': 'cpe:/o:microsoft:windows_10:1511',\n", 76 | " 'cpe23Uri': 'cpe:2.3:o:microsoft:windows_10:1511:*:*:*:*:*:*:*',\n", 77 | " 'vulnerable': True},\n", 78 | " {'cpe22Uri': 'cpe:/o:microsoft:windows_10:1607',\n", 79 | " 'cpe23Uri': 'cpe:2.3:o:microsoft:windows_10:1607:*:*:*:*:*:*:*',\n", 80 | " 'vulnerable': True},\n", 81 | " {'cpe22Uri': 'cpe:/o:microsoft:windows_server_2016:-',\n", 82 | " 'cpe23Uri': 'cpe:2.3:o:microsoft:windows_server_2016:-:*:*:*:*:*:*:*',\n", 83 | " 'vulnerable': True}],\n", 84 | " 'operator': 'OR'}]},\n", 85 | " 'cve': {'CVE_data_meta': {'ASSIGNER': 'cve@mitre.org', 'ID': 'CVE-2017-0219'},\n", 86 | " 'affects': {'vendor': {'vendor_data': [{'product': {'product_data': [{'product_name': 'windows_10',\n", 87 | " 'version': {'version_data': [{'version_value': '1511'},\n", 88 | " {'version_value': '1607'}]}},\n", 89 | " {'product_name': 'windows_server_2016',\n", 90 | " 'version': {'version_data': [{'version_value': '-'}]}}]},\n", 91 | " 'vendor_name': 'microsoft'}]}},\n", 92 | " 'data_format': 'MITRE',\n", 93 | " 'data_type': 'CVE',\n", 94 | " 'data_version': '4.0',\n", 95 | " 'description': {'description_data': [{'lang': 'en',\n", 96 | " 'value': 'Microsoft Windows 10 Gold, Windows 10 1511, Windows 10 1607, and Windows Server 2016 allow an attacker to exploit a security feature bypass vulnerability in Device Guard that could allow the attacker to inject malicious code into a Windows PowerShell session, aka \"Device Guard Code Integrity Policy Security Feature Bypass Vulnerability.\" This CVE ID is unique from CVE-2017-0173, CVE-2017-0215, CVE-2017-0216, and CVE-2017-0218.'}]},\n", 97 | " 'problemtype': {'problemtype_data': [{'description': [{'lang': 'en',\n", 98 | " 'value': 'CWE-254'}]}]},\n", 99 | " 'references': {'reference_data': [{'url': 'http://www.securityfocus.com/bid/98898'},\n", 100 | " {'url': 'https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-2017-0219'}]}},\n", 101 | " 'impact': {'baseMetricV2': {'cvssV2': {'accessComplexity': 'LOW',\n", 102 | " 'accessVector': 'LOCAL',\n", 103 | " 'authentication': 'NONE',\n", 104 | " 'availabilityImpact': 'PARTIAL',\n", 105 | " 'baseScore': 4.6,\n", 106 | " 'confidentialityImpact': 'PARTIAL',\n", 107 | " 'integrityImpact': 'PARTIAL',\n", 108 | " 'vectorString': '(AV:L/AC:L/Au:N/C:P/I:P/A:P)',\n", 109 | " 'version': '2.0'},\n", 110 | " 'exploitabilityScore': 3.9,\n", 111 | " 'impactScore': 6.4,\n", 112 | " 'obtainAllPrivilege': False,\n", 113 | " 'obtainOtherPrivilege': False,\n", 114 | " 'obtainUserPrivilege': False,\n", 115 | " 'severity': 'MEDIUM',\n", 116 | " 'userInteractionRequired': False},\n", 117 | " 'baseMetricV3': {'cvssV3': {'attackComplexity': 'LOW',\n", 118 | " 'attackVector': 'LOCAL',\n", 119 | " 'availabilityImpact': 'LOW',\n", 120 | " 'baseScore': 5.3,\n", 121 | " 'baseSeverity': 'MEDIUM',\n", 122 | " 'confidentialityImpact': 'LOW',\n", 123 | " 'integrityImpact': 'LOW',\n", 124 | " 'privilegesRequired': 'LOW',\n", 125 | " 'scope': 'UNCHANGED',\n", 126 | " 'userInteraction': 'NONE',\n", 127 | " 'vectorString': 'CVSS:3.0/AV:L/AC:L/PR:L/UI:N/S:U/C:L/I:L/A:L',\n", 128 | " 'version': '3.0'},\n", 129 | " 'exploitabilityScore': 1.8,\n", 130 | " 'impactScore': 3.4}},\n", 131 | " 'lastModifiedDate': '2017-06-21T17:48Z',\n", 132 | " 'publishedDate': '2017-06-15T01:29Z'}" 133 | ] 134 | }, 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "dataset2018.CVE_Items[200]" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 6, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "dict_keys(['cve', 'configurations', 'impact', 'publishedDate', 'lastModifiedDate'])" 153 | ] 154 | }, 155 | "execution_count": 6, 156 | "metadata": {}, 157 | "output_type": "execute_result" 158 | } 159 | ], 160 | "source": [ 161 | "dataset2018.CVE_Items[200].keys()" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 15, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "rev = []\n", 171 | "novals = []\n", 172 | "j=0\n", 173 | "for i in range(dataset2018.shape[0]):\n", 174 | " new=dataset2018.CVE_Items[i]\n", 175 | " if('baseMetricV2' in new['impact'].keys()):\n", 176 | " var=new['impact']['baseMetricV2']['severity']\n", 177 | " rev.append(var)\n", 178 | " else:\n", 179 | " rev.append('Not Exists')\n", 180 | " novals.append(j)\n", 181 | " j=j+1 " 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 10, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "rev=np.array(rev)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 11, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "data": { 200 | "text/plain": [ 201 | "(13949,)" 202 | ] 203 | }, 204 | "execution_count": 11, 205 | "metadata": {}, 206 | "output_type": "execute_result" 207 | } 208 | ], 209 | "source": [ 210 | "rev.shape" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 12, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "num = 0\n", 220 | "val = []\n", 221 | "for i in range(rev.shape[0]):\n", 222 | " if (rev[i] == 'Not Exists'):\n", 223 | " num = num + 1\n", 224 | " val.append(i)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 13, 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "1817\n" 237 | ] 238 | } 239 | ], 240 | "source": [ 241 | "print(num)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "### so out of 13949 rows , 1817 doesn't have severity value in there " 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "### an example of row that doesn't has severity in there " 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 17, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "novals=np.array(novals)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 18, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "(1817,)" 276 | ] 277 | }, 278 | "execution_count": 18, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "novals.shape" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 19, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "data": { 294 | "text/plain": [ 295 | "330" 296 | ] 297 | }, 298 | "execution_count": 19, 299 | "metadata": {}, 300 | "output_type": "execute_result" 301 | } 302 | ], 303 | "source": [ 304 | "novals[0]" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 23, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "{'configurations': {'CVE_data_version': '4.0', 'nodes': []},\n", 316 | " 'cve': {'CVE_data_meta': {'ASSIGNER': 'cve@mitre.org', 'ID': 'CVE-2017-0357'},\n", 317 | " 'affects': {'vendor': {'vendor_data': []}},\n", 318 | " 'data_format': 'MITRE',\n", 319 | " 'data_type': 'CVE',\n", 320 | " 'data_version': '4.0',\n", 321 | " 'description': {'description_data': [{'lang': 'en',\n", 322 | " 'value': 'A heap-overflow flaw exists in the -tr loader of iucode-tool starting with v1.4 and before v2.1.1, potentially leading to SIGSEGV, or heap corruption.'}]},\n", 323 | " 'problemtype': {'problemtype_data': [{'description': []}]},\n", 324 | " 'references': {'reference_data': [{'url': 'http://www.securityfocus.com/bid/95432'},\n", 325 | " {'url': 'https://gitlab.com/iucode-tool/iucode-tool/issues/3'},\n", 326 | " {'url': 'https://security-tracker.debian.org/tracker/CVE-2017-0357'}]}},\n", 327 | " 'impact': {},\n", 328 | " 'lastModifiedDate': '2018-04-16T09:58Z',\n", 329 | " 'publishedDate': '2018-04-13T15:29Z'}" 330 | ] 331 | }, 332 | "execution_count": 23, 333 | "metadata": {}, 334 | "output_type": "execute_result" 335 | } 336 | ], 337 | "source": [ 338 | "dataset2018.CVE_Items[331]" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "### above example has id CVE-2017-0356\n", 346 | "### it has the link https://nvd.nist.gov/vuln/detail/CVE-2017-0357" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "# one can see that these are newly identified vulnerbalities that have not been put in the datafeeds " 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [] 362 | } 363 | ], 364 | "metadata": { 365 | "kernelspec": { 366 | "display_name": "Python 3", 367 | "language": "python", 368 | "name": "python3" 369 | }, 370 | "language_info": { 371 | "codemirror_mode": { 372 | "name": "ipython", 373 | "version": 3 374 | }, 375 | "file_extension": ".py", 376 | "mimetype": "text/x-python", 377 | "name": "python", 378 | "nbconvert_exporter": "python", 379 | "pygments_lexer": "ipython3", 380 | "version": "3.6.4" 381 | } 382 | }, 383 | "nbformat": 4, 384 | "nbformat_minor": 2 385 | } 386 | -------------------------------------------------------------------------------- /CERT Impact Analysis .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "warnings.filterwarnings('ignore')" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset = pd.read_excel('cert.xlsx')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "(150, 5)" 33 | ] 34 | }, 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "output_type": "execute_result" 38 | } 39 | ], 40 | "source": [ 41 | "dataset.shape" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 4, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "dataset=np.array(dataset)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 5, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "description = []\n", 60 | "severity = []\n", 61 | "scores = []\n", 62 | "impact = []\n", 63 | "\n", 64 | "\n", 65 | "for i in range(dataset.shape[0]):\n", 66 | " severity.append(dataset[i][4])\n", 67 | " scores.append(dataset[i][3])\n", 68 | " description.append(dataset[i][2])\n", 69 | " impact.append(dataset[i][1])" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 6, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "from bs4 import BeautifulSoup \n", 79 | "import re\n", 80 | "from nltk.corpus import stopwords\n", 81 | "def review_to_words( raw_review ):\n", 82 | " # Function to convert a raw review to a string of words\n", 83 | " # The input is a single string (a raw movie review), and \n", 84 | " # the output is a single string (a preprocessed movie review)\n", 85 | " #\n", 86 | " # 1. Remove HTML\n", 87 | " review_text = BeautifulSoup(raw_review).get_text() \n", 88 | " #\n", 89 | " # 2. Remove non-letters \n", 90 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 91 | " #\n", 92 | " # 3. Convert to lower case, split into individual words\n", 93 | " words = letters_only.lower().split() \n", 94 | " #\n", 95 | " # 4. In Python, searching a set is much faster than searching\n", 96 | " # a list, so convert the stop words to a set\n", 97 | " stops = set(stopwords.words(\"english\")) \n", 98 | " # \n", 99 | " # 5. Remove stop words\n", 100 | " meaningful_words = [w for w in words if not w in stops] \n", 101 | " #\n", 102 | " # 6. Join the words back into one string separated by space, \n", 103 | " # and return the result.\n", 104 | " return( \" \".join( meaningful_words )) " 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 7, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# Get the number of reviews based on the dataframe column size\n", 114 | "\n", 115 | "# Initialize an empty list to hold the clean reviews\n", 116 | "clean_description = []\n", 117 | "clean_impact = []\n", 118 | "clean_severity = []\n", 119 | "# Loop over each review; create an index i that goes from 0 to the length\n", 120 | "# of the movie review list \n", 121 | "for i in range(150):\n", 122 | " # Call our function for each one, and add the result to the list of\n", 123 | " # clean reviews\n", 124 | " clean_description.append( review_to_words( description[i] ) )\n", 125 | " clean_impact.append( review_to_words( impact[i] ) )\n", 126 | " clean_severity.append( review_to_words( severity[i] ) )" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 8, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "clean_severity=np.array(clean_severity)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 9, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "name": "stdout", 145 | "output_type": "stream", 146 | "text": [ 147 | "Creating the bag of words...\n", 148 | "\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "clean_description_array = np.array(clean_description)\n", 154 | "print (\"Creating the bag of words...\\n\")\n", 155 | "from sklearn.feature_extraction.text import CountVectorizer\n", 156 | "\n", 157 | "# Initialize the \"CountVectorizer\" object, which is scikit-learn's\n", 158 | "# bag of words tool. \n", 159 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 160 | " tokenizer = None, \\\n", 161 | " preprocessor = None, \\\n", 162 | " stop_words = None, \\\n", 163 | " max_features = 500) \n", 164 | "\n", 165 | "# fit_transform() does two functions: First, it fits the model\n", 166 | "# and learns the vocabulary; second, it transforms our training data\n", 167 | "# into feature vectors. The input to fit_transform should be a list of \n", 168 | "# strings.\n", 169 | "description_features = vectorizer.fit_transform(clean_description_array)\n", 170 | "\n", 171 | "# Numpy arrays are easy to work with, so convert the result to an \n", 172 | "# array\n", 173 | "description_features = description_features.toarray()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 10, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stderr", 183 | "output_type": "stream", 184 | "text": [ 185 | "/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 186 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 187 | ] 188 | }, 189 | { 190 | "name": "stdout", 191 | "output_type": "stream", 192 | "text": [ 193 | "0.6129032258064516\n", 194 | "0.5806451612903226\n", 195 | "0.5483870967741935\n", 196 | "0.5517241379310345\n", 197 | "0.42857142857142855\n", 198 | "avg 0.5444462100746861\n" 199 | ] 200 | } 201 | ], 202 | "source": [ 203 | "from sklearn.cross_validation import StratifiedKFold\n", 204 | "from sklearn.ensemble import RandomForestClassifier\n", 205 | "from sklearn import metrics\n", 206 | "n_folds = 5\n", 207 | "score = 0.0\n", 208 | "skf = StratifiedKFold(severity, n_folds)\n", 209 | "avg_score = 0\n", 210 | "\n", 211 | "for train_index, test_index in skf:\n", 212 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 213 | " y_train, y_test = clean_severity[train_index], clean_severity[test_index]\n", 214 | " forest = RandomForestClassifier(n_estimators = 50)\n", 215 | " forest.fit( X_train, y_train )\n", 216 | " score = forest.score(X_test,y_test)\n", 217 | " avg_score += score \n", 218 | " print(score)\n", 219 | " \n", 220 | "print(\"avg\",avg_score/n_folds)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 11, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "clean_description = np.array(clean_description)\n", 230 | "clean_impact = np.array(clean_impact)\n", 231 | "imp_desc = np.array(clean_description)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 12, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "for i in range(150):\n", 241 | " imp_desc[i] = clean_description[i]+clean_impact[i]" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 15, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "'unprotected transport credentials issue discovered abb ellipse ellipse released prior december including ellipse select vulnerability exists authentication ellipse ldap ad using ldap protocol attacker could exploit vulnerability sniffing local network traffic allowing discovery authentication credentials'" 253 | ] 254 | }, 255 | "execution_count": 15, 256 | "metadata": {}, 257 | "output_type": "execute_result" 258 | } 259 | ], 260 | "source": [ 261 | "clean_description[4]" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 70, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "Creating the bag of words...\n", 274 | "\n" 275 | ] 276 | } 277 | ], 278 | "source": [ 279 | "print (\"Creating the bag of words...\\n\")\n", 280 | "from sklearn.feature_extraction.text import CountVectorizer\n", 281 | "\n", 282 | "# Initialize the \"CountVectorizer\" object, which is scikit-learn's\n", 283 | "# bag of words tool. \n", 284 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 285 | " tokenizer = None, \\\n", 286 | " preprocessor = None, \\\n", 287 | " stop_words = None, \\\n", 288 | " max_features = 500) \n", 289 | "\n", 290 | "# fit_transform() does two functions: First, it fits the model\n", 291 | "# and learns the vocabulary; second, it transforms our training data\n", 292 | "# into feature vectors. The input to fit_transform should be a list of \n", 293 | "# strings.\n", 294 | "description_features = vectorizer.fit_transform(imp_desc)\n", 295 | "\n", 296 | "# Numpy arrays are easy to work with, so convert the result to an \n", 297 | "# array\n", 298 | "description_features = description_features.toarray()" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 71, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "0.6129032258064516\n", 311 | "0.6129032258064516\n", 312 | "0.5161290322580645\n", 313 | "0.41379310344827586\n", 314 | "0.5714285714285714\n", 315 | "avg 0.545431431749563\n" 316 | ] 317 | } 318 | ], 319 | "source": [ 320 | "from sklearn.cross_validation import StratifiedKFold\n", 321 | "from sklearn.ensemble import RandomForestClassifier\n", 322 | "from sklearn import metrics\n", 323 | "n_folds = 5\n", 324 | "score = 0.0\n", 325 | "skf = StratifiedKFold(severity, n_folds)\n", 326 | "avg_score = 0\n", 327 | "\n", 328 | "for train_index, test_index in skf:\n", 329 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 330 | " y_train, y_test = clean_severity[train_index], clean_severity[test_index]\n", 331 | " forest = RandomForestClassifier(n_estimators = 50)\n", 332 | " forest.fit( X_train, y_train )\n", 333 | " score = forest.score(X_test,y_test)\n", 334 | " avg_score += score \n", 335 | " print(score)\n", 336 | " \n", 337 | "print(\"avg\",avg_score/n_folds)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [] 346 | } 347 | ], 348 | "metadata": { 349 | "kernelspec": { 350 | "display_name": "Python 3", 351 | "language": "python", 352 | "name": "python3" 353 | }, 354 | "language_info": { 355 | "codemirror_mode": { 356 | "name": "ipython", 357 | "version": 3 358 | }, 359 | "file_extension": ".py", 360 | "mimetype": "text/x-python", 361 | "name": "python", 362 | "nbconvert_exporter": "python", 363 | "pygments_lexer": "ipython3", 364 | "version": "3.6.4" 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /Bi Grams.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "warnings.filterwarnings('ignore')" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 11, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset = pd.read_json('cve-2016.json')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 16, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "array(['CVE_Items', 'CVE_data_format', 'CVE_data_numberOfCVEs',\n", 33 | " 'CVE_data_timestamp', 'CVE_data_type', 'CVE_data_version'],\n", 34 | " dtype=object)" 35 | ] 36 | }, 37 | "execution_count": 16, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "dataset.columns.values" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 24, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "rev = []\n", 53 | "j=0\n", 54 | "for i in range(dataset.shape[0]):\n", 55 | " new=dataset.CVE_Items[i]\n", 56 | " if('baseMetricV2' in new['impact'].keys()):\n", 57 | " var=new['impact']['baseMetricV2']['severity']\n", 58 | " rev.append(var)\n", 59 | " else:\n", 60 | " rev.append('Not Exists')\n", 61 | " j=j+1 " 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 25, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/plain": [ 72 | "(9417,)" 73 | ] 74 | }, 75 | "execution_count": 25, 76 | "metadata": {}, 77 | "output_type": "execute_result" 78 | } 79 | ], 80 | "source": [ 81 | "rev=np.array(rev)\n", 82 | "rev.shape" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 26, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "description = []\n", 92 | "severity = []\n", 93 | "scores = []\n", 94 | "\n", 95 | "for i in range(dataset.shape[0]):\n", 96 | " new=dataset.CVE_Items[i]\n", 97 | " if('baseMetricV2' in new['impact'].keys()):\n", 98 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 99 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 100 | " description.append(new['cve']['description']['description_data'][0]['value'])" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 27, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "description = np.array(description)\n", 110 | "severity = np.array(severity)\n", 111 | "scores = np.array(scores)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 28, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "from bs4 import BeautifulSoup \n", 121 | "import re\n", 122 | "from nltk.corpus import stopwords\n", 123 | "def review_to_words( raw_review ):\n", 124 | " # Function to convert a raw review to a string of words\n", 125 | " # The input is a single string (a raw movie review), and \n", 126 | " # the output is a single string (a preprocessed movie review)\n", 127 | " #\n", 128 | " # 1. Remove HTML\n", 129 | " review_text = BeautifulSoup(raw_review).get_text() \n", 130 | " #\n", 131 | " # 2. Remove non-letters \n", 132 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 133 | " #\n", 134 | " # 3. Convert to lower case, split into individual words\n", 135 | " words = letters_only.lower().split() \n", 136 | " #\n", 137 | " # 4. In Python, searching a set is much faster than searching\n", 138 | " # a list, so convert the stop words to a set\n", 139 | " stops = set(stopwords.words(\"english\")) \n", 140 | " # \n", 141 | " # 5. Remove stop words\n", 142 | " meaningful_words = [w for w in words if not w in stops] \n", 143 | " #\n", 144 | " # 6. Join the words back into one string separated by space, \n", 145 | " # and return the result.\n", 146 | " return( \" \".join( meaningful_words )) " 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 29, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# Get the number of reviews based on the dataframe column size\n", 156 | "\n", 157 | "# Initialize an empty list to hold the clean reviews\n", 158 | "clean_description = []\n", 159 | "\n", 160 | "# Loop over each review; create an index i that goes from 0 to the length\n", 161 | "# of the movie review list \n", 162 | "for i in range(description.shape[0]):\n", 163 | " # Call our function for each one, and add the result to the list of\n", 164 | " # clean reviews\n", 165 | " clean_description.append( review_to_words( description[i] ) )" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 35, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "clean_description_array = np.array(clean_description)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 37, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "from sklearn.feature_extraction.text import CountVectorizer\n", 184 | "\n", 185 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 186 | " tokenizer = None, \\\n", 187 | " preprocessor = None, \\\n", 188 | " stop_words = None, \\\n", 189 | " max_features = 5000) \n", 190 | "\n", 191 | "bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1) \n", 192 | "train_data_features1 = vectorizer.fit_transform(clean_description_array)\n", 193 | "\n", 194 | "# Numpy arrays are easy to work with, so convert the result to an \n", 195 | "# array\n", 196 | "train_data_features1 = train_data_features1.toarray()" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 38, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "vocab_bi = vectorizer.get_feature_names()" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 39, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "#print (vocab_bi)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 40, 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "(8205, 5000)" 226 | ] 227 | }, 228 | "execution_count": 40, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "train_data_features1.shape" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 41, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "<8205x5000 sparse matrix of type ''\n", 246 | "\twith 179826 stored elements in Compressed Sparse Row format>" 247 | ] 248 | }, 249 | "execution_count": 41, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "from sklearn.feature_extraction.text import TfidfTransformer\n", 256 | "tfidf_transformer = TfidfTransformer(smooth_idf=False)\n", 257 | "train_data_features1 = tfidf_transformer.fit_transform(train_data_features1)\n", 258 | "train_data_features1" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 42, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "data": { 268 | "text/plain": [ 269 | "(8205,)" 270 | ] 271 | }, 272 | "execution_count": 42, 273 | "metadata": {}, 274 | "output_type": "execute_result" 275 | } 276 | ], 277 | "source": [ 278 | "score_result = np.array(scores).astype(np.float)\n", 279 | "score_result.shape" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 43, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "MAE: 0.9860\n", 292 | "MAE: 0.9221\n", 293 | "MAE: 0.9988\n", 294 | "MAE: 0.9547\n", 295 | "MAE: 1.0171\n", 296 | "avg 0.9757471145780707\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "from sklearn import ensemble\n", 302 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 303 | "from sklearn.cross_validation import StratifiedKFold\n", 304 | "\n", 305 | "n_folds = 5\n", 306 | "avg_mad = 0\n", 307 | "skf = StratifiedKFold(score_result, n_folds)\n", 308 | "\n", 309 | "\n", 310 | "\n", 311 | "for train_index, test_index in skf:\n", 312 | " X_train, X_test = train_data_features1[train_index], train_data_features1[test_index]\n", 313 | " y_train, y_test = score_result[train_index], score_result[test_index]\n", 314 | " params = {'n_estimators': 500, 'max_depth': 8, 'min_samples_split': 2,\n", 315 | " 'learning_rate': 0.01, 'loss': 'ls'}\n", 316 | " clf = ensemble.GradientBoostingRegressor(**params)\n", 317 | " clf.fit(X_train, y_train)\n", 318 | " mse = mean_squared_error(y_test, clf.predict(X_test))\n", 319 | " mad = mean_absolute_error(y_test, clf.predict(X_test))\n", 320 | " #print(\"MSE: %.4f\" % mse)\n", 321 | " avg_mad += mad\n", 322 | " print(\"MAE: %.4f\" % mad)\n", 323 | "print(\"avg\",avg_mad/n_folds) " 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 19, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "ename": "NameError", 333 | "evalue": "name 'description_features' is not defined", 334 | "output_type": "error", 335 | "traceback": [ 336 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 337 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 338 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtemp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdescription_features\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m30\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m31\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m500\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 339 | "\u001b[0;31mNameError\u001b[0m: name 'description_features' is not defined" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "temp = description_features[30:31,0:500]" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 20, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "ename": "NameError", 354 | "evalue": "name 'clf' is not defined", 355 | "output_type": "error", 356 | "traceback": [ 357 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 358 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 359 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mclf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtemp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 360 | "\u001b[0;31mNameError\u001b[0m: name 'clf' is not defined" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "clf.predict(temp)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 21, 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "text/plain": [ 376 | "5.0" 377 | ] 378 | }, 379 | "execution_count": 21, 380 | "metadata": {}, 381 | "output_type": "execute_result" 382 | } 383 | ], 384 | "source": [ 385 | "scores[30]" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [] 401 | } 402 | ], 403 | "metadata": { 404 | "kernelspec": { 405 | "display_name": "Python 3", 406 | "language": "python", 407 | "name": "python3" 408 | }, 409 | "language_info": { 410 | "codemirror_mode": { 411 | "name": "ipython", 412 | "version": 3 413 | }, 414 | "file_extension": ".py", 415 | "mimetype": "text/x-python", 416 | "name": "python", 417 | "nbconvert_exporter": "python", 418 | "pygments_lexer": "ipython3", 419 | "version": "3.6.4" 420 | } 421 | }, 422 | "nbformat": 4, 423 | "nbformat_minor": 2 424 | } 425 | -------------------------------------------------------------------------------- /Nessus Plugins Prediction .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Importing the Libraries" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "#import nltk\n", 17 | "#from nltk.corpus import stopwords\n", 18 | "#set(stopwords.words('english'))" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import numpy as np\n", 28 | "import pandas as pd\n", 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# Reading the Dataset " 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "dataset = pd.read_csv('train-set.csv', header=0, \\\n", 47 | " delimiter=\"\\t\")" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### Checking Dataset Shape" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 5, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "text/plain": [ 65 | "(64, 1)" 66 | ] 67 | }, 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "dataset.shape" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Checking the Column Names" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 6, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "array(['id;description;severity;type;family;risk factor;score'],\n", 93 | " dtype=object)" 94 | ] 95 | }, 96 | "execution_count": 6, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "dataset.columns.values" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Converting the Dataset into an Array" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 7, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "dataset_array = np.array(dataset)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### Accessing the Third Row " 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 8, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "'3;The version of MySQL running on the remote host is 5.6.x prior to 5.6.39. It is, therefore, affected by multiple vulnerabilities as noted in the January 2018 Critical Patch Update advisory. Please consult the CVRF details for the applicable CVEs for additional information.;medium ;local ;databases;medium ;4.3'" 137 | ] 138 | }, 139 | "execution_count": 8, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "dataset_array[2][0]" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "# Extracting the description from the Dataset " 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 9, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "description = []\n", 162 | "for i in range(dataset.shape[0]):\n", 163 | " temporary_variable = dataset_array[i][0].split(';')\n", 164 | " temporary_variable=np.array(temporary_variable)\n", 165 | " description.append(temporary_variable[1]) " 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 10, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "description = np.array(description)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 11, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "(64,)" 186 | ] 187 | }, 188 | "execution_count": 11, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "description.shape" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### Removing unecessary Details i.e. Stop Words, Non Letters, HTML" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 12, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "from bs4 import BeautifulSoup \n", 211 | "import re\n", 212 | "from nltk.corpus import stopwords\n", 213 | "def review_to_words( raw_review ):\n", 214 | " # Function to convert a raw review to a string of words\n", 215 | " # The input is a single string (a raw movie review), and \n", 216 | " # the output is a single string (a preprocessed movie review)\n", 217 | " #\n", 218 | " # 1. Remove HTML\n", 219 | " review_text = BeautifulSoup(raw_review).get_text() \n", 220 | " #\n", 221 | " # 2. Remove non-letters \n", 222 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 223 | " #\n", 224 | " # 3. Convert to lower case, split into individual words\n", 225 | " words = letters_only.lower().split() \n", 226 | " #\n", 227 | " # 4. In Python, searching a set is much faster than searching\n", 228 | " # a list, so convert the stop words to a set\n", 229 | " stops = set(stopwords.words(\"english\")) \n", 230 | " # \n", 231 | " # 5. Remove stop words\n", 232 | " meaningful_words = [w for w in words if not w in stops] \n", 233 | " #\n", 234 | " # 6. Join the words back into one string separated by space, \n", 235 | " # and return the result.\n", 236 | " return( \" \".join( meaningful_words )) " 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 13, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# Get the number of reviews based on the dataframe column size\n", 246 | "\n", 247 | "# Initialize an empty list to hold the clean reviews\n", 248 | "clean_description = []\n", 249 | "\n", 250 | "# Loop over each review; create an index i that goes from 0 to the length\n", 251 | "# of the movie review list \n", 252 | "for i in range(dataset.shape[0]):\n", 253 | " # Call our function for each one, and add the result to the list of\n", 254 | " # clean reviews\n", 255 | " clean_description.append( review_to_words( description[i] ) )" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 14, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "clean_description_array = np.array(clean_description)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 15, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "(64,)" 276 | ] 277 | }, 278 | "execution_count": 15, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "clean_description_array.shape" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 16, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "data": { 294 | "text/plain": [ 295 | "'plugin runs hydra find http proxy accounts passwords brute force use plugin enter logins file passwords file hydra nasl wrappers options advanced settings block'" 296 | ] 297 | }, 298 | "execution_count": 16, 299 | "metadata": {}, 300 | "output_type": "execute_result" 301 | } 302 | ], 303 | "source": [ 304 | "clean_description_array[6]" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "# Creating the Bag of Words " 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 17, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "name": "stdout", 321 | "output_type": "stream", 322 | "text": [ 323 | "Creating the bag of words...\n", 324 | "\n" 325 | ] 326 | } 327 | ], 328 | "source": [ 329 | "print (\"Creating the bag of words...\\n\")\n", 330 | "from sklearn.feature_extraction.text import CountVectorizer\n", 331 | "\n", 332 | "# Initialize the \"CountVectorizer\" object, which is scikit-learn's\n", 333 | "# bag of words tool. \n", 334 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 335 | " tokenizer = None, \\\n", 336 | " preprocessor = None, \\\n", 337 | " stop_words = None, \\\n", 338 | " max_features = 5000) \n", 339 | "\n", 340 | "# fit_transform() does two functions: First, it fits the model\n", 341 | "# and learns the vocabulary; second, it transforms our training data\n", 342 | "# into feature vectors. The input to fit_transform should be a list of \n", 343 | "# strings.\n", 344 | "description_features = vectorizer.fit_transform(clean_description_array)\n", 345 | "\n", 346 | "# Numpy arrays are easy to work with, so convert the result to an \n", 347 | "# array\n", 348 | "description_features = description_features.toarray()" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 18, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "data": { 358 | "text/plain": [ 359 | "(64, 618)" 360 | ] 361 | }, 362 | "execution_count": 18, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "description_features.shape" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "## Creating the Result array as per SEVERITY" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 19, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "severity = []\n", 385 | "for i in range(dataset.shape[0]):\n", 386 | " temporary_variable = dataset_array[i][0].split(';')\n", 387 | " temporary_variable=np.array(temporary_variable)\n", 388 | " severity.append(temporary_variable[2]) " 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 20, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "severity = np.array(severity)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 21, 403 | "metadata": {}, 404 | "outputs": [ 405 | { 406 | "data": { 407 | "text/plain": [ 408 | "(64,)" 409 | ] 410 | }, 411 | "execution_count": 21, 412 | "metadata": {}, 413 | "output_type": "execute_result" 414 | } 415 | ], 416 | "source": [ 417 | "severity.shape" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 22, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "data": { 427 | "text/plain": [ 428 | "'high'" 429 | ] 430 | }, 431 | "execution_count": 22, 432 | "metadata": {}, 433 | "output_type": "execute_result" 434 | } 435 | ], 436 | "source": [ 437 | "severity[6]" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "### Testing with Random Forest (3 Class Classification as per Severity)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 23, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "name": "stderr", 454 | "output_type": "stream", 455 | "text": [ 456 | "/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 457 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 458 | ] 459 | }, 460 | { 461 | "name": "stdout", 462 | "output_type": "stream", 463 | "text": [ 464 | "0.25\n", 465 | "0.38461538461538464\n", 466 | "0.6666666666666666\n", 467 | "0.3333333333333333\n", 468 | "0.2727272727272727\n", 469 | "avg 0.38146853146853144\n" 470 | ] 471 | } 472 | ], 473 | "source": [ 474 | "from sklearn.cross_validation import StratifiedKFold\n", 475 | "from sklearn.ensemble import RandomForestClassifier\n", 476 | "from sklearn import metrics\n", 477 | "n_folds = 5\n", 478 | "score = 0.0\n", 479 | "skf = StratifiedKFold(severity, n_folds)\n", 480 | "avg_score = 0\n", 481 | "\n", 482 | "for train_index, test_index in skf:\n", 483 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 484 | " y_train, y_test = severity[train_index], severity[test_index]\n", 485 | " forest = RandomForestClassifier(n_estimators = 50)\n", 486 | " forest.fit( X_train, y_train )\n", 487 | " score = forest.score(X_test,y_test)\n", 488 | " avg_score += score \n", 489 | " print(score)\n", 490 | " \n", 491 | "print(\"avg\",avg_score/n_folds)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "# Creating the Result array as per SCORE" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 24, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "score_result = []\n", 508 | "for i in range(dataset.shape[0]):\n", 509 | " temporary_variable = dataset_array[i][0].split(';')\n", 510 | " temporary_variable=np.array(temporary_variable)\n", 511 | " score_result.append(temporary_variable[6]) " 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": 25, 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "score_result[12]='7.5'\n", 521 | "score_result[54]='7.5'\n", 522 | "score_result = np.array(score_result).astype(np.float)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 26, 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "data": { 532 | "text/plain": [ 533 | "(64,)" 534 | ] 535 | }, 536 | "execution_count": 26, 537 | "metadata": {}, 538 | "output_type": "execute_result" 539 | } 540 | ], 541 | "source": [ 542 | "score_result.shape" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 27, 548 | "metadata": {}, 549 | "outputs": [ 550 | { 551 | "name": "stdout", 552 | "output_type": "stream", 553 | "text": [ 554 | "7.5\n", 555 | "3.5\n", 556 | "4.3\n", 557 | "10.0\n", 558 | "7.5\n", 559 | "7.5\n", 560 | "7.5\n", 561 | "10.0\n", 562 | "10.0\n", 563 | "6.5\n", 564 | "3.3\n", 565 | "4.0\n", 566 | "7.5\n", 567 | "7.8\n", 568 | "10.0\n", 569 | "10.0\n", 570 | "10.0\n", 571 | "10.0\n", 572 | "10.0\n", 573 | "9.3\n", 574 | "9.4\n", 575 | "7.5\n", 576 | "9.7\n", 577 | "9.3\n", 578 | "6.4\n", 579 | "5.6\n", 580 | "5.0\n", 581 | "5.0\n", 582 | "10.0\n", 583 | "10.0\n", 584 | "10.0\n", 585 | "10.0\n", 586 | "10.0\n", 587 | "10.0\n", 588 | "10.0\n", 589 | "10.0\n", 590 | "10.0\n", 591 | "7.5\n", 592 | "7.6\n", 593 | "7.5\n", 594 | "7.5\n", 595 | "7.5\n", 596 | "9.3\n", 597 | "9.0\n", 598 | "7.5\n", 599 | "7.5\n", 600 | "7.5\n", 601 | "9.0\n", 602 | "4.3\n", 603 | "6.8\n", 604 | "4.9\n", 605 | "4.1\n", 606 | "5.0\n", 607 | "4.6\n", 608 | "7.5\n", 609 | "5.8\n", 610 | "4.8\n", 611 | "2.1\n", 612 | "3.7\n", 613 | "3.5\n", 614 | "3.5\n", 615 | "2.1\n", 616 | "2.6\n", 617 | "2.1\n" 618 | ] 619 | } 620 | ], 621 | "source": [ 622 | "for i in range(dataset.shape[0]):\n", 623 | " print(score_result[i])" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "# Testing with Gradient Boosting Regression " 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 28, 636 | "metadata": {}, 637 | "outputs": [ 638 | { 639 | "name": "stdout", 640 | "output_type": "stream", 641 | "text": [ 642 | "MSE: 8.1312\n", 643 | "MSE: 2.9190\n", 644 | "MSE: 2.8643\n", 645 | "MSE: 9.3840\n", 646 | "MSE: 3.1921\n" 647 | ] 648 | } 649 | ], 650 | "source": [ 651 | "from sklearn import ensemble\n", 652 | "from sklearn.metrics import mean_squared_error\n", 653 | "\n", 654 | "n_folds = 5\n", 655 | "skf = StratifiedKFold(score_result, n_folds)\n", 656 | "\n", 657 | "\n", 658 | "\n", 659 | "for train_index, test_index in skf:\n", 660 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 661 | " y_train, y_test = score_result[train_index], score_result[test_index]\n", 662 | " params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,\n", 663 | " 'learning_rate': 0.01, 'loss': 'ls'}\n", 664 | " clf = ensemble.GradientBoostingRegressor(**params)\n", 665 | " clf.fit(X_train, y_train)\n", 666 | " mse = mean_squared_error(y_test, clf.predict(X_test))\n", 667 | " print(\"MSE: %.4f\" % mse)\n", 668 | " \n" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": {}, 675 | "outputs": [], 676 | "source": [] 677 | } 678 | ], 679 | "metadata": { 680 | "kernelspec": { 681 | "display_name": "Python 3", 682 | "language": "python", 683 | "name": "python3" 684 | }, 685 | "language_info": { 686 | "codemirror_mode": { 687 | "name": "ipython", 688 | "version": 3 689 | }, 690 | "file_extension": ".py", 691 | "mimetype": "text/x-python", 692 | "name": "python", 693 | "nbconvert_exporter": "python", 694 | "pygments_lexer": "ipython3", 695 | "version": "3.6.4" 696 | } 697 | }, 698 | "nbformat": 4, 699 | "nbformat_minor": 2 700 | } 701 | -------------------------------------------------------------------------------- /Word2Vectors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 13, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "import nltk.data\n", 13 | "#nltk.download() \n", 14 | "warnings.filterwarnings('ignore')" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 14, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "dataset = pd.read_json('cve-2016.json')" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 45, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/plain": [ 34 | "(9417, 6)" 35 | ] 36 | }, 37 | "execution_count": 45, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "dataset.shape" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 15, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "array(['CVE_Items', 'CVE_data_format', 'CVE_data_numberOfCVEs',\n", 55 | " 'CVE_data_timestamp', 'CVE_data_type', 'CVE_data_version'],\n", 56 | " dtype=object)" 57 | ] 58 | }, 59 | "execution_count": 15, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "dataset.columns.values" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 16, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "description = []\n", 75 | "severity = []\n", 76 | "scores = []\n", 77 | "\n", 78 | "for i in range(dataset.shape[0]):\n", 79 | " new=dataset.CVE_Items[i]\n", 80 | " if('baseMetricV2' in new['impact'].keys()):\n", 81 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 82 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 83 | " description.append(new['cve']['description']['description_data'][0]['value'])" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 17, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "description = np.array(description)\n", 93 | "severity = np.array(severity)\n", 94 | "scores = np.array(scores)" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 18, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "from bs4 import BeautifulSoup \n", 104 | "import re\n", 105 | "from nltk.corpus import stopwords\n", 106 | "def review_to_wordlist( raw_review, remove_stopwords=False ):\n", 107 | " # Function to convert a raw review to a string of words\n", 108 | " # The input is a single string (a raw movie review), and \n", 109 | " # the output is a single string (a preprocessed movie review)\n", 110 | " #\n", 111 | " # 1. Remove HTML\n", 112 | " review_text = BeautifulSoup(raw_review).get_text() \n", 113 | " #\n", 114 | " # 2. Remove non-letters \n", 115 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 116 | " #\n", 117 | " # 3. Convert to lower case, split into individual words\n", 118 | " words = letters_only.lower().split() \n", 119 | " #\n", 120 | " # 4. In Python, searching a set is much faster than searching\n", 121 | " # a list, so convert the stop words to a set\n", 122 | " stops = set(stopwords.words(\"english\")) \n", 123 | " # \n", 124 | " # 5. Remove stop words\n", 125 | " words = [w for w in words if not w in stops] \n", 126 | " #\n", 127 | " # 6. Join the words back into one string separated by space, \n", 128 | " # and return the result.\n", 129 | " return(words ) " 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 19, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "\n", 139 | "# Load the punkt tokenizer\n", 140 | "tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n", 141 | "\n", 142 | "# Define a function to split a review into parsed sentences\n", 143 | "def review_to_sentences( review, tokenizer, remove_stopwords=False ):\n", 144 | " # Function to split a review into parsed sentences. Returns a \n", 145 | " # list of sentences, where each sentence is a list of words\n", 146 | " #\n", 147 | " # 1. Use the NLTK tokenizer to split the paragraph into sentences\n", 148 | " raw_sentences = tokenizer.tokenize(review.strip())\n", 149 | " #\n", 150 | " # 2. Loop over each sentence\n", 151 | " sentences = []\n", 152 | " for raw_sentence in raw_sentences:\n", 153 | " # If a sentence is empty, skip it\n", 154 | " if len(raw_sentence) > 0:\n", 155 | " # Otherwise, call review_to_wordlist to get a list of words\n", 156 | " sentences.append( review_to_wordlist( raw_sentence, \\\n", 157 | " remove_stopwords ))\n", 158 | " #\n", 159 | " # Return the list of sentences (each sentence is a list of words,\n", 160 | " # so this returns a list of lists\n", 161 | " return sentences" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 20, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "sentences = [] # Initialize an empty list of sentences\n", 171 | "\n", 172 | "\n", 173 | "for review in description:\n", 174 | " sentences += review_to_sentences(review, tokenizer)\n", 175 | "\n" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 21, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "data": { 185 | "text/plain": [ 186 | "11870" 187 | ] 188 | }, 189 | "execution_count": 21, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "len(sentences)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 22, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "['ibm',\n", 207 | " 'sametime',\n", 208 | " 'meeting',\n", 209 | " 'server',\n", 210 | " 'vulnerable',\n", 211 | " 'cross',\n", 212 | " 'site',\n", 213 | " 'scripting']" 214 | ] 215 | }, 216 | "execution_count": 22, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "sentences[3400]" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 23, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "data": { 232 | "text/plain": [ 233 | "(8205,)" 234 | ] 235 | }, 236 | "execution_count": 23, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "scores.shape" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 24, 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "name": "stderr", 252 | "output_type": "stream", 253 | "text": [ 254 | "2018-07-09 02:03:02,365 : INFO : 'pattern' package not found; tag filters are not available for English\n", 255 | "2018-07-09 02:03:02,371 : INFO : collecting all words and their counts\n", 256 | "2018-07-09 02:03:02,372 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", 257 | "2018-07-09 02:03:02,424 : INFO : PROGRESS: at sentence #10000, processed 202304 words, keeping 9499 word types\n", 258 | "2018-07-09 02:03:02,434 : INFO : collected 10441 word types from a corpus of 230372 raw words and 11870 sentences\n", 259 | "2018-07-09 02:03:02,435 : INFO : Loading a fresh vocabulary\n", 260 | "2018-07-09 02:03:02,442 : INFO : min_count=40 retains 705 unique words (6% of original 10441, drops 9736)\n", 261 | "2018-07-09 02:03:02,442 : INFO : min_count=40 leaves 189168 word corpus (82% of original 230372, drops 41204)\n", 262 | "2018-07-09 02:03:02,447 : INFO : deleting the raw counts dictionary of 10441 items\n", 263 | "2018-07-09 02:03:02,448 : INFO : sample=0.001 downsamples 77 most-common words\n", 264 | "2018-07-09 02:03:02,449 : INFO : downsampling leaves estimated 127140 word corpus (67.2% of prior 189168)\n", 265 | "2018-07-09 02:03:02,454 : INFO : estimated required memory for 705 words and 300 dimensions: 2044500 bytes\n", 266 | "2018-07-09 02:03:02,455 : INFO : resetting layer weights\n", 267 | "2018-07-09 02:03:02,477 : INFO : training model with 4 workers on 705 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10\n" 268 | ] 269 | }, 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "Training model...\n" 275 | ] 276 | }, 277 | { 278 | "name": "stderr", 279 | "output_type": "stream", 280 | "text": [ 281 | "2018-07-09 02:03:02,613 : INFO : worker thread finished; awaiting finish of 3 more threads\n", 282 | "2018-07-09 02:03:02,618 : INFO : worker thread finished; awaiting finish of 2 more threads\n", 283 | "2018-07-09 02:03:02,621 : INFO : worker thread finished; awaiting finish of 1 more threads\n", 284 | "2018-07-09 02:03:02,624 : INFO : worker thread finished; awaiting finish of 0 more threads\n", 285 | "2018-07-09 02:03:02,625 : INFO : EPOCH - 1 : training on 230372 raw words (126976 effective words) took 0.1s, 936605 effective words/s\n", 286 | "2018-07-09 02:03:02,741 : INFO : worker thread finished; awaiting finish of 3 more threads\n", 287 | "2018-07-09 02:03:02,743 : INFO : worker thread finished; awaiting finish of 2 more threads\n", 288 | "2018-07-09 02:03:02,744 : INFO : worker thread finished; awaiting finish of 1 more threads\n", 289 | "2018-07-09 02:03:02,749 : INFO : worker thread finished; awaiting finish of 0 more threads\n", 290 | "2018-07-09 02:03:02,750 : INFO : EPOCH - 2 : training on 230372 raw words (127006 effective words) took 0.1s, 1090665 effective words/s\n", 291 | "2018-07-09 02:03:02,861 : INFO : worker thread finished; awaiting finish of 3 more threads\n", 292 | "2018-07-09 02:03:02,868 : INFO : worker thread finished; awaiting finish of 2 more threads\n", 293 | "2018-07-09 02:03:02,870 : INFO : worker thread finished; awaiting finish of 1 more threads\n", 294 | "2018-07-09 02:03:02,871 : INFO : worker thread finished; awaiting finish of 0 more threads\n", 295 | "2018-07-09 02:03:02,872 : INFO : EPOCH - 3 : training on 230372 raw words (127203 effective words) took 0.1s, 1109499 effective words/s\n", 296 | "2018-07-09 02:03:02,976 : INFO : worker thread finished; awaiting finish of 3 more threads\n", 297 | "2018-07-09 02:03:02,981 : INFO : worker thread finished; awaiting finish of 2 more threads\n", 298 | "2018-07-09 02:03:02,982 : INFO : worker thread finished; awaiting finish of 1 more threads\n", 299 | "2018-07-09 02:03:02,984 : INFO : worker thread finished; awaiting finish of 0 more threads\n", 300 | "2018-07-09 02:03:02,985 : INFO : EPOCH - 4 : training on 230372 raw words (126960 effective words) took 0.1s, 1228739 effective words/s\n", 301 | "2018-07-09 02:03:03,098 : INFO : worker thread finished; awaiting finish of 3 more threads\n", 302 | "2018-07-09 02:03:03,105 : INFO : worker thread finished; awaiting finish of 2 more threads\n", 303 | "2018-07-09 02:03:03,108 : INFO : worker thread finished; awaiting finish of 1 more threads\n", 304 | "2018-07-09 02:03:03,110 : INFO : worker thread finished; awaiting finish of 0 more threads\n", 305 | "2018-07-09 02:03:03,111 : INFO : EPOCH - 5 : training on 230372 raw words (127279 effective words) took 0.1s, 1078009 effective words/s\n", 306 | "2018-07-09 02:03:03,112 : INFO : training on a 1151860 raw words (635424 effective words) took 0.6s, 1001218 effective words/s\n", 307 | "2018-07-09 02:03:03,113 : INFO : precomputing L2-norms of word weight vectors\n", 308 | "2018-07-09 02:03:03,124 : INFO : saving Word2Vec object under 300features_40minwords_10context, separately None\n", 309 | "2018-07-09 02:03:03,125 : INFO : not storing attribute vectors_norm\n", 310 | "2018-07-09 02:03:03,126 : INFO : not storing attribute cum_table\n", 311 | "2018-07-09 02:03:03,152 : INFO : saved 300features_40minwords_10context\n" 312 | ] 313 | } 314 | ], 315 | "source": [ 316 | "# Import the built-in logging module and configure it so that Word2Vec \n", 317 | "# creates nice output messages\n", 318 | "import logging\n", 319 | "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\\\n", 320 | " level=logging.INFO)\n", 321 | "\n", 322 | "# Set values for various parameters\n", 323 | "num_features = 300 # Word vector dimensionality \n", 324 | "min_word_count = 40 # Minimum word count \n", 325 | "num_workers = 4 # Number of threads to run in parallel\n", 326 | "context = 10 # Context window size \n", 327 | "downsampling = 1e-3 # Downsample setting for frequent words\n", 328 | "\n", 329 | "# Initialize and train the model (this will take some time)\n", 330 | "from gensim.models import word2vec\n", 331 | "print (\"Training model...\")\n", 332 | "model = word2vec.Word2Vec(sentences, workers=num_workers, \\\n", 333 | " size=num_features, min_count = min_word_count, \\\n", 334 | " window = context, sample = downsampling)\n", 335 | "\n", 336 | "# If you don't plan to train the model any further, calling \n", 337 | "# init_sims will make the model much more memory-efficient.\n", 338 | "model.init_sims(replace=True)\n", 339 | "\n", 340 | "# It can be helpful to create a meaningful model name and \n", 341 | "# save the model for later use. You can load it later using Word2Vec.load()\n", 342 | "model_name = \"300features_40minwords_10context\"\n", 343 | "model.save(model_name)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 25, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "data": { 353 | "text/plain": [ 354 | "[('gold', 0.9094369411468506),\n", 355 | " ('rt', 0.8332555890083313),\n", 356 | " ('loading', 0.827616274356842),\n", 357 | " ('r', 0.7401587963104248),\n", 358 | " ('continuous', 0.7247640490531921),\n", 359 | " ('microsoft', 0.7045366168022156),\n", 360 | " ('os', 0.690426766872406),\n", 361 | " ('sp', 0.6736881732940674),\n", 362 | " ('execute', 0.6600849628448486),\n", 363 | " ('privileges', 0.6469964981079102)]" 364 | ] 365 | }, 366 | "execution_count": 25, 367 | "metadata": {}, 368 | "output_type": "execute_result" 369 | } 370 | ], 371 | "source": [ 372 | "model.most_similar(\"windows\")" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 26, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/plain": [ 383 | "[('caused', 0.8722426891326904),\n", 384 | " ('forgery', 0.8688178062438965),\n", 385 | " ('cross', 0.855127215385437),\n", 386 | " ('websphere', 0.8501957654953003),\n", 387 | " ('csrf', 0.7897992134094238),\n", 388 | " ('reflected', 0.7893112897872925),\n", 389 | " ('lifecycle', 0.7889853715896606),\n", 390 | " ('tivoli', 0.7777188420295715),\n", 391 | " ('actions', 0.776308536529541),\n", 392 | " ('victim', 0.7574336528778076)]" 393 | ] 394 | }, 395 | "execution_count": 26, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "model.most_similar(\"vulnerable\") " 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 28, 407 | "metadata": {}, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "Word2Vec(vocab=705, size=300, alpha=0.025)\n" 414 | ] 415 | } 416 | ], 417 | "source": [ 418 | "print(model)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 36, 424 | "metadata": {}, 425 | "outputs": [ 426 | { 427 | "data": { 428 | "text/plain": [ 429 | "(300,)" 430 | ] 431 | }, 432 | "execution_count": 36, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "model['internet'].shape" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 34, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "words = list(model.wv.vocab)\n", 448 | "#print(words)" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 41, 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "name": "stdout", 458 | "output_type": "stream", 459 | "text": [ 460 | "[-0.04629976 -0.03775239 -0.0612707 0.04138781 0.01347958 -0.13033734\n", 461 | " 0.03593222 -0.09561654 0.09193933 0.07477549 -0.01045573 -0.06526111\n", 462 | " 0.04529662 0.00507673 -0.05801545 -0.08556713 -0.03978465 0.00544046\n", 463 | " 0.05104588 -0.06756661 0.0370345 -0.04885554 -0.06992226 0.06885462\n", 464 | " -0.01036942 -0.01555624 0.01590928 0.05672768 0.03949884 0.00494205\n", 465 | " 0.09041023 0.00571667 0.09692305 0.00783475 -0.01307637 -0.09706757\n", 466 | " 0.01655647 -0.05390599 -0.04951017 -0.0244361 -0.03356093 -0.05495088\n", 467 | " -0.09433108 -0.01117052 0.01180344 0.06742935 0.02205241 -0.01068878\n", 468 | " 0.03313516 0.05438668 0.07552817 -0.09990546 -0.08005269 0.04091886\n", 469 | " 0.07880871 -0.04165225 -0.00918824 0.11682916 0.01531674 0.03951841\n", 470 | " 0.06148017 -0.03348539 0.07114412 -0.05529675 0.02345757 0.05184504\n", 471 | " -0.08638492 0.03308222 -0.04185783 0.06531723 -0.02965119 0.05407631\n", 472 | " -0.08639693 0.01736786 0.0844368 0.03411895 -0.04996516 -0.04844293\n", 473 | " -0.00310361 0.11538444 -0.05378686 -0.02967841 0.07217427 -0.00684789\n", 474 | " -0.0179275 0.00763353 0.09376784 -0.02292064 0.00583797 -0.05738004\n", 475 | " 0.05958892 -0.00936469 -0.0802224 0.02582087 -0.01356882 0.05525155\n", 476 | " -0.03021552 0.05766937 0.07452633 -0.03261406 0.01363617 0.02784311\n", 477 | " 0.05403625 -0.0721683 -0.07162187 0.07456078 -0.05203228 0.0164813\n", 478 | " 0.0517411 -0.04355783 0.02554489 0.03528553 -0.01070342 0.02726801\n", 479 | " -0.02041939 -0.05251348 0.08820312 0.01178347 0.00489954 0.00588435\n", 480 | " 0.01646947 0.07780317 0.09179305 0.11958703 -0.0700468 0.08392422\n", 481 | " -0.10074809 -0.0024156 -0.03128896 0.05947265 -0.02526046 -0.07388628\n", 482 | " -0.01059673 -0.02448021 0.01914293 -0.08530535 -0.0962625 0.02729028\n", 483 | " 0.08145312 -0.0462271 -0.06010916 -0.09902808 -0.05604574 0.09974051\n", 484 | " -0.04245941 0.08706457 0.07051359 -0.00488337 -0.04609452 0.1145611\n", 485 | " -0.0057279 -0.03807061 -0.05118256 -0.11710254 0.02467629 0.06788233\n", 486 | " -0.06110778 -0.01176782 0.00729163 -0.03396017 -0.0781162 -0.06268714\n", 487 | " 0.03930313 -0.03695241 -0.02600467 -0.12235393 0.02224673 0.07157399\n", 488 | " -0.05865877 0.00422364 -0.0476161 0.07807697 0.16394554 -0.03288585\n", 489 | " -0.04041051 -0.0528004 -0.0232111 -0.02279345 0.00045042 0.09485012\n", 490 | " 0.03876022 -0.05911956 0.01148629 -0.00197921 0.05525924 -0.03475343\n", 491 | " 0.04871691 -0.01206489 -0.02855589 0.01153949 0.01555656 0.03300145\n", 492 | " -0.03109088 -0.03301695 0.07278996 0.01285422 0.00177927 -0.0844135\n", 493 | " 0.11162003 0.07612702 0.02196101 0.0828164 -0.11434157 -0.0276711\n", 494 | " 0.07023861 0.10799347 0.03814689 -0.00362443 -0.08742201 -0.07609434\n", 495 | " 0.00698442 0.04796709 -0.03997444 -0.00326881 0.08609959 0.09073488\n", 496 | " -0.0240713 0.03292091 -0.03615429 0.0610457 -0.10379627 -0.02782106\n", 497 | " -0.00432852 0.07537484 0.05877474 -0.00925471 -0.03437344 -0.0141502\n", 498 | " 0.04523747 -0.09719168 -0.04366856 -0.030666 -0.05388606 0.02842467\n", 499 | " -0.04205393 0.04848009 -0.12231693 0.03359671 0.05251773 0.00452684\n", 500 | " -0.02673173 -0.04534775 -0.04329343 0.04071627 -0.04099033 -0.00925254\n", 501 | " 0.00621853 -0.03125075 0.08599172 -0.03397847 -0.08195248 -0.05399286\n", 502 | " 0.03652589 0.00599218 0.07541826 -0.074653 -0.00042468 0.06669124\n", 503 | " -0.00104121 0.071053 0.00749078 -0.03258489 0.02988847 -0.09998156\n", 504 | " -0.11790653 0.09019578 0.11468264 -0.07821774 -0.01264623 0.04464556\n", 505 | " -0.01605871 0.04245372 -0.03344565 -0.08192168 -0.00658706 -0.08812152\n", 506 | " 0.01667917 -0.04725152 -0.02900057 0.00827639 0.01228501 0.08989947\n", 507 | " 0.02437822 0.01969418 -0.06138816 0.03426072 0.02008862 -0.06260328\n", 508 | " -0.1004454 0.05077861 0.08392266 0.06744874 0.03703656 -0.02402286\n", 509 | " 0.06734611 0.08680253 0.09700686 0.02178777 -0.02805798 -0.00768509]\n" 510 | ] 511 | } 512 | ], 513 | "source": [ 514 | "print(model[words[704]])" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [] 523 | } 524 | ], 525 | "metadata": { 526 | "kernelspec": { 527 | "display_name": "Python 3", 528 | "language": "python", 529 | "name": "python3" 530 | }, 531 | "language_info": { 532 | "codemirror_mode": { 533 | "name": "ipython", 534 | "version": 3 535 | }, 536 | "file_extension": ".py", 537 | "mimetype": "text/x-python", 538 | "name": "python", 539 | "nbconvert_exporter": "python", 540 | "pygments_lexer": "ipython3", 541 | "version": "3.6.4" 542 | } 543 | }, 544 | "nbformat": 4, 545 | "nbformat_minor": 2 546 | } 547 | -------------------------------------------------------------------------------- /Yearly NVD Analysis .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "warnings.filterwarnings('ignore')" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset2002 = pd.read_json('cve-2002.json')\n", 22 | "dataset2003 = pd.read_json('cve-2003.json')\n", 23 | "dataset2004 = pd.read_json('cve-2004.json')\n", 24 | "dataset2005 = pd.read_json('cve-2005.json')\n", 25 | "dataset2006 = pd.read_json('cve-2006.json')\n", 26 | "dataset2007 = pd.read_json('cve-2007.json')\n", 27 | "dataset2008 = pd.read_json('cve-2008.json')\n", 28 | "dataset2009 = pd.read_json('cve-2009.json')\n", 29 | "dataset2010 = pd.read_json('cve-2010.json')\n", 30 | "dataset2011 = pd.read_json('cve-2011.json')\n", 31 | "dataset2012 = pd.read_json('cve-2012.json')\n", 32 | "dataset2013 = pd.read_json('cve-2013.json')\n", 33 | "dataset2014 = pd.read_json('cve-2014.json')\n", 34 | "dataset2015 = pd.read_json('cve-2015.json')\n", 35 | "dataset2016 = pd.read_json('cve-2016.json')\n", 36 | "dataset2017 = pd.read_json('cve-2017.json')\n", 37 | "dataset2018 = pd.read_json('cve-2018.json')" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "# yearly inclusions " 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 61, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "description = []\n", 54 | "severity = []\n", 55 | "scores = []\n", 56 | "exploitability = []\n", 57 | "\n", 58 | "test_description = []\n", 59 | "test_severity = []\n", 60 | "test_scores = []\n", 61 | "test_exploitability = []\n", 62 | "\n", 63 | "for i in range(dataset2018.shape[0]):\n", 64 | " new=dataset2018.CVE_Items[i]\n", 65 | " if('baseMetricV2' in new['impact'].keys()):\n", 66 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 67 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 68 | " description.append(new['cve']['description']['description_data'][0]['value'])\n", 69 | " exploitability.append(new['impact']['baseMetricV2'][ 'exploitabilityScore'])\n", 70 | " \n", 71 | "for i in range(dataset2017.shape[0]):\n", 72 | " new=dataset2017.CVE_Items[i]\n", 73 | " if('baseMetricV2' in new['impact'].keys()):\n", 74 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 75 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 76 | " description.append(new['cve']['description']['description_data'][0]['value'])\n", 77 | " exploitability.append(new['impact']['baseMetricV2']['exploitabilityScore'])\n", 78 | "\n", 79 | "for i in range(dataset2016.shape[0]):\n", 80 | " new=dataset2016.CVE_Items[i]\n", 81 | " if('baseMetricV2' in new['impact'].keys()):\n", 82 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 83 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 84 | " description.append(new['cve']['description']['description_data'][0]['value'])\n", 85 | "\n", 86 | "for i in range(dataset2015.shape[0]):\n", 87 | " new=dataset2015.CVE_Items[i]\n", 88 | " if('baseMetricV2' in new['impact'].keys()):\n", 89 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 90 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 91 | " description.append(new['cve']['description']['description_data'][0]['value'])\n", 92 | "\n", 93 | "for i in range(dataset2014.shape[0]):\n", 94 | " new=dataset2014.CVE_Items[i]\n", 95 | " if('baseMetricV2' in new['impact'].keys()):\n", 96 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 97 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 98 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 99 | "\n", 100 | "for i in range(dataset2013.shape[0]):\n", 101 | " new=dataset2013.CVE_Items[i]\n", 102 | " if('baseMetricV2' in new['impact'].keys()):\n", 103 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 104 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 105 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 106 | " \n", 107 | "for i in range(dataset2012.shape[0]):\n", 108 | " new=dataset2012.CVE_Items[i]\n", 109 | " if('baseMetricV2' in new['impact'].keys()):\n", 110 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 111 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 112 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 113 | " \n", 114 | "for i in range(dataset2011.shape[0]):\n", 115 | " new=dataset2011.CVE_Items[i]\n", 116 | " if('baseMetricV2' in new['impact'].keys()):\n", 117 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 118 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 119 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 120 | "\n", 121 | "for i in range(dataset2010.shape[0]):\n", 122 | " new=dataset2010.CVE_Items[i]\n", 123 | " if('baseMetricV2' in new['impact'].keys()):\n", 124 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 125 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 126 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 127 | "\n", 128 | "for i in range(dataset2009.shape[0]):\n", 129 | " new=dataset2009.CVE_Items[i]\n", 130 | " if('baseMetricV2' in new['impact'].keys()):\n", 131 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 132 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 133 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 134 | " \n", 135 | "for i in range(dataset2008.shape[0]):\n", 136 | " new=dataset2008.CVE_Items[i]\n", 137 | " if('baseMetricV2' in new['impact'].keys()):\n", 138 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 139 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 140 | " description.append(new['cve']['description']['description_data'][0]['value']) \n", 141 | " \n", 142 | "#for i in range(dataset2007.shape[0]):\n", 143 | " # new=dataset2007.CVE_Items[i]\n", 144 | " # if('baseMetricV2' in new['impact'].keys()):\n", 145 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 146 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 147 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 148 | " \n", 149 | "#for i in range(dataset2006.shape[0]):\n", 150 | " # new=dataset2006.CVE_Items[i]\n", 151 | " # if('baseMetricV2' in new['impact'].keys()):\n", 152 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 153 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 154 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 155 | " \n", 156 | "#for i in range(dataset2005.shape[0]):\n", 157 | " # new=dataset2005.CVE_Items[i]\n", 158 | " # if('baseMetricV2' in new['impact'].keys()):\n", 159 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 160 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 161 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 162 | " \n", 163 | "#for i in range(dataset2004.shape[0]):\n", 164 | " # new=dataset2004.CVE_Items[i]\n", 165 | " # if('baseMetricV2' in new['impact'].keys()):\n", 166 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 167 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 168 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 169 | " \n", 170 | "#for i in range(dataset2003.shape[0]):\n", 171 | " # new=dataset2003.CVE_Items[i]\n", 172 | " # if('baseMetricV2' in new['impact'].keys()):\n", 173 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 174 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 175 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 176 | " \n", 177 | "#for i in range(dataset2002.shape[0]):\n", 178 | " # new=dataset2002.CVE_Items[i]\n", 179 | " # if('baseMetricV2' in new['impact'].keys()):\n", 180 | " # severity.append(new['impact']['baseMetricV2']['severity'])\n", 181 | " # scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 182 | " # description.append(new['cve']['description']['description_data'][0]['value']) \n", 183 | " " 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 62, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "description = np.array(description)\n", 193 | "severity = np.array(severity)\n", 194 | "scores = np.array(scores)\n", 195 | "exploitability = np.array(exploitability)\n", 196 | "\n", 197 | "test_description = np.array(description)\n", 198 | "test_severity = np.array(severity)\n", 199 | "test_scores = np.array(scores)\n", 200 | "test_exploitability = np.array(exploitability)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 63, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "from bs4 import BeautifulSoup \n", 210 | "import re\n", 211 | "from nltk.corpus import stopwords\n", 212 | "def review_to_words( raw_review ):\n", 213 | " # Function to convert a raw review to a string of words\n", 214 | " # The input is a single string (a raw movie review), and \n", 215 | " # the output is a single string (a preprocessed movie review)\n", 216 | " #\n", 217 | " # 1. Remove HTML\n", 218 | " review_text = BeautifulSoup(raw_review).get_text() \n", 219 | " #\n", 220 | " # 2. Remove non-letters \n", 221 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 222 | " #\n", 223 | " # 3. Convert to lower case, split into individual words\n", 224 | " words = letters_only.lower().split() \n", 225 | " #\n", 226 | " # 4. In Python, searching a set is much faster than searching\n", 227 | " # a list, so convert the stop words to a set\n", 228 | " stops = set(stopwords.words(\"english\")) \n", 229 | " # \n", 230 | " # 5. Remove stop words\n", 231 | " meaningful_words = [w for w in words if not w in stops] \n", 232 | " #\n", 233 | " # 6. Join the words back into one string separated by space, \n", 234 | " # and return the result.\n", 235 | " return( \" \".join( meaningful_words )) " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 64, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "# Initialize an empty list to hold the clean reviews\n", 245 | "clean_description = []\n", 246 | "\n", 247 | "# Loop over each review; create an index i that goes from 0 to the length\n", 248 | "# of the movie review list \n", 249 | "for i in range(description.shape[0]):\n", 250 | " # Call our function for each one, and add the result to the list of\n", 251 | " # clean reviews\n", 252 | " clean_description.append( review_to_words( description[i] ) )" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 65, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "clean_description_array = np.array(clean_description)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 66, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "# Initialize an empty list to hold the clean reviews\n", 271 | "test_clean_description = []\n", 272 | "\n", 273 | "# Loop over each review; create an index i that goes from 0 to the length\n", 274 | "# of the movie review list \n", 275 | "for i in range(test_description.shape[0]):\n", 276 | " # Call our function for each one, and add the result to the list of\n", 277 | " # clean reviews\n", 278 | " test_clean_description.append( review_to_words( test_description[i] ) )" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 67, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "test_clean_description_array = np.array(test_clean_description)" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 68, 293 | "metadata": {}, 294 | "outputs": [ 295 | { 296 | "name": "stdout", 297 | "output_type": "stream", 298 | "text": [ 299 | "Creating the bag of words...\n", 300 | "\n" 301 | ] 302 | } 303 | ], 304 | "source": [ 305 | "print (\"Creating the bag of words...\\n\")\n", 306 | "from sklearn.feature_extraction.text import CountVectorizer\n", 307 | "\n", 308 | "# Initialize the \"CountVectorizer\" object, which is scikit-learn's\n", 309 | "# bag of words tool. \n", 310 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 311 | " tokenizer = None, \\\n", 312 | " preprocessor = None, \\\n", 313 | " stop_words = None, \\\n", 314 | " max_features = 500) \n", 315 | "\n", 316 | "# fit_transform() does two functions: First, it fits the model\n", 317 | "# and learns the vocabulary; second, it transforms our training data\n", 318 | "# into feature vectors. The input to fit_transform should be a list of \n", 319 | "# strings.\n", 320 | "description_features = vectorizer.fit_transform(clean_description_array)\n", 321 | "#test_description_features = vectorizer.fit_transform(test_clean_description_array)\n", 322 | "\n", 323 | "# Numpy arrays are easy to work with, so convert the result to an \n", 324 | "# array\n", 325 | "description_features = description_features.toarray()\n", 326 | "#test_description_features = test_description_features.toarray()" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "# Random Forest" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 56, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "0.7204157659936531\n", 346 | "0.8023271857609345\n", 347 | "0.8078373654677583\n", 348 | "0.7493330880323797\n", 349 | "0.7197332106715731\n", 350 | "avg 0.7599293231852597\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "from sklearn.cross_validation import StratifiedKFold\n", 356 | "from sklearn.ensemble import RandomForestClassifier\n", 357 | "from sklearn import metrics\n", 358 | "n_folds = 5\n", 359 | "score = 0.0\n", 360 | "skf = StratifiedKFold(severity, n_folds)\n", 361 | "avg_score = 0\n", 362 | "\n", 363 | "for train_index, test_index in skf:\n", 364 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 365 | " y_train, y_test = severity[train_index], severity[test_index]\n", 366 | " forest = RandomForestClassifier(n_estimators = 50)\n", 367 | " forest.fit( X_train, y_train )\n", 368 | " score = forest.score(X_test,y_test)\n", 369 | " avg_score += score \n", 370 | " print(score)\n", 371 | " \n", 372 | "print(\"avg\",avg_score/n_folds)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "# Testing on 18 and trainin on 17/16" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 37, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "name": "stdout", 389 | "output_type": "stream", 390 | "text": [ 391 | "0.9903553797364893\n" 392 | ] 393 | } 394 | ], 395 | "source": [ 396 | "from sklearn.ensemble import RandomForestClassifier\n", 397 | "from sklearn import metrics\n", 398 | "\n", 399 | "\n", 400 | "\n", 401 | "\n", 402 | "\n", 403 | "forest = RandomForestClassifier(n_estimators = 50)\n", 404 | "\n", 405 | "forest.fit( description_features, severity )\n", 406 | "\n", 407 | "score = forest.score( test_description_features, test_severity)\n", 408 | "\n", 409 | "\n", 410 | "print(score)\n", 411 | " \n" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "# MLP Classifier" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 57, 424 | "metadata": {}, 425 | "outputs": [ 426 | { 427 | "name": "stdout", 428 | "output_type": "stream", 429 | "text": [ 430 | "0.6878075702524951\n", 431 | "0.7675113829738307\n", 432 | "0.7924293993192899\n", 433 | "0.7344770490295282\n", 434 | "0.7050137994480221\n", 435 | "avg 0.7374478402046332\n" 436 | ] 437 | } 438 | ], 439 | "source": [ 440 | "from sklearn.neural_network import MLPClassifier\n", 441 | "n_folds = 5\n", 442 | "score = 0.0\n", 443 | "skf = StratifiedKFold(severity, n_folds)\n", 444 | "clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)\n", 445 | "avg_score = 0\n", 446 | "\n", 447 | "for train_index, test_index in skf:\n", 448 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 449 | " y_train, y_test = severity[train_index], severity[test_index]\n", 450 | " \n", 451 | " clf.fit(X_train, y_train)\n", 452 | " #YPred = forest.predict(X_test)\n", 453 | " #y.append(YPred)\n", 454 | " score = clf.score(X_test,y_test)\n", 455 | " print(score)\n", 456 | " avg_score += score\n", 457 | "print(\"avg\",avg_score/n_folds)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "# K neighbor" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 58, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "name": "stdout", 474 | "output_type": "stream", 475 | "text": [ 476 | "0.6799889619647703\n", 477 | "0.744883410752886\n", 478 | "0.7761475485235949\n", 479 | "0.7358108729647687\n", 480 | "0.7125114995400184\n", 481 | "avg 0.7298684587492077\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "from sklearn.neighbors import KNeighborsClassifier\n", 487 | "n_folds = 5\n", 488 | "score = 0.0\n", 489 | "skf = StratifiedKFold(severity, n_folds)\n", 490 | "neigh = KNeighborsClassifier(n_neighbors=25)\n", 491 | "avg_score = 0\n", 492 | "\n", 493 | "for train_index, test_index in skf:\n", 494 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 495 | " y_train, y_test = severity[train_index], severity[test_index]\n", 496 | " \n", 497 | " #YPred = forest.predict(X_test)\n", 498 | " #y.append(YPred)\n", 499 | " score = neigh.fit(X_train, y_train).score(X_test,y_test)\n", 500 | " print(score)\n", 501 | " avg_score += score\n", 502 | "print(\"avg\",avg_score/n_folds)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "# Regression Scoring Analysis " 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 70, 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "score_result = np.array(scores).astype(np.float)\n", 519 | "exploitability_result = np.array(exploitability).astype(np.float)\n", 520 | "\n", 521 | "test_score_result = np.array(test_scores).astype(np.float)\n", 522 | "test_exploitability_result = np.array(test_exploitability).astype(np.float)\n" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "### Random Forest Regressor " 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": 71, 535 | "metadata": {}, 536 | "outputs": [ 537 | { 538 | "name": "stdout", 539 | "output_type": "stream", 540 | "text": [ 541 | "MAE: 0.7591\n", 542 | "MAE: 0.6233\n", 543 | "MAE: 0.7976\n", 544 | "MAE: 0.7355\n", 545 | "MAE: 0.8243\n", 546 | "avg 0.7479518694443361\n" 547 | ] 548 | } 549 | ], 550 | "source": [ 551 | "from sklearn.ensemble import RandomForestRegressor\n", 552 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 553 | "from sklearn.cross_validation import StratifiedKFold\n", 554 | "\n", 555 | "n_folds = 5\n", 556 | "avg_mad = 0\n", 557 | "skf = StratifiedKFold(score_result, n_folds)\n", 558 | "\n", 559 | "\n", 560 | "\n", 561 | "for train_index, test_index in skf:\n", 562 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 563 | " y_train, y_test = score_result[train_index], score_result[test_index]\n", 564 | " \n", 565 | " clf = RandomForestRegressor(max_depth=45, random_state=0)\n", 566 | " clf.fit(X_train, y_train)\n", 567 | " #mse = mean_squared_error(y_test, clf.predict(X_test))\n", 568 | " mad = mean_absolute_error(y_test, clf.predict(X_test))\n", 569 | " #print(\"MSE: %.4f\" % mse)\n", 570 | " avg_mad += mad\n", 571 | " print(\"MAE: %.4f\" % mad)\n", 572 | "print(\"avg\",avg_mad/n_folds) " 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 40, 578 | "metadata": {}, 579 | "outputs": [ 580 | { 581 | "ename": "ValueError", 582 | "evalue": "Cannot have number of folds n_folds=5 greater than the number of samples: 0.", 583 | "output_type": "error", 584 | "traceback": [ 585 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 586 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", 587 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mn_folds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mavg_mad\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mskf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStratifiedKFold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mexploitability_result\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_folds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 588 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, y, n_folds, shuffle, random_state)\u001b[0m\n\u001b[1;32m 536\u001b[0m random_state=None):\n\u001b[1;32m 537\u001b[0m super(StratifiedKFold, self).__init__(\n\u001b[0;32m--> 538\u001b[0;31m len(y), n_folds, shuffle, random_state)\n\u001b[0m\u001b[1;32m 539\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 540\u001b[0m \u001b[0mn_samples\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 589 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, n, n_folds, shuffle, random_state)\u001b[0m\n\u001b[1;32m 260\u001b[0m raise ValueError(\n\u001b[1;32m 261\u001b[0m (\"Cannot have number of folds n_folds={0} greater\"\n\u001b[0;32m--> 262\u001b[0;31m \" than the number of samples: {1}.\").format(n_folds, n))\n\u001b[0m\u001b[1;32m 263\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 264\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mshuffle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbool\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 590 | "\u001b[0;31mValueError\u001b[0m: Cannot have number of folds n_folds=5 greater than the number of samples: 0." 591 | ] 592 | } 593 | ], 594 | "source": [ 595 | "from sklearn.ensemble import RandomForestRegressor\n", 596 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 597 | "from sklearn.cross_validation import StratifiedKFold\n", 598 | "\n", 599 | "n_folds = 5\n", 600 | "avg_mad = 0\n", 601 | "skf = StratifiedKFold(exploitability_result, n_folds)\n", 602 | "\n", 603 | "\n", 604 | "\n", 605 | "for train_index, test_index in skf:\n", 606 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 607 | " y_train, y_test = exploitability_result[train_index], exploitability_result[test_index]\n", 608 | " \n", 609 | " clf = RandomForestRegressor(max_depth=45, random_state=0)\n", 610 | " clf.fit(X_train, y_train)\n", 611 | " #mse = mean_squared_error(y_test, clf.predict(X_test))\n", 612 | " mad = mean_absolute_error(y_test, clf.predict(X_test))\n", 613 | " #print(\"MSE: %.4f\" % mse)\n", 614 | " avg_mad += mad\n", 615 | " print(\"MAE: %.4f\" % mad)\n", 616 | "print(\"avg\",avg_mad/n_folds) " 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "# Testing on 2018 .... training on 16/17" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": 41, 629 | "metadata": {}, 630 | "outputs": [ 631 | { 632 | "name": "stdout", 633 | "output_type": "stream", 634 | "text": [ 635 | "MAE: 0.3269\n" 636 | ] 637 | } 638 | ], 639 | "source": [ 640 | "from sklearn.ensemble import RandomForestRegressor\n", 641 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 642 | "\n", 643 | "\n", 644 | "\n", 645 | "clf = RandomForestRegressor(max_depth=45, random_state=0)\n", 646 | "clf.fit(description_features, score_result)\n", 647 | "#mse = mean_squared_error(y_test, clf.predict(X_test))\n", 648 | "mad = mean_absolute_error(test_score_result, clf.predict(test_description_features))\n", 649 | "#print(\"MSE: %.4f\" % mse)\n", 650 | "\n", 651 | "print(\"MAE: %.4f\" % mad)\n", 652 | " " 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "# testing with scada descriptions " 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 128, 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [ 668 | "desc_test = \"Delta PMSoft versions 2.10 and prior have multiple stack-based buffer overflow vulnerabilities where a .ppm file can introduce a value larger than is readable by PMSoft's fixed-length stack buffer. This can cause the buffer to be overwritten, which may allow arbitrary code execution or cause the application to crash. CVSS v3 base score: 7.1; CVSS vector string: AV:L/AC:L/PR:N/UI:R/S:U/C:N/I:H/A:H. Delta Electronics recommends affected users update to at least PMSoft v2.11, which was made available as of March 22, 2018, or the latest available version.\"" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 129, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "data": { 678 | "text/plain": [ 679 | "\"Delta PMSoft versions 2.10 and prior have multiple stack-based buffer overflow vulnerabilities where a .ppm file can introduce a value larger than is readable by PMSoft's fixed-length stack buffer. This can cause the buffer to be overwritten, which may allow arbitrary code execution or cause the application to crash. CVSS v3 base score: 7.1; CVSS vector string: AV:L/AC:L/PR:N/UI:R/S:U/C:N/I:H/A:H. Delta Electronics recommends affected users update to at least PMSoft v2.11, which was made available as of March 22, 2018, or the latest available version.\"" 680 | ] 681 | }, 682 | "execution_count": 129, 683 | "metadata": {}, 684 | "output_type": "execute_result" 685 | } 686 | ], 687 | "source": [ 688 | "desc_test" 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": 140, 694 | "metadata": {}, 695 | "outputs": [], 696 | "source": [ 697 | "clean_desc_test = review_to_words (desc_test)" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 141, 703 | "metadata": {}, 704 | "outputs": [ 705 | { 706 | "data": { 707 | "text/plain": [ 708 | "'delta pmsoft versions prior multiple stack based buffer overflow vulnerabilities ppm file introduce value larger readable pmsoft fixed length stack buffer cause buffer overwritten may allow arbitrary code execution cause application crash cvss v base score cvss vector string av l ac l pr n ui r u c n h h delta electronics recommends affected users update least pmsoft v made available march latest available version'" 709 | ] 710 | }, 711 | "execution_count": 141, 712 | "metadata": {}, 713 | "output_type": "execute_result" 714 | } 715 | ], 716 | "source": [ 717 | "clean_desc_test" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 142, 723 | "metadata": {}, 724 | "outputs": [], 725 | "source": [ 726 | "clean_desc_test = np.array(clean_desc_test)" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": 144, 732 | "metadata": {}, 733 | "outputs": [ 734 | { 735 | "data": { 736 | "text/plain": [ 737 | "()" 738 | ] 739 | }, 740 | "execution_count": 144, 741 | "metadata": {}, 742 | "output_type": "execute_result" 743 | } 744 | ], 745 | "source": [] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 136, 750 | "metadata": {}, 751 | "outputs": [ 752 | { 753 | "ename": "TypeError", 754 | "evalue": "iteration over a 0-d array", 755 | "output_type": "error", 756 | "traceback": [ 757 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 758 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 759 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mclean_desc_test_features\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mvectorizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mclean_desc_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 760 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py\u001b[0m in \u001b[0;36mfit_transform\u001b[0;34m(self, raw_documents, y)\u001b[0m\n\u001b[1;32m 867\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 868\u001b[0m vocabulary, X = self._count_vocab(raw_documents,\n\u001b[0;32m--> 869\u001b[0;31m self.fixed_vocabulary_)\n\u001b[0m\u001b[1;32m 870\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 871\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbinary\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 761 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py\u001b[0m in \u001b[0;36m_count_vocab\u001b[0;34m(self, raw_documents, fixed_vocab)\u001b[0m\n\u001b[1;32m 788\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_make_int_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 789\u001b[0m \u001b[0mindptr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 790\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mdoc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mraw_documents\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 791\u001b[0m \u001b[0mfeature_counter\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 792\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfeature\u001b[0m \u001b[0;32min\u001b[0m \u001b[0manalyze\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 762 | "\u001b[0;31mTypeError\u001b[0m: iteration over a 0-d array" 763 | ] 764 | } 765 | ], 766 | "source": [ 767 | "clean_desc_test_features = vectorizer.fit_transform(clean_desc_test)" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [] 776 | } 777 | ], 778 | "metadata": { 779 | "kernelspec": { 780 | "display_name": "Python 3", 781 | "language": "python", 782 | "name": "python3" 783 | }, 784 | "language_info": { 785 | "codemirror_mode": { 786 | "name": "ipython", 787 | "version": 3 788 | }, 789 | "file_extension": ".py", 790 | "mimetype": "text/x-python", 791 | "name": "python", 792 | "nbconvert_exporter": "python", 793 | "pygments_lexer": "ipython3", 794 | "version": "3.6.4" 795 | } 796 | }, 797 | "nbformat": 4, 798 | "nbformat_minor": 2 799 | } 800 | -------------------------------------------------------------------------------- /Tensor Flow NN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import warnings\n", 12 | "warnings.filterwarnings('ignore')" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset = pd.read_json('cve-2016.json')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "description = []\n", 31 | "severity = []\n", 32 | "scores = []\n", 33 | "\n", 34 | "for i in range(dataset.shape[0]):\n", 35 | " new=dataset.CVE_Items[i]\n", 36 | " if('baseMetricV2' in new['impact'].keys()):\n", 37 | " severity.append(new['impact']['baseMetricV2']['severity'])\n", 38 | " scores.append(new['impact']['baseMetricV2']['cvssV2']['baseScore'])\n", 39 | " description.append(new['cve']['description']['description_data'][0]['value'])\n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 4, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "description = np.array(description)\n", 49 | "severity = np.array(severity)\n", 50 | "scores = np.array(scores)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 15, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "score_result = np.array(scores).astype(np.float)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 5, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "from bs4 import BeautifulSoup \n", 69 | "import re\n", 70 | "from nltk.corpus import stopwords\n", 71 | "def review_to_words( raw_review ):\n", 72 | " # Function to convert a raw review to a string of words\n", 73 | " # The input is a single string (a raw movie review), and \n", 74 | " # the output is a single string (a preprocessed movie review)\n", 75 | " #\n", 76 | " # 1. Remove HTML\n", 77 | " review_text = BeautifulSoup(raw_review).get_text() \n", 78 | " #\n", 79 | " # 2. Remove non-letters \n", 80 | " letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n", 81 | " #\n", 82 | " # 3. Convert to lower case, split into individual words\n", 83 | " words = letters_only.lower().split() \n", 84 | " #\n", 85 | " # 4. In Python, searching a set is much faster than searching\n", 86 | " # a list, so convert the stop words to a set\n", 87 | " stops = set(stopwords.words(\"english\")) \n", 88 | " # \n", 89 | " # 5. Remove stop words\n", 90 | " meaningful_words = [w for w in words if not w in stops] \n", 91 | " #\n", 92 | " # 6. Join the words back into one string separated by space, \n", 93 | " # and return the result.\n", 94 | " return( \" \".join( meaningful_words )) " 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 6, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "# Get the number of reviews based on the dataframe column size\n", 104 | "\n", 105 | "# Initialize an empty list to hold the clean reviews\n", 106 | "clean_description = []\n", 107 | "\n", 108 | "# Loop over each review; create an index i that goes from 0 to the length\n", 109 | "# of the movie review list \n", 110 | "for i in range(description.shape[0]):\n", 111 | " # Call our function for each one, and add the result to the list of\n", 112 | " # clean reviews\n", 113 | " clean_description.append( review_to_words( description[i] ) )" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 7, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "clean_description_array = np.array(clean_description)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 8, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "Creating the bag of words...\n", 135 | "\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "print (\"Creating the bag of words...\\n\")\n", 141 | "from sklearn.feature_extraction.text import CountVectorizer\n", 142 | "\n", 143 | "# Initialize the \"CountVectorizer\" object, which is scikit-learn's\n", 144 | "# bag of words tool. \n", 145 | "vectorizer = CountVectorizer(analyzer = \"word\", \\\n", 146 | " tokenizer = None, \\\n", 147 | " preprocessor = None, \\\n", 148 | " stop_words = None, \\\n", 149 | " max_features = 500) \n", 150 | "\n", 151 | "# fit_transform() does two functions: First, it fits the model\n", 152 | "# and learns the vocabulary; second, it transforms our training data\n", 153 | "# into feature vectors. The input to fit_transform should be a list of \n", 154 | "# strings.\n", 155 | "description_features = vectorizer.fit_transform(clean_description_array)\n", 156 | "\n", 157 | "# Numpy arrays are easy to work with, so convert the result to an \n", 158 | "# array\n", 159 | "description_features = description_features.toarray()" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 9, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "import os\n", 169 | "import tensorflow as tf\n", 170 | "from sklearn.preprocessing import MinMaxScaler" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 17, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "0 5.809585 5.8462367\n", 183 | "5 5.160324 5.2313447\n", 184 | "10 4.2164226 4.246761\n", 185 | "15 2.9862418 2.9408264\n", 186 | "20 2.5988214 2.5132635\n", 187 | "25 2.5496793 2.5486178\n", 188 | "30 2.2028866 2.3043842\n", 189 | "35 2.2061298 2.329694\n", 190 | "40 2.134942 2.26963\n", 191 | "45 2.0858686 2.248738\n", 192 | "50 2.0794406 2.2617583\n", 193 | "55 2.04184 2.2180786\n", 194 | "60 2.0302649 2.1936624\n", 195 | "65 2.0178003 2.1843233\n", 196 | "70 2.0057778 2.1821067\n", 197 | "75 1.9960316 2.1751335\n", 198 | "80 1.9873366 2.158892\n", 199 | "85 1.9803038 2.1441638\n", 200 | "90 1.9724272 2.1340103\n", 201 | "95 1.9656241 2.1253836\n", 202 | "Training is complete!\n", 203 | "Final Training cost: 1.9599733352661133\n", 204 | "Final Testing cost: 2.116116523742676\n", 205 | "0 6.3137636 6.2920585\n", 206 | "5 5.868054 5.8787446\n", 207 | "10 5.4937167 5.4885683\n", 208 | "15 4.9287376 4.952388\n", 209 | "20 4.118076 4.2019634\n", 210 | "25 3.10795 3.1922522\n", 211 | "30 2.6273646 2.5404053\n", 212 | "35 2.6899126 2.554878\n", 213 | "40 2.353303 2.2635539\n", 214 | "45 2.248467 2.1674213\n", 215 | "50 2.2123804 2.126564\n", 216 | "55 2.135391 2.0603943\n", 217 | "60 2.1185682 2.0704641\n", 218 | "65 2.09118 2.0492766\n", 219 | "70 2.0647955 2.0151448\n", 220 | "75 2.0486777 2.00189\n", 221 | "80 2.0305424 1.9952497\n", 222 | "85 2.017988 1.9920549\n", 223 | "90 2.0043027 1.9815844\n", 224 | "95 1.9927137 1.9734163\n", 225 | "Training is complete!\n", 226 | "Final Training cost: 1.9831268787384033\n", 227 | "Final Testing cost: 1.969703197479248\n", 228 | "0 6.0931044 6.107996\n", 229 | "5 5.6935997 5.6785045\n", 230 | "10 5.110418 5.07619\n", 231 | "15 4.2169714 4.1914697\n" 232 | ] 233 | }, 234 | { 235 | "ename": "KeyboardInterrupt", 236 | "evalue": "", 237 | "output_type": "error", 238 | "traceback": [ 239 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 240 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 241 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 78\u001b[0m \u001b[0;31m# Feed in the training data and do one step of neural network training\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 79\u001b[0;31m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moptimizer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mY_train\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 80\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 81\u001b[0m \u001b[0;31m# Every 5 training steps, log our progress\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 242 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 898\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 899\u001b[0m result = self._run(None, fetches, feed_dict, options_ptr,\n\u001b[0;32m--> 900\u001b[0;31m run_metadata_ptr)\n\u001b[0m\u001b[1;32m 901\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrun_metadata\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 902\u001b[0m \u001b[0mproto_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtf_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTF_GetBuffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun_metadata_ptr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 243 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, handle, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1133\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfinal_fetches\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mfinal_targets\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mfeed_dict_tensor\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1134\u001b[0m results = self._do_run(handle, final_targets, final_fetches,\n\u001b[0;32m-> 1135\u001b[0;31m feed_dict_tensor, options, run_metadata)\n\u001b[0m\u001b[1;32m 1136\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1137\u001b[0m \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 244 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_run\u001b[0;34m(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1314\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhandle\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1315\u001b[0m return self._do_call(_run_fn, feeds, fetches, targets, options,\n\u001b[0;32m-> 1316\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1317\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1318\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_prun_fn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeeds\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetches\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 245 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_call\u001b[0;34m(self, fn, *args)\u001b[0m\n\u001b[1;32m 1320\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1321\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1322\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1323\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mOpError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1324\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcompat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mas_text\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 246 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run_fn\u001b[0;34m(feed_dict, fetch_list, target_list, options, run_metadata)\u001b[0m\n\u001b[1;32m 1305\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_extend_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1306\u001b[0m return self._call_tf_sessionrun(\n\u001b[0;32m-> 1307\u001b[0;31m options, feed_dict, fetch_list, target_list, run_metadata)\n\u001b[0m\u001b[1;32m 1308\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1309\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prun_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 247 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_call_tf_sessionrun\u001b[0;34m(self, options, feed_dict, fetch_list, target_list, run_metadata)\u001b[0m\n\u001b[1;32m 1407\u001b[0m return tf_session.TF_SessionRun_wrapper(\n\u001b[1;32m 1408\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_session\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_list\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1409\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1410\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1411\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_exception_on_not_ok_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mstatus\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 248 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "from sklearn import ensemble\n", 254 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 255 | "from sklearn.cross_validation import StratifiedKFold\n", 256 | "\n", 257 | "n_folds = 5\n", 258 | "avg_mad = 0\n", 259 | "skf = StratifiedKFold(score_result, n_folds)\n", 260 | "\n", 261 | "for train_index, test_index in skf:\n", 262 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 263 | " Y_train, Y_test = score_result[train_index], score_result[test_index]\n", 264 | " \n", 265 | " \n", 266 | " # Define model parameters\n", 267 | " learning_rate = 0.001\n", 268 | " training_epochs = 100\n", 269 | "\n", 270 | " # Define how many inputs and outputs are in our neural network\n", 271 | " number_of_inputs = 500\n", 272 | " number_of_outputs = 1\n", 273 | "\n", 274 | " # Define how many neurons we want in each layer of our neural network\n", 275 | " layer_1_nodes = 50\n", 276 | " layer_2_nodes = 100\n", 277 | " layer_3_nodes = 50\n", 278 | " \n", 279 | " tf.reset_default_graph()\n", 280 | " \n", 281 | " # Input Layer\n", 282 | " with tf.variable_scope('input'):\n", 283 | " X = tf.placeholder(tf.float32, shape=(None, number_of_inputs))\n", 284 | "\n", 285 | " # Layer 1\n", 286 | " with tf.variable_scope('layer_1'):\n", 287 | " weights = tf.get_variable(\"weights1\", shape=[number_of_inputs, layer_1_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 288 | " biases = tf.get_variable(name=\"biases1\", shape=[layer_1_nodes], initializer=tf.zeros_initializer())\n", 289 | " layer_1_output = tf.nn.relu(tf.matmul(X, weights) + biases)\n", 290 | "\n", 291 | " # Layer 2\n", 292 | " with tf.variable_scope('layer_2'):\n", 293 | " weights = tf.get_variable(\"weights2\", shape=[layer_1_nodes, layer_2_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 294 | " biases = tf.get_variable(name=\"biases2\", shape=[layer_2_nodes], initializer=tf.zeros_initializer())\n", 295 | " layer_2_output = tf.nn.relu(tf.matmul(layer_1_output, weights) + biases)\n", 296 | "\n", 297 | " # Layer 3\n", 298 | " with tf.variable_scope('layer_3'):\n", 299 | " weights = tf.get_variable(\"weights3\", shape=[layer_2_nodes, layer_3_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 300 | " biases = tf.get_variable(name=\"biases3\", shape=[layer_3_nodes], initializer=tf.zeros_initializer())\n", 301 | " layer_3_output = tf.nn.relu(tf.matmul(layer_2_output, weights) + biases)\n", 302 | "\n", 303 | " # Output Layer\n", 304 | " with tf.variable_scope('output'):\n", 305 | " weights = tf.get_variable(\"weights4\", shape=[layer_3_nodes, number_of_outputs])\n", 306 | " biases = tf.get_variable(name=\"biases4\", shape=[number_of_outputs], initializer=tf.zeros_initializer())\n", 307 | " prediction = tf.matmul(layer_3_output, weights) + biases\n", 308 | "\n", 309 | " # Section Two: Define the cost function of the neural network that will measure prediction accuracy during training\n", 310 | "\n", 311 | " with tf.variable_scope('cost'):\n", 312 | " Y = tf.placeholder(tf.float32)\n", 313 | " cost = tf.reduce_mean(abs(prediction-Y))\n", 314 | " \n", 315 | " # Section Three: Define the optimizer function that will be run to optimize the neural network\n", 316 | "\n", 317 | " with tf.variable_scope('train'):\n", 318 | " optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)\n", 319 | " \n", 320 | " # Initialize a session so that we can run TensorFlow operations\n", 321 | " with tf.Session() as session:\n", 322 | "\n", 323 | " # Run the global variable initializer to initialize all variables and layers of the neural network\n", 324 | " session.run(tf.global_variables_initializer())\n", 325 | "\n", 326 | " # Run the optimizer over and over to train the network.\n", 327 | " # One epoch is one full run through the training data set.\n", 328 | " for epoch in range(training_epochs):\n", 329 | "\n", 330 | " # Feed in the training data and do one step of neural network training\n", 331 | " session.run(optimizer, feed_dict={X: X_train, Y: Y_train})\n", 332 | "\n", 333 | " # Every 5 training steps, log our progress\n", 334 | " if epoch % 5 == 0:\n", 335 | " training_cost = session.run(cost, feed_dict={X: X_train, Y:Y_train})\n", 336 | " testing_cost = session.run(cost, feed_dict={X: X_test, Y:Y_test})\n", 337 | "\n", 338 | " print(epoch, training_cost, testing_cost)\n", 339 | "\n", 340 | " # Training is now complete!\n", 341 | " print(\"Training is complete!\")\n", 342 | "\n", 343 | " final_training_cost = session.run(cost, feed_dict={X: X_train, Y: Y_train})\n", 344 | " final_testing_cost = session.run(cost, feed_dict={X: X_test, Y: Y_test})\n", 345 | "\n", 346 | " print(\"Final Training cost: {}\".format(final_training_cost))\n", 347 | " print(\"Final Testing cost: {}\".format(final_testing_cost))\n", 348 | " " 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 22, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "0 6.3913636 6.330703\n", 361 | "5 5.9288187 5.916041\n", 362 | "10 5.5377703 5.5372915\n", 363 | "15 5.135031 5.1657634\n", 364 | "20 4.736758 4.78527\n", 365 | "25 4.3379664 4.392524\n", 366 | "30 3.9444056 4.0002966\n", 367 | "35 3.5720854 3.6296792\n", 368 | "40 3.2340565 3.2989545\n", 369 | "45 2.943415 3.0183892\n", 370 | "50 2.7128747 2.8018372\n", 371 | "55 2.5437698 2.6515915\n", 372 | "60 2.4343283 2.5563335\n", 373 | "65 2.3604124 2.4999278\n", 374 | "70 2.3145652 2.4660306\n", 375 | "75 2.279439 2.4442096\n", 376 | "80 2.2531784 2.4279125\n", 377 | "85 2.2313697 2.4141328\n", 378 | "90 2.2133071 2.4030728\n", 379 | "95 2.1971588 2.3945718\n", 380 | "100 2.1838992 2.387891\n", 381 | "105 2.1721396 2.382463\n", 382 | "110 2.162445 2.3777144\n", 383 | "115 2.154373 2.3734365\n", 384 | "120 2.1467428 2.3697414\n", 385 | "125 2.1401405 2.3663604\n", 386 | "130 2.133463 2.3634777\n", 387 | "135 2.1269307 2.360345\n", 388 | "140 2.121909 2.3576157\n", 389 | "145 2.1158257 2.354647\n", 390 | "150 2.1107652 2.3515654\n", 391 | "155 2.1055086 2.34805\n", 392 | "160 2.100851 2.3449965\n", 393 | "165 2.097205 2.3415732\n", 394 | "170 2.0933006 2.3382661\n", 395 | "175 2.0886867 2.3346088\n", 396 | "180 2.0846198 2.3310363\n", 397 | "185 2.0806203 2.327309\n", 398 | "190 2.076965 2.3236957\n", 399 | "195 2.0726862 2.3197658\n", 400 | "Training is complete!\n", 401 | "Final Training cost: 2.0701630115509033\n", 402 | "Final Testing cost: 2.3166565895080566\n", 403 | "0 6.386008 6.3886228\n", 404 | "5 5.9259405 5.9336896\n", 405 | "10 5.503696 5.504508\n" 406 | ] 407 | }, 408 | { 409 | "ename": "KeyboardInterrupt", 410 | "evalue": "", 411 | "output_type": "error", 412 | "traceback": [ 413 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 414 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 415 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 72\u001b[0m \u001b[0;31m# Feed in the training data and do one step of neural network training\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 73\u001b[0;31m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moptimizer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mY_train\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 74\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 75\u001b[0m \u001b[0;31m# Every 5 training steps, log our progress\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 416 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 898\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 899\u001b[0m result = self._run(None, fetches, feed_dict, options_ptr,\n\u001b[0;32m--> 900\u001b[0;31m run_metadata_ptr)\n\u001b[0m\u001b[1;32m 901\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrun_metadata\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 902\u001b[0m \u001b[0mproto_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtf_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTF_GetBuffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun_metadata_ptr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 417 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, handle, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1133\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfinal_fetches\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mfinal_targets\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mfeed_dict_tensor\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1134\u001b[0m results = self._do_run(handle, final_targets, final_fetches,\n\u001b[0;32m-> 1135\u001b[0;31m feed_dict_tensor, options, run_metadata)\n\u001b[0m\u001b[1;32m 1136\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1137\u001b[0m \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 418 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_run\u001b[0;34m(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1314\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhandle\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1315\u001b[0m return self._do_call(_run_fn, feeds, fetches, targets, options,\n\u001b[0;32m-> 1316\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1317\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1318\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_prun_fn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeeds\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetches\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 419 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_call\u001b[0;34m(self, fn, *args)\u001b[0m\n\u001b[1;32m 1320\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1321\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1322\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1323\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mOpError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1324\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcompat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mas_text\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 420 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run_fn\u001b[0;34m(feed_dict, fetch_list, target_list, options, run_metadata)\u001b[0m\n\u001b[1;32m 1305\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_extend_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1306\u001b[0m return self._call_tf_sessionrun(\n\u001b[0;32m-> 1307\u001b[0;31m options, feed_dict, fetch_list, target_list, run_metadata)\n\u001b[0m\u001b[1;32m 1308\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1309\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prun_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 421 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_call_tf_sessionrun\u001b[0;34m(self, options, feed_dict, fetch_list, target_list, run_metadata)\u001b[0m\n\u001b[1;32m 1407\u001b[0m return tf_session.TF_SessionRun_wrapper(\n\u001b[1;32m 1408\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_session\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_list\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1409\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1410\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1411\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_exception_on_not_ok_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mstatus\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 422 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 423 | ] 424 | } 425 | ], 426 | "source": [ 427 | "from sklearn import ensemble\n", 428 | "from sklearn.metrics import mean_squared_error,mean_absolute_error\n", 429 | "from sklearn.cross_validation import StratifiedKFold\n", 430 | "\n", 431 | "\n", 432 | "n_folds = 5\n", 433 | "avg_mad = 0\n", 434 | "skf = StratifiedKFold(score_result, n_folds)\n", 435 | "\n", 436 | "\n", 437 | "for train_index, test_index in skf:\n", 438 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 439 | " Y_train, Y_test = score_result[train_index], score_result[test_index]\n", 440 | " \n", 441 | " \n", 442 | " #regularisation\n", 443 | " regularizer = tf.contrib.layers.l2_regularizer(scale=0.01)\n", 444 | " \n", 445 | " # Define model parameters\n", 446 | " learning_rate = 0.001\n", 447 | " training_epochs = 200\n", 448 | "\n", 449 | " # Define how many inputs and outputs are in our neural network\n", 450 | " number_of_inputs = 500\n", 451 | " number_of_outputs = 1\n", 452 | "\n", 453 | " # Define how many neurons we want in each layer of our neural network\n", 454 | " layer_1_nodes = 50\n", 455 | " layer_2_nodes = 100\n", 456 | " layer_3_nodes = 50\n", 457 | " \n", 458 | " tf.reset_default_graph()\n", 459 | " \n", 460 | " # Input Layer\n", 461 | " with tf.variable_scope('input'):\n", 462 | " X = tf.placeholder(tf.float32, shape=(None, number_of_inputs))\n", 463 | "\n", 464 | " # Layer 1\n", 465 | " with tf.variable_scope('layer_1'):\n", 466 | " weights = tf.get_variable(\"weights1\", regularizer=regularizer,shape=[number_of_inputs, layer_1_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 467 | " biases = tf.get_variable(name=\"biases1\", shape=[layer_1_nodes], initializer=tf.zeros_initializer())\n", 468 | " layer_1_output = tf.nn.relu(tf.matmul(X, weights) + biases)\n", 469 | "\n", 470 | "\n", 471 | " # Output Layer\n", 472 | " with tf.variable_scope('output'):\n", 473 | " weights = tf.get_variable(\"weights4\", regularizer=regularizer, shape=[layer_1_nodes, number_of_outputs])\n", 474 | " biases = tf.get_variable(name=\"biases4\", shape=[number_of_outputs], initializer=tf.zeros_initializer())\n", 475 | " prediction = tf.matmul(layer_1_output, weights) + biases\n", 476 | "\n", 477 | " # Section Two: Define the cost function of the neural network that will measure prediction accuracy during training\n", 478 | "\n", 479 | " with tf.variable_scope('cost'):\n", 480 | " Y = tf.placeholder(tf.float32)\n", 481 | " cost = tf.reduce_mean(abs(prediction-Y))\n", 482 | " \n", 483 | " # Section Three: Define the optimizer function that will be run to optimize the neural network\n", 484 | "\n", 485 | " with tf.variable_scope('train'):\n", 486 | " optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)\n", 487 | " \n", 488 | " # Initialize a session so that we can run TensorFlow operations\n", 489 | " with tf.Session() as session:\n", 490 | "\n", 491 | " # Run the global variable initializer to initialize all variables and layers of the neural network\n", 492 | " session.run(tf.global_variables_initializer())\n", 493 | "\n", 494 | " # Run the optimizer over and over to train the network.\n", 495 | " # One epoch is one full run through the training data set.\n", 496 | " for epoch in range(training_epochs):\n", 497 | "\n", 498 | " # Feed in the training data and do one step of neural network training\n", 499 | " session.run(optimizer, feed_dict={X: X_train, Y: Y_train})\n", 500 | "\n", 501 | " # Every 5 training steps, log our progress\n", 502 | " if epoch % 5 == 0:\n", 503 | " training_cost = session.run(cost, feed_dict={X: X_train, Y:Y_train})\n", 504 | " testing_cost = session.run(cost, feed_dict={X: X_test, Y:Y_test})\n", 505 | "\n", 506 | " print(epoch, training_cost, testing_cost)\n", 507 | "\n", 508 | " # Training is now complete!\n", 509 | " print(\"Training is complete!\")\n", 510 | "\n", 511 | " final_training_cost = session.run(cost, feed_dict={X: X_train, Y: Y_train})\n", 512 | " final_testing_cost = session.run(cost, feed_dict={X: X_test, Y: Y_test})\n", 513 | "\n", 514 | " print(\"Final Training cost: {}\".format(final_training_cost))\n", 515 | " print(\"Final Testing cost: {}\".format(final_testing_cost))" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 23, 521 | "metadata": {}, 522 | "outputs": [ 523 | { 524 | "name": "stdout", 525 | "output_type": "stream", 526 | "text": [ 527 | "0 6.083245 6.080838\n", 528 | "5 5.5260377 5.5600276\n", 529 | "10 4.971763 5.038488\n", 530 | "15 4.4246273 4.5104346\n", 531 | "20 3.9139543 3.9918978\n", 532 | "25 3.4488997 3.524602\n", 533 | "30 3.0483305 3.1278381\n", 534 | "35 2.7404468 2.828259\n", 535 | "40 2.53088 2.6351495\n", 536 | "45 2.4048758 2.5277576\n", 537 | "50 2.334103 2.4732242\n", 538 | "55 2.2906017 2.4425502\n", 539 | "60 2.2589564 2.4197042\n", 540 | "65 2.2281182 2.3995092\n", 541 | "70 2.202884 2.38372\n", 542 | "75 2.1825671 2.3724823\n", 543 | "80 2.1671972 2.364674\n", 544 | "85 2.1542888 2.3584723\n", 545 | "90 2.144201 2.3531485\n", 546 | "95 2.1350422 2.348176\n", 547 | "100 2.1265638 2.3436902\n", 548 | "105 2.1175482 2.3393404\n", 549 | "110 2.1106508 2.3347213\n", 550 | "115 2.1047485 2.3300486\n", 551 | "120 2.098937 2.3253934\n", 552 | "125 2.0932546 2.3205848\n", 553 | "130 2.0880895 2.315838\n", 554 | "135 2.0822582 2.3107066\n", 555 | "140 2.0773337 2.3060513\n", 556 | "145 2.073586 2.3012295\n", 557 | "150 2.068536 2.2963436\n", 558 | "155 2.0648196 2.2918763\n", 559 | "160 2.0598803 2.2871017\n", 560 | "165 2.0558808 2.2826457\n", 561 | "170 2.052055 2.2777278\n", 562 | "175 2.0484335 2.2728\n", 563 | "180 2.0448995 2.2680101\n", 564 | "185 2.0410495 2.2634091\n", 565 | "190 2.037872 2.258481\n", 566 | "195 2.033979 2.2536285\n", 567 | "Training is complete!\n", 568 | "Final Training cost: 2.031540870666504\n", 569 | "Final Testing cost: 2.2498011589050293\n", 570 | "0 5.786031 5.7584634\n", 571 | "5 5.276452 5.24619\n", 572 | "10 4.7634377 4.7501187\n", 573 | "15 4.2564397 4.2635527\n", 574 | "20 3.763897 3.7807336\n", 575 | "25 3.3099797 3.3099504\n", 576 | "30 2.9288201 2.879146\n", 577 | "35 2.6489623 2.5442593\n", 578 | "40 2.4751177 2.3303452\n", 579 | "45 2.3823915 2.2176428\n", 580 | "50 2.3349264 2.163645\n", 581 | "55 2.3025234 2.140226\n", 582 | "60 2.272921 2.124854\n", 583 | "65 2.2469532 2.1120722\n", 584 | "70 2.2247584 2.1003363\n", 585 | "75 2.2082613 2.0902684\n", 586 | "80 2.1953907 2.0815587\n", 587 | "85 2.183322 2.0747714\n", 588 | "90 2.1739733 2.0689385\n", 589 | "95 2.1655912 2.0635607\n", 590 | "100 2.1566525 2.0588613\n", 591 | "105 2.149286 2.054494\n", 592 | "110 2.1417162 2.0503848\n", 593 | "115 2.135081 2.0465348\n", 594 | "120 2.128687 2.042894\n", 595 | "125 2.1223211 2.039422\n", 596 | "130 2.1166208 2.0362449\n", 597 | "135 2.111051 2.0333471\n", 598 | "140 2.1053548 2.030591\n", 599 | "145 2.1002746 2.0280256\n", 600 | "150 2.0952697 2.0257974\n", 601 | "155 2.0905244 2.0235007\n", 602 | "160 2.0863369 2.0213807\n", 603 | "165 2.0814583 2.019396\n", 604 | "170 2.0768123 2.0173519\n", 605 | "175 2.0723486 2.0154889\n", 606 | "180 2.0683796 2.013409\n", 607 | "185 2.0639787 2.011601\n", 608 | "190 2.060075 2.009731\n", 609 | "195 2.0561123 2.007988\n", 610 | "Training is complete!\n", 611 | "Final Training cost: 2.0533547401428223\n", 612 | "Final Testing cost: 2.0065207481384277\n", 613 | "0 5.8722553 5.874944\n", 614 | "5 5.2967777 5.289318\n", 615 | "10 4.730514 4.7286253\n", 616 | "15 4.172605 4.1904984\n", 617 | "20 3.6481903 3.6828492\n", 618 | "25 3.1920922 3.2236845\n", 619 | "30 2.827173 2.8439672\n", 620 | "35 2.5852838 2.5776653\n", 621 | "40 2.446994 2.4183152\n", 622 | "45 2.3764029 2.3357441\n", 623 | "50 2.3387005 2.2936857\n", 624 | "55 2.3070948 2.263823\n", 625 | "60 2.2776423 2.2371588\n", 626 | "65 2.2515697 2.2135673\n", 627 | "70 2.2304595 2.1949704\n", 628 | "75 2.213401 2.1802068\n", 629 | "80 2.2003808 2.1689966\n", 630 | "85 2.1907675 2.15936\n", 631 | "90 2.179741 2.1498966\n", 632 | "95 2.169519 2.1411004\n", 633 | "100 2.161231 2.1332774\n", 634 | "105 2.1525738 2.1262095\n", 635 | "110 2.1449878 2.119555\n", 636 | "115 2.13767 2.1135612\n", 637 | "120 2.1307268 2.1081827\n", 638 | "125 2.1245189 2.1029258\n", 639 | "130 2.1180696 2.0984974\n", 640 | "135 2.1126537 2.09404\n", 641 | "140 2.1078687 2.0902355\n", 642 | "145 2.1021779 2.0865002\n", 643 | "150 2.0966923 2.0831409\n", 644 | "155 2.0912092 2.079788\n", 645 | "160 2.0869124 2.0768535\n", 646 | "165 2.0826051 2.0739443\n", 647 | "170 2.0790567 2.0713716\n", 648 | "175 2.0751283 2.0687284\n", 649 | "180 2.0720415 2.0664299\n", 650 | "185 2.0684783 2.0640426\n", 651 | "190 2.0643828 2.0617573\n", 652 | "195 2.0611863 2.059693\n", 653 | "Training is complete!\n", 654 | "Final Training cost: 2.0586845874786377\n", 655 | "Final Testing cost: 2.05816650390625\n", 656 | "0 6.003251 6.0009375\n", 657 | "5 5.5456467 5.546373\n", 658 | "10 5.0589247 5.0673103\n", 659 | "15 4.5497584 4.5540247\n", 660 | "20 4.034681 4.0333495\n", 661 | "25 3.561919 3.535421\n", 662 | "30 3.1461713 3.0988853\n", 663 | "35 2.8193307 2.7726498\n", 664 | "40 2.5958555 2.5585785\n", 665 | "45 2.461613 2.4354646\n", 666 | "50 2.3848336 2.3735893\n", 667 | "55 2.3369153 2.3382103\n", 668 | "60 2.2982998 2.3076613\n", 669 | "65 2.2672374 2.2773237\n", 670 | "70 2.2413843 2.2498527\n", 671 | "75 2.2221472 2.2279947\n", 672 | "80 2.207646 2.2105472\n", 673 | "85 2.1952562 2.1963797\n", 674 | "90 2.1836543 2.1846797\n", 675 | "95 2.1732273 2.1742082\n", 676 | "100 2.1631167 2.165288\n", 677 | "105 2.1544719 2.1570706\n", 678 | "110 2.1463413 2.149796\n", 679 | "115 2.1387305 2.1428277\n", 680 | "120 2.1325114 2.1362693\n", 681 | "125 2.1261013 2.1302793\n", 682 | "130 2.1203005 2.1242847\n", 683 | "135 2.11481 2.118746\n", 684 | "140 2.1088216 2.1140485\n", 685 | "145 2.103245 2.1090016\n", 686 | "150 2.097926 2.1047196\n", 687 | "155 2.0930274 2.1006134\n", 688 | "160 2.088081 2.0965412\n", 689 | "165 2.0831625 2.0928378\n", 690 | "170 2.0787213 2.0891888\n", 691 | "175 2.074456 2.086069\n", 692 | "180 2.070619 2.0830286\n", 693 | "185 2.0668306 2.0800366\n", 694 | "190 2.0631332 2.076991\n", 695 | "195 2.0599709 2.0743442\n", 696 | "Training is complete!\n", 697 | "Final Training cost: 2.0569071769714355\n", 698 | "Final Testing cost: 2.0722782611846924\n", 699 | "0 5.8064704 5.7994876\n", 700 | "5 5.276487 5.2141395\n", 701 | "10 4.7305403 4.65042\n", 702 | "15 4.1816673 4.1109486\n", 703 | "20 3.6428988 3.6145203\n", 704 | "25 3.1623962 3.2038553\n", 705 | "30 2.777672 2.9087696\n", 706 | "35 2.5253992 2.7344189\n", 707 | "40 2.3897297 2.648426\n", 708 | "45 2.3254619 2.6035998\n", 709 | "50 2.2896852 2.5651326\n", 710 | "55 2.2620988 2.5207853\n", 711 | "60 2.2332084 2.467924\n", 712 | "65 2.2075698 2.4195442\n", 713 | "70 2.187628 2.3826606\n", 714 | "75 2.1702995 2.3570383\n", 715 | "80 2.1585033 2.339875\n", 716 | "85 2.147122 2.3270495\n", 717 | "90 2.1368384 2.3164332\n", 718 | "95 2.128505 2.306792\n", 719 | "100 2.1195886 2.2983844\n", 720 | "105 2.1117687 2.2917898\n", 721 | "110 2.1043434 2.286257\n", 722 | "115 2.0971017 2.2811835\n", 723 | "120 2.0913007 2.2769423\n", 724 | "125 2.0853148 2.2730267\n", 725 | "130 2.0799198 2.2690527\n", 726 | "135 2.0745983 2.2656088\n", 727 | "140 2.069324 2.2622397\n", 728 | "145 2.0640712 2.2592444\n", 729 | "150 2.059761 2.2559388\n", 730 | "155 2.0556314 2.2530465\n", 731 | "160 2.0514395 2.2502763\n", 732 | "165 2.0470777 2.2474878\n", 733 | "170 2.0431151 2.244634\n", 734 | "175 2.0398836 2.2423286\n", 735 | "180 2.0364077 2.2397652\n", 736 | "185 2.0335624 2.2375617\n", 737 | "190 2.0304675 2.2356656\n", 738 | "195 2.0272236 2.2333982\n", 739 | "Training is complete!\n", 740 | "Final Training cost: 2.0245625972747803\n", 741 | "Final Testing cost: 2.2318248748779297\n" 742 | ] 743 | } 744 | ], 745 | "source": [ 746 | "n_folds = 5\n", 747 | "avg_mad = 0\n", 748 | "skf = StratifiedKFold(score_result, n_folds)\n", 749 | "\n", 750 | "\n", 751 | "for train_index, test_index in skf:\n", 752 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 753 | " Y_train, Y_test = score_result[train_index], score_result[test_index]\n", 754 | " \n", 755 | " \n", 756 | " # Define model parameters\n", 757 | " learning_rate = 0.001\n", 758 | " training_epochs = 200\n", 759 | "\n", 760 | " # Define how many inputs and outputs are in our neural network\n", 761 | " number_of_inputs = 500\n", 762 | " number_of_outputs = 1\n", 763 | "\n", 764 | " # Define how many neurons we want in each layer of our neural network\n", 765 | " layer_1_nodes = 50\n", 766 | " layer_2_nodes = 100\n", 767 | " layer_3_nodes = 50\n", 768 | " \n", 769 | " tf.reset_default_graph()\n", 770 | " \n", 771 | " # Input Layer\n", 772 | " with tf.variable_scope('input'):\n", 773 | " X = tf.placeholder(tf.float32, shape=(None, number_of_inputs))\n", 774 | "\n", 775 | " # Layer 1\n", 776 | " with tf.variable_scope('layer_1'):\n", 777 | " weights = tf.get_variable(\"weights1\",shape=[number_of_inputs, layer_1_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 778 | " biases = tf.get_variable(name=\"biases1\", shape=[layer_1_nodes], initializer=tf.zeros_initializer())\n", 779 | " layer_1_output = tf.nn.relu(tf.matmul(X, weights) + biases)\n", 780 | "\n", 781 | "\n", 782 | " # Output Layer\n", 783 | " with tf.variable_scope('output'):\n", 784 | " weights = tf.get_variable(\"weights4\", shape=[layer_1_nodes, number_of_outputs])\n", 785 | " biases = tf.get_variable(name=\"biases4\", shape=[number_of_outputs], initializer=tf.zeros_initializer())\n", 786 | " prediction = tf.matmul(layer_1_output, weights) + biases\n", 787 | "\n", 788 | " # Section Two: Define the cost function of the neural network that will measure prediction accuracy during training\n", 789 | "\n", 790 | " with tf.variable_scope('cost'):\n", 791 | " Y = tf.placeholder(tf.float32)\n", 792 | " cost = tf.reduce_mean(abs(prediction-Y))\n", 793 | " \n", 794 | " # Section Three: Define the optimizer function that will be run to optimize the neural network\n", 795 | "\n", 796 | " with tf.variable_scope('train'):\n", 797 | " optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)\n", 798 | " \n", 799 | " # Initialize a session so that we can run TensorFlow operations\n", 800 | " with tf.Session() as session:\n", 801 | "\n", 802 | " # Run the global variable initializer to initialize all variables and layers of the neural network\n", 803 | " session.run(tf.global_variables_initializer())\n", 804 | "\n", 805 | " # Run the optimizer over and over to train the network.\n", 806 | " # One epoch is one full run through the training data set.\n", 807 | " for epoch in range(training_epochs):\n", 808 | "\n", 809 | " # Feed in the training data and do one step of neural network training\n", 810 | " session.run(optimizer, feed_dict={X: X_train, Y: Y_train})\n", 811 | "\n", 812 | " # Every 5 training steps, log our progress\n", 813 | " if epoch % 5 == 0:\n", 814 | " training_cost = session.run(cost, feed_dict={X: X_train, Y:Y_train})\n", 815 | " testing_cost = session.run(cost, feed_dict={X: X_test, Y:Y_test})\n", 816 | "\n", 817 | " print(epoch, training_cost, testing_cost)\n", 818 | "\n", 819 | " # Training is now complete!\n", 820 | " print(\"Training is complete!\")\n", 821 | "\n", 822 | " final_training_cost = session.run(cost, feed_dict={X: X_train, Y: Y_train})\n", 823 | " final_testing_cost = session.run(cost, feed_dict={X: X_test, Y: Y_test})\n", 824 | "\n", 825 | " print(\"Final Training cost: {}\".format(final_training_cost))\n", 826 | " print(\"Final Testing cost: {}\".format(final_testing_cost))" 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": 29, 832 | "metadata": {}, 833 | "outputs": [ 834 | { 835 | "name": "stdout", 836 | "output_type": "stream", 837 | "text": [ 838 | "0 6.062321 6.0602183\n", 839 | "5 5.5669346 5.595575\n", 840 | "10 5.0552363 5.1093874\n", 841 | "15 4.5413995 4.6046963\n", 842 | "20 4.0413485 4.0950303\n", 843 | "25 3.5686922 3.6246138\n", 844 | "30 3.1437864 3.2121816\n", 845 | "35 2.7962208 2.8848507\n", 846 | "40 2.553022 2.6630778\n", 847 | "45 2.4085565 2.5366564\n", 848 | "50 2.330534 2.4762907\n", 849 | "55 2.2872834 2.4477959\n", 850 | "60 2.2569604 2.4284132\n", 851 | "65 2.2283502 2.4105759\n", 852 | "70 2.2004101 2.3945217\n", 853 | "75 2.1798089 2.3826413\n", 854 | "80 2.164123 2.3740633\n", 855 | "85 2.1521626 2.367618\n", 856 | "90 2.1415398 2.36214\n", 857 | "95 2.1327498 2.357028\n", 858 | "100 2.1234882 2.3523703\n", 859 | "105 2.116745 2.348004\n", 860 | "110 2.1101744 2.343452\n", 861 | "115 2.1029496 2.338674\n", 862 | "120 2.0973334 2.3335829\n", 863 | "125 2.091523 2.3282762\n", 864 | "130 2.0854359 2.3232312\n", 865 | "135 2.080473 2.3177183\n", 866 | "140 2.0758843 2.3126848\n", 867 | "145 2.0710387 2.3077934\n", 868 | "150 2.066506 2.302633\n", 869 | "155 2.0620923 2.2976885\n", 870 | "160 2.0581756 2.2925594\n", 871 | "165 2.054114 2.2874322\n", 872 | "170 2.0510647 2.282283\n", 873 | "175 2.0470674 2.2772973\n", 874 | "180 2.0432103 2.271976\n", 875 | "185 2.0395427 2.2668421\n", 876 | "190 2.0356562 2.2619812\n", 877 | "195 2.0326715 2.2568953\n", 878 | "200 2.0296624 2.2519739\n", 879 | "205 2.0266175 2.247082\n", 880 | "210 2.0233376 2.242167\n", 881 | "215 2.0201082 2.2373335\n", 882 | "220 2.0173252 2.232589\n", 883 | "225 2.0141754 2.22831\n", 884 | "230 2.011061 2.223964\n", 885 | "235 2.008475 2.2197325\n", 886 | "240 2.0058863 2.215397\n", 887 | "245 2.0035982 2.2111194\n", 888 | "250 2.001077 2.2066622\n", 889 | "255 1.9983191 2.202334\n", 890 | "260 1.9965649 2.1980445\n", 891 | "265 1.9944901 2.193703\n", 892 | "270 1.9916476 2.1895187\n", 893 | "275 1.9893954 2.1850965\n", 894 | "280 1.9869858 2.1810696\n", 895 | "285 1.985123 2.1771178\n", 896 | "290 1.983007 2.1733663\n", 897 | "295 1.980544 2.169393\n", 898 | "Training is complete!\n", 899 | "Final Training cost: 1.9792307615280151\n", 900 | "Final Testing cost: 2.166503667831421\n", 901 | "0 5.9911466 6.0206876\n", 902 | "5 5.470893 5.48928\n", 903 | "10 4.938597 4.950726\n", 904 | "15 4.414492 4.423245\n", 905 | "20 3.9229224 3.936662\n" 906 | ] 907 | }, 908 | { 909 | "ename": "KeyboardInterrupt", 910 | "evalue": "", 911 | "output_type": "error", 912 | "traceback": [ 913 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 914 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 915 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 66\u001b[0m \u001b[0;31m# Feed in the training data and do one step of neural network training\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 67\u001b[0;31m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moptimizer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mY_train\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 68\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;31m# Every 5 training steps, log our progress\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 916 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 898\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 899\u001b[0m result = self._run(None, fetches, feed_dict, options_ptr,\n\u001b[0;32m--> 900\u001b[0;31m run_metadata_ptr)\n\u001b[0m\u001b[1;32m 901\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrun_metadata\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 902\u001b[0m \u001b[0mproto_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtf_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTF_GetBuffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun_metadata_ptr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 917 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, handle, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1133\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfinal_fetches\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mfinal_targets\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mfeed_dict_tensor\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1134\u001b[0m results = self._do_run(handle, final_targets, final_fetches,\n\u001b[0;32m-> 1135\u001b[0;31m feed_dict_tensor, options, run_metadata)\n\u001b[0m\u001b[1;32m 1136\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1137\u001b[0m \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 918 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_run\u001b[0;34m(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 1314\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhandle\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1315\u001b[0m return self._do_call(_run_fn, feeds, fetches, targets, options,\n\u001b[0;32m-> 1316\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1317\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1318\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_prun_fn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeeds\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetches\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 919 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_call\u001b[0;34m(self, fn, *args)\u001b[0m\n\u001b[1;32m 1320\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1321\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1322\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1323\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mOpError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1324\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcompat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mas_text\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 920 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run_fn\u001b[0;34m(feed_dict, fetch_list, target_list, options, run_metadata)\u001b[0m\n\u001b[1;32m 1305\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_extend_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1306\u001b[0m return self._call_tf_sessionrun(\n\u001b[0;32m-> 1307\u001b[0;31m options, feed_dict, fetch_list, target_list, run_metadata)\n\u001b[0m\u001b[1;32m 1308\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1309\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prun_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 921 | "\u001b[0;32m/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_call_tf_sessionrun\u001b[0;34m(self, options, feed_dict, fetch_list, target_list, run_metadata)\u001b[0m\n\u001b[1;32m 1407\u001b[0m return tf_session.TF_SessionRun_wrapper(\n\u001b[1;32m 1408\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_session\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_list\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1409\u001b[0;31m run_metadata)\n\u001b[0m\u001b[1;32m 1410\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1411\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_exception_on_not_ok_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mstatus\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 922 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 923 | ] 924 | } 925 | ], 926 | "source": [ 927 | "n_folds = 5\n", 928 | "avg_mad = 0\n", 929 | "skf = StratifiedKFold(score_result, n_folds)\n", 930 | "dropout = 0.5\n", 931 | "\n", 932 | "for train_index, test_index in skf:\n", 933 | " X_train, X_test = description_features[train_index], description_features[test_index]\n", 934 | " Y_train, Y_test = score_result[train_index], score_result[test_index]\n", 935 | " \n", 936 | " \n", 937 | " # Define model parameters\n", 938 | " learning_rate = 0.001\n", 939 | " training_epochs = 300\n", 940 | "\n", 941 | " # Define how many inputs and outputs are in our neural network\n", 942 | " number_of_inputs = 500\n", 943 | " number_of_outputs = 1\n", 944 | "\n", 945 | " # Define how many neurons we want in each layer of our neural network\n", 946 | " layer_1_nodes = 50\n", 947 | " layer_2_nodes = 100\n", 948 | " layer_3_nodes = 50\n", 949 | " \n", 950 | " tf.reset_default_graph()\n", 951 | " \n", 952 | " # Input Layer\n", 953 | " with tf.variable_scope('input'):\n", 954 | " X = tf.placeholder(tf.float32, shape=(None, number_of_inputs))\n", 955 | "\n", 956 | " # Layer 1\n", 957 | " with tf.variable_scope('layer_1'):\n", 958 | " weights = tf.get_variable(\"weights1\",shape=[number_of_inputs, layer_1_nodes], initializer=tf.contrib.layers.xavier_initializer())\n", 959 | " biases = tf.get_variable(name=\"biases1\", shape=[layer_1_nodes], initializer=tf.zeros_initializer())\n", 960 | " layer_1_output = tf.nn.relu(tf.matmul(X, weights) + biases)\n", 961 | " layer_1_output = tf.layers.dropout(layer_1_output, rate=dropout)\n", 962 | "\n", 963 | " # Output Layer\n", 964 | " with tf.variable_scope('output'):\n", 965 | " weights = tf.get_variable(\"weights4\", shape=[layer_1_nodes, number_of_outputs])\n", 966 | " biases = tf.get_variable(name=\"biases4\", shape=[number_of_outputs], initializer=tf.zeros_initializer())\n", 967 | " prediction = tf.matmul(layer_1_output, weights) + biases\n", 968 | " prediction = tf.layers.dropout(prediction, rate=dropout)\n", 969 | " \n", 970 | " \n", 971 | " # Section Two: Define the cost function of the neural network that will measure prediction accuracy during training\n", 972 | "\n", 973 | " with tf.variable_scope('cost'):\n", 974 | " Y = tf.placeholder(tf.float32)\n", 975 | " cost = tf.reduce_mean(abs(prediction-Y))\n", 976 | " \n", 977 | " # Section Three: Define the optimizer function that will be run to optimize the neural network\n", 978 | "\n", 979 | " with tf.variable_scope('train'):\n", 980 | " optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)\n", 981 | " \n", 982 | " # Initialize a session so that we can run TensorFlow operations\n", 983 | " with tf.Session() as session:\n", 984 | "\n", 985 | " # Run the global variable initializer to initialize all variables and layers of the neural network\n", 986 | " session.run(tf.global_variables_initializer())\n", 987 | "\n", 988 | " # Run the optimizer over and over to train the network.\n", 989 | " # One epoch is one full run through the training data set.\n", 990 | " for epoch in range(training_epochs):\n", 991 | "\n", 992 | " # Feed in the training data and do one step of neural network training\n", 993 | " session.run(optimizer, feed_dict={X: X_train, Y: Y_train})\n", 994 | "\n", 995 | " # Every 5 training steps, log our progress\n", 996 | " if epoch % 5 == 0:\n", 997 | " training_cost = session.run(cost, feed_dict={X: X_train, Y:Y_train})\n", 998 | " testing_cost = session.run(cost, feed_dict={X: X_test, Y:Y_test})\n", 999 | "\n", 1000 | " print(epoch, training_cost, testing_cost)\n", 1001 | "\n", 1002 | " # Training is now complete!\n", 1003 | " print(\"Training is complete!\")\n", 1004 | "\n", 1005 | " final_training_cost = session.run(cost, feed_dict={X: X_train, Y: Y_train})\n", 1006 | " final_testing_cost = session.run(cost, feed_dict={X: X_test, Y: Y_test})\n", 1007 | "\n", 1008 | " print(\"Final Training cost: {}\".format(final_training_cost))\n", 1009 | " print(\"Final Testing cost: {}\".format(final_testing_cost))" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "code", 1014 | "execution_count": null, 1015 | "metadata": {}, 1016 | "outputs": [], 1017 | "source": [] 1018 | } 1019 | ], 1020 | "metadata": { 1021 | "kernelspec": { 1022 | "display_name": "Python 3", 1023 | "language": "python", 1024 | "name": "python3" 1025 | }, 1026 | "language_info": { 1027 | "codemirror_mode": { 1028 | "name": "ipython", 1029 | "version": 3 1030 | }, 1031 | "file_extension": ".py", 1032 | "mimetype": "text/x-python", 1033 | "name": "python", 1034 | "nbconvert_exporter": "python", 1035 | "pygments_lexer": "ipython3", 1036 | "version": "3.6.4" 1037 | } 1038 | }, 1039 | "nbformat": 4, 1040 | "nbformat_minor": 2 1041 | } 1042 | --------------------------------------------------------------------------------