├── Data processing ├── Input_for_NER_task.ipynb ├── Input_for_RE_task.ipynb └── web_scraping.ipynb ├── KG analysis └── Analysis.ipynb ├── KG construction ├── Input_for_KG_construction_task.ipynb └── KG_construction.ipynb ├── NER ├── WordCharacterEmbedding_BiLSTM_CRF.ipynb ├── ner_BERT+BiLSTM (1).ipynb ├── ner_ELMO+BiLSTM+CRF.ipynb └── ner_bert_base_cased.ipynb ├── README.md └── Relation extraction ├── Aug_RE_BiLSTM_Att.ipynb └── Aug_RE_CNN_Att.ipynb /Data processing/web_scraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "web scraping.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "accelerator": "GPU" 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "code", 22 | "source": [ 23 | "from google.colab import drive\n", 24 | "drive.mount('/drive')" 25 | ], 26 | "metadata": { 27 | "colab": { 28 | "base_uri": "https://localhost:8080/" 29 | }, 30 | "id": "Th1dtj3u-9uR", 31 | "outputId": "90de24ec-b59b-485d-cefe-a36bf79aa73a" 32 | }, 33 | "execution_count": null, 34 | "outputs": [ 35 | { 36 | "output_type": "stream", 37 | "name": "stdout", 38 | "text": [ 39 | "Mounted at /drive\n" 40 | ] 41 | } 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "source": [ 47 | "# Extraction of needed datasets " 48 | ], 49 | "metadata": { 50 | "id": "XTgN78e3DBMv" 51 | } 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "source": [ 56 | "**Le but de la création de ce notebook est d'enrichir le graphe de connaissance par l'extraction des informations que je juge être un plus ceci permettra de caractériser mon projet par rapport au reste des travaux déjà existantes dans le cadre de Cybersecurity Knowledge Graphs** 🍀" 57 | ], 58 | "metadata": { 59 | "id": "6Wr5LwYa_TwH" 60 | } 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "source": [ 65 | "Les informations que je compte ajouter sont : **APT Group**, **Techniques** utilisés par les attaquants, **tactics**, **Mitigations** pour contrer l'attaque . Tout type de malware appartient à un groupe apt spécifique et à partir de ce groupe on pourra avoir un ensemble d'informations sur les méthodes offensives du pirate informatique et les méthodes défensives pour le contrer. Malheureusement, cette information n'existe pas dans les rapports annotés (apt malwares notes) qui construit mon corpus :( . Pour cette raison, j'ai décidé d'opter pour une technique qui permettra d'attribuer à chaque malware le groupe APT auquel il appartient. " 66 | ], 67 | "metadata": { 68 | "id": "5YAtJz2hA6cm" 69 | } 70 | }, 71 | { 72 | "cell_type": "code", 73 | "source": [ 74 | "import requests\n", 75 | "from lxml import etree\n", 76 | "import time\n", 77 | "import csv" 78 | ], 79 | "metadata": { 80 | "id": "qdNhtsw4Bsuw" 81 | }, 82 | "execution_count": null, 83 | "outputs": [] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "source": [ 88 | "Tactics_info_list = []\n", 89 | "\n", 90 | "Techniques_info_list = []\n", 91 | "Techniques_url = []\n", 92 | "\n", 93 | "Mitigations_info_list = []\n", 94 | "Mitigations_info_list_temp = []\n", 95 | "Mitigations_url = []\n", 96 | "\n", 97 | "Groups_info_list = []\n", 98 | "Groups_info_list_temp = []\n", 99 | "Groups_url = []" 100 | ], 101 | "metadata": { 102 | "id": "UtVneuB3OKId" 103 | }, 104 | "execution_count": null, 105 | "outputs": [] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "source": [ 110 | "def get_tactics_info(url):\n", 111 | " \n", 112 | " html = requests.get(url)\n", 113 | " selector = etree.HTML(html.text)\n", 114 | " \n", 115 | " Tactics_ID = selector.xpath('//div[@class=\"card-body\"]/div[1]/text()')\n", 116 | " Tactics_Created = selector.xpath('//div[@class=\"card-body\"]/div[2]/text()')\n", 117 | " Tactics_LM = selector.xpath('//div[@class=\"card-body\"]/div[3]/text()')\n", 118 | " Tactics_Name = selector.xpath('//div[@class=\"container-fluid\"]/h1/text()')\n", 119 | " Tactics_Intros = selector.xpath('//div[@class=\"container-fluid\"]/div[1]/div[1]/div[1]/p/text()')\n", 120 | " info_list = [Tactics_Name[0].strip(), Tactics_Intros[0], Tactics_ID[0].strip(), Tactics_Created[0], Tactics_LM[0]]\n", 121 | " Tactics_info_list.append(info_list)\n", 122 | "\n", 123 | " \n", 124 | " Techniques_url_infos = selector.xpath('//table[@class=\"table-techniques\"]/tbody')\n", 125 | " for info in Techniques_url_infos:\n", 126 | " Techniques_ID = info.xpath('tr[@class=\"technique\"]/td[1]/a/text()')\n", 127 | " for ID in Techniques_ID:\n", 128 | " Techniques_url.append('https://attack.mitre.org/techniques/{}/'.format(str(ID.strip())))" 129 | ], 130 | "metadata": { 131 | "id": "giuUovZROQbM" 132 | }, 133 | "execution_count": null, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": [ 139 | "def get_techniques_info(url):\n", 140 | " headers = {\n", 141 | " 'User-Agent': 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}\n", 142 | " html = requests.get(url, headers=headers)\n", 143 | " selector = etree.HTML(html.text)\n", 144 | " \n", 145 | " Techniques_Name = selector.xpath('//div[@class=\"container-fluid\"]/h1/text()')\n", 146 | " Techniques_ID = selector.xpath('//div[@class=\"card-body\"]/div/div/span[contains(text(),\"ID\")]/../text()')\n", 147 | " Techniques_Platforms = selector.xpath('//div[@class=\"card-body\"]/div/div/span[contains(text(),\"Platforms\")]/../text()')\n", 148 | " Techniques_Tactic = selector.xpath('//div[@class=\"card-body\"]/div/div/span[contains(text(),\"Tactic\")]/../text()')\n", 149 | "\n", 150 | "\n", 151 | " Techniques_Sub_tec = selector.xpath(\n", 152 | " '//div[@class=\"card-body\"]/div/div/span[contains(text(),\"Sub-techniques\")]/../text()')\n", 153 | " if Techniques_Sub_tec[0].strip() != 'No sub-techniques':\n", 154 | " Techniques_Sub_tec = selector.xpath(\n", 155 | " '//div[@class=\"card-body\"]/div/span[contains(text(),\"Sub-techniques\")]/../a/text()')\n", 156 | " Techniques_Sub_tec = str(len(Techniques_Sub_tec)) + ' sub-techniques'\n", 157 | " else:\n", 158 | " Techniques_Sub_tec = Techniques_Sub_tec[0].strip()\n", 159 | " Techniques_PR = selector.xpath(\n", 160 | " '//div[@class=\"card-body\"]/div/span[contains(text(),\"Permissions Required\")]/../text()')\n", 161 | " if Techniques_PR != []:\n", 162 | " Techniques_PR = Techniques_PR[0].strip()\n", 163 | " Techniques_DS = selector.xpath('//div[@class=\"card-body\"]/div/span[contains(text(),\"Data Sources\")]/../text()')\n", 164 | " if Techniques_DS != []:\n", 165 | " Techniques_DS = Techniques_DS[0].strip()\n", 166 | "\n", 167 | " info_list = [Techniques_Name[0].strip(), Techniques_ID[0].strip(), Techniques_Sub_tec,\n", 168 | " Techniques_Tactic[0].replace('\\n', '').replace(' ', ''), Techniques_Platforms[0],\n", 169 | " Techniques_DS, Techniques_PR]\n", 170 | " Techniques_info_list.append(info_list)" 171 | ], 172 | "metadata": { 173 | "id": "K0rzdMtWOa8s" 174 | }, 175 | "execution_count": null, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "source": [ 181 | "def get_mitigations_url():\n", 182 | " headers = {\n", 183 | " 'User-Agent': 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}\n", 184 | " html = requests.get('https://attack.mitre.org/mitigations/enterprise/', headers=headers)\n", 185 | " selector = etree.HTML(html.text)\n", 186 | " \n", 187 | " Mitigation_ID = selector.xpath('//div[@class=\"overflow-x-auto\"]/table/tbody/tr/td[1]/a/text()')\n", 188 | " Mitigation_Name = selector.xpath('//div[@class=\"overflow-x-auto\"]/table/tbody/tr/td[2]/a/text()')\n", 189 | " Mitigation_Des = selector.xpath('//div[@class=\"overflow-x-auto\"]/table/tbody/tr/td[3]/text()')\n", 190 | " for i in range(0, len(Mitigation_Des)):\n", 191 | " Mitigations_url.append('https://attack.mitre.org/mitigations/{}/'.format(str(Mitigation_ID[i].strip())))\n", 192 | " info_list = [Mitigation_Name[i], Mitigation_ID[i], Mitigation_Des[i].strip()]\n", 193 | " Mitigations_info_list.append(info_list)\n" 194 | ], 195 | "metadata": { 196 | "id": "eYXDE-ETOf4a" 197 | }, 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "source": [ 204 | "def get_mitigations_info(url):\n", 205 | " headers = {\n", 206 | " 'User-Agent': 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}\n", 207 | " html = requests.get(url, headers=headers)\n", 208 | " selector = etree.HTML(html.text)\n", 209 | " \n", 210 | " Tec_Addressed_by_Mitigation = selector.xpath(\n", 211 | " '//div[@class=\"container-fluid\"]/table/tbody/tr[@class=\"technique\"]/td[3]/a/text()')\n", 212 | " Mitigations_info_list_temp.append(Tec_Addressed_by_Mitigation)\n" 213 | ], 214 | "metadata": { 215 | "id": "AffKQ_rtOmVp" 216 | }, 217 | "execution_count": null, 218 | "outputs": [] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "source": [ 223 | "def get_groups_url():\n", 224 | " headers = {\n", 225 | " 'User-Agent': 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}\n", 226 | " html = requests.get('https://attack.mitre.org/groups/', headers=headers)\n", 227 | " selector = etree.HTML(html.text)\n", 228 | " \n", 229 | " Associated_Groups = selector.xpath('//table[@class=\"table table-bordered table-alternate mt-2\"]/tbody/tr/td[1]/text()')\n", 230 | " for group in Associated_Groups:\n", 231 | " Groups_info_list_temp.append(group.strip())\n", 232 | " Group_ID = selector.xpath('//table[@class=\"table table-bordered table-alternate mt-2\"]/tbody/tr/td[2]/a/@href')\n", 233 | " for ID in Group_ID:\n", 234 | " Groups_url.append('https://attack.mitre.org{}/'.format(str(ID)))" 235 | ], 236 | "metadata": { 237 | "id": "10Bs8W1tOqab" 238 | }, 239 | "execution_count": null, 240 | "outputs": [] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "source": [ 245 | "def get_groups_info(url):\n", 246 | " headers = {\n", 247 | " 'User-Agent': 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}\n", 248 | " html = requests.get(url, headers=headers)\n", 249 | " selector = etree.HTML(html.text)\n", 250 | " \n", 251 | " Tecs_Used_by_Group = selector.xpath(\n", 252 | " '//table[@class=\"table techniques-used table-bordered mt-2\"]/tbody/tr/td[2]/a/text()')\n", 253 | " Group_Name =selector.xpath('//div[@class=\"container-fluid\"]/h1/text()')\n", 254 | " Group_ID =selector.xpath('//div[@class=\"card-body\"]/div[1]/text()')\n", 255 | " info_list = [Group_Name[0].strip(), Group_ID[0].replace(':', '').replace(' ', ''), Tecs_Used_by_Group]\n", 256 | " Groups_info_list.append(info_list)" 257 | ], 258 | "metadata": { 259 | "id": "JmB8ka4GOvhD" 260 | }, 261 | "execution_count": null, 262 | "outputs": [] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "source": [ 267 | "def get_urls1():\n", 268 | " urls = ['https://attack.mitre.org/tactics/TA000{}/'.format(str(i)) for i in range(1, 10)]\n", 269 | " urls.extend(['https://attack.mitre.org/tactics/TA0010/', 'https://attack.mitre.org/tactics/TA0011/',\n", 270 | " 'https://attack.mitre.org/tactics/TA0040/'])\n", 271 | " for url in urls:\n", 272 | " get_tactics_info(url)\n", 273 | " time.sleep(0.05)" 274 | ], 275 | "metadata": { 276 | "id": "P04J3wajOxcf" 277 | }, 278 | "execution_count": null, 279 | "outputs": [] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "source": [ 284 | "def get_urls2():\n", 285 | " for url in Techniques_url:\n", 286 | " get_techniques_info(url)\n", 287 | " time.sleep(0.05)" 288 | ], 289 | "metadata": { 290 | "id": "XciIJe-sO0az" 291 | }, 292 | "execution_count": null, 293 | "outputs": [] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "source": [ 298 | "def get_urls3():\n", 299 | " get_mitigations_url()\n", 300 | " for url in Mitigations_url:\n", 301 | " get_mitigations_info(url)\n", 302 | " time.sleep(0.05)\n", 303 | "\n", 304 | " for i in range(0, len(Mitigations_info_list_temp)):\n", 305 | " Mitigations_info_list[i].append(Mitigations_info_list_temp[i])" 306 | ], 307 | "metadata": { 308 | "id": "5XZRMLBGQgSJ" 309 | }, 310 | "execution_count": null, 311 | "outputs": [] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "source": [ 316 | "def get_urls4():\n", 317 | " get_groups_url()\n", 318 | " for url in Groups_url:\n", 319 | " get_groups_info(url)\n", 320 | " time.sleep(0.05)\n", 321 | "\n", 322 | " for i in range(0, len(Groups_info_list)):\n", 323 | " Groups_info_list[i].append(Groups_info_list_temp[i])" 324 | ], 325 | "metadata": { 326 | "id": "qLr3Yp5jQo1k" 327 | }, 328 | "execution_count": null, 329 | "outputs": [] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "source": [ 334 | "len(Groups_info_list_temp)" 335 | ], 336 | "metadata": { 337 | "colab": { 338 | "base_uri": "https://localhost:8080/" 339 | }, 340 | "id": "bcEknLync47-", 341 | "outputId": "44da9715-7d0a-4ae8-c30b-3abd20f6e7fc" 342 | }, 343 | "execution_count": null, 344 | "outputs": [ 345 | { 346 | "output_type": "execute_result", 347 | "data": { 348 | "text/plain": [ 349 | "266" 350 | ] 351 | }, 352 | "metadata": {}, 353 | "execution_count": 49 354 | } 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "source": [ 360 | "len(Groups_info_list)" 361 | ], 362 | "metadata": { 363 | "colab": { 364 | "base_uri": "https://localhost:8080/" 365 | }, 366 | "id": "WtXu5pgRc-sW", 367 | "outputId": "f467cf54-ae40-457d-d75c-da82d11e2e0d" 368 | }, 369 | "execution_count": null, 370 | "outputs": [ 371 | { 372 | "output_type": "execute_result", 373 | "data": { 374 | "text/plain": [ 375 | "133" 376 | ] 377 | }, 378 | "metadata": {}, 379 | "execution_count": 50 380 | } 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "source": [ 386 | "get_urls1()\n", 387 | "get_urls2()\n", 388 | "get_urls3()\n", 389 | "get_urls4()\n", 390 | " " 391 | ], 392 | "metadata": { 393 | "id": "P1G6HgFPQqtu" 394 | }, 395 | "execution_count": null, 396 | "outputs": [] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "source": [ 401 | "**Datasets headers**" 402 | ], 403 | "metadata": { 404 | "id": "tBzdfpqBRJoR" 405 | } 406 | }, 407 | { 408 | "cell_type": "code", 409 | "source": [ 410 | "header1 = ['Name', 'Intro', 'ID', 'Created', 'Last_Modified']\n", 411 | "header2 = ['Name', 'ID', 'Sub-Tec', 'Tactic', 'Platforms', 'Data Sources', 'Permissions Required']\n", 412 | "header3 = ['Name', 'ID', 'Description', 'Tecs Addressed by Mitigation']\n", 413 | "header4 = ['Name', 'ID', 'Tecs Used by Group', 'Associated Groups']" 414 | ], 415 | "metadata": { 416 | "id": "sBTLneRsRBz4" 417 | }, 418 | "execution_count": null, 419 | "outputs": [] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "source": [ 424 | "**Create csv files**" 425 | ], 426 | "metadata": { 427 | "id": "EvydSAjyRdXR" 428 | } 429 | }, 430 | { 431 | "cell_type": "code", 432 | "source": [ 433 | "csvfile1 = open('ATT&CK MATRICES Tac.csv', 'w', errors='ignore', newline='')\n", 434 | "csvfile2 = open('ATT&CK MATRICES Tec.csv', 'w', errors='ignore', newline='')\n", 435 | "csvfile3 = open('ATT&CK MATRICES Miti.csv', 'w', errors='ignore', newline='')\n", 436 | "csvfile4 = open('ATT&CK MATRICES Group.csv', 'w', errors='ignore', newline='')" 437 | ], 438 | "metadata": { 439 | "id": "PvVGUxghRSvz" 440 | }, 441 | "execution_count": null, 442 | "outputs": [] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "source": [ 447 | "**Filling csv files with extracted data**" 448 | ], 449 | "metadata": { 450 | "id": "8IjJIN-uRjeC" 451 | } 452 | }, 453 | { 454 | "cell_type": "code", 455 | "source": [ 456 | "sheet1 = csv.writer(csvfile1)\n", 457 | "sheet2 = csv.writer(csvfile2)\n", 458 | "sheet3 = csv.writer(csvfile3)\n", 459 | "sheet4 = csv.writer(csvfile4)\n", 460 | "\n", 461 | "sheet1.writerow(header1)\n", 462 | "sheet2.writerow(header2)\n", 463 | "sheet3.writerow(header3)\n", 464 | "sheet4.writerow(header4)\n", 465 | " \n", 466 | "for list1 in Tactics_info_list:\n", 467 | " sheet1.writerow(list1)\n", 468 | "for list2 in Techniques_info_list:\n", 469 | " sheet2.writerow(list2)\n", 470 | "for list3 in Mitigations_info_list:\n", 471 | " sheet3.writerow(list3)\n", 472 | "for list4 in Groups_info_list:\n", 473 | " sheet4.writerow(list4)" 474 | ], 475 | "metadata": { 476 | "id": "pMc3ySPVRjzr" 477 | }, 478 | "execution_count": null, 479 | "outputs": [] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "source": [ 484 | "csvfile1.close()\n", 485 | "csvfile2.close()\n", 486 | "csvfile3.close()\n", 487 | "csvfile4.close()" 488 | ], 489 | "metadata": { 490 | "id": "8LTTFTbySIvR" 491 | }, 492 | "execution_count": null, 493 | "outputs": [] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "source": [ 498 | "" 499 | ], 500 | "metadata": { 501 | "id": "7p-l1SH7sOlJ" 502 | }, 503 | "execution_count": null, 504 | "outputs": [] 505 | } 506 | ] 507 | } -------------------------------------------------------------------------------- /KG construction/Input_for_KG_construction_task.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "Le but est de créer un jeu de données contenant les triplets généré à partir d'un ensemble de textes. \n", 7 | "\n", 8 | "- Source node , Target, Relationship label\n", 9 | "\n", 10 | "Puisque je vais construire le KG avec py2neo, j'aurai donc besoin même des étiquettes dans lesquels source et target nodes appartient. \n", 11 | "\n", 12 | "La sortie de ce notebook est une dataset contenant les triplets et les étiquettes; Ce jeu de donnée sera ensuite utilisé pour créer le graphe de connaissances" 13 | ], 14 | "metadata": { 15 | "id": "eQdXcPSwkEJh" 16 | } 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": { 22 | "colab": { 23 | "base_uri": "https://localhost:8080/" 24 | }, 25 | "id": "311eP72iZ5d_", 26 | "outputId": "67e6ae72-2d6c-4c3f-bd8d-7cecb9c7a480" 27 | }, 28 | "outputs": [ 29 | { 30 | "output_type": "stream", 31 | "name": "stdout", 32 | "text": [ 33 | "Mounted at /drive\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "from google.colab import drive\n", 39 | "drive.mount('/drive')" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "colab": { 47 | "base_uri": "https://localhost:8080/" 48 | }, 49 | "id": "R-B3zH4C0JWj", 50 | "outputId": "bd11d7c3-4301-4237-8842-b98db483f0f8" 51 | }, 52 | "outputs": [ 53 | { 54 | "output_type": "stream", 55 | "name": "stdout", 56 | "text": [ 57 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 58 | "Collecting torch==1.6.0\n", 59 | " Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)\n", 60 | "\u001b[K |████████████████████████████████| 748.8 MB 15 kB/s \n", 61 | "\u001b[?25hRequirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from torch==1.6.0) (0.16.0)\n", 62 | "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torch==1.6.0) (1.21.6)\n", 63 | "Installing collected packages: torch\n", 64 | " Attempting uninstall: torch\n", 65 | " Found existing installation: torch 1.11.0+cu113\n", 66 | " Uninstalling torch-1.11.0+cu113:\n", 67 | " Successfully uninstalled torch-1.11.0+cu113\n", 68 | "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", 69 | "torchvision 0.12.0+cu113 requires torch==1.11.0, but you have torch 1.6.0 which is incompatible.\n", 70 | "torchtext 0.12.0 requires torch==1.11.0, but you have torch 1.6.0 which is incompatible.\n", 71 | "torchaudio 0.11.0+cu113 requires torch==1.11.0, but you have torch 1.6.0 which is incompatible.\u001b[0m\n", 72 | "Successfully installed torch-1.6.0\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "!pip install torch==1.6.0" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "id": "vsK5ZuVGYT73" 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "GIT_URL_1 = \"https://github.com/kbandla/APTnotes\"\n", 89 | "GIT_URL_2 = \"https://github.com/CyberMonitor/APT_CyberCriminal_Campagin_Collections\"\n", 90 | "SOURCE_PDF_DIR='/drive/My Drive/PDF'\n", 91 | "DST_TXT_DIR='/drive/My Drive/TXT'" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": { 97 | "id": "gQ52tc_2Y2dH" 98 | }, 99 | "source": [ 100 | "## **Download malwares reports**" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "id": "H1JePLD0aIZR" 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "import os" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "id": "8W-kE4SzYZSl" 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "def download_repo(github_url, dst_dir):\n", 123 | " repo_name = github_url.split('/')[-1]\n", 124 | " clone_bash_command = f\"git clone {github_url}.git {dst_dir}/{repo_name}\"\n", 125 | " print(f\"Run: {clone_bash_command}\")\n", 126 | " os.system(clone_bash_command)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "colab": { 134 | "base_uri": "https://localhost:8080/" 135 | }, 136 | "id": "33-XWIZ3YdoO", 137 | "outputId": "8cb622a4-7f51-494d-ca6d-4b27b24ddb69" 138 | }, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "Run: git clone https://github.com/kbandla/APTnotes.git /drive/My Drive/PDF/APTnotes\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "download_repo(GIT_URL_1, SOURCE_PDF_DIR)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": { 156 | "colab": { 157 | "base_uri": "https://localhost:8080/" 158 | }, 159 | "id": "Kby-mdh3m_2Y", 160 | "outputId": "ab419905-c001-494b-da8f-1f7ed86c69b2" 161 | }, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Cloning into '/drive/My Drive/PDF/APTnotes'...\n", 168 | "remote: Enumerating objects: 1612, done.\u001b[K\n", 169 | "remote: Total 1612 (delta 0), reused 0 (delta 0), pack-reused 1612\u001b[K\n", 170 | "Receiving objects: 100% (1612/1612), 456.14 MiB | 16.69 MiB/s, done.\n", 171 | "Resolving deltas: 100% (878/878), done.\n", 172 | "Checking out files: 100% (303/303), done.\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "!git clone https://github.com/kbandla/APTnotes.git '/drive/My Drive/PDF/APTnotes'" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": { 183 | "id": "1aUhP-rWY_iz" 184 | }, 185 | "source": [ 186 | "## **Convert PDF to TXT**" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "colab": { 194 | "base_uri": "https://localhost:8080/" 195 | }, 196 | "id": "HJLHzC-1fbW4", 197 | "outputId": "16f44017-c5b4-4fdd-eb89-88a2ed036dd9" 198 | }, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "Reading package lists... Done\n", 205 | "Building dependency tree \n", 206 | "Reading state information... Done\n", 207 | "The following package was automatically installed and is no longer required:\n", 208 | " libnvidia-common-460\n", 209 | "Use 'apt autoremove' to remove it.\n", 210 | "The following additional packages will be installed:\n", 211 | " libpoppler-cpp0v5\n", 212 | "The following NEW packages will be installed:\n", 213 | " libpoppler-cpp-dev libpoppler-cpp0v5\n", 214 | "0 upgraded, 2 newly installed, 0 to remove and 42 not upgraded.\n", 215 | "Need to get 36.7 kB of archives.\n", 216 | "After this operation, 188 kB of additional disk space will be used.\n", 217 | "Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpoppler-cpp0v5 amd64 0.62.0-2ubuntu2.12 [28.0 kB]\n", 218 | "Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpoppler-cpp-dev amd64 0.62.0-2ubuntu2.12 [8,676 B]\n", 219 | "Fetched 36.7 kB in 0s (690 kB/s)\n", 220 | "Selecting previously unselected package libpoppler-cpp0v5:amd64.\n", 221 | "(Reading database ... 155657 files and directories currently installed.)\n", 222 | "Preparing to unpack .../libpoppler-cpp0v5_0.62.0-2ubuntu2.12_amd64.deb ...\n", 223 | "Unpacking libpoppler-cpp0v5:amd64 (0.62.0-2ubuntu2.12) ...\n", 224 | "Selecting previously unselected package libpoppler-cpp-dev:amd64.\n", 225 | "Preparing to unpack .../libpoppler-cpp-dev_0.62.0-2ubuntu2.12_amd64.deb ...\n", 226 | "Unpacking libpoppler-cpp-dev:amd64 (0.62.0-2ubuntu2.12) ...\n", 227 | "Setting up libpoppler-cpp0v5:amd64 (0.62.0-2ubuntu2.12) ...\n", 228 | "Setting up libpoppler-cpp-dev:amd64 (0.62.0-2ubuntu2.12) ...\n", 229 | "Processing triggers for libc-bin (2.27-3ubuntu1.3) ...\n", 230 | "/sbin/ldconfig.real: /usr/local/lib/python3.7/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link\n", 231 | "\n" 232 | ] 233 | } 234 | ], 235 | "source": [ 236 | "!apt-get install libpoppler-cpp-dev" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": { 243 | "colab": { 244 | "base_uri": "https://localhost:8080/" 245 | }, 246 | "id": "5GtxqByZaeTv", 247 | "outputId": "df0deb90-c153-45d0-9ee8-502a35a96423" 248 | }, 249 | "outputs": [ 250 | { 251 | "name": "stdout", 252 | "output_type": "stream", 253 | "text": [ 254 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 255 | "Collecting pdftotext\n", 256 | " Using cached pdftotext-2.2.2.tar.gz (113 kB)\n", 257 | "Building wheels for collected packages: pdftotext\n", 258 | " Building wheel for pdftotext (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 259 | " Created wheel for pdftotext: filename=pdftotext-2.2.2-cp37-cp37m-linux_x86_64.whl size=54932 sha256=41c41745e0e649a953cbc8f886a490e7478824de2af07d7b5d77c3d2284b3efe\n", 260 | " Stored in directory: /root/.cache/pip/wheels/98/19/8e/e8648026db8b7ef3324ad9afa1f7c9109a7e7509846f693ed9\n", 261 | "Successfully built pdftotext\n", 262 | "Installing collected packages: pdftotext\n", 263 | "Successfully installed pdftotext-2.2.2\n" 264 | ] 265 | } 266 | ], 267 | "source": [ 268 | "!pip install pdftotext" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "colab": { 276 | "base_uri": "https://localhost:8080/" 277 | }, 278 | "id": "El3mnvEt-lnH", 279 | "outputId": "c49da6cd-17b0-432e-f33a-972228b89a3f" 280 | }, 281 | "outputs": [ 282 | { 283 | "output_type": "stream", 284 | "name": "stdout", 285 | "text": [ 286 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 287 | "Collecting transformers\n", 288 | " Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)\n", 289 | "\u001b[K |████████████████████████████████| 4.2 MB 11.9 MB/s \n", 290 | "\u001b[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.7.0)\n", 291 | "Collecting pyyaml>=5.1\n", 292 | " Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)\n", 293 | "\u001b[K |████████████████████████████████| 596 kB 53.4 MB/s \n", 294 | "\u001b[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1\n", 295 | " Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)\n", 296 | "\u001b[K |████████████████████████████████| 6.6 MB 49.2 MB/s \n", 297 | "\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.64.0)\n", 298 | "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)\n", 299 | "Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)\n", 300 | "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.11.4)\n", 301 | "Collecting huggingface-hub<1.0,>=0.1.0\n", 302 | " Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)\n", 303 | "\u001b[K |████████████████████████████████| 86 kB 3.7 MB/s \n", 304 | "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.21.6)\n", 305 | "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)\n", 306 | "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.2.0)\n", 307 | "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.9)\n", 308 | "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.8.0)\n", 309 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2022.5.18.1)\n", 310 | "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)\n", 311 | "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)\n", 312 | "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)\n", 313 | "Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers\n", 314 | " Attempting uninstall: pyyaml\n", 315 | " Found existing installation: PyYAML 3.13\n", 316 | " Uninstalling PyYAML-3.13:\n", 317 | " Successfully uninstalled PyYAML-3.13\n", 318 | "Successfully installed huggingface-hub-0.7.0 pyyaml-6.0 tokenizers-0.12.1 transformers-4.19.2\n" 319 | ] 320 | } 321 | ], 322 | "source": [ 323 | "!pip install -U transformers" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": { 330 | "id": "L6jFbaPfajLC" 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "from pathlib import Path\n", 335 | "import pdftotext" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "id": "1OHvPkfTZC-n" 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "def pdftotext_converter(source_pdf_dir, dst_txt_dir):\n", 347 | " source_pdf_dir = Path(source_pdf_dir)\n", 348 | " dst_txt_dir = Path(dst_txt_dir)\n", 349 | " print(f\"pdf_dir : {source_pdf_dir}\")\n", 350 | " print(f\"dst_txt_dir : {dst_txt_dir}\")\n", 351 | " bad_counter = 0\n", 352 | " for i, pdf_path in enumerate(source_pdf_dir.rglob(\"*pdf\")):\n", 353 | " rel_pdf_path = pdf_path.relative_to(source_pdf_dir)\n", 354 | " dst_path = dst_txt_dir / f\"{rel_pdf_path}.txt\"\n", 355 | " dst_path.parent.mkdir(exist_ok=True, parents=True)\n", 356 | " # Load your PDF\n", 357 | " try:\n", 358 | " with open(pdf_path, \"rb\") as f:\n", 359 | " pdf = pdftotext.PDF(f)\n", 360 | "\n", 361 | " with open(dst_path, 'w') as f:\n", 362 | " f.write(\"\\n\\n\".join(pdf))\n", 363 | "\n", 364 | " print(\"converted\", i - bad_counter, dst_path)\n", 365 | " except Exception as e:\n", 366 | " print(e, i, dst_path)\n", 367 | " bad_counter += 1" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": { 374 | "colab": { 375 | "base_uri": "https://localhost:8080/" 376 | }, 377 | "id": "1oVOZ6XggghX", 378 | "outputId": "5e9a914e-4a80-4ab2-d064-8e887ec85ccf" 379 | }, 380 | "outputs": [ 381 | { 382 | "name": "stdout", 383 | "output_type": "stream", 384 | "text": [ 385 | "pdf_dir : /drive/My Drive/PDF/APTnotes\n", 386 | "dst_txt_dir : /drive/My Drive/TXT/APTnotes\n", 387 | "converted 0 /drive/My Drive/TXT/APTnotes/2008/556_10535_798405_Annex87_CyberAttacks.pdf.txt\n", 388 | "converted 1 /drive/My Drive/TXT/APTnotes/2009/ghostnet.pdf.txt\n", 389 | "converted 2 /drive/My Drive/TXT/APTnotes/2010/Aurora_Botnet_Command_Structure.pdf.txt\n", 390 | "converted 3 /drive/My Drive/TXT/APTnotes/2010/Aurora_HBGARY_DRAFT.pdf.txt\n", 391 | "converted 4 /drive/My Drive/TXT/APTnotes/2010/Case_Study_Operation_Aurora_V11.pdf.txt\n", 392 | "converted 5 /drive/My Drive/TXT/APTnotes/2010/Combating Threats - Operation Aurora.pdf.txt\n", 393 | "converted 6 /drive/My Drive/TXT/APTnotes/2010/MSUpdaterTrojanWhitepaper.pdf.txt\n", 394 | "converted 7 /drive/My Drive/TXT/APTnotes/2010/WhitePaper HBGary Threat Report, Operation Aurora.pdf.txt\n", 395 | "converted 8 /drive/My Drive/TXT/APTnotes/2010/how_can_u_tell_Aurora.pdf.txt\n", 396 | "converted 9 /drive/My Drive/TXT/APTnotes/2010/in-depth_analysis_of_hydraq_final_231538.pdf.txt\n", 397 | "converted 10 /drive/My Drive/TXT/APTnotes/2010/shadows-in-the-cloud.pdf.txt\n", 398 | "converted 11 /drive/My Drive/TXT/APTnotes/2011/Alerts DL-2011 Alerts-A-2011-02-18-01 Night Dragon Attachment 1.pdf.txt\n", 399 | "converted 12 /drive/My Drive/TXT/APTnotes/2011/C5_APT_ADecadeInReview.pdf.txt\n", 400 | "converted 13 /drive/My Drive/TXT/APTnotes/2011/C5_APT_SKHack.pdf.txt\n", 401 | "converted 14 /drive/My Drive/TXT/APTnotes/2011/Duqu_Trojan_Questions_and_Answers.pdf.txt\n", 402 | "converted 15 /drive/My Drive/TXT/APTnotes/2011/Evolution_Drivers_Duqu_Stuxnet.pdf.txt\n", 403 | "converted 16 /drive/My Drive/TXT/APTnotes/2011/HTran_and_the_Advanced_Persistent_Threat.pdf.txt\n", 404 | "converted 17 /drive/My Drive/TXT/APTnotes/2011/Palebot_Palestinian_credentials.pdf.txt\n", 405 | "converted 18 /drive/My Drive/TXT/APTnotes/2011/Stuxnet_Under_the_Microscope.pdf.txt\n", 406 | "converted 19 /drive/My Drive/TXT/APTnotes/2011/shady_rat_vanity.pdf.txt\n", 407 | "converted 20 /drive/My Drive/TXT/APTnotes/2011/tb_advanced_persistent_threats.pdf.txt\n", 408 | "converted 21 /drive/My Drive/TXT/APTnotes/2011/the_nitro_attacks.pdf.txt\n", 409 | "converted 22 /drive/My Drive/TXT/APTnotes/2011/w32_stuxnet_dossier.pdf.txt\n", 410 | "converted 23 /drive/My Drive/TXT/APTnotes/2011/wp-global-energy-cyberattacks-night-dragon.pdf.txt\n", 411 | "converted 24 /drive/My Drive/TXT/APTnotes/2011/wp-operation-shady-rat.pdf.txt\n", 412 | "converted 25 /drive/My Drive/TXT/APTnotes/2011/wp_dissecting-lurid-apt.pdf.txt\n", 413 | "converted 26 /drive/My Drive/TXT/APTnotes/2012/Crouching_tiger_hidden_dragon.pdf.txt\n", 414 | "converted 27 /drive/My Drive/TXT/APTnotes/2012/Crypto-DarkComet-Report.pdf.txt\n", 415 | "converted 28 /drive/My Drive/TXT/APTnotes/2012/Cyberattack_against_Israeli_and_Palestinian_targets.pdf.txt\n", 416 | "converted 29 /drive/My Drive/TXT/APTnotes/2012/FTA 1007 - Shamoon.pdf.txt\n", 417 | "converted 30 /drive/My Drive/TXT/APTnotes/2012/Faces_Ghost_RAT.pdf.txt\n", 418 | "converted 31 /drive/My Drive/TXT/APTnotes/2012/From-Bahrain-With-Love-FinFishers-Spy-Kit-Exposed.pdf.txt\n", 419 | "converted 32 /drive/My Drive/TXT/APTnotes/2012/IEXPL0RE_RAT.pdf.txt\n", 420 | "converted 33 /drive/My Drive/TXT/APTnotes/2012/NormanShark-MaudiOperation.pdf.txt\n", 421 | "converted 34 /drive/My Drive/TXT/APTnotes/2012/OSX_SabPub.pdf.txt\n", 422 | "converted 35 /drive/My Drive/TXT/APTnotes/2012/PEST-CONTROL.pdf.txt\n", 423 | "converted 36 /drive/My Drive/TXT/APTnotes/2012/The_Madi_Infostealers.pdf.txt\n", 424 | "converted 37 /drive/My Drive/TXT/APTnotes/2012/The_Mirage_Campaign.pdf.txt\n", 425 | "converted 38 /drive/My Drive/TXT/APTnotes/2012/The_Sin_Digoo_Affair.pdf.txt\n", 426 | "converted 39 /drive/My Drive/TXT/APTnotes/2012/Tibet_Lurk.pdf.txt\n", 427 | "converted 40 /drive/My Drive/TXT/APTnotes/2012/VOHO_WP_FINAL_READY-FOR-Publication-09242012_AC.pdf.txt\n", 428 | "converted 41 /drive/My Drive/TXT/APTnotes/2012/WickedRose_andNCPH.pdf.txt\n", 429 | "converted 42 /drive/My Drive/TXT/APTnotes/2012/kaspersky-lab-gauss.pdf.txt\n", 430 | "converted 43 /drive/My Drive/TXT/APTnotes/2012/skywiper.pdf.txt\n", 431 | "converted 44 /drive/My Drive/TXT/APTnotes/2012/the-elderwood-project.pdf.txt\n", 432 | "converted 45 /drive/My Drive/TXT/APTnotes/2012/trojan_taidoor-targeting_think_tanks.pdf.txt\n", 433 | "converted 46 /drive/My Drive/TXT/APTnotes/2012/w32_flamer_newsforyou.pdf.txt\n", 434 | "converted 47 /drive/My Drive/TXT/APTnotes/2012/wp_ixeshe.pdf.txt\n", 435 | "converted 48 /drive/My Drive/TXT/APTnotes/2012/wp_luckycat_redux.pdf.txt\n", 436 | "converted 49 /drive/My Drive/TXT/APTnotes/2012/wp_the-heartbeat-apt-campaign.pdf.txt\n", 437 | "converted 50 /drive/My Drive/TXT/APTnotes/2013/15-2013-youonlyclicktwice.pdf.txt\n", 438 | "converted 51 /drive/My Drive/TXT/APTnotes/2013/19-2013-acalltoharm.pdf.txt\n", 439 | "converted 52 /drive/My Drive/TXT/APTnotes/2013/2013-9.pdf.txt\n", 440 | "converted 53 /drive/My Drive/TXT/APTnotes/2013/2q-report-on-targeted-attack-campaigns.pdf.txt\n", 441 | "converted 54 /drive/My Drive/TXT/APTnotes/2013/ByeBye_Shell_target.pdf.txt\n", 442 | "converted 55 /drive/My Drive/TXT/APTnotes/2013/C5_APT_C2InTheFifthDomain.pdf.txt\n", 443 | "converted 56 /drive/My Drive/TXT/APTnotes/2013/Dark_Seoul_Cyberattack.pdf.txt\n", 444 | "converted 57 /drive/My Drive/TXT/APTnotes/2013/ETSO_APT_Attacks_Analysis.pdf.txt\n", 445 | "converted 58 /drive/My Drive/TXT/APTnotes/2013/FTA 1010 - njRAT The Saga Continues.pdf.txt\n", 446 | "converted 59 /drive/My Drive/TXT/APTnotes/2013/FireEye-Terminator_RAT.pdf.txt\n", 447 | "converted 60 /drive/My Drive/TXT/APTnotes/2013/India_Pak_Tranchulas.pdf.txt\n", 448 | "converted 61 /drive/My Drive/TXT/APTnotes/2013/Inside_Report_by_Infosec_Consortium.pdf.txt\n", 449 | "converted 62 /drive/My Drive/TXT/APTnotes/2013/KeyBoy_Vietnam_India.pdf.txt\n", 450 | "converted 63 /drive/My Drive/TXT/APTnotes/2013/Kimsuky.pdf.txt\n", 451 | "converted 64 /drive/My Drive/TXT/APTnotes/2013/Mandiant_APT1_Report.pdf.txt\n", 452 | "converted 65 /drive/My Drive/TXT/APTnotes/2013/McAfee_Labs_Threat_Advisory_Exploit_Operation_Red_Oct.pdf.txt\n", 453 | "converted 66 /drive/My Drive/TXT/APTnotes/2013/MiniDuke_Paper_Final.pdf.txt\n", 454 | "converted 67 /drive/My Drive/TXT/APTnotes/2013/NS-Unveiling-an-Indian-Cyberattack-Infrastructure_FINAL_Web.pdf.txt\n", 455 | "converted 68 /drive/My Drive/TXT/APTnotes/2013/NormanShark-MaudiOperation.pdf.txt\n", 456 | "converted 69 /drive/My Drive/TXT/APTnotes/2013/Norman_HangOver report_Executive Summary_042513.pdf.txt\n", 457 | "converted 70 /drive/My Drive/TXT/APTnotes/2013/Operation_DeputyDog.pdf.txt\n", 458 | "converted 71 /drive/My Drive/TXT/APTnotes/2013/Operation_EphemeralHydra.pdf.txt\n", 459 | "converted 72 /drive/My Drive/TXT/APTnotes/2013/Operation_Molerats.pdf.txt\n", 460 | "converted 73 /drive/My Drive/TXT/APTnotes/2013/Plugx_Smoaler.pdf.txt\n", 461 | "converted 74 /drive/My Drive/TXT/APTnotes/2013/Presentation_Targeted-Attacks_EN.pdf.txt\n", 462 | "converted 75 /drive/My Drive/TXT/APTnotes/2013/RAP002_APT1_Technical_backstage.1.0.pdf.txt\n", 463 | "converted 76 /drive/My Drive/TXT/APTnotes/2013/Safe-a-targeted-threat.pdf.txt\n", 464 | "converted 77 /drive/My Drive/TXT/APTnotes/2013/Secrets_of_the_Comfoo_Masters.pdf.txt\n", 465 | "converted 78 /drive/My Drive/TXT/APTnotes/2013/Securelist_RedOctober.pdf.txt\n", 466 | "converted 79 /drive/My Drive/TXT/APTnotes/2013/Securelist_RedOctober_Detail.pdf.txt\n", 467 | "converted 80 /drive/My Drive/TXT/APTnotes/2013/Surtr_Malware_Tibetan.pdf.txt\n", 468 | "converted 81 /drive/My Drive/TXT/APTnotes/2013/Trojan.APT.BaneChant.pdf.txt\n", 469 | "converted 82 /drive/My Drive/TXT/APTnotes/2013/Trojan.APT.Seinup.pdf.txt\n", 470 | "converted 83 /drive/My Drive/TXT/APTnotes/2013/US-13-Yarochkin-In-Depth-Analysis-of-Escalated-APT-Attacks-Slides.pdf.txt\n", 471 | "converted 84 /drive/My Drive/TXT/APTnotes/2013/Unveiling an Indian Cyberattack Infrastructure - appendixes.pdf.txt\n", 472 | "converted 85 /drive/My Drive/TXT/APTnotes/2013/circl-analysisreport-miniduke-stage3-public.pdf.txt\n", 473 | "converted 86 /drive/My Drive/TXT/APTnotes/2013/comment_crew_indicators_of_compromise.pdf.txt\n", 474 | "converted 87 /drive/My Drive/TXT/APTnotes/2013/dissecting-operation-troy.pdf.txt\n", 475 | "converted 88 /drive/My Drive/TXT/APTnotes/2013/energy-at-risk.pdf.txt\n", 476 | "converted 89 /drive/My Drive/TXT/APTnotes/2013/fireeye-china-chopper-report.pdf.txt\n", 477 | "converted 90 /drive/My Drive/TXT/APTnotes/2013/fireeye-malware-supply-chain.pdf.txt\n", 478 | "converted 91 /drive/My Drive/TXT/APTnotes/2013/fireeye-operation-ke3chang.pdf.txt\n", 479 | "converted 92 /drive/My Drive/TXT/APTnotes/2013/fireeye-poison-ivy-report.pdf.txt\n", 480 | "converted 93 /drive/My Drive/TXT/APTnotes/2013/fireeye-wwc-report.pdf.txt\n", 481 | "converted 94 /drive/My Drive/TXT/APTnotes/2013/fta-1009---njrat-uncovered-1.pdf.txt\n", 482 | "converted 95 /drive/My Drive/TXT/APTnotes/2013/hidden_lynx.pdf.txt\n", 483 | "converted 96 /drive/My Drive/TXT/APTnotes/2013/icefog.pdf.txt\n", 484 | "converted 97 /drive/My Drive/TXT/APTnotes/2013/kaspersky-the-net-traveler-part1-final.pdf.txt\n", 485 | "converted 98 /drive/My Drive/TXT/APTnotes/2013/miniduke_indicators_public.pdf.txt\n", 486 | "converted 99 /drive/My Drive/TXT/APTnotes/2013/stuxnet_0_5_the_missing_link.pdf.txt\n", 487 | "converted 100 /drive/My Drive/TXT/APTnotes/2013/themysteryofthepdf0-dayassemblermicrobackdoor.pdf.txt\n", 488 | "converted 101 /drive/My Drive/TXT/APTnotes/2013/theteamspystory_final_t2.pdf.txt\n", 489 | "converted 102 /drive/My Drive/TXT/APTnotes/2013/tr-12-circl-plugx-analysis-v1.pdf.txt\n", 490 | "converted 103 /drive/My Drive/TXT/APTnotes/2013/winnti-more-than-just-a-game-130410.pdf.txt\n", 491 | "converted 104 /drive/My Drive/TXT/APTnotes/2013/wp-fakem-rat.pdf.txt\n", 492 | "converted 105 /drive/My Drive/TXT/APTnotes/2014/ASERT-Threat-Intelligence-Brief-2014-07-Illuminating-Etumbot-APT.pdf.txt\n", 493 | "converted 106 /drive/My Drive/TXT/APTnotes/2014/AdversaryIntelligenceReport_DeepPanda_0 (1).pdf.txt\n", 494 | "converted 107 /drive/My Drive/TXT/APTnotes/2014/Aided_Frame_Aided_Direction.pdf.txt\n", 495 | "converted 108 /drive/My Drive/TXT/APTnotes/2014/Alienvault_Scanbox.pdf.txt\n", 496 | "converted 109 /drive/My Drive/TXT/APTnotes/2014/Anunak_APT_against_financial_institutions.pdf.txt\n", 497 | "converted 110 /drive/My Drive/TXT/APTnotes/2014/BlackEnergy2_Plugins_Router.pdf.txt\n", 498 | "converted 111 /drive/My Drive/TXT/APTnotes/2014/Chinese_MITM_Google.pdf.txt\n", 499 | "converted 112 /drive/My Drive/TXT/APTnotes/2014/CloudAtlas_RedOctober_APT.pdf.txt\n", 500 | "converted 113 /drive/My Drive/TXT/APTnotes/2014/Compromise_Greece_Beijing.pdf.txt\n", 501 | "converted 114 /drive/My Drive/TXT/APTnotes/2014/CrowdStrike_Flying_Kitten.pdf.txt\n", 502 | "converted 115 /drive/My Drive/TXT/APTnotes/2014/Cylance_Operation_Cleaver_Report.pdf.txt\n", 503 | "converted 116 /drive/My Drive/TXT/APTnotes/2014/DEEP_PANDA_Sakula.pdf.txt\n", 504 | "converted 117 /drive/My Drive/TXT/APTnotes/2014/Darwin_fav_APT_Group.pdf.txt\n", 505 | "converted 118 /drive/My Drive/TXT/APTnotes/2014/Democracy_HongKong_Under_Attack.pdf.txt\n", 506 | "converted 119 /drive/My Drive/TXT/APTnotes/2014/Derusbi_Server_Analysis-Final.pdf.txt\n", 507 | "converted 120 /drive/My Drive/TXT/APTnotes/2014/Dragonfly_Threat_Against_Western_Energy_Suppliers.pdf.txt\n", 508 | "converted 121 /drive/My Drive/TXT/APTnotes/2014/EB-YetiJuly2014-Public.pdf.txt\n", 509 | "converted 122 /drive/My Drive/TXT/APTnotes/2014/El_Machete.pdf.txt\n", 510 | "converted 123 /drive/My Drive/TXT/APTnotes/2014/EvilBunny_Suspect4_v1.0.pdf.txt\n", 511 | "converted 124 /drive/My Drive/TXT/APTnotes/2014/FTA 1001 FINAL 1.15.14.pdf.txt\n", 512 | "converted 125 /drive/My Drive/TXT/APTnotes/2014/FTA 1011 Follow UP.pdf.txt\n", 513 | "converted 126 /drive/My Drive/TXT/APTnotes/2014/FTA 1012 STTEAM Final.pdf.txt\n", 514 | "converted 127 /drive/My Drive/TXT/APTnotes/2014/FTA_1013_RAT_in_a_jar.pdf.txt\n", 515 | "converted 128 /drive/My Drive/TXT/APTnotes/2014/FTA_1014_Bots_Machines_and_the_Matrix.pdf.txt\n", 516 | "converted 129 /drive/My Drive/TXT/APTnotes/2014/GDATA_TooHash_CaseStudy_102014_EN_v1.pdf.txt\n", 517 | "converted 130 /drive/My Drive/TXT/APTnotes/2014/GData_Uroburos_RedPaper_EN_v1.pdf.txt\n", 518 | "converted 131 /drive/My Drive/TXT/APTnotes/2014/Gholee_Protective_Edge_themed_spear_phishing_campaign.pdf.txt\n", 519 | "converted 132 /drive/My Drive/TXT/APTnotes/2014/Group72_Opening_ZxShell.pdf.txt\n", 520 | "converted 133 /drive/My Drive/TXT/APTnotes/2014/Group_72.pdf.txt\n", 521 | "converted 134 /drive/My Drive/TXT/APTnotes/2014/HPSR SecurityBriefing_Episode16_NorthKorea.pdf.txt\n", 522 | "converted 135 /drive/My Drive/TXT/APTnotes/2014/Hikit_Analysis-Final.pdf.txt\n", 523 | "converted 136 /drive/My Drive/TXT/APTnotes/2014/ICS_Havex_backdoors.pdf.txt\n", 524 | "converted 137 /drive/My Drive/TXT/APTnotes/2014/KL_Epic_Turla_Technical_Appendix_20140806.pdf.txt\n", 525 | "converted 138 /drive/My Drive/TXT/APTnotes/2014/KL_report_syrian_malware.pdf.txt\n", 526 | "converted 139 /drive/My Drive/TXT/APTnotes/2014/Kaspersky_Lab_crouching_yeti_appendixes_eng_final.pdf.txt\n", 527 | "converted 140 /drive/My Drive/TXT/APTnotes/2014/Kaspersky_Lab_whitepaper_Regin_platform_eng.pdf.txt\n", 528 | "converted 141 /drive/My Drive/TXT/APTnotes/2014/Korplug_Afghanistan_Tajikistan.pdf.txt\n", 529 | "converted 142 /drive/My Drive/TXT/APTnotes/2014/LeoUncia_OrcaRat.pdf.txt\n", 530 | "converted 143 /drive/My Drive/TXT/APTnotes/2014/Micro-Targeted-Malvertising-WP-10-27-14-1.pdf.txt\n", 531 | "converted 144 /drive/My Drive/TXT/APTnotes/2014/Miniduke_twitter.pdf.txt\n", 532 | "converted 145 /drive/My Drive/TXT/APTnotes/2014/Modified_Binaries_Tor.pdf.txt\n", 533 | "converted 146 /drive/My Drive/TXT/APTnotes/2014/NYTimes_Attackers_Evolve_Quickly.pdf.txt\n", 534 | "converted 147 /drive/My Drive/TXT/APTnotes/2014/NetTraveler_Makeover_10th_Birthday.pdf.txt\n", 535 | "converted 148 /drive/My Drive/TXT/APTnotes/2014/OnionDuke_Tor.pdf.txt\n", 536 | "converted 149 /drive/My Drive/TXT/APTnotes/2014/Op_Clandestine_Fox.pdf.txt\n", 537 | "converted 150 /drive/My Drive/TXT/APTnotes/2014/Op_SnowMan_DeputyDog.pdf.txt\n", 538 | "converted 151 /drive/My Drive/TXT/APTnotes/2014/OperationCleaver_The_Notepad_Files.pdf.txt\n", 539 | "converted 152 /drive/My Drive/TXT/APTnotes/2014/OperationDoubleTap.pdf.txt\n", 540 | "converted 153 /drive/My Drive/TXT/APTnotes/2014/Operation_CloudyOmega_Ichitaro.pdf.txt\n", 541 | "converted 154 /drive/My Drive/TXT/APTnotes/2014/Operation_GreedyWonk.pdf.txt\n", 542 | "converted 155 /drive/My Drive/TXT/APTnotes/2014/Operation_Poisoned_Handover.pdf.txt\n", 543 | "converted 156 /drive/My Drive/TXT/APTnotes/2014/Operation_Poisoned_Hurricane.pdf.txt\n", 544 | "converted 157 /drive/My Drive/TXT/APTnotes/2014/Operation_SnowMan.pdf.txt\n", 545 | "converted 158 /drive/My Drive/TXT/APTnotes/2014/OrcaRAT.pdf.txt\n", 546 | "converted 159 /drive/My Drive/TXT/APTnotes/2014/PAN_Nitro.pdf.txt\n", 547 | "converted 160 /drive/My Drive/TXT/APTnotes/2014/Pitty_Tiger_Final_Report.pdf.txt\n", 548 | "converted 161 /drive/My Drive/TXT/APTnotes/2014/Regis_The_Intercept.pdf.txt\n", 549 | "converted 162 /drive/My Drive/TXT/APTnotes/2014/Reuters_Turla.pdf.txt\n", 550 | "converted 163 /drive/My Drive/TXT/APTnotes/2014/Sandworm_briefing2.pdf.txt\n", 551 | "converted 164 /drive/My Drive/TXT/APTnotes/2014/Sayad_Flying_Kitten_analysis.pdf.txt\n", 552 | "converted 165 /drive/My Drive/TXT/APTnotes/2014/Syrian_Malware_Team_BlackWorm.pdf.txt\n", 553 | "converted 166 /drive/My Drive/TXT/APTnotes/2014/TA14-353A_wiper.pdf.txt\n", 554 | "converted 167 /drive/My Drive/TXT/APTnotes/2014/Targeted_Attacks_Lense_NGO.pdf.txt\n", 555 | "converted 168 /drive/My Drive/TXT/APTnotes/2014/Targeting_Syrian_ISIS_Critics.pdf.txt\n", 556 | "converted 169 /drive/My Drive/TXT/APTnotes/2014/The_Epic_Turla_Operation.pdf.txt\n", 557 | "converted 170 /drive/My Drive/TXT/APTnotes/2014/The_Monju_Incident.pdf.txt\n", 558 | "converted 171 /drive/My Drive/TXT/APTnotes/2014/The_Siesta_Campaign.pdf.txt\n", 559 | "converted 172 /drive/My Drive/TXT/APTnotes/2014/The_Uroburos_case.pdf.txt\n", 560 | "converted 173 /drive/My Drive/TXT/APTnotes/2014/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt\n", 561 | "converted 174 /drive/My Drive/TXT/APTnotes/2014/TrapX_ZOMBIE_Report_Final.pdf.txt\n", 562 | "converted 175 /drive/My Drive/TXT/APTnotes/2014/Turla_2_Penquin.pdf.txt\n", 563 | "converted 176 /drive/My Drive/TXT/APTnotes/2014/Vinself_steganography.pdf.txt\n", 564 | "converted 177 /drive/My Drive/TXT/APTnotes/2014/Wiper_Malware.pdf.txt\n", 565 | "converted 178 /drive/My Drive/TXT/APTnotes/2014/XSLCmd_OSX.pdf.txt\n", 566 | "converted 179 /drive/My Drive/TXT/APTnotes/2014/XtremeRAT_fireeye.pdf.txt\n", 567 | "converted 180 /drive/My Drive/TXT/APTnotes/2014/ZoxPNG_Full_Analysis-Final.pdf.txt\n", 568 | "converted 181 /drive/My Drive/TXT/APTnotes/2014/apt28.pdf.txt\n", 569 | "converted 182 /drive/My Drive/TXT/APTnotes/2014/bcs_wp_InceptionReport_EN_v12914.pdf.txt\n", 570 | "converted 183 /drive/My Drive/TXT/APTnotes/2014/blackenergy_whitepaper.pdf.txt\n", 571 | "converted 184 /drive/My Drive/TXT/APTnotes/2014/circl-tr25-analysis-turla-pfinet-snake-uroburos.pdf.txt\n", 572 | "converted 185 /drive/My Drive/TXT/APTnotes/2014/cosmicduke_whitepaper.pdf.txt\n", 573 | "converted 186 /drive/My Drive/TXT/APTnotes/2014/darkhotel_kl_07.11.pdf.txt\n", 574 | "converted 187 /drive/My Drive/TXT/APTnotes/2014/darkhotelappendixindicators_kl.pdf.txt\n", 575 | "converted 188 /drive/My Drive/TXT/APTnotes/2014/deep-panda-webshells.pdf.txt\n", 576 | "converted 189 /drive/My Drive/TXT/APTnotes/2014/fireeye-operation-quantum-entanglement.pdf.txt\n", 577 | "converted 190 /drive/My Drive/TXT/APTnotes/2014/fireeye-operation-saffron-rose.pdf.txt\n", 578 | "converted 191 /drive/My Drive/TXT/APTnotes/2014/fireeye-sidewinder-targeted-attack.pdf.txt\n", 579 | "converted 192 /drive/My Drive/TXT/APTnotes/2014/h12756-wp-shell-crew.pdf.txt\n", 580 | "converted 193 /drive/My Drive/TXT/APTnotes/2014/korea_power_plant_wiper.pdf.txt\n", 581 | "converted 194 /drive/My Drive/TXT/APTnotes/2014/operation-poisoned-helmand.pdf.txt\n", 582 | "converted 195 /drive/My Drive/TXT/APTnotes/2014/putter-panda.pdf.txt\n", 583 | "converted 196 /drive/My Drive/TXT/APTnotes/2014/pwc_ScanBox_framework.pdf.txt\n", 584 | "converted 197 /drive/My Drive/TXT/APTnotes/2014/regin-analysis.pdf.txt\n", 585 | "converted 198 /drive/My Drive/TXT/APTnotes/2014/roaming_tiger_zeronights_2014.pdf.txt\n", 586 | "converted 199 /drive/My Drive/TXT/APTnotes/2014/rpt-fin4.pdf.txt\n", 587 | "converted 200 /drive/My Drive/TXT/APTnotes/2014/sec14-paper-hardy.pdf.txt\n", 588 | "converted 201 /drive/My Drive/TXT/APTnotes/2014/sec14-paper-marczak.pdf.txt\n", 589 | "converted 202 /drive/My Drive/TXT/APTnotes/2014/snake_whitepaper.pdf.txt\n", 590 | "converted 203 /drive/My Drive/TXT/APTnotes/2014/sophos-rotten-tomato-campaign.pdf.txt\n", 591 | "converted 204 /drive/My Drive/TXT/APTnotes/2014/tactical-intelligence-bulletin---sofacy-phishing-.pdf.txt\n", 592 | "converted 205 /drive/My Drive/TXT/APTnotes/2014/targeted_attacks_against_the_energy_sector.pdf.txt\n", 593 | "converted 206 /drive/My Drive/TXT/APTnotes/2014/th3bug_Watering_Hole_PoisonIvy.pdf.txt\n", 594 | "converted 207 /drive/My Drive/TXT/APTnotes/2014/unveilingthemask_v1.0.pdf.txt\n", 595 | "converted 208 /drive/My Drive/TXT/APTnotes/2014/w32_regin_stage_1.pdf.txt\n", 596 | "converted 209 /drive/My Drive/TXT/APTnotes/2014/w64_regin_stage_1.pdf.txt\n", 597 | "converted 210 /drive/My Drive/TXT/APTnotes/2014/wp-operation-pawn-storm.pdf.txt\n", 598 | "converted 211 /drive/My Drive/TXT/APTnotes/2015/ANALYSIS-ON-APT-TO-BE-ATTACK-THAT-FOCUSING-ON-CHINAS-GOVERNMENT-AGENCY-.pdf.txt\n", 599 | "converted 212 /drive/My Drive/TXT/APTnotes/2015/Agent.BTZ_to_ComRAT.pdf.txt\n", 600 | "converted 213 /drive/My Drive/TXT/APTnotes/2015/Anthem_hack_all_roads_lead_to_China.pdf.txt\n", 601 | "converted 214 /drive/My Drive/TXT/APTnotes/2015/Attacks against Israeli & Palestinian interests - Cyber security updates.pdf.txt\n", 602 | "converted 215 /drive/My Drive/TXT/APTnotes/2015/Backdoor.Winnti_Trojan.Skelky.pdf.txt\n", 603 | "converted 216 /drive/My Drive/TXT/APTnotes/2015/BlueTermite_Japan.pdf.txt\n", 604 | "converted 217 /drive/My Drive/TXT/APTnotes/2015/Carbanak_APT_eng.pdf.txt\n", 605 | "converted 218 /drive/My Drive/TXT/APTnotes/2015/China_Peace_Palace.pdf.txt\n", 606 | "converted 219 /drive/My Drive/TXT/APTnotes/2015/CmstarDownloader_Lurid_Enfal_Cousin.pdf.txt\n", 607 | "converted 220 /drive/My Drive/TXT/APTnotes/2015/CozyDuke.pdf.txt\n", 608 | "converted 221 /drive/My Drive/TXT/APTnotes/2015/Cylance SPEAR Team_ A Threat Actor Resurfaces.pdf.txt\n", 609 | "poppler error creating document 222 /drive/My Drive/TXT/APTnotes/2015/DTL-06282015-01.pdf.txt\n", 610 | "converted 222 /drive/My Drive/TXT/APTnotes/2015/DTL-12012015-01.pdf.txt\n", 611 | "converted 223 /drive/My Drive/TXT/APTnotes/2015/Dino – the latest spying malware from an allegedly French espionage group analyzed.pdf.txt\n", 612 | "converted 224 /drive/My Drive/TXT/APTnotes/2015/Dissecting-LinuxMoose.pdf.txt\n", 613 | "converted 225 /drive/My Drive/TXT/APTnotes/2015/Dissecting-the-Kraken.pdf.txt\n", 614 | "converted 226 /drive/My Drive/TXT/APTnotes/2015/Duke_cloud_Linux.pdf.txt\n", 615 | "converted 227 /drive/My Drive/TXT/APTnotes/2015/Elephantosis.pdf.txt\n", 616 | "converted 228 /drive/My Drive/TXT/APTnotes/2015/Equation_group_questions_and_answers.pdf.txt\n", 617 | "converted 229 /drive/My Drive/TXT/APTnotes/2015/FSOFACY.pdf.txt\n", 618 | "converted 230 /drive/My Drive/TXT/APTnotes/2015/Forkmeiamfamous_SeaDuke.pdf.txt\n", 619 | "converted 231 /drive/My Drive/TXT/APTnotes/2015/GlobalThreatIntelReport.pdf.txt\n", 620 | "converted 232 /drive/My Drive/TXT/APTnotes/2015/Grabit.pdf.txt\n", 621 | "converted 233 /drive/My Drive/TXT/APTnotes/2015/Inception_APT_Analysis_Bluecoat.pdf.txt\n", 622 | "converted 234 /drive/My Drive/TXT/APTnotes/2015/Indicators_of_Compormise_Hellsing.pdf.txt\n", 623 | "converted 235 /drive/My Drive/TXT/APTnotes/2015/Inside_EquationDrug_Espionage_Platform.pdf.txt\n", 624 | "converted 236 /drive/My Drive/TXT/APTnotes/2015/Minerva_Clearsky_CopyKittens(11-23-15).pdf.txt\n", 625 | "converted 237 /drive/My Drive/TXT/APTnotes/2015/MiniDionis_CozyCar_Seaduke.pdf.txt\n", 626 | "converted 238 /drive/My Drive/TXT/APTnotes/2015/OceanLotusReport.pdf.txt\n", 627 | "converted 239 /drive/My Drive/TXT/APTnotes/2015/Operation RussianDoll.pdf.txt\n", 628 | "converted 240 /drive/My Drive/TXT/APTnotes/2015/Operation-Potao-Express_final_v2.pdf.txt\n", 629 | "converted 241 /drive/My Drive/TXT/APTnotes/2015/OperationClandestineWolf.pdf.txt\n", 630 | "converted 242 /drive/My Drive/TXT/APTnotes/2015/P2P_PlugX_Analysis.pdf.txt\n", 631 | "converted 243 /drive/My Drive/TXT/APTnotes/2015/PawnStorm_iOS.pdf.txt\n", 632 | "converted 244 /drive/My Drive/TXT/APTnotes/2015/Project_Cobra_Analysis.pdf.txt\n", 633 | "converted 245 /drive/My Drive/TXT/APTnotes/2015/Regin_Hopscotch_Legspin.pdf.txt\n", 634 | "converted 246 /drive/My Drive/TXT/APTnotes/2015/Scarab_Russian.pdf.txt\n", 635 | "converted 247 /drive/My Drive/TXT/APTnotes/2015/Skeleton_Key_Analysis.pdf.txt\n", 636 | "converted 248 /drive/My Drive/TXT/APTnotes/2015/Targeted-Attacks-against-Tibetan-and-Hong-Kong-Groups-Exploiting-CVE-2014-4114.pdf.txt\n", 637 | "converted 249 /drive/My Drive/TXT/APTnotes/2015/Terracotta-VPN-Report-Final-8-3.pdf.txt\n", 638 | "converted 250 /drive/My Drive/TXT/APTnotes/2015/Thamar-Reservoir.pdf.txt\n", 639 | "converted 251 /drive/My Drive/TXT/APTnotes/2015/The Chronicles of the Hellsing APT_ the Empire Strikes Back - Securelist.pdf.txt\n", 640 | "converted 252 /drive/My Drive/TXT/APTnotes/2015/The CozyDuke APT - Securelist.pdf.txt\n", 641 | "converted 253 /drive/My Drive/TXT/APTnotes/2015/The Naikon APT - Securelist.pdf.txt\n", 642 | "converted 254 /drive/My Drive/TXT/APTnotes/2015/The-Desert-Falcons-targeted-attacks.pdf.txt\n", 643 | "converted 255 /drive/My Drive/TXT/APTnotes/2015/TheNaikonAPT-MsnMM1.pdf.txt\n", 644 | "converted 256 /drive/My Drive/TXT/APTnotes/2015/TheNaikonAPT-MsnMM2.pdf.txt\n", 645 | "converted 257 /drive/My Drive/TXT/APTnotes/2015/The_Mystery_of_Duqu_2_0_a_sophisticated_cyberespionage_actor_returns.pdf.txt\n", 646 | "converted 258 /drive/My Drive/TXT/APTnotes/2015/Tibetan-Uprising-Day-Malware-Attacks_websitepdf.pdf.txt\n", 647 | "converted 259 /drive/My Drive/TXT/APTnotes/2015/UnFIN4ished_Business_pwd.pdf.txt\n", 648 | "converted 260 /drive/My Drive/TXT/APTnotes/2015/WateringHole_Aerospace_CVE-2015-5122_IsSpace.pdf.txt\n", 649 | "converted 261 /drive/My Drive/TXT/APTnotes/2015/WildNeutron_Economic_espionage.pdf.txt\n", 650 | "converted 262 /drive/My Drive/TXT/APTnotes/2015/apt29-hammertoss-stealthy-tactics-define-a.pdf.txt\n", 651 | "converted 263 /drive/My Drive/TXT/APTnotes/2015/butterfly-corporate-spies-out-for-financial-gain.pdf.txt\n", 652 | "converted 264 /drive/My Drive/TXT/APTnotes/2015/cto-tib-20150223-01a.pdf.txt\n", 653 | "converted 265 /drive/My Drive/TXT/APTnotes/2015/cto-tib-20150420-01a.pdf.txt\n", 654 | "converted 266 /drive/My Drive/TXT/APTnotes/2015/duqu2_crysys.pdf.txt\n", 655 | "converted 267 /drive/My Drive/TXT/APTnotes/2015/oil-tanker-en.pdf.txt\n", 656 | "converted 268 /drive/My Drive/TXT/APTnotes/2015/operation-arid-viper-whitepaper-en.pdf.txt\n", 657 | "converted 269 /drive/My Drive/TXT/APTnotes/2015/plugx-goes-to-the-registry-and-india.pdf.txt\n", 658 | "converted 270 /drive/My Drive/TXT/APTnotes/2015/rpt-apt30.pdf.txt\n", 659 | "converted 271 /drive/My Drive/TXT/APTnotes/2015/rpt-behind-the-syria-conflict.pdf.txt\n", 660 | "converted 272 /drive/My Drive/TXT/APTnotes/2015/rpt-southeast-asia-threat-landscape.pdf.txt\n", 661 | "converted 273 /drive/My Drive/TXT/APTnotes/2015/the-black-vine-cyberespionage-group.pdf.txt\n", 662 | "converted 274 /drive/My Drive/TXT/APTnotes/2015/unit42-operation-lotus-blossom.pdf.txt\n", 663 | "converted 275 /drive/My Drive/TXT/APTnotes/2015/volatile-cedar-technical-report.pdf.txt\n", 664 | "converted 276 /drive/My Drive/TXT/APTnotes/2015/waterbug-attack-group.pdf.txt\n", 665 | "converted 277 /drive/My Drive/TXT/APTnotes/2015/winnti_pharmaceutical.pdf.txt\n", 666 | "converted 278 /drive/My Drive/TXT/APTnotes/2015/wp-operation-tropic-trooper.pdf.txt\n", 667 | "converted 279 /drive/My Drive/TXT/APTnotes/2015/wp-operation-woolen-goldfish.pdf.txt\n", 668 | "converted 280 /drive/My Drive/TXT/APTnotes/historical/2008/Cyberwar.pdf.txt\n", 669 | "converted 281 /drive/My Drive/TXT/APTnotes/historical/2008/chinas-electronic.pdf.txt\n", 670 | "converted 282 /drive/My Drive/TXT/APTnotes/historical/2009/Ashmore - Impact of Alleged Russian Cyber Attacks .pdf.txt\n", 671 | "converted 283 /drive/My Drive/TXT/APTnotes/historical/2009/Cyber-030.pdf.txt\n", 672 | "converted 284 /drive/My Drive/TXT/APTnotes/historical/2009/DECLAWING THE DRAGON.pdf.txt\n", 673 | "converted 285 /drive/My Drive/TXT/APTnotes/historical/2011/CyberEspionage.pdf.txt\n", 674 | "converted 286 /drive/My Drive/TXT/APTnotes/historical/2011/enter-the-cyberdragon.pdf.txt\n", 675 | "converted 287 /drive/My Drive/TXT/APTnotes/historical/2011/vol7no2Ball.pdf.txt\n" 676 | ] 677 | } 678 | ], 679 | "source": [ 680 | "pdftotext_converter('/drive/My Drive/PDF/APTnotes', '/drive/My Drive/TXT/APTnotes')" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": { 686 | "id": "INRAVkSKpF_8" 687 | }, 688 | "source": [ 689 | "## **Let's apply our best ner & RE model**" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "id": "lvQ8LNNL3cug" 697 | }, 698 | "outputs": [], 699 | "source": [ 700 | "import transformers\n", 701 | "import torch\n", 702 | "from transformers import BertForTokenClassification, AdamW\n", 703 | "from transformers import BertTokenizer, BertConfig\n", 704 | "import pandas as pd\n", 705 | "from copy import deepcopy\n", 706 | "from collections import OrderedDict\n", 707 | "from transformers import pipeline\n", 708 | "from transformers import AutoTokenizer, AutoModelForTokenClassification\n", 709 | "from nltk import tokenize" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": { 715 | "id": "zzPpCGHDZYAw" 716 | }, 717 | "source": [ 718 | "**Loading ner model**" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": null, 724 | "metadata": { 725 | "id": "YapqzZ6g4EYk" 726 | }, 727 | "outputs": [], 728 | "source": [ 729 | "# If the tokenizer uses a single vocabulary file, you can point directly to this file\n", 730 | "tokenizer = AutoTokenizer.from_pretrained('/drive/My Drive/bestmodel/')\n", 731 | "model = AutoModelForTokenClassification.from_pretrained('/drive/My Drive/bestmodel/')" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": { 738 | "id": "ifHuizuAGZJT" 739 | }, 740 | "outputs": [], 741 | "source": [ 742 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": { 749 | "id": "vpPZVg2JGOnq" 750 | }, 751 | "outputs": [], 752 | "source": [ 753 | "id2label= {\n", 754 | " \"0\": \"B-ATTACKER\",\n", 755 | " \"1\": \"I-ATTACKER\",\n", 756 | " \"2\": \"B-COMPAIGN\",\n", 757 | " \"3\": \"I-COMPAIGN\",\n", 758 | " \"4\": \"B-DATE\",\n", 759 | " \"5\": \"I-DATE\",\n", 760 | " \"6\": \"B-ExploitTargetObject\",\n", 761 | " \"7\": \"I-ExploitTargetObject\",\n", 762 | " \"8\": \"B-INDICATOR\",\n", 763 | " \"9\": \"I-INDICATOR\",\n", 764 | " \"10\": \"B-INFORMATION\",\n", 765 | " \"11\": \"I-INFORMATION\",\n", 766 | " \"12\": \"B-LOC\",\n", 767 | " \"13\": \"I-LOC\",\n", 768 | " \"14\": \"B-MALWARE\",\n", 769 | " \"15\": \"I-MALWARE\",\n", 770 | " \"16\": \"B-MALWARECHARACTERISTICS\",\n", 771 | " \"17\": \"I-MALWARECHARACTERISTICS\",\n", 772 | " \"18\": \"B-ORG\",\n", 773 | " \"19\": \"I-ORG\",\n", 774 | " \"20\": \"B-PRODUCT\",\n", 775 | " \"21\": \"I-PRODUCT\",\n", 776 | " \"22\": \"B-VULNERABILITY\",\n", 777 | " \"23\": \"I-VULNERABILITY\",\n", 778 | " \"24\": \"O\"\n", 779 | " }" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": null, 785 | "metadata": { 786 | "colab": { 787 | "base_uri": "https://localhost:8080/" 788 | }, 789 | "id": "nxRSznreazXA", 790 | "outputId": "9199d1aa-51ab-46aa-d9ad-cb979f9d83a1" 791 | }, 792 | "outputs": [ 793 | { 794 | "output_type": "execute_result", 795 | "data": { 796 | "text/plain": [ 797 | "BertForTokenClassification(\n", 798 | " (bert): BertModel(\n", 799 | " (embeddings): BertEmbeddings(\n", 800 | " (word_embeddings): Embedding(28996, 768, padding_idx=0)\n", 801 | " (position_embeddings): Embedding(512, 768)\n", 802 | " (token_type_embeddings): Embedding(2, 768)\n", 803 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 804 | " (dropout): Dropout(p=0.1, inplace=False)\n", 805 | " )\n", 806 | " (encoder): BertEncoder(\n", 807 | " (layer): ModuleList(\n", 808 | " (0): BertLayer(\n", 809 | " (attention): BertAttention(\n", 810 | " (self): BertSelfAttention(\n", 811 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 812 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 813 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 814 | " (dropout): Dropout(p=0.1, inplace=False)\n", 815 | " )\n", 816 | " (output): BertSelfOutput(\n", 817 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 818 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 819 | " (dropout): Dropout(p=0.1, inplace=False)\n", 820 | " )\n", 821 | " )\n", 822 | " (intermediate): BertIntermediate(\n", 823 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 824 | " (intermediate_act_fn): GELUActivation()\n", 825 | " )\n", 826 | " (output): BertOutput(\n", 827 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 828 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 829 | " (dropout): Dropout(p=0.1, inplace=False)\n", 830 | " )\n", 831 | " )\n", 832 | " (1): BertLayer(\n", 833 | " (attention): BertAttention(\n", 834 | " (self): BertSelfAttention(\n", 835 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 836 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 837 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 838 | " (dropout): Dropout(p=0.1, inplace=False)\n", 839 | " )\n", 840 | " (output): BertSelfOutput(\n", 841 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 842 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 843 | " (dropout): Dropout(p=0.1, inplace=False)\n", 844 | " )\n", 845 | " )\n", 846 | " (intermediate): BertIntermediate(\n", 847 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 848 | " (intermediate_act_fn): GELUActivation()\n", 849 | " )\n", 850 | " (output): BertOutput(\n", 851 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 852 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 853 | " (dropout): Dropout(p=0.1, inplace=False)\n", 854 | " )\n", 855 | " )\n", 856 | " (2): BertLayer(\n", 857 | " (attention): BertAttention(\n", 858 | " (self): BertSelfAttention(\n", 859 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 860 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 861 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 862 | " (dropout): Dropout(p=0.1, inplace=False)\n", 863 | " )\n", 864 | " (output): BertSelfOutput(\n", 865 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 866 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 867 | " (dropout): Dropout(p=0.1, inplace=False)\n", 868 | " )\n", 869 | " )\n", 870 | " (intermediate): BertIntermediate(\n", 871 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 872 | " (intermediate_act_fn): GELUActivation()\n", 873 | " )\n", 874 | " (output): BertOutput(\n", 875 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 876 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 877 | " (dropout): Dropout(p=0.1, inplace=False)\n", 878 | " )\n", 879 | " )\n", 880 | " (3): BertLayer(\n", 881 | " (attention): BertAttention(\n", 882 | " (self): BertSelfAttention(\n", 883 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 884 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 885 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 886 | " (dropout): Dropout(p=0.1, inplace=False)\n", 887 | " )\n", 888 | " (output): BertSelfOutput(\n", 889 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 890 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 891 | " (dropout): Dropout(p=0.1, inplace=False)\n", 892 | " )\n", 893 | " )\n", 894 | " (intermediate): BertIntermediate(\n", 895 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 896 | " (intermediate_act_fn): GELUActivation()\n", 897 | " )\n", 898 | " (output): BertOutput(\n", 899 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 900 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 901 | " (dropout): Dropout(p=0.1, inplace=False)\n", 902 | " )\n", 903 | " )\n", 904 | " (4): BertLayer(\n", 905 | " (attention): BertAttention(\n", 906 | " (self): BertSelfAttention(\n", 907 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 908 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 909 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 910 | " (dropout): Dropout(p=0.1, inplace=False)\n", 911 | " )\n", 912 | " (output): BertSelfOutput(\n", 913 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 914 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 915 | " (dropout): Dropout(p=0.1, inplace=False)\n", 916 | " )\n", 917 | " )\n", 918 | " (intermediate): BertIntermediate(\n", 919 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 920 | " (intermediate_act_fn): GELUActivation()\n", 921 | " )\n", 922 | " (output): BertOutput(\n", 923 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 924 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 925 | " (dropout): Dropout(p=0.1, inplace=False)\n", 926 | " )\n", 927 | " )\n", 928 | " (5): BertLayer(\n", 929 | " (attention): BertAttention(\n", 930 | " (self): BertSelfAttention(\n", 931 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 932 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 933 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 934 | " (dropout): Dropout(p=0.1, inplace=False)\n", 935 | " )\n", 936 | " (output): BertSelfOutput(\n", 937 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 938 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 939 | " (dropout): Dropout(p=0.1, inplace=False)\n", 940 | " )\n", 941 | " )\n", 942 | " (intermediate): BertIntermediate(\n", 943 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 944 | " (intermediate_act_fn): GELUActivation()\n", 945 | " )\n", 946 | " (output): BertOutput(\n", 947 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 948 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 949 | " (dropout): Dropout(p=0.1, inplace=False)\n", 950 | " )\n", 951 | " )\n", 952 | " (6): BertLayer(\n", 953 | " (attention): BertAttention(\n", 954 | " (self): BertSelfAttention(\n", 955 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 956 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 957 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 958 | " (dropout): Dropout(p=0.1, inplace=False)\n", 959 | " )\n", 960 | " (output): BertSelfOutput(\n", 961 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 962 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 963 | " (dropout): Dropout(p=0.1, inplace=False)\n", 964 | " )\n", 965 | " )\n", 966 | " (intermediate): BertIntermediate(\n", 967 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 968 | " (intermediate_act_fn): GELUActivation()\n", 969 | " )\n", 970 | " (output): BertOutput(\n", 971 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 972 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 973 | " (dropout): Dropout(p=0.1, inplace=False)\n", 974 | " )\n", 975 | " )\n", 976 | " (7): BertLayer(\n", 977 | " (attention): BertAttention(\n", 978 | " (self): BertSelfAttention(\n", 979 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 980 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 981 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 982 | " (dropout): Dropout(p=0.1, inplace=False)\n", 983 | " )\n", 984 | " (output): BertSelfOutput(\n", 985 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 986 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 987 | " (dropout): Dropout(p=0.1, inplace=False)\n", 988 | " )\n", 989 | " )\n", 990 | " (intermediate): BertIntermediate(\n", 991 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 992 | " (intermediate_act_fn): GELUActivation()\n", 993 | " )\n", 994 | " (output): BertOutput(\n", 995 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 996 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 997 | " (dropout): Dropout(p=0.1, inplace=False)\n", 998 | " )\n", 999 | " )\n", 1000 | " (8): BertLayer(\n", 1001 | " (attention): BertAttention(\n", 1002 | " (self): BertSelfAttention(\n", 1003 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 1004 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 1005 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 1006 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1007 | " )\n", 1008 | " (output): BertSelfOutput(\n", 1009 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 1010 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1011 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1012 | " )\n", 1013 | " )\n", 1014 | " (intermediate): BertIntermediate(\n", 1015 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 1016 | " (intermediate_act_fn): GELUActivation()\n", 1017 | " )\n", 1018 | " (output): BertOutput(\n", 1019 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 1020 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1021 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1022 | " )\n", 1023 | " )\n", 1024 | " (9): BertLayer(\n", 1025 | " (attention): BertAttention(\n", 1026 | " (self): BertSelfAttention(\n", 1027 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 1028 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 1029 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 1030 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1031 | " )\n", 1032 | " (output): BertSelfOutput(\n", 1033 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 1034 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1035 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1036 | " )\n", 1037 | " )\n", 1038 | " (intermediate): BertIntermediate(\n", 1039 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 1040 | " (intermediate_act_fn): GELUActivation()\n", 1041 | " )\n", 1042 | " (output): BertOutput(\n", 1043 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 1044 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1045 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1046 | " )\n", 1047 | " )\n", 1048 | " (10): BertLayer(\n", 1049 | " (attention): BertAttention(\n", 1050 | " (self): BertSelfAttention(\n", 1051 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 1052 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 1053 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 1054 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1055 | " )\n", 1056 | " (output): BertSelfOutput(\n", 1057 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 1058 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1059 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1060 | " )\n", 1061 | " )\n", 1062 | " (intermediate): BertIntermediate(\n", 1063 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 1064 | " (intermediate_act_fn): GELUActivation()\n", 1065 | " )\n", 1066 | " (output): BertOutput(\n", 1067 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 1068 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1069 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1070 | " )\n", 1071 | " )\n", 1072 | " (11): BertLayer(\n", 1073 | " (attention): BertAttention(\n", 1074 | " (self): BertSelfAttention(\n", 1075 | " (query): Linear(in_features=768, out_features=768, bias=True)\n", 1076 | " (key): Linear(in_features=768, out_features=768, bias=True)\n", 1077 | " (value): Linear(in_features=768, out_features=768, bias=True)\n", 1078 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1079 | " )\n", 1080 | " (output): BertSelfOutput(\n", 1081 | " (dense): Linear(in_features=768, out_features=768, bias=True)\n", 1082 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1083 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1084 | " )\n", 1085 | " )\n", 1086 | " (intermediate): BertIntermediate(\n", 1087 | " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", 1088 | " (intermediate_act_fn): GELUActivation()\n", 1089 | " )\n", 1090 | " (output): BertOutput(\n", 1091 | " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", 1092 | " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", 1093 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1094 | " )\n", 1095 | " )\n", 1096 | " )\n", 1097 | " )\n", 1098 | " )\n", 1099 | " (dropout): Dropout(p=0.1, inplace=False)\n", 1100 | " (classifier): Linear(in_features=768, out_features=25, bias=True)\n", 1101 | ")" 1102 | ] 1103 | }, 1104 | "metadata": {}, 1105 | "execution_count": 10 1106 | } 1107 | ], 1108 | "source": [ 1109 | "model.to(device)" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "markdown", 1114 | "metadata": { 1115 | "id": "OobZr9PkZeos" 1116 | }, 1117 | "source": [ 1118 | "**Loading relation extraction model**" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "code", 1123 | "execution_count": null, 1124 | "metadata": { 1125 | "id": "NdJQ5wbqlIiX" 1126 | }, 1127 | "outputs": [], 1128 | "source": [ 1129 | "rel2id={'': 0,\n", 1130 | " 'authored(e1,e2)': 20,\n", 1131 | " 'authored(e2,e1)': 19,\n", 1132 | " 'belongsto(e1,e2)': 25,\n", 1133 | " 'belongsto(e2,e1)': 6,\n", 1134 | " 'exploits(e1,e2)': 9,\n", 1135 | " 'exploits(e2,e1)': 11,\n", 1136 | " 'hasattacklocation(e1,e2)': 7,\n", 1137 | " 'hasattacklocation(e2,e1)': 5,\n", 1138 | " 'hasattacktime(e1,e2)': 1,\n", 1139 | " 'hasattacktime(e2,e1)': 14,\n", 1140 | " 'hascharacteristics(e1,e2)': 15,\n", 1141 | " 'hascharacteristics(e2,e1)': 21,\n", 1142 | " 'hasproduct(e1,e2)': 13,\n", 1143 | " 'hasproduct(e2,e1)': 18,\n", 1144 | " 'hasvulnerability(e1,e2)': 3,\n", 1145 | " 'hasvulnerability(e2,e1)': 10,\n", 1146 | " 'indicates(e1,e2)': 12,\n", 1147 | " 'indicates(e2,e1)': 2,\n", 1148 | " 'involvesmalware(e1,e2)': 22,\n", 1149 | " 'involvesmalware(e2,e1)': 23,\n", 1150 | " 'other': 8,\n", 1151 | " 'targets(e1,e2)': 4,\n", 1152 | " 'targets(e2,e1)': 17,\n", 1153 | " 'usesmalware(e1,e2)': 24,\n", 1154 | " 'usesmalware(e2,e1)': 16}" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "code", 1159 | "execution_count": null, 1160 | "metadata": { 1161 | "id": "GDZWE_vnZ6_j" 1162 | }, 1163 | "outputs": [], 1164 | "source": [ 1165 | "id2rel={0: '',\n", 1166 | " 1: 'hasattacktime(e1,e2)',\n", 1167 | " 2: 'indicates(e2,e1)',\n", 1168 | " 3: 'hasvulnerability(e1,e2)',\n", 1169 | " 4: 'targets(e1,e2)',\n", 1170 | " 5: 'hasattacklocation(e2,e1)',\n", 1171 | " 6: 'belongsto(e2,e1)',\n", 1172 | " 7: 'hasattacklocation(e1,e2)',\n", 1173 | " 8: 'other',\n", 1174 | " 9: 'exploits(e1,e2)',\n", 1175 | " 10: 'hasvulnerability(e2,e1)',\n", 1176 | " 11: 'exploits(e2,e1)',\n", 1177 | " 12: 'indicates(e1,e2)',\n", 1178 | " 13: 'hasproduct(e1,e2)',\n", 1179 | " 14: 'hasattacktime(e2,e1)',\n", 1180 | " 15: 'hascharacteristics(e1,e2)',\n", 1181 | " 16: 'usesmalware(e2,e1)',\n", 1182 | " 17: 'targets(e2,e1)',\n", 1183 | " 18: 'hasproduct(e2,e1)',\n", 1184 | " 19: 'authored(e2,e1)',\n", 1185 | " 20: 'authored(e1,e2)',\n", 1186 | " 21: 'hascharacteristics(e2,e1)',\n", 1187 | " 22: 'involvesmalware(e1,e2)',\n", 1188 | " 23: 'involvesmalware(e2,e1)',\n", 1189 | " 24: 'usesmalware(e1,e2)',\n", 1190 | " 25: 'belongsto(e1,e2)'}" 1191 | ] 1192 | }, 1193 | { 1194 | "cell_type": "code", 1195 | "execution_count": null, 1196 | "metadata": { 1197 | "id": "5EwIXuFCgHVD" 1198 | }, 1199 | "outputs": [], 1200 | "source": [ 1201 | "import ast\n", 1202 | "file = open(\"w2id.txt\", \"r\")\n", 1203 | "\n", 1204 | "contents = file.read()\n", 1205 | "word2id = ast.literal_eval(contents)\n", 1206 | "\n", 1207 | "file.close()" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "code", 1212 | "execution_count": null, 1213 | "metadata": { 1214 | "id": "P1XWFzaSEpnv" 1215 | }, 1216 | "outputs": [], 1217 | "source": [ 1218 | "word_vec=torch.load('/drive/My Drive/tensors.pt')" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": null, 1224 | "metadata": { 1225 | "id": "7OgKK_wCgls7" 1226 | }, 1227 | "outputs": [], 1228 | "source": [ 1229 | "import re\n", 1230 | "def search_entity(sentence):\n", 1231 | " e1 = re.findall(r'(.*)', sentence)[0]\n", 1232 | " e2 = re.findall(r'(.*)', sentence)[0]\n", 1233 | " sentence = sentence.replace('' + e1 + '', ' ' + e1 + ' ', 1)\n", 1234 | " sentence = sentence.replace('' + e2 + '', ' ' + e2 + ' ', 1)\n", 1235 | " sentence = sentence.split()\n", 1236 | " sentence = ' '.join(sentence)\n", 1237 | " sentence = sentence.replace('< e1 >', '')\n", 1238 | " sentence = sentence.replace('< e2 >', '')\n", 1239 | " sentence = sentence.replace('< /e1 >', '')\n", 1240 | " sentence = sentence.replace('< /e2 >', '')\n", 1241 | " sentence = sentence.split()\n", 1242 | "\n", 1243 | " assert '' in sentence\n", 1244 | " assert '' in sentence\n", 1245 | " assert '' in sentence\n", 1246 | " assert '' in sentence\n", 1247 | "\n", 1248 | " return sentence" 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": null, 1254 | "metadata": { 1255 | "id": "SM6WYec5a5ys" 1256 | }, 1257 | "outputs": [], 1258 | "source": [ 1259 | "import json\n", 1260 | "def convert(path_src, path_des):\n", 1261 | " with open(path_src, 'r', encoding='utf-8') as fr:\n", 1262 | " data = fr.readlines()\n", 1263 | " \n", 1264 | " with open(path_des, 'w', encoding='utf-8') as fw:\n", 1265 | " for i in range(0, len(data), 3):\n", 1266 | " id_s, sentence = data[i].strip().split(' ')\n", 1267 | " sentence = sentence[1:len(sentence)]\n", 1268 | " sentence = search_entity(sentence)\n", 1269 | " meta = dict(\n", 1270 | " id=id_s,\n", 1271 | " relation=data[i+1].strip(),\n", 1272 | " sentence=sentence,\n", 1273 | " \n", 1274 | " )\n", 1275 | " json.dump(meta, fw, ensure_ascii=False)\n", 1276 | " fw.write('\\n')" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "code", 1281 | "execution_count": null, 1282 | "metadata": { 1283 | "id": "p-Az1YBbbTxf" 1284 | }, 1285 | "outputs": [], 1286 | "source": [ 1287 | "import numpy as np\n", 1288 | "class BrevetsDataLoader(object):\n", 1289 | " def __init__(self, rel2id, word2id, batch_size, max_len):\n", 1290 | " self.rel2id = rel2id\n", 1291 | " self.word2id = word2id\n", 1292 | " self.batch_size = batch_size\n", 1293 | " self.max_len = max_len\n", 1294 | "\n", 1295 | " def __collate_fn(self, batch):\n", 1296 | " data = zip(*batch) # unzip the batch data\n", 1297 | " data = list(data)\n", 1298 | " #label = list(label)\n", 1299 | " data = torch.from_numpy(np.concatenate(data, axis=0))\n", 1300 | " #label = torch.from_numpy(np.asarray(label, dtype=np.int64))\n", 1301 | " return data\n", 1302 | "\n", 1303 | " def __get_data(self, filename, shuffle=False):\n", 1304 | " dataset = BrevetsDateset(filename, self.rel2id, self.word2id, self.max_len)\n", 1305 | " loader = DataLoader(\n", 1306 | " dataset=dataset,\n", 1307 | " batch_size= self.batch_size,\n", 1308 | " shuffle=shuffle,\n", 1309 | " num_workers=2,\n", 1310 | " collate_fn=self.__collate_fn\n", 1311 | " )\n", 1312 | " return loader\n", 1313 | " def get_test(self):\n", 1314 | " return self.__get_data('sentence.json', shuffle=False)\n", 1315 | "\n", 1316 | "batch_size = 10\n", 1317 | "max_len = 500\n", 1318 | "loader = BrevetsDataLoader(rel2id, word2id, batch_size, max_len)\n", 1319 | "\n", 1320 | "#test_loader = loader.get_test()" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": null, 1326 | "metadata": { 1327 | "id": "y5df5txlDPct" 1328 | }, 1329 | "outputs": [], 1330 | "source": [ 1331 | "import torch\n", 1332 | "import torch.nn as nn\n", 1333 | "import torch.nn.functional as F\n", 1334 | "from torch.nn import init\n", 1335 | "from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence\n", 1336 | "\n", 1337 | "\n", 1338 | "class Att_BLSTM(nn.Module):\n", 1339 | " def __init__(self, word_vec, class_num, max_len, word_dim, hidden_size, layers_num, dropout):\n", 1340 | " super().__init__()\n", 1341 | " self.word_vec = word_vec\n", 1342 | " self.class_num = class_num\n", 1343 | "\n", 1344 | " self.max_len = max_len\n", 1345 | " self.word_dim = word_dim\n", 1346 | " self.hidden_size = hidden_size\n", 1347 | " self.layers_num = layers_num\n", 1348 | " self.dropout_value = dropout\n", 1349 | "\n", 1350 | " self.word_embedding = nn.Embedding.from_pretrained(\n", 1351 | " embeddings=self.word_vec,\n", 1352 | " freeze=False,\n", 1353 | " )\n", 1354 | " self.lstm = nn.LSTM(\n", 1355 | " input_size=self.word_dim,\n", 1356 | " hidden_size=self.hidden_size,\n", 1357 | " num_layers=self.layers_num,\n", 1358 | " bias=True,\n", 1359 | " batch_first=True,\n", 1360 | " dropout=self.dropout_value,\n", 1361 | " bidirectional=True,\n", 1362 | " )\n", 1363 | " self.tanh = nn.Tanh()\n", 1364 | " self.dropout = nn.Dropout(self.dropout_value)\n", 1365 | "\n", 1366 | " self.att_weight = nn.Parameter(torch.randn(1, self.hidden_size, 1))\n", 1367 | " self.dense = nn.Linear(\n", 1368 | " in_features=self.hidden_size,\n", 1369 | " out_features=self.class_num,\n", 1370 | " bias=True\n", 1371 | " )\n", 1372 | " init.xavier_normal_(self.dense.weight)\n", 1373 | " init.constant_(self.dense.bias, 0.)\n", 1374 | "\n", 1375 | " def lstm_layer(self, x, mask):\n", 1376 | " lengths = torch.sum(mask.gt(0), dim=-1)\n", 1377 | " x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)\n", 1378 | " h, (_, _) = self.lstm(x)\n", 1379 | " h, _ = pad_packed_sequence(h, batch_first=True, padding_value=0.0, total_length=self.max_len)\n", 1380 | " h = h.view(-1, self.max_len, 2, self.hidden_size)\n", 1381 | " h = torch.sum(h, dim=2) \n", 1382 | " return h\n", 1383 | "\n", 1384 | " def attention_layer(self, h, mask):\n", 1385 | " att_weight = self.att_weight.expand(mask.shape[0], -1, -1) \n", 1386 | " att_score = torch.bmm(self.tanh(h), att_weight) \n", 1387 | " mask = mask.unsqueeze(dim=-1) \n", 1388 | " att_score = att_score.masked_fill(mask.eq(0), float('-inf')) \n", 1389 | " att_weight = F.softmax(att_score, dim=1) \n", 1390 | " reps = torch.bmm(h.transpose(1, 2), att_weight).squeeze(dim=-1) \n", 1391 | " reps = self.tanh(reps) \n", 1392 | " return reps\n", 1393 | "\n", 1394 | " def forward(self, data):\n", 1395 | " token = data[:, 0, :].contiguous().view(-1, self.max_len)\n", 1396 | " mask = data[:, 1, :].contiguous().view(-1, self.max_len)\n", 1397 | " emb = self.word_embedding(token) \n", 1398 | " emb = self.dropout(emb)\n", 1399 | " h = self.lstm_layer(emb, mask) \n", 1400 | " h = self.dropout(h)\n", 1401 | " reps = self.attention_layer(h, mask) \n", 1402 | " reps = self.dropout(reps)\n", 1403 | " logits = self.dense(reps)\n", 1404 | " return logits" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "code", 1409 | "execution_count": null, 1410 | "metadata": { 1411 | "colab": { 1412 | "base_uri": "https://localhost:8080/" 1413 | }, 1414 | "id": "h2IYncqRDQZ1", 1415 | "outputId": "fd25ecec-885a-4a00-ab58-07a8d9a81022" 1416 | }, 1417 | "outputs": [ 1418 | { 1419 | "output_type": "stream", 1420 | "name": "stderr", 1421 | "text": [ 1422 | "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py:60: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1\n", 1423 | " \"num_layers={}\".format(dropout, num_layers))\n" 1424 | ] 1425 | } 1426 | ], 1427 | "source": [ 1428 | "model_re = Att_BLSTM(word_vec = word_vec, class_num = 26, max_len = 500, word_dim = 300, hidden_size = 100, layers_num = 1, dropout = 0.5).to(device)" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": null, 1434 | "metadata": { 1435 | "colab": { 1436 | "base_uri": "https://localhost:8080/" 1437 | }, 1438 | "id": "jHaPdaUpF1cn", 1439 | "outputId": "1c720fe0-d3ff-40c5-e735-066bb048877e" 1440 | }, 1441 | "outputs": [ 1442 | { 1443 | "output_type": "execute_result", 1444 | "data": { 1445 | "text/plain": [ 1446 | "Att_BLSTM(\n", 1447 | " (word_embedding): Embedding(400006, 300)\n", 1448 | " (lstm): LSTM(300, 100, batch_first=True, dropout=0.5, bidirectional=True)\n", 1449 | " (tanh): Tanh()\n", 1450 | " (dropout): Dropout(p=0.5, inplace=False)\n", 1451 | " (dense): Linear(in_features=100, out_features=26, bias=True)\n", 1452 | ")" 1453 | ] 1454 | }, 1455 | "metadata": {}, 1456 | "execution_count": 19 1457 | } 1458 | ], 1459 | "source": [ 1460 | "model_re.load_state_dict(torch.load('/drive/My Drive/tut2-model_ner.pt'))\n", 1461 | "model_re.eval()" 1462 | ] 1463 | }, 1464 | { 1465 | "cell_type": "code", 1466 | "execution_count": null, 1467 | "metadata": { 1468 | "id": "mXKXQUCkF4Cd" 1469 | }, 1470 | "outputs": [], 1471 | "source": [ 1472 | "def align_tokens_and_predicted_labels(toks_cpu, preds_cpu):\n", 1473 | " aligned_toks, aligned_preds = [], []\n", 1474 | " prev_tok = None\n", 1475 | " for tok, pred in zip(toks_cpu, preds_cpu):\n", 1476 | " if tok.startswith(\"##\") and prev_tok is not None:\n", 1477 | " prev_tok += tok[2:]\n", 1478 | " else:\n", 1479 | " if prev_tok is not None:\n", 1480 | " aligned_toks.append(prev_tok)\n", 1481 | " aligned_preds.append(id2label[str(prev_pred)])\n", 1482 | " prev_tok = tok\n", 1483 | " prev_pred = pred\n", 1484 | " if prev_tok is not None:\n", 1485 | " aligned_toks.append(prev_tok)\n", 1486 | " aligned_preds.append(id2label[str(prev_pred)])\n", 1487 | " return aligned_toks, aligned_preds\n", 1488 | "\n", 1489 | "\n", 1490 | "def predict(texts):\n", 1491 | " aligned_tok_list, aligned_pred_list = [], []\n", 1492 | " for text in texts:\n", 1493 | " \n", 1494 | " inputs = tokenizer(text, return_tensors=\"pt\").to(device)\n", 1495 | " outputs = model(**inputs)\n", 1496 | " tokens_cpu = tokenizer.convert_ids_to_tokens(inputs.input_ids.view(-1))\n", 1497 | " preds_cpu = torch.argmax(outputs.logits, dim=-1)[0].cpu().numpy()\n", 1498 | "\n", 1499 | " aligned_toks, aligned_preds = align_tokens_and_predicted_labels(tokens_cpu, preds_cpu)\n", 1500 | "\n", 1501 | " aligned_tok_list.append(aligned_toks)\n", 1502 | " aligned_pred_list.append(aligned_preds)\n", 1503 | "\n", 1504 | " return aligned_tok_list, aligned_pred_list\n", 1505 | "\n", 1506 | "\n", 1507 | "predicted_tokens, predicted_tags = predict([\n", 1508 | " [\"Marie Curie won the Nobel Prize in 1903 and 1911 .Joe Biden is the current President of the United States .\"],\n", 1509 | " [\"Joe Biden is the current President of the United States .\"]\n", 1510 | "])" 1511 | ] 1512 | }, 1513 | { 1514 | "cell_type": "code", 1515 | "execution_count": null, 1516 | "metadata": { 1517 | "id": "PFgbzuUOg4iq" 1518 | }, 1519 | "outputs": [], 1520 | "source": [ 1521 | "predicted_tokens, predicted_tags = predict([\n", 1522 | " [\"Marie Curie won the Nobel Prize in 1903 and 1911 .Joe Biden is the current President of the United States .\"],\n", 1523 | " [\"Joe Biden is the current President of the United States .\"]\n", 1524 | "])" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "markdown", 1529 | "metadata": { 1530 | "id": "y1-8yFBPbAq1" 1531 | }, 1532 | "source": [ 1533 | "### After loading best ner & re models , let's apply them to our corpus" 1534 | ] 1535 | }, 1536 | { 1537 | "cell_type": "code", 1538 | "execution_count": null, 1539 | "metadata": { 1540 | "id": "f6ZDwDNC2YL4" 1541 | }, 1542 | "outputs": [], 1543 | "source": [ 1544 | "import glob\n", 1545 | "path='/drive/My Drive/TXT/APTnotes'\n", 1546 | "all_files = glob.glob(path + \"/*\")" 1547 | ] 1548 | }, 1549 | { 1550 | "cell_type": "code", 1551 | "execution_count": null, 1552 | "metadata": { 1553 | "colab": { 1554 | "base_uri": "https://localhost:8080/" 1555 | }, 1556 | "id": "myZX6CqgeH5s", 1557 | "outputId": "9cb63324-b94d-4bbc-9b7c-3afecc9d68ba" 1558 | }, 1559 | "outputs": [ 1560 | { 1561 | "output_type": "stream", 1562 | "name": "stdout", 1563 | "text": [ 1564 | "/drive/My Drive/TXT/APTnotes/2008/556_10535_798405_Annex87_CyberAttacks.pdf.txt\n", 1565 | "/drive/My Drive/TXT/APTnotes/2009/ghostnet.pdf.txt\n", 1566 | "/drive/My Drive/TXT/APTnotes/2010/Aurora_Botnet_Command_Structure.pdf.txt\n", 1567 | "/drive/My Drive/TXT/APTnotes/2011/Alerts DL-2011 Alerts-A-2011-02-18-01 Night Dragon Attachment 1.pdf.txt\n", 1568 | "/drive/My Drive/TXT/APTnotes/2012/Crouching_tiger_hidden_dragon.pdf.txt\n", 1569 | "/drive/My Drive/TXT/APTnotes/2013/15-2013-youonlyclicktwice.pdf.txt\n", 1570 | "/drive/My Drive/TXT/APTnotes/2014/ASERT-Threat-Intelligence-Brief-2014-07-Illuminating-Etumbot-APT.pdf.txt\n", 1571 | "/drive/My Drive/TXT/APTnotes/2015/ANALYSIS-ON-APT-TO-BE-ATTACK-THAT-FOCUSING-ON-CHINAS-GOVERNMENT-AGENCY-.pdf.txt\n" 1572 | ] 1573 | } 1574 | ], 1575 | "source": [ 1576 | "for j in all_files:\n", 1577 | " text_files=glob.glob(j +\"/*.txt\")\n", 1578 | " for x in text_files:\n", 1579 | " print(x)\n", 1580 | " break" 1581 | ] 1582 | }, 1583 | { 1584 | "cell_type": "markdown", 1585 | "metadata": { 1586 | "id": "Kyw0MJWX77hy" 1587 | }, 1588 | "source": [ 1589 | "- read each file\n", 1590 | "- get sentences\n", 1591 | "- extract entities from each sentence using ner model\n", 1592 | "- if a sentence got two entities : surround them with la balise e then apply relation extraction model \n", 1593 | "- store in a dataframe the following informations : \"entity1\" , \"tag1\", \"relation_label\",\"entity2\",\"tag2\"\n", 1594 | "- create a kg using py2neo\n" 1595 | ] 1596 | }, 1597 | { 1598 | "cell_type": "code", 1599 | "execution_count": null, 1600 | "metadata": { 1601 | "id": "e_yD95zdiPf4" 1602 | }, 1603 | "outputs": [], 1604 | "source": [ 1605 | "source=[]\n", 1606 | "target=[]\n", 1607 | "relation=[]\n", 1608 | "tags1=[]\n", 1609 | "tags2=[]" 1610 | ] 1611 | }, 1612 | { 1613 | "cell_type": "code", 1614 | "execution_count": null, 1615 | "metadata": { 1616 | "id": "92WZ7PPfj1wi" 1617 | }, 1618 | "outputs": [], 1619 | "source": [ 1620 | "def get_prediction(model, iterator):\n", 1621 | " predict_label=[]\n", 1622 | " for _, data in enumerate(iterator):\n", 1623 | " data = data.to(device)\n", 1624 | "\n", 1625 | " logits = model(data)\n", 1626 | " \n", 1627 | "\n", 1628 | " _, pred = torch.max(logits, dim=1) \n", 1629 | " pred = pred.cpu().detach().numpy().reshape((-1, 1))\n", 1630 | " predict_label.append(pred)\n", 1631 | " predict_label = np.concatenate(predict_label, axis=0).reshape(-1).astype(np.int64)\n", 1632 | " return(predict_label)" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "code", 1637 | "execution_count": null, 1638 | "metadata": { 1639 | "id": "n4UWgV5yNLaE" 1640 | }, 1641 | "outputs": [], 1642 | "source": [ 1643 | "import os\n", 1644 | "import json\n", 1645 | "import torch\n", 1646 | "import numpy as np\n", 1647 | "from torch.utils.data import Dataset, DataLoader" 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "code", 1652 | "execution_count": null, 1653 | "metadata": { 1654 | "id": "x9D8dKMqBWpG" 1655 | }, 1656 | "outputs": [], 1657 | "source": [ 1658 | "class BrevetsDateset(Dataset):\n", 1659 | " def __init__(self, filename, rel2id, word2id, max_len):\n", 1660 | " self.filename = filename\n", 1661 | " self.rel2id = rel2id\n", 1662 | " self.word2id = word2id\n", 1663 | " self.max_len = max_len\n", 1664 | " self.data_dir = ''\n", 1665 | " self.dataset = self.__load_data()\n", 1666 | "\n", 1667 | " def __symbolize_sentence(self, sentence):\n", 1668 | " \n", 1669 | " mask = [1] * len(sentence)\n", 1670 | " words = []\n", 1671 | " length = min(self.max_len, len(sentence))\n", 1672 | " mask = mask[:length]\n", 1673 | "\n", 1674 | " for i in range(length):\n", 1675 | " words.append(self.word2id.get(sentence[i].lower(), self.word2id['UNK']))\n", 1676 | "\n", 1677 | " if length < self.max_len:\n", 1678 | " for i in range(length, self.max_len):\n", 1679 | " mask.append(0) # 'PAD' mask is zero\n", 1680 | " words.append(self.word2id['PAD'])\n", 1681 | "\n", 1682 | " unit = np.asarray([words, mask], dtype=np.int64)\n", 1683 | " unit = np.reshape(unit, newshape=(1, 2, self.max_len))\n", 1684 | " return unit\n", 1685 | "\n", 1686 | " def __load_data(self):\n", 1687 | " path_data_file = os.path.join(self.data_dir, self.filename)\n", 1688 | " data = []\n", 1689 | " #labels = []\n", 1690 | " with open(path_data_file, 'r', encoding='utf-8') as fr:\n", 1691 | " for line in fr:\n", 1692 | " line = json.loads(line.strip())\n", 1693 | " #label = line['relation']\n", 1694 | " sentence = line['sentence']\n", 1695 | " #label_idx = self.rel2id[label]\n", 1696 | "\n", 1697 | " one_sentence = self.__symbolize_sentence(sentence)\n", 1698 | " data.append(one_sentence)\n", 1699 | " #labels.append(label_idx)\n", 1700 | " return data\n", 1701 | "\n", 1702 | " def __getitem__(self, index):\n", 1703 | " index=0\n", 1704 | " data = self.dataset[index]\n", 1705 | " #label = self.label[index]\n", 1706 | " return data\n", 1707 | " def __len__(self):\n", 1708 | " return len('relation')" 1709 | ] 1710 | }, 1711 | { 1712 | "cell_type": "code", 1713 | "execution_count": null, 1714 | "metadata": { 1715 | "colab": { 1716 | "base_uri": "https://localhost:8080/" 1717 | }, 1718 | "id": "EfxRfD7UBeha", 1719 | "outputId": "cf3d6756-284e-47a9-cb3f-229eab7e0a1f" 1720 | }, 1721 | "outputs": [ 1722 | { 1723 | "output_type": "stream", 1724 | "name": "stdout", 1725 | "text": [ 1726 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 1727 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 1728 | ] 1729 | }, 1730 | { 1731 | "output_type": "execute_result", 1732 | "data": { 1733 | "text/plain": [ 1734 | "True" 1735 | ] 1736 | }, 1737 | "metadata": {}, 1738 | "execution_count": 28 1739 | } 1740 | ], 1741 | "source": [ 1742 | "import nltk\n", 1743 | "nltk.download('punkt')" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "code", 1748 | "execution_count": null, 1749 | "metadata": { 1750 | "id": "XEOV7x1EBkq7" 1751 | }, 1752 | "outputs": [], 1753 | "source": [ 1754 | "# remove extra spaces and ending space if any\n", 1755 | "import re\n", 1756 | "spaces = ['\\u200b', '\\u200e', '\\u202a', '\\u202c', '\\ufeff', '\\uf0d8', '\\u2061', '\\x10', '\\x7f', '\\x9d', '\\xad', '\\xa0']\n", 1757 | "def remove_space(text):\n", 1758 | " for space in spaces:\n", 1759 | " text = text.replace(space, ' ')\n", 1760 | " text = text.strip()\n", 1761 | " text = re.sub('\\s+', ' ', text)\n", 1762 | " return text" 1763 | ] 1764 | }, 1765 | { 1766 | "cell_type": "code", 1767 | "execution_count": null, 1768 | "metadata": { 1769 | "id": "BFtcvOwG2mNS" 1770 | }, 1771 | "outputs": [], 1772 | "source": [ 1773 | "id=0\n", 1774 | "for folder in all_files:\n", 1775 | " text_files=glob.glob(folder +\"/*.txt\")\n", 1776 | " for file in text_files:\n", 1777 | " with open(file, 'r', encoding = 'utf-8') as f:\n", 1778 | " text = f.read()\n", 1779 | " text=text.replace('\\n',' ')\n", 1780 | " text=remove_space(text)\n", 1781 | " sentences=tokenize.sent_tokenize(text)\n", 1782 | " sentences=[[s[:513]] for s in sentences] # I returned it this form so that i can apply ner model on each sentence \n", 1783 | " predicted_tokens, predicted_tags = predict(sentences)\n", 1784 | " for i in range (len(sentences)):\n", 1785 | " firstTag=True\n", 1786 | " sentence=\"\"\n", 1787 | " other=['O' for o in range (len(predicted_tags[i]))] # sentence which doesnt contain any tag\n", 1788 | " if predicted_tags[i] != other: \n", 1789 | " for tok in range (len(predicted_tokens[i])):\n", 1790 | " if predicted_tags[i][tok]=='O':\n", 1791 | " sentence=sentence+' '+predicted_tokens[i][tok]\n", 1792 | " elif predicted_tokens[i][tok]!='[SEP]' and predicted_tags[i][tok].startswith('B-') and firstTag:\n", 1793 | " tag1=predicted_tags[i][tok][2:]\n", 1794 | " if tok!=(len(predicted_tags[i])-1) and predicted_tags[i][tok+1].startswith('I-'):\n", 1795 | " sentence=sentence+' '+''+predicted_tokens[i][tok]\n", 1796 | " else:\n", 1797 | " sentence=sentence+' '+''+predicted_tokens[i][tok]+''\n", 1798 | " firstTag=False\n", 1799 | " elif predicted_tokens[i][tok]!='[SEP]' and predicted_tags[i][tok].startswith('B-') and firstTag is False:\n", 1800 | " tag2=predicted_tags[i][tok][2:]\n", 1801 | " if tok!=(len(predicted_tags[i])-1) and predicted_tags[i][tok+1].startswith('I-') and predicted_tokens[i][tok+1]!='[SEP]':\n", 1802 | " sentence=sentence+' '+''+predicted_tokens[i][tok]\n", 1803 | " else:\n", 1804 | " sentence=sentence+' '+''+predicted_tokens[i][tok]+''\n", 1805 | " else:\n", 1806 | " if predicted_tokens[i][tok]!='[SEP]':\n", 1807 | " if tok!=(len(predicted_tags[i])-1) and predicted_tags[i][tok+1].startswith('I-') and predicted_tokens[i][tok+1]!='[SEP]' :\n", 1808 | " sentence=sentence+' '+predicted_tokens[i][tok]\n", 1809 | " else:\n", 1810 | " if \"\" in sentence:\n", 1811 | " sentence=sentence+' '+predicted_tokens[i][tok]+\"\"\n", 1812 | " else:\n", 1813 | " sentence=sentence+' '+predicted_tokens[i][tok]+''\n", 1814 | "\n", 1815 | "\n", 1816 | " if sentence.count(\"\")==1 and sentence.count(\"\")==1:\n", 1817 | " m=sentence.find('') \n", 1818 | " n=sentence.find('')\n", 1819 | " k=m+4\n", 1820 | " entity1=sentence[k:n] \n", 1821 | " p= sentence.find('')\n", 1822 | " q=sentence.find('')\n", 1823 | " y=p+4\n", 1824 | " entity2=sentence[y:q]\n", 1825 | " #write sentence in text file\n", 1826 | " with open('sentence.txt', 'w') as f:\n", 1827 | " id=1\n", 1828 | " f.write(str(id)+' '+sentence)\n", 1829 | " f.write(\"\\n\")\n", 1830 | " f.write('relation')\n", 1831 | " f.close()\n", 1832 | " convert('sentence.txt','sentence.json')\n", 1833 | " test_loader = loader.get_test()\n", 1834 | " pred=get_prediction(model_re, test_loader)\n", 1835 | " rel=id2rel[pred[0]]\n", 1836 | " if \"(e1,e2)\" in rel:\n", 1837 | " source.append(entity1)\n", 1838 | " target.append(entity2)\n", 1839 | " relation.append(rel[:-7])\n", 1840 | " tags1.append(tag1)\n", 1841 | " tags2.append(tag2)\n", 1842 | " else:\n", 1843 | " source.append(entity2)\n", 1844 | " target.append(entity1)\n", 1845 | " relation.append(rel[:-7])\n", 1846 | " tags1.append(tag2)\n", 1847 | " tags2.append(tag1)\n" 1848 | ] 1849 | }, 1850 | { 1851 | "cell_type": "code", 1852 | "execution_count": null, 1853 | "metadata": { 1854 | "id": "59eUimG6Bam8" 1855 | }, 1856 | "outputs": [], 1857 | "source": [ 1858 | "data2=pd.DataFrame({\n", 1859 | " 'source':source,\n", 1860 | " 'tag1':tags1,\n", 1861 | " 'relation_label':relation,\n", 1862 | " 'target':target,\n", 1863 | " 'tag2':tags2\n", 1864 | "})" 1865 | ] 1866 | }, 1867 | { 1868 | "cell_type": "code", 1869 | "execution_count": null, 1870 | "metadata": { 1871 | "colab": { 1872 | "base_uri": "https://localhost:8080/", 1873 | "height": 206 1874 | }, 1875 | "id": "UpMOcc_qS3em", 1876 | "outputId": "04857f88-126d-44ea-eaab-1df74882a677" 1877 | }, 1878 | "outputs": [ 1879 | { 1880 | "output_type": "execute_result", 1881 | "data": { 1882 | "text/plain": [ 1883 | " source tag1 relation_label target tag2\n", 1884 | "0 Microsoft ORG hasproduct Windows PRODUCT\n", 1885 | "1 Microsoft ORG hasproduct PsExec PRODUCT\n", 1886 | "2 Adobe ORG hasproduct Acrobat PRODUCT\n", 1887 | "3 BlackEnergy ORG hasproduct Trojans PRODUCT\n", 1888 | "4 Tsai ATTACKER hasattacklocation China LOC" 1889 | ], 1890 | "text/html": [ 1891 | "\n", 1892 | "
\n", 1893 | "
\n", 1894 | "
\n", 1895 | "\n", 1908 | "\n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | "
sourcetag1relation_labeltargettag2
0MicrosoftORGhasproductWindowsPRODUCT
1MicrosoftORGhasproductPsExecPRODUCT
2AdobeORGhasproductAcrobatPRODUCT
3BlackEnergyORGhasproductTrojansPRODUCT
4TsaiATTACKERhasattacklocationChinaLOC
\n", 1962 | "
\n", 1963 | " \n", 1973 | " \n", 1974 | " \n", 2011 | "\n", 2012 | " \n", 2036 | "
\n", 2037 | "
\n", 2038 | " " 2039 | ] 2040 | }, 2041 | "metadata": {}, 2042 | "execution_count": 35 2043 | } 2044 | ], 2045 | "source": [ 2046 | "data2.head()" 2047 | ] 2048 | }, 2049 | { 2050 | "cell_type": "code", 2051 | "execution_count": null, 2052 | "metadata": { 2053 | "id": "y1G_ywb_pMq8" 2054 | }, 2055 | "outputs": [], 2056 | "source": [ 2057 | "import pandas as pd\n", 2058 | "group=pd.read_csv('/drive/My Drive/ATT&CK MATRICES Group.csv')" 2059 | ] 2060 | }, 2061 | { 2062 | "cell_type": "code", 2063 | "execution_count": null, 2064 | "metadata": { 2065 | "colab": { 2066 | "base_uri": "https://localhost:8080/", 2067 | "height": 206 2068 | }, 2069 | "id": "E6Mklfymptwb", 2070 | "outputId": "ac89ecb0-7f02-4df4-a662-dbd97832cff3" 2071 | }, 2072 | "outputs": [ 2073 | { 2074 | "output_type": "execute_result", 2075 | "data": { 2076 | "text/plain": [ 2077 | " Name ID Tecs Used by Group \\\n", 2078 | "0 admin@338 G0018 ['T1087', 'T1059', 'T1203', 'T1083', 'T1036', ... \n", 2079 | "1 APT-C-36 G0099 ['T1059', 'T1105', 'T1036', 'T1571', 'T1027', ... \n", 2080 | "2 APT1 G0006 ['T1087', 'T1583', 'T1560', 'T1119', 'T1059', ... \n", 2081 | "3 APT12 G0005 ['T1568', 'T1203', 'T1566', 'T1204', 'T1102'] \n", 2082 | "4 APT16 G0023 ['T1584'] \n", 2083 | "\n", 2084 | " Associated Groups \n", 2085 | "0 NaN \n", 2086 | "1 Blind Eagle \n", 2087 | "2 Comment Crew, Comment Group, Comment Panda \n", 2088 | "3 IXESHE, DynCalc, Numbered Panda, DNSCALC \n", 2089 | "4 NaN " 2090 | ], 2091 | "text/html": [ 2092 | "\n", 2093 | "
\n", 2094 | "
\n", 2095 | "
\n", 2096 | "\n", 2109 | "\n", 2110 | " \n", 2111 | " \n", 2112 | " \n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | " \n", 2125 | " \n", 2126 | " \n", 2127 | " \n", 2128 | " \n", 2129 | " \n", 2130 | " \n", 2131 | " \n", 2132 | " \n", 2133 | " \n", 2134 | " \n", 2135 | " \n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | " \n", 2143 | " \n", 2144 | " \n", 2145 | " \n", 2146 | " \n", 2147 | " \n", 2148 | " \n", 2149 | " \n", 2150 | " \n", 2151 | " \n", 2152 | " \n", 2153 | " \n", 2154 | " \n", 2155 | " \n", 2156 | "
NameIDTecs Used by GroupAssociated Groups
0admin@338G0018['T1087', 'T1059', 'T1203', 'T1083', 'T1036', ...NaN
1APT-C-36G0099['T1059', 'T1105', 'T1036', 'T1571', 'T1027', ...Blind Eagle
2APT1G0006['T1087', 'T1583', 'T1560', 'T1119', 'T1059', ...Comment Crew, Comment Group, Comment Panda
3APT12G0005['T1568', 'T1203', 'T1566', 'T1204', 'T1102']IXESHE, DynCalc, Numbered Panda, DNSCALC
4APT16G0023['T1584']NaN
\n", 2157 | "
\n", 2158 | " \n", 2168 | " \n", 2169 | " \n", 2206 | "\n", 2207 | " \n", 2231 | "
\n", 2232 | "
\n", 2233 | " " 2234 | ] 2235 | }, 2236 | "metadata": {}, 2237 | "execution_count": 201 2238 | } 2239 | ], 2240 | "source": [ 2241 | "group.head()" 2242 | ] 2243 | }, 2244 | { 2245 | "cell_type": "code", 2246 | "execution_count": null, 2247 | "metadata": { 2248 | "id": "iqccHOoLpO0v" 2249 | }, 2250 | "outputs": [], 2251 | "source": [ 2252 | "compaigns=[]\n", 2253 | "for i in group['Associated Groups']:\n", 2254 | " if type(i)!=float:\n", 2255 | " compaigns=compaigns+i.split(',')" 2256 | ] 2257 | }, 2258 | { 2259 | "cell_type": "code", 2260 | "execution_count": null, 2261 | "metadata": { 2262 | "id": "_nCgoHGkqk5G" 2263 | }, 2264 | "outputs": [], 2265 | "source": [ 2266 | "for i in range (len(compaigns)):\n", 2267 | " compaigns[i]=remove_space(compaigns[i])\n", 2268 | " compaigns[i]=compaigns[i].lower()" 2269 | ] 2270 | }, 2271 | { 2272 | "cell_type": "code", 2273 | "source": [ 2274 | "df=data" 2275 | ], 2276 | "metadata": { 2277 | "id": "zKKul9xrepqu" 2278 | }, 2279 | "execution_count": null, 2280 | "outputs": [] 2281 | }, 2282 | { 2283 | "cell_type": "code", 2284 | "execution_count": null, 2285 | "metadata": { 2286 | "id": "Kl1urQWPdz4h", 2287 | "colab": { 2288 | "base_uri": "https://localhost:8080/" 2289 | }, 2290 | "outputId": "794d5f29-c4a8-4e14-b46a-44f192a18521" 2291 | }, 2292 | "outputs": [ 2293 | { 2294 | "output_type": "stream", 2295 | "name": "stderr", 2296 | "text": [ 2297 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2298 | " \"\"\"Entry point for launching an IPython kernel.\n", 2299 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2300 | " \n", 2301 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2302 | " This is separate from the ipykernel package so we can avoid doing imports until\n", 2303 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2304 | " after removing the cwd from sys.path.\n", 2305 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2306 | " import sys\n", 2307 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:8: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2308 | " \n", 2309 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:9: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2310 | " if __name__ == '__main__':\n", 2311 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2312 | " # Remove the CWD from sys.path while we load stuff.\n", 2313 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2314 | " from ipykernel import kernelapp as app\n", 2315 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2316 | " app.launch_new_instance()\n", 2317 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:17: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2318 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2319 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:19: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2320 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:21: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2321 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:23: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2322 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:24: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", 2323 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:25: FutureWarning: The default value of regex will change from True to False in a future version.\n", 2324 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:26: FutureWarning: The default value of regex will change from True to False in a future version.\n" 2325 | ] 2326 | } 2327 | ], 2328 | "source": [ 2329 | "df['source']=df['source'].str.replace(')','')\n", 2330 | "df['source']=df['source'].str.replace('(','')\n", 2331 | "df['source']=df['source'].str.replace('[','')\n", 2332 | "df['source']=df['source'].str.replace(']','')\n", 2333 | "df['source']=df['source'].str.replace(\"'s\",'')\n", 2334 | "df['source']=df['source'].str.replace(\"'\",' ')\n", 2335 | "df['target']=df['target'].str.replace(')','')\n", 2336 | "df['target']=df['target'].str.replace('(','')\n", 2337 | "df['target']=df['target'].str.replace('[','')\n", 2338 | "df['target']=df['target'].str.replace(']','')\n", 2339 | "df['target']=df['target'].str.replace(\"'s\",'')\n", 2340 | "df['target']=df['target'].str.replace(\"'\",' ')\n", 2341 | "df['target']=df['target'].str.replace('\"','')\n", 2342 | "df['source']=df['source'].str.replace('\"','')\n", 2343 | "df['target']=df['target'].str.replace('{','')\n", 2344 | "df['target']=df['target'].str.replace('}','')\n", 2345 | "df['source']=df['source'].str.replace('{','')\n", 2346 | "df['source']=df['source'].str.replace('}','')\n", 2347 | "df['target']=df['target'].str.replace('*','')\n", 2348 | "df['target']=df['target'].str.replace('\"','')\n", 2349 | "df['source']=df['source'].str.replace('*','')\n", 2350 | "df['source']=df['source'].str.replace('\"','')\n", 2351 | "df['target']=df['target'].str.replace(\"\\\\\",'')\n", 2352 | "df['source']=df['source'].str.replace(\"\\\\\",'')\n", 2353 | "df['target']=df['target'].str.replace(\"\\ \",'')\n", 2354 | "df['source']=df['source'].str.replace(\"\\ \",'')" 2355 | ] 2356 | }, 2357 | { 2358 | "cell_type": "code", 2359 | "execution_count": null, 2360 | "metadata": { 2361 | "id": "kHfHC--LWrPs" 2362 | }, 2363 | "outputs": [], 2364 | "source": [ 2365 | "df.loc[df.source.str.lower().isin(compaigns), 'tag1']= 'CAMPAIGN'\n", 2366 | "df.loc[df.target.str.lower().isin(compaigns), 'tag2']= 'CAMPAIGN'" 2367 | ] 2368 | }, 2369 | { 2370 | "cell_type": "code", 2371 | "execution_count": null, 2372 | "metadata": { 2373 | "id": "CG2wSok4ewL7" 2374 | }, 2375 | "outputs": [], 2376 | "source": [ 2377 | "df.loc[df.tag1=='COMPAIGN', 'tag1']='CAMPAIGN'\n", 2378 | "df.loc[df.tag2=='COMPAIGN', 'tag2']='CAMPAIGN'" 2379 | ] 2380 | }, 2381 | { 2382 | "cell_type": "code", 2383 | "source": [ 2384 | "df" 2385 | ], 2386 | "metadata": { 2387 | "colab": { 2388 | "base_uri": "https://localhost:8080/", 2389 | "height": 424 2390 | }, 2391 | "id": "7qB1npCMJZS6", 2392 | "outputId": "b530a254-2482-42b3-90cf-84da50743572" 2393 | }, 2394 | "execution_count": null, 2395 | "outputs": [ 2396 | { 2397 | "output_type": "execute_result", 2398 | "data": { 2399 | "text/plain": [ 2400 | " source tag1 relation_label target tag2\n", 2401 | "0 Microsoft ORG hasproduct Windows PRODUCT\n", 2402 | "1 Microsoft ORG hasproduct PsExec PRODUCT\n", 2403 | "2 Adobe ORG hasproduct Acrobat PRODUCT\n", 2404 | "3 BlackEnergy ORG hasproduct Trojans PRODUCT\n", 2405 | "4 Tsai ATTACKER hasattacklocation China LOC\n", 2406 | "... ... ... ... ... ...\n", 2407 | "21894 107 MALWARE indicates 181 INDICATOR\n", 2408 | "21895 xxxxxxxx INDICATOR indicates / INDICATOR\n", 2409 | "21896 index INDICATOR hascharacteristics index INDICATOR\n", 2410 | "21897 www INDICATOR hasproduct trendmicro INDICATOR\n", 2411 | "21898 john MALWARE hasattacktime 2015 DATE\n", 2412 | "\n", 2413 | "[21899 rows x 5 columns]" 2414 | ], 2415 | "text/html": [ 2416 | "\n", 2417 | "
\n", 2418 | "
\n", 2419 | "
\n", 2420 | "\n", 2433 | "\n", 2434 | " \n", 2435 | " \n", 2436 | " \n", 2437 | " \n", 2438 | " \n", 2439 | " \n", 2440 | " \n", 2441 | " \n", 2442 | " \n", 2443 | " \n", 2444 | " \n", 2445 | " \n", 2446 | " \n", 2447 | " \n", 2448 | " \n", 2449 | " \n", 2450 | " \n", 2451 | " \n", 2452 | " \n", 2453 | " \n", 2454 | " \n", 2455 | " \n", 2456 | " \n", 2457 | " \n", 2458 | " \n", 2459 | " \n", 2460 | " \n", 2461 | " \n", 2462 | " \n", 2463 | " \n", 2464 | " \n", 2465 | " \n", 2466 | " \n", 2467 | " \n", 2468 | " \n", 2469 | " \n", 2470 | " \n", 2471 | " \n", 2472 | " \n", 2473 | " \n", 2474 | " \n", 2475 | " \n", 2476 | " \n", 2477 | " \n", 2478 | " \n", 2479 | " \n", 2480 | " \n", 2481 | " \n", 2482 | " \n", 2483 | " \n", 2484 | " \n", 2485 | " \n", 2486 | " \n", 2487 | " \n", 2488 | " \n", 2489 | " \n", 2490 | " \n", 2491 | " \n", 2492 | " \n", 2493 | " \n", 2494 | " \n", 2495 | " \n", 2496 | " \n", 2497 | " \n", 2498 | " \n", 2499 | " \n", 2500 | " \n", 2501 | " \n", 2502 | " \n", 2503 | " \n", 2504 | " \n", 2505 | " \n", 2506 | " \n", 2507 | " \n", 2508 | " \n", 2509 | " \n", 2510 | " \n", 2511 | " \n", 2512 | " \n", 2513 | " \n", 2514 | " \n", 2515 | " \n", 2516 | " \n", 2517 | " \n", 2518 | " \n", 2519 | " \n", 2520 | " \n", 2521 | " \n", 2522 | " \n", 2523 | " \n", 2524 | " \n", 2525 | " \n", 2526 | " \n", 2527 | " \n", 2528 | " \n", 2529 | " \n", 2530 | " \n", 2531 | " \n", 2532 | " \n", 2533 | " \n", 2534 | "
sourcetag1relation_labeltargettag2
0MicrosoftORGhasproductWindowsPRODUCT
1MicrosoftORGhasproductPsExecPRODUCT
2AdobeORGhasproductAcrobatPRODUCT
3BlackEnergyORGhasproductTrojansPRODUCT
4TsaiATTACKERhasattacklocationChinaLOC
..................
21894107MALWAREindicates181INDICATOR
21895xxxxxxxxINDICATORindicates/INDICATOR
21896indexINDICATORhascharacteristicsindexINDICATOR
21897wwwINDICATORhasproducttrendmicroINDICATOR
21898johnMALWAREhasattacktime2015DATE
\n", 2535 | "

21899 rows × 5 columns

\n", 2536 | "
\n", 2537 | " \n", 2547 | " \n", 2548 | " \n", 2585 | "\n", 2586 | " \n", 2610 | "
\n", 2611 | "
\n", 2612 | " " 2613 | ] 2614 | }, 2615 | "metadata": {}, 2616 | "execution_count": 214 2617 | } 2618 | ] 2619 | }, 2620 | { 2621 | "cell_type": "code", 2622 | "source": [ 2623 | "df.to_csv('df.csv')" 2624 | ], 2625 | "metadata": { 2626 | "id": "3N-rrxXKfH55" 2627 | }, 2628 | "execution_count": null, 2629 | "outputs": [] 2630 | } 2631 | ], 2632 | "metadata": { 2633 | "accelerator": "GPU", 2634 | "colab": { 2635 | "background_execution": "on", 2636 | "collapsed_sections": [ 2637 | "gQ52tc_2Y2dH", 2638 | "1aUhP-rWY_iz", 2639 | "YS9fedEh4oB9", 2640 | "o2RzEngCaxSy", 2641 | "fdABbgSba3BK", 2642 | "BbcA6-xKkFMX", 2643 | "UKp9DjPHkLbi", 2644 | "T1g9zv0FkVBz", 2645 | "Subt2FHrkZ0Z", 2646 | "r1js2OnXkezk", 2647 | "NsmsigSiklqc", 2648 | "XM3kB_Eckt27", 2649 | "dSYWIe_Mkyz0", 2650 | "faQq5UBkk3UY", 2651 | "BtLLF0rGk6mF" 2652 | ], 2653 | "machine_shape": "hm", 2654 | "name": "Input for KG construction task.ipynb", 2655 | "provenance": [] 2656 | }, 2657 | "kernelspec": { 2658 | "display_name": "Python 3", 2659 | "name": "python3" 2660 | }, 2661 | "language_info": { 2662 | "name": "python" 2663 | } 2664 | }, 2665 | "nbformat": 4, 2666 | "nbformat_minor": 0 2667 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The repository contains the following tasks : Construction of the corpus, Named Entity Recognition , Relationship Extraction , Construction of knowledge graph using py2neo , Analysis on the constructed KG with cypher query 2 | 3 | **Conceptual model of the knowledge graph :** 4 | 5 | ![image](https://user-images.githubusercontent.com/77699359/205594664-66976e6e-9034-445a-95c0-d969bb658b81.png) 6 | 7 | **Overall design of the solution :** 8 | 9 | ![image](https://user-images.githubusercontent.com/77699359/205593799-b33e8936-6da4-481a-a02e-5e48bd7b52cb.png) 10 | 11 | In general, it is essential to build beforehand an idea on the usefulness and the types of information that we need to access at the level of our knowledge graph, given that the field of cybersecurity is wide. This will make it easier for us to choose the documents on which we will base the creation of the graph. 12 | 13 | First of all, we will have to build the corpus: we can either base ourselves on a single collection of reports bearing the same type of information, or expand the field of knowledge by adding other types of information, from various sources, deemed to be useful during the analysis, while ensuring the homogeneity of the corpus built. Before starting the step of extracting the entities, it will have to carry out a preprocessing on the documents contained in the corpus in order to prepare the data or the data set that will be suitable for the application of the REN approach. 14 | 15 | By trying different models and different types of architecture, we will have a better chance of obtaining an efficient model that allows the extraction of entities. After obtaining the entities, we will need to know the context in which they are located. In order to contextualize the entities, we will first have to prepare the data so that it is adequate for the task of extracting the relationships and then to train a model that can predict the existing relationships between the entities as accurately as possible. 16 | 17 | Once we succeed in having a model that performs well in recognizing named entities and another model that performs well in extracting relationships. We can extract the triples from a collection of reports: entities and relations that contextualize them, and then build our knowledge graph. 18 | 19 | Since the objective is not only based on the construction of the graph but rather on its usefulness rather on its usefulness and efficiency in countering a cyber-attack. An analysis step is added towards the end, which consists of analyzing and querying the knowledge graph constructed. The figure above illustrates the approach followed. 20 | 21 | **Detailed design of the solution :** 22 | 23 | ![image](https://user-images.githubusercontent.com/77699359/205594972-7cfc9dd3-a9cd-4409-80b0-fd58e82ff6ba.png) 24 | 25 | **Model Performance for the NER task** 26 | 27 | ![image](https://user-images.githubusercontent.com/77699359/205599401-dc4d6fbc-706d-4af1-aaff-9840b448f114.png) 28 | 29 | **Model Performance for the Relation-Extraction task** 30 | 31 | ![image](https://user-images.githubusercontent.com/77699359/205599828-dea66006-f39b-432f-a7cf-fdcb916fd763.png) 32 | 33 | **Small Overview of the constructed graph** 34 | 35 | ![image](https://user-images.githubusercontent.com/77699359/205599934-a3238a63-029e-4084-afc2-a10bafe0241a.png) 36 | 37 | 38 | 39 | 40 | --------------------------------------------------------------------------------