├── Named_Entity_Recognition-LSTM-CNN-CRF-Tutorial.ipynb ├── README.md └── bio2bioes.py /Named_Entity_Recognition-LSTM-CNN-CRF-Tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this tutorial we will demonstrate how to implement a state of the art Bi-directional LSTM-CNN-CRF architecture (Published at ACL'16. [Link To Paper](http://www.aclweb.org/anthology/P16-1101)) for Named Entity Recognition using Pytorch. \n", 15 | "\n", 16 | "The main aim of the tutorial is to make the audience comfortable with pytorch using this tutorial and give a step-by-step walk through of the Bi-LSTM-CNN-CRF architecture for NER. Some familiarity with pytorch (or any other deep learning framework) would definitely be a plus. \n", 17 | "\n", 18 | "The agenda of this tutorial is as follows:\n", 19 | "\n", 20 | "1. Getting Ready with the data \n", 21 | "2. Network Definition. This includes\n", 22 | " * CNN Encoder for Character Level representation.\n", 23 | " * Bi-directional LSTM for Word-Level Encoding.\n", 24 | " * Conditional Random Fields(CRF) for output decoding\n", 25 | "3. Training \n", 26 | "4. Model testing\n", 27 | "\n", 28 | "This tutorial draws its content/design heavily from [this](https://github.com/ZhixiuYe/NER-pytorch) Github implementation of NER model. We reuse their data preprocessing/Model creation methodology. This helps in focussing more on explaining model architecture and it's translation from formulae to code. " 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "**Authors:**\n", 36 | "[**Anirudh Ganesh**](https://www.linkedin.com/in/anirudh-ganesh95/),\n", 37 | "[**Peddamail Jayavardhan Reddy**](https://www.linkedin.com/in/jayavardhan-reddy-peddamail-6b4125a0/)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Data Preparation\n", 45 | "\n", 46 | "The paper uses the English data from CoNLL 2003 shared task\\[1\\], which is present in the \"data\" directory of this project. We will later apply more preprocessing steps to generate tag mapping, word mapping and character mapping. The data set contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC and uses the BIO tagging scheme\n", 47 | "\n", 48 | "BIO tagging Scheme:\n", 49 | "\n", 50 | " I - Word is inside a phrase of type TYPE\n", 51 | " B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE \n", 52 | " O - Word is not part of a phrase\n", 53 | " \n", 54 | "Example of English-NER sentence available in the data:\n", 55 | " \n", 56 | " U.N. NNP I-NP I-ORG \n", 57 | " official NN I-NP O \n", 58 | " Ekeus NNP I-NP I-PER \n", 59 | " heads VBZ I-VP O \n", 60 | " for IN I-PP O \n", 61 | " Baghdad NNP I-NP I-LOC \n", 62 | " . . O O \n", 63 | " \n", 64 | "Data Split(We use the same split as mentioned in paper):\n", 65 | "\n", 66 | " Training Data - eng.train\n", 67 | " Validation Data - eng.testa\n", 68 | " Testing Data - eng.testb\n", 69 | " \n", 70 | "\n", 71 | " To get started we first import the necessary libraries" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 1, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "from __future__ import print_function\n", 81 | "from collections import OrderedDict\n", 82 | "\n", 83 | "import torch\n", 84 | "import torch.nn as nn\n", 85 | "from torch.nn import init\n", 86 | "from torch.autograd import Variable\n", 87 | "from torch import autograd\n", 88 | "\n", 89 | "import time\n", 90 | "import _pickle as cPickle\n", 91 | "\n", 92 | "import urllib\n", 93 | "import matplotlib.pyplot as plt\n", 94 | "plt.rcParams['figure.dpi'] = 80\n", 95 | "plt.style.use('seaborn-pastel')\n", 96 | "\n", 97 | "import os\n", 98 | "import sys\n", 99 | "import codecs\n", 100 | "import re\n", 101 | "import numpy as np" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "##### Define constants and paramaters" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "We now define some constants and parameters that we will be using later" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 2, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "#parameters for the Model\n", 125 | "parameters = OrderedDict()\n", 126 | "parameters['train'] = \"./data/eng.train\" #Path to train file\n", 127 | "parameters['dev'] = \"./data/eng.testa\" #Path to test file\n", 128 | "parameters['test'] = \"./data/eng.testb\" #Path to dev file\n", 129 | "parameters['tag_scheme'] = \"BIOES\" #BIO or BIOES\n", 130 | "parameters['lower'] = True # Boolean variable to control lowercasing of words\n", 131 | "parameters['zeros'] = True # Boolean variable to control replacement of all digits by 0 \n", 132 | "parameters['char_dim'] = 30 #Char embedding dimension\n", 133 | "parameters['word_dim'] = 100 #Token embedding dimension\n", 134 | "parameters['word_lstm_dim'] = 200 #Token LSTM hidden layer size\n", 135 | "parameters['word_bidirect'] = True #Use a bidirectional LSTM for words\n", 136 | "parameters['embedding_path'] = \"./data/glove.6B.100d.txt\" #Location of pretrained embeddings\n", 137 | "parameters['all_emb'] = 1 #Load all embeddings\n", 138 | "parameters['crf'] =1 #Use CRF (0 to disable)\n", 139 | "parameters['dropout'] = 0.5 #Droupout on the input (0 = no dropout)\n", 140 | "parameters['epoch'] = 50 #Number of epochs to run\"\n", 141 | "parameters['weights'] = \"\" #path to Pretrained for from a previous run\n", 142 | "parameters['name'] = \"self-trained-model\" # Model name\n", 143 | "parameters['gradient_clip']=5.0\n", 144 | "models_path = \"./models/\" #path to saved models\n", 145 | "\n", 146 | "#GPU\n", 147 | "parameters['use_gpu'] = torch.cuda.is_available() #GPU Check\n", 148 | "use_gpu = parameters['use_gpu']\n", 149 | "\n", 150 | "parameters['reload'] = \"./models/pre-trained-model\" \n", 151 | "\n", 152 | "#Constants\n", 153 | "START_TAG = ''\n", 154 | "STOP_TAG = ''" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 3, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "#paths to files \n", 164 | "#To stored mapping file\n", 165 | "mapping_file = './data/mapping.pkl'\n", 166 | "\n", 167 | "#To stored model\n", 168 | "name = parameters['name']\n", 169 | "model_name = models_path + name #get_name(parameters)\n", 170 | "\n", 171 | "if not os.path.exists(models_path):\n", 172 | " os.makedirs(models_path)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "##### Load data and preprocess" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Firstly, the data is loaded from the train, dev and test files into a list of sentences.\n", 187 | "\n", 188 | "Preprocessing:\n", 189 | "\n", 190 | " * All the digits in the words are replaced by 0\n", 191 | " \n", 192 | "Why this preprocessing step?\n", 193 | " * For the Named Entity Recognition task, the information present in numerical digits doesnot help in predicting the entity. So, we replace all the digits by 0. So, now the model can concentrate on more important alphabets." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 4, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "def zero_digits(s):\n", 203 | " \"\"\"\n", 204 | " Replace every digit in a string by a zero.\n", 205 | " \"\"\"\n", 206 | " return re.sub('\\d', '0', s)\n", 207 | "\n", 208 | "def load_sentences(path, zeros):\n", 209 | " \"\"\"\n", 210 | " Load sentences. A line must contain at least a word and its tag.\n", 211 | " Sentences are separated by empty lines.\n", 212 | " \"\"\"\n", 213 | " sentences = []\n", 214 | " sentence = []\n", 215 | " for line in codecs.open(path, 'r', 'utf8'):\n", 216 | " line = zero_digits(line.rstrip()) if zeros else line.rstrip()\n", 217 | " if not line:\n", 218 | " if len(sentence) > 0:\n", 219 | " if 'DOCSTART' not in sentence[0][0]:\n", 220 | " sentences.append(sentence)\n", 221 | " sentence = []\n", 222 | " else:\n", 223 | " word = line.split()\n", 224 | " assert len(word) >= 2\n", 225 | " sentence.append(word)\n", 226 | " if len(sentence) > 0:\n", 227 | " if 'DOCSTART' not in sentence[0][0]:\n", 228 | " sentences.append(sentence)\n", 229 | " return sentences" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 5, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "train_sentences = load_sentences(parameters['train'], parameters['zeros'])\n", 239 | "test_sentences = load_sentences(parameters['test'], parameters['zeros'])\n", 240 | "dev_sentences = load_sentences(parameters['dev'], parameters['zeros'])" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "##### Update tagging scheme" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "Different types of tagging schemes can be used for NER. We update the tags for train, test and dev data ( depending on the parameters \\[ tag_scheme \\] ).\n", 255 | "\n", 256 | "In the paper, the authors use the tagging Scheme ( BIOES ) rather than BIO (which is used by the dataset). So, we need to first update the data to convert tag scheme from BIO to BIOES.\n", 257 | "\n", 258 | "BIOES tagging scheme:\n", 259 | "\n", 260 | " I - Word is inside a phrase of type TYPE\n", 261 | " B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE \n", 262 | " O - Word is not part of a phrase\n", 263 | " E - End ( E will not appear in a prefix-only partial match )\n", 264 | " S - Single" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 6, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "def iob2(tags):\n", 274 | " \"\"\"\n", 275 | " Check that tags have a valid BIO format.\n", 276 | " Tags in BIO1 format are converted to BIO2.\n", 277 | " \"\"\"\n", 278 | " for i, tag in enumerate(tags):\n", 279 | " if tag == 'O':\n", 280 | " continue\n", 281 | " split = tag.split('-')\n", 282 | " if len(split) != 2 or split[0] not in ['I', 'B']:\n", 283 | " return False\n", 284 | " if split[0] == 'B':\n", 285 | " continue\n", 286 | " elif i == 0 or tags[i - 1] == 'O': # conversion IOB1 to IOB2\n", 287 | " tags[i] = 'B' + tag[1:]\n", 288 | " elif tags[i - 1][1:] == tag[1:]:\n", 289 | " continue\n", 290 | " else: # conversion IOB1 to IOB2\n", 291 | " tags[i] = 'B' + tag[1:]\n", 292 | " return True\n", 293 | "\n", 294 | "def iob_iobes(tags):\n", 295 | " \"\"\"\n", 296 | " the function is used to convert\n", 297 | " BIO -> BIOES tagging\n", 298 | " \"\"\"\n", 299 | " new_tags = []\n", 300 | " for i, tag in enumerate(tags):\n", 301 | " if tag == 'O':\n", 302 | " new_tags.append(tag)\n", 303 | " elif tag.split('-')[0] == 'B':\n", 304 | " if i + 1 != len(tags) and \\\n", 305 | " tags[i + 1].split('-')[0] == 'I':\n", 306 | " new_tags.append(tag)\n", 307 | " else:\n", 308 | " new_tags.append(tag.replace('B-', 'S-'))\n", 309 | " elif tag.split('-')[0] == 'I':\n", 310 | " if i + 1 < len(tags) and \\\n", 311 | " tags[i + 1].split('-')[0] == 'I':\n", 312 | " new_tags.append(tag)\n", 313 | " else:\n", 314 | " new_tags.append(tag.replace('I-', 'E-'))\n", 315 | " else:\n", 316 | " raise Exception('Invalid IOB format!')\n", 317 | " return new_tags\n", 318 | "\n", 319 | "def update_tag_scheme(sentences, tag_scheme):\n", 320 | " \"\"\"\n", 321 | " Check and update sentences tagging scheme to BIO2\n", 322 | " Only BIO1 and BIO2 schemes are accepted for input data.\n", 323 | " \"\"\"\n", 324 | " for i, s in enumerate(sentences):\n", 325 | " tags = [w[-1] for w in s]\n", 326 | " # Check that tags are given in the BIO format\n", 327 | " if not iob2(tags):\n", 328 | " s_str = '\\n'.join(' '.join(w) for w in s)\n", 329 | " raise Exception('Sentences should be given in BIO format! ' +\n", 330 | " 'Please check sentence %i:\\n%s' % (i, s_str))\n", 331 | " if tag_scheme == 'BIOES':\n", 332 | " new_tags = iob_iobes(tags)\n", 333 | " for word, new_tag in zip(s, new_tags):\n", 334 | " word[-1] = new_tag\n", 335 | " else:\n", 336 | " raise Exception('Wrong tagging scheme!')" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 7, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "update_tag_scheme(train_sentences, parameters['tag_scheme'])\n", 346 | "update_tag_scheme(dev_sentences, parameters['tag_scheme'])\n", 347 | "update_tag_scheme(test_sentences, parameters['tag_scheme'])" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "##### Create Mappings for Words, Characters and Tags" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "After we have updated the tag scheme. We now have a list of sentences which are words along with their modified tags. Now, we want to map these individual words, tags and characters in each word, to unique numerical ID's so that each unique word, character and tag in the vocabulary is represented by a particular integer ID. To do this, we first create a functions that do these mapping for us" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "##### Why mapping is important?" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "These indices for words, tags and characters help us employ matrix (tensor) operations inside the neural network architecture, which are considerably faster." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 8, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "def create_dico(item_list):\n", 385 | " \"\"\"\n", 386 | " Create a dictionary of items from a list of list of items.\n", 387 | " \"\"\"\n", 388 | " assert type(item_list) is list\n", 389 | " dico = {}\n", 390 | " for items in item_list:\n", 391 | " for item in items:\n", 392 | " if item not in dico:\n", 393 | " dico[item] = 1\n", 394 | " else:\n", 395 | " dico[item] += 1\n", 396 | " return dico\n", 397 | "\n", 398 | "def create_mapping(dico):\n", 399 | " \"\"\"\n", 400 | " Create a mapping (item to ID / ID to item) from a dictionary.\n", 401 | " Items are ordered by decreasing frequency.\n", 402 | " \"\"\"\n", 403 | " sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0]))\n", 404 | " id_to_item = {i: v[0] for i, v in enumerate(sorted_items)}\n", 405 | " item_to_id = {v: k for k, v in id_to_item.items()}\n", 406 | " return item_to_id, id_to_item\n", 407 | "\n", 408 | "def word_mapping(sentences, lower):\n", 409 | " \"\"\"\n", 410 | " Create a dictionary and a mapping of words, sorted by frequency.\n", 411 | " \"\"\"\n", 412 | " words = [[x[0].lower() if lower else x[0] for x in s] for s in sentences]\n", 413 | " dico = create_dico(words)\n", 414 | " dico[''] = 10000000 #UNK tag for unknown words\n", 415 | " word_to_id, id_to_word = create_mapping(dico)\n", 416 | " print(\"Found %i unique words (%i in total)\" % (\n", 417 | " len(dico), sum(len(x) for x in words)\n", 418 | " ))\n", 419 | " return dico, word_to_id, id_to_word\n", 420 | "\n", 421 | "def char_mapping(sentences):\n", 422 | " \"\"\"\n", 423 | " Create a dictionary and mapping of characters, sorted by frequency.\n", 424 | " \"\"\"\n", 425 | " chars = [\"\".join([w[0] for w in s]) for s in sentences]\n", 426 | " dico = create_dico(chars)\n", 427 | " char_to_id, id_to_char = create_mapping(dico)\n", 428 | " print(\"Found %i unique characters\" % len(dico))\n", 429 | " return dico, char_to_id, id_to_char\n", 430 | "\n", 431 | "def tag_mapping(sentences):\n", 432 | " \"\"\"\n", 433 | " Create a dictionary and a mapping of tags, sorted by frequency.\n", 434 | " \"\"\"\n", 435 | " tags = [[word[-1] for word in s] for s in sentences]\n", 436 | " dico = create_dico(tags)\n", 437 | " dico[START_TAG] = -1\n", 438 | " dico[STOP_TAG] = -2\n", 439 | " tag_to_id, id_to_tag = create_mapping(dico)\n", 440 | " print(\"Found %i unique named entity tags\" % len(dico))\n", 441 | " return dico, tag_to_id, id_to_tag" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 9, 447 | "metadata": {}, 448 | "outputs": [ 449 | { 450 | "name": "stdout", 451 | "output_type": "stream", 452 | "text": [ 453 | "Found 17493 unique words (203621 in total)\n", 454 | "Found 75 unique characters\n", 455 | "Found 19 unique named entity tags\n" 456 | ] 457 | } 458 | ], 459 | "source": [ 460 | "dico_words,word_to_id,id_to_word = word_mapping(train_sentences, parameters['lower'])\n", 461 | "dico_chars, char_to_id, id_to_char = char_mapping(train_sentences)\n", 462 | "dico_tags, tag_to_id, id_to_tag = tag_mapping(train_sentences)" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "##### Preparing final dataset" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "The function prepare dataset returns a list of dictionaries ( one dictionary per each sentence )\n", 477 | "\n", 478 | "Each of the dictionary returned by the function contains\n", 479 | " 1. list of all words in the sentence\n", 480 | " 2. list of word index for all words in the sentence\n", 481 | " 3. list of lists, containing character id of each character for words in the sentence\n", 482 | " 4. list of tag for each word in the sentence." 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 10, 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [ 491 | "def lower_case(x,lower=False):\n", 492 | " if lower:\n", 493 | " return x.lower() \n", 494 | " else:\n", 495 | " return x" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 11, 501 | "metadata": {}, 502 | "outputs": [ 503 | { 504 | "name": "stdout", 505 | "output_type": "stream", 506 | "text": [ 507 | "14041 / 3250 / 3453 sentences in train / dev / test.\n" 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "def prepare_dataset(sentences, word_to_id, char_to_id, tag_to_id, lower=False):\n", 513 | " \"\"\"\n", 514 | " Prepare the dataset. Return a list of lists of dictionaries containing:\n", 515 | " - word indexes\n", 516 | " - word char indexes\n", 517 | " - tag indexes\n", 518 | " \"\"\"\n", 519 | " data = []\n", 520 | " for s in sentences:\n", 521 | " str_words = [w[0] for w in s]\n", 522 | " words = [word_to_id[lower_case(w,lower) if lower_case(w,lower) in word_to_id else '']\n", 523 | " for w in str_words]\n", 524 | " # Skip characters that are not in the training set\n", 525 | " chars = [[char_to_id[c] for c in w if c in char_to_id]\n", 526 | " for w in str_words]\n", 527 | " tags = [tag_to_id[w[-1]] for w in s]\n", 528 | " data.append({\n", 529 | " 'str_words': str_words,\n", 530 | " 'words': words,\n", 531 | " 'chars': chars,\n", 532 | " 'tags': tags,\n", 533 | " })\n", 534 | " return data\n", 535 | "\n", 536 | "train_data = prepare_dataset(\n", 537 | " train_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']\n", 538 | ")\n", 539 | "dev_data = prepare_dataset(\n", 540 | " dev_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']\n", 541 | ")\n", 542 | "test_data = prepare_dataset(\n", 543 | " test_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']\n", 544 | ")\n", 545 | "print(\"{} / {} / {} sentences in train / dev / test.\".format(len(train_data), len(dev_data), len(test_data)))" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "We are done with the preprocessing step for input data. It ready to be given as input to the model ! ! !" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "##### Load Word Embeddings\n", 560 | "\n", 561 | "Now, We move to the next step of loading the pre-trained word embeddings.\n", 562 | "\n", 563 | "The paper uses glove vectors 100 dimension vectors trained on the ( Wikipedia 2014 + Gigaword 5 ) corpus containing 6 Billion Words. The word embedding file ( glove.6B.100d.txt ) is placed in the data folder." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 12, 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "name": "stdout", 573 | "output_type": "stream", 574 | "text": [ 575 | "Loaded 400000 pretrained embeddings.\n" 576 | ] 577 | } 578 | ], 579 | "source": [ 580 | "all_word_embeds = {}\n", 581 | "for i, line in enumerate(codecs.open(parameters['embedding_path'], 'r', 'utf-8')):\n", 582 | " s = line.strip().split()\n", 583 | " if len(s) == parameters['word_dim'] + 1:\n", 584 | " all_word_embeds[s[0]] = np.array([float(i) for i in s[1:]])\n", 585 | "\n", 586 | "#Intializing Word Embedding Matrix\n", 587 | "word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), parameters['word_dim']))\n", 588 | "\n", 589 | "for w in word_to_id:\n", 590 | " if w in all_word_embeds:\n", 591 | " word_embeds[word_to_id[w]] = all_word_embeds[w]\n", 592 | " elif w.lower() in all_word_embeds:\n", 593 | " word_embeds[word_to_id[w]] = all_word_embeds[w.lower()]\n", 594 | "\n", 595 | "print('Loaded %i pretrained embeddings.' % len(all_word_embeds))" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "##### Storing Processed Data for Reuse\n", 603 | "\n", 604 | "We can store the preprocessed data and the embedding matrix for future reuse. This helps us avoid the time taken by the step of preprocessing, when we are trying to tune the hyper parameters for the model." 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 13, 610 | "metadata": {}, 611 | "outputs": [ 612 | { 613 | "name": "stdout", 614 | "output_type": "stream", 615 | "text": [ 616 | "word_to_id: 17493\n" 617 | ] 618 | } 619 | ], 620 | "source": [ 621 | "with open(mapping_file, 'wb') as f:\n", 622 | " mappings = {\n", 623 | " 'word_to_id': word_to_id,\n", 624 | " 'tag_to_id': tag_to_id,\n", 625 | " 'char_to_id': char_to_id,\n", 626 | " 'parameters': parameters,\n", 627 | " 'word_embeds': word_embeds\n", 628 | " }\n", 629 | " cPickle.dump(mappings, f)\n", 630 | "\n", 631 | "print('word_to_id: ', len(word_to_id))" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "### Model\n" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "The model that we are presenting is a complicated one, since its a hybridized network using LSTMs and CNNs. So in order to break down the complexity, we have attempted to simplify the process by splitting up operations into individual functions that we can go over part by part. This hopefully makes the whole thing more easily digestable and gives a more intuitive understanding of the whole process." 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": {}, 651 | "source": [ 652 | "##### Initialization of weights" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "We start with the init_embedding function, which just initializes the embedding layer by pooling from a random sample. \n", 660 | "\n", 661 | "The distribution is pooled from $-\\sqrt{\\frac{3}{V}}$ to $+\\sqrt{\\frac{3}{V}}$ where $V$ is the embedding dimension size." 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": 14, 667 | "metadata": {}, 668 | "outputs": [], 669 | "source": [ 670 | "def init_embedding(input_embedding):\n", 671 | " \"\"\"\n", 672 | " Initialize embedding\n", 673 | " \"\"\"\n", 674 | " bias = np.sqrt(3.0 / input_embedding.size(1))\n", 675 | " nn.init.uniform(input_embedding, -bias, bias)" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "Similar to the initialization above, except this is for the linear layer." 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": 15, 688 | "metadata": {}, 689 | "outputs": [], 690 | "source": [ 691 | "def init_linear(input_linear):\n", 692 | " \"\"\"\n", 693 | " Initialize linear transformation\n", 694 | " \"\"\"\n", 695 | " bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))\n", 696 | " nn.init.uniform(input_linear.weight, -bias, bias)\n", 697 | " if input_linear.bias is not None:\n", 698 | " input_linear.bias.data.zero_()" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "This is the initialization scheme for the LSTM layers. \n", 706 | "\n", 707 | "The LSTM layers are initialized by uniform sampling from $-\\sqrt{\\frac{6}{r+c}}$ to $+\\sqrt{\\frac{6}{r+c}}$. Where $r$ is the number of rows, $c$ is the number of columns (based on the shape of the weight matrix)." 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": 16, 713 | "metadata": {}, 714 | "outputs": [], 715 | "source": [ 716 | "def init_lstm(input_lstm):\n", 717 | " \"\"\"\n", 718 | " Initialize lstm\n", 719 | " \n", 720 | " PyTorch weights parameters:\n", 721 | " \n", 722 | " weight_ih_l[k]: the learnable input-hidden weights of the k-th layer,\n", 723 | " of shape `(hidden_size * input_size)` for `k = 0`. Otherwise, the shape is\n", 724 | " `(hidden_size * hidden_size)`\n", 725 | " \n", 726 | " weight_hh_l[k]: the learnable hidden-hidden weights of the k-th layer,\n", 727 | " of shape `(hidden_size * hidden_size)` \n", 728 | " \"\"\"\n", 729 | " \n", 730 | " # Weights init for forward layer\n", 731 | " for ind in range(0, input_lstm.num_layers):\n", 732 | " \n", 733 | " ## Gets the weights Tensor from our model, for the input-hidden weights in our current layer\n", 734 | " weight = eval('input_lstm.weight_ih_l' + str(ind))\n", 735 | " \n", 736 | " # Initialize the sampling range\n", 737 | " sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))\n", 738 | " \n", 739 | " # Randomly sample from our samping range using uniform distribution and apply it to our current layer\n", 740 | " nn.init.uniform(weight, -sampling_range, sampling_range)\n", 741 | " \n", 742 | " # Similar to above but for the hidden-hidden weights of the current layer\n", 743 | " weight = eval('input_lstm.weight_hh_l' + str(ind))\n", 744 | " sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))\n", 745 | " nn.init.uniform(weight, -sampling_range, sampling_range)\n", 746 | " \n", 747 | " \n", 748 | " # We do the above again, for the backward layer if we are using a bi-directional LSTM (our final model uses this)\n", 749 | " if input_lstm.bidirectional:\n", 750 | " for ind in range(0, input_lstm.num_layers):\n", 751 | " weight = eval('input_lstm.weight_ih_l' + str(ind) + '_reverse')\n", 752 | " sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))\n", 753 | " nn.init.uniform(weight, -sampling_range, sampling_range)\n", 754 | " weight = eval('input_lstm.weight_hh_l' + str(ind) + '_reverse')\n", 755 | " sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))\n", 756 | " nn.init.uniform(weight, -sampling_range, sampling_range)\n", 757 | "\n", 758 | " # Bias initialization steps\n", 759 | " \n", 760 | " # We initialize them to zero except for the forget gate bias, which is initialized to 1\n", 761 | " if input_lstm.bias:\n", 762 | " for ind in range(0, input_lstm.num_layers):\n", 763 | " bias = eval('input_lstm.bias_ih_l' + str(ind))\n", 764 | " \n", 765 | " # Initializing to zero\n", 766 | " bias.data.zero_()\n", 767 | " \n", 768 | " # This is the range of indices for our forget gates for each LSTM cell\n", 769 | " bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1\n", 770 | " \n", 771 | " #Similar for the hidden-hidden layer\n", 772 | " bias = eval('input_lstm.bias_hh_l' + str(ind))\n", 773 | " bias.data.zero_()\n", 774 | " bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1\n", 775 | " \n", 776 | " # Similar to above, we do for backward layer if we are using a bi-directional LSTM \n", 777 | " if input_lstm.bidirectional:\n", 778 | " for ind in range(0, input_lstm.num_layers):\n", 779 | " bias = eval('input_lstm.bias_ih_l' + str(ind) + '_reverse')\n", 780 | " bias.data.zero_()\n", 781 | " bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1\n", 782 | " bias = eval('input_lstm.bias_hh_l' + str(ind) + '_reverse')\n", 783 | " bias.data.zero_()\n", 784 | " bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1" 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "##### CRF Layer\n", 792 | "\n", 793 | "We have two options: \n", 794 | "\n", 795 | "* softmax: normalize the scores into a vector such that can be interpreted as the probability that the word belongs to class. Eventually, the probability of a sequence of tag $y$ is the product of all tags.\n", 796 | "\n", 797 | "\n", 798 | "* linear-chain CRF: the first method makes local choices. In other words, even if we capture some information from the context thanks to the bi-LSTM, the tagging decision is still local. We don’t make use of the neighbooring tagging decisions. Given a sequence of words $w_1,…,w_m$, a sequence of score vectors $s_1,…,s_m$ and a sequence of tags $y_1,…,y_m$, a linear-chain CRF defines a global score $C \\in \\mathbb{R}$ such that\n", 799 | "\n", 800 | "$$% $$\n", 805 | "\n", 806 | "where $T$ is a transition matrix in $R^{9×9}$ and $e,b \\in R^9$ are vectors of scores that capture the cost of beginning or ending with a given tag. The use of the matrix $T$ captures linear (one step) dependencies between tagging decisions.\n", 807 | "\n", 808 | "The motivation behind CRFs was to generate sentence level likelihoods for optimal tags. What that means is for each word we estimate maximum likelihood and then we use the Viterbi algorithm to decode the tag sequence optimally.\n", 809 | "\n", 810 | "\n", 811 | "**Advantages of CRF over Softmax:**\n", 812 | "* Softmax doesn't value any dependencies, this is a problem since NER the context heavily influences the tag that is assigned. This is solved by applying CRF as it takes into account the full sequence to assign the tag. \n", 813 | "* *Example: I-ORG cannot directly follow I-PER.*\n", 814 | "\n", 815 | "\n", 816 | "(Image Source)\n", 817 | "\n", 818 | "The figure shows a simple CRF network, in our case we have the inputs feeding in from our BiLSTMs, but otherwise the structure largely remains the same." 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "##### Evaluation schemes: Forward pass and Viterbi algorithm" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "Recall that the CRF computes a conditional probability. Let $y$ be a tag sequence and $x$ an input sequence of words. Then we compute\n", 833 | "\n", 834 | "$$P(y|x) = \\frac{\\exp{(\\text{Score}(x, y)})}{\\sum_{y'} \\exp{(\\text{Score}(x, y')})}$$\n", 835 | "\n", 836 | "Where the score is determined by defining some log potentials $\\log \\psi_i(x,y)$ such that\n", 837 | "\n", 838 | "$$\\text{Score}(x,y) = \\sum_i \\log \\psi_i(x,y)$$\n", 839 | "\n", 840 | "In our model, we define two kinds of potentials: emission and transition. The emission potential for the word at index $i$ comes from the hidden state of the Bi-LSTM at timestep $i$. The transition scores are stored in a $|T|x|T|$ matrix $P$, where $T$ is the tag set. In my implementation, $P_{j,k}$ is the score of transitioning to tag $j$ from tag $k$. So:\n", 841 | "\n", 842 | "$$\\text{Score}(x,y) = \\sum_i \\log \\psi_\\text{EMIT}(y_i \\rightarrow x_i) + \\log \\psi_\\text{TRANS}(y_{i-1} \\rightarrow y_i)$$\n", 843 | "$$= \\sum_i h_i[y_i] + \\textbf{P}_{y_i, y_{i-1}}$$" 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": {}, 849 | "source": [ 850 | "##### Helper Functions" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "Now, we define some helper functions for numerical operations and score calculations" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 17, 863 | "metadata": {}, 864 | "outputs": [], 865 | "source": [ 866 | "def log_sum_exp(vec):\n", 867 | " '''\n", 868 | " This function calculates the score explained above for the forward algorithm\n", 869 | " vec 2D: 1 * tagset_size\n", 870 | " '''\n", 871 | " max_score = vec[0, argmax(vec)]\n", 872 | " max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])\n", 873 | " return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))\n", 874 | " \n", 875 | "def argmax(vec):\n", 876 | " '''\n", 877 | " This function returns the max index in a vector\n", 878 | " '''\n", 879 | " _, idx = torch.max(vec, 1)\n", 880 | " return to_scalar(idx)\n", 881 | "\n", 882 | "def to_scalar(var):\n", 883 | " '''\n", 884 | " Function to convert pytorch tensor to a scalar\n", 885 | " '''\n", 886 | " return var.view(-1).data.tolist()[0]" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": {}, 892 | "source": [ 893 | "##### Helper function to calculate score\n", 894 | "\n", 895 | "This is a score function for our sentences. \n", 896 | "\n", 897 | "This function takes two things, a list of ground truths that tell us what the corresponding tags are, the other are the features which contains the supposed tagged parts of the function. Which is then used to compute the score." 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 18, 903 | "metadata": {}, 904 | "outputs": [], 905 | "source": [ 906 | "def score_sentences(self, feats, tags):\n", 907 | " # tags is ground_truth, a list of ints, length is len(sentence)\n", 908 | " # feats is a 2D tensor, len(sentence) * tagset_size\n", 909 | " r = torch.LongTensor(range(feats.size()[0]))\n", 910 | " if self.use_gpu:\n", 911 | " r = r.cuda()\n", 912 | " pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])\n", 913 | " pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])\n", 914 | " else:\n", 915 | " pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])\n", 916 | " pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])\n", 917 | "\n", 918 | " score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])\n", 919 | "\n", 920 | " return score" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "##### Implementation of Forward Algorithm" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": 19, 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [ 936 | "def forward_alg(self, feats):\n", 937 | " '''\n", 938 | " This function performs the forward algorithm explained above\n", 939 | " '''\n", 940 | " # calculate in log domain\n", 941 | " # feats is len(sentence) * tagset_size\n", 942 | " # initialize alpha with a Tensor with values all equal to -10000.\n", 943 | " \n", 944 | " # Do the forward algorithm to compute the partition function\n", 945 | " init_alphas = torch.Tensor(1, self.tagset_size).fill_(-10000.)\n", 946 | " \n", 947 | " # START_TAG has all of the score.\n", 948 | " init_alphas[0][self.tag_to_ix[START_TAG]] = 0.\n", 949 | " \n", 950 | " # Wrap in a variable so that we will get automatic backprop\n", 951 | " forward_var = autograd.Variable(init_alphas)\n", 952 | " if self.use_gpu:\n", 953 | " forward_var = forward_var.cuda()\n", 954 | " \n", 955 | " # Iterate through the sentence\n", 956 | " for feat in feats:\n", 957 | " # broadcast the emission score: it is the same regardless of\n", 958 | " # the previous tag\n", 959 | " emit_score = feat.view(-1, 1)\n", 960 | " \n", 961 | " # the ith entry of trans_score is the score of transitioning to\n", 962 | " # next_tag from i\n", 963 | " tag_var = forward_var + self.transitions + emit_score\n", 964 | " \n", 965 | " # The ith entry of next_tag_var is the value for the\n", 966 | " # edge (i -> next_tag) before we do log-sum-exp\n", 967 | " max_tag_var, _ = torch.max(tag_var, dim=1)\n", 968 | " \n", 969 | " # The forward variable for this tag is log-sum-exp of all the\n", 970 | " # scores.\n", 971 | " tag_var = tag_var - max_tag_var.view(-1, 1)\n", 972 | " \n", 973 | " # Compute log sum exp in a numerically stable way for the forward algorithm\n", 974 | " forward_var = max_tag_var + torch.log(torch.sum(torch.exp(tag_var), dim=1)).view(1, -1) # ).view(1, -1)\n", 975 | " terminal_var = (forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]).view(1, -1)\n", 976 | " alpha = log_sum_exp(terminal_var)\n", 977 | " # Z(x)\n", 978 | " return alpha" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "##### Viterbi decode" 986 | ] 987 | }, 988 | { 989 | "cell_type": "markdown", 990 | "metadata": {}, 991 | "source": [ 992 | "Viterbi decode is basically applying dynamic programming to choosing our tag sequence. Let’s suppose that we have the solution $\\tilde{s}_{t+1} (y^{t+1})$ for time steps $t + 1, ...., m$ for sequences that start with $y^{t+1}$ for each of the possible $y^{t+1}$. Then the solution $\\tilde{s}_t(y_t)$ for time steps $t, ..., m$ that starts with $y_t$ verifies \n", 993 | "\n", 994 | "$$ % $$\n", 999 | "\n", 1000 | "Then, we can easily define the probability of a given sequence of tags as\n", 1001 | "\n", 1002 | "$$ \\mathbb{P}(y_1, \\ldots, y_m) = \\frac{e^{C(y_1, \\ldots, y_m)}}{Z} $$" 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "markdown", 1007 | "metadata": {}, 1008 | "source": [ 1009 | "##### Implementation of Viterbi Algorithm" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "code", 1014 | "execution_count": 20, 1015 | "metadata": {}, 1016 | "outputs": [], 1017 | "source": [ 1018 | "def viterbi_algo(self, feats):\n", 1019 | " '''\n", 1020 | " In this function, we implement the viterbi algorithm explained above.\n", 1021 | " A Dynamic programming based approach to find the best tag sequence\n", 1022 | " '''\n", 1023 | " backpointers = []\n", 1024 | " # analogous to forward\n", 1025 | " \n", 1026 | " # Initialize the viterbi variables in log space\n", 1027 | " init_vvars = torch.Tensor(1, self.tagset_size).fill_(-10000.)\n", 1028 | " init_vvars[0][self.tag_to_ix[START_TAG]] = 0\n", 1029 | " \n", 1030 | " # forward_var at step i holds the viterbi variables for step i-1\n", 1031 | " forward_var = Variable(init_vvars)\n", 1032 | " if self.use_gpu:\n", 1033 | " forward_var = forward_var.cuda()\n", 1034 | " for feat in feats:\n", 1035 | " next_tag_var = forward_var.view(1, -1).expand(self.tagset_size, self.tagset_size) + self.transitions\n", 1036 | " _, bptrs_t = torch.max(next_tag_var, dim=1)\n", 1037 | " bptrs_t = bptrs_t.squeeze().data.cpu().numpy() # holds the backpointers for this step\n", 1038 | " next_tag_var = next_tag_var.data.cpu().numpy() \n", 1039 | " viterbivars_t = next_tag_var[range(len(bptrs_t)), bptrs_t] # holds the viterbi variables for this step\n", 1040 | " viterbivars_t = Variable(torch.FloatTensor(viterbivars_t))\n", 1041 | " if self.use_gpu:\n", 1042 | " viterbivars_t = viterbivars_t.cuda()\n", 1043 | " \n", 1044 | " # Now add in the emission scores, and assign forward_var to the set\n", 1045 | " # of viterbi variables we just computed\n", 1046 | " forward_var = viterbivars_t + feat\n", 1047 | " backpointers.append(bptrs_t)\n", 1048 | "\n", 1049 | " # Transition to STOP_TAG\n", 1050 | " terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]\n", 1051 | " terminal_var.data[self.tag_to_ix[STOP_TAG]] = -10000.\n", 1052 | " terminal_var.data[self.tag_to_ix[START_TAG]] = -10000.\n", 1053 | " best_tag_id = argmax(terminal_var.unsqueeze(0))\n", 1054 | " path_score = terminal_var[best_tag_id]\n", 1055 | " \n", 1056 | " # Follow the back pointers to decode the best path.\n", 1057 | " best_path = [best_tag_id]\n", 1058 | " for bptrs_t in reversed(backpointers):\n", 1059 | " best_tag_id = bptrs_t[best_tag_id]\n", 1060 | " best_path.append(best_tag_id)\n", 1061 | " \n", 1062 | " # Pop off the start tag (we dont want to return that to the caller)\n", 1063 | " start = best_path.pop()\n", 1064 | " assert start == self.tag_to_ix[START_TAG] # Sanity check\n", 1065 | " best_path.reverse()\n", 1066 | " return path_score, best_path" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "code", 1071 | "execution_count": 21, 1072 | "metadata": {}, 1073 | "outputs": [], 1074 | "source": [ 1075 | "def forward_calc(self, sentence, chars, chars2_length, d):\n", 1076 | " \n", 1077 | " '''\n", 1078 | " The function calls viterbi decode and generates the \n", 1079 | " most probable sequence of tags for the sentence\n", 1080 | " '''\n", 1081 | " \n", 1082 | " # Get the emission scores from the BiLSTM\n", 1083 | " feats = self._get_lstm_features(sentence, chars, chars2_length, d)\n", 1084 | " # viterbi to get tag_seq\n", 1085 | " \n", 1086 | " # Find the best path, given the features.\n", 1087 | " if self.use_crf:\n", 1088 | " score, tag_seq = self.viterbi_decode(feats)\n", 1089 | " else:\n", 1090 | " score, tag_seq = torch.max(feats, 1)\n", 1091 | " tag_seq = list(tag_seq.cpu().data)\n", 1092 | "\n", 1093 | " return score, tag_seq" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "### Details fo the Model" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "markdown", 1105 | "metadata": {}, 1106 | "source": [ 1107 | "##### 1. CNN model for generating character embeddings\n", 1108 | "\n", 1109 | "\n", 1110 | "Consider the word 'cat', we pad it on both ends to get our maximum word length ( this is mainly an implementation quirk since we can't have variable length layers at run time, our algorithm will ignore the pads).\n", 1111 | "\n", 1112 | "We then apply a convolution layer on top that generates spatial coherence across characters, we use a maxpool to extract meaningful features out of our convolution layer. This now gives us a dense vector representation of each word. This representation will be concatenated with the pre-trained GloVe embeddings using a simple lookup.\n", 1113 | "\n", 1114 | "\n", 1115 | "\n", 1116 | "Image Source\n", 1117 | "\n", 1118 | "\n", 1119 | "This snippet shows us how the CNN is implemented in pytorch\n", 1120 | "\n", 1121 | "`self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))`\n", 1122 | "\n", 1123 | "##### 2. Rest of the model (LSTM based) that generates tags for the given sequence\n", 1124 | "\n", 1125 | "The word-embeddings( glove+char embedding ) that we generated above, we feed to a bi-directional LSTM model. The LSTM model has 2 layers, \n", 1126 | "* The forward layer takes in a sequence of word vectors and generates a new vector based on what it has seen so far in the forward direction (starting from the start word up until current word) this vector can be thought of as a summary of all the words it has seen. \n", 1127 | "\n", 1128 | "* The backwards layer does the same but in opposite direction, i.e., from the end of the sentence to the current word.\n", 1129 | "\n", 1130 | "The forward vector and the backwards vector at current word concatanate to generate a unified representation.\n", 1131 | "\n", 1132 | "\n", 1133 | "Image Source\n", 1134 | "\n", 1135 | "This snippet shows us how the BiLSTM is implemented in pytorch\n", 1136 | "\n", 1137 | "`self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)`\n", 1138 | "\n", 1139 | "Finally, we have a linear layer to map hidden vectors to tag space." 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "markdown", 1144 | "metadata": {}, 1145 | "source": [ 1146 | "##### Main Model Implementation" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "markdown", 1151 | "metadata": {}, 1152 | "source": [ 1153 | "The get_lstm_features function returns the LSTM's tag vectors. The function performs all the steps mentioned above for the model.\n", 1154 | "\n", 1155 | "Steps:\n", 1156 | "1. It takes in characters, converts them to embeddings using our character CNN.\n", 1157 | "2. We concat Character Embeeding with glove vectors, use this as features that we feed to Bidirectional-LSTM. \n", 1158 | "3. The Bidirectional-LSTM generates outputs based on these set of features.\n", 1159 | "4. The output are passed through a linear layer to convert to tag space." 1160 | ] 1161 | }, 1162 | { 1163 | "cell_type": "code", 1164 | "execution_count": 22, 1165 | "metadata": {}, 1166 | "outputs": [], 1167 | "source": [ 1168 | "def get_lstm_features(self, sentence, chars2, chars2_length, d):\n", 1169 | " \n", 1170 | " chars_embeds = self.char_embeds(chars2).unsqueeze(1)\n", 1171 | " \n", 1172 | " ## Creating Character level representation using Convolutional Neural Netowrk\n", 1173 | " ## followed by a Maxpooling Layer\n", 1174 | " chars_cnn_out3 = self.char_cnn3(chars_embeds)\n", 1175 | " chars_embeds = nn.functional.max_pool2d(chars_cnn_out3,\n", 1176 | " kernel_size=(chars_cnn_out3.size(2), 1)).view(chars_cnn_out3.size(0), self.out_channels)\n", 1177 | "\n", 1178 | " ## Loading word embeddings\n", 1179 | " embeds = self.word_embeds(sentence)\n", 1180 | " \n", 1181 | " ## We concatenate the word embeddings and the character level representation\n", 1182 | " ## to create unified representation for each word\n", 1183 | " embeds = torch.cat((embeds, chars_embeds), 1)\n", 1184 | "\n", 1185 | " embeds = embeds.unsqueeze(1)\n", 1186 | " \n", 1187 | " ## Dropout on the unified embeddings\n", 1188 | " embeds = self.dropout(embeds)\n", 1189 | " \n", 1190 | " ## Word lstm\n", 1191 | " ## Takes words as input and generates a output at each step\n", 1192 | " lstm_out, _ = self.lstm(embeds)\n", 1193 | " \n", 1194 | " ## Reshaping the outputs from the lstm layer\n", 1195 | " lstm_out = lstm_out.view(len(sentence), self.hidden_dim*2)\n", 1196 | " \n", 1197 | " ## Dropout on the lstm output\n", 1198 | " lstm_out = self.dropout(lstm_out)\n", 1199 | " \n", 1200 | " ## Linear layer converts the ouput vectors to tag space\n", 1201 | " lstm_feats = self.hidden2tag(lstm_out)\n", 1202 | " \n", 1203 | " return lstm_feats" 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "markdown", 1208 | "metadata": {}, 1209 | "source": [ 1210 | "##### Funtion for Negative log likelihood calculation" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "markdown", 1215 | "metadata": {}, 1216 | "source": [ 1217 | "This is a helper function that calculates the negative log likelihood. \n", 1218 | "\n", 1219 | "The functions takes as input the previously calulcated lstm features to use to calculate the sentence score and then perform a forward run score and compare it with our predicted score to generate a log likelihood. \n", 1220 | "\n", 1221 | "`Implementation detail: Notice we do not pump out any log conversion in this function that is supposedly about log likelihood calculation, this is because we have ensured that we get the scores from our helper functions in the log domain.`" 1222 | ] 1223 | }, 1224 | { 1225 | "cell_type": "code", 1226 | "execution_count": 23, 1227 | "metadata": {}, 1228 | "outputs": [], 1229 | "source": [ 1230 | "def get_neg_log_likelihood(self, sentence, tags, chars2, chars2_length, d):\n", 1231 | " # sentence, tags is a list of ints\n", 1232 | " # features is a 2D tensor, len(sentence) * self.tagset_size\n", 1233 | " feats = self._get_lstm_features(sentence, chars2, chars2_length, d)\n", 1234 | "\n", 1235 | " if self.use_crf:\n", 1236 | " forward_score = self._forward_alg(feats)\n", 1237 | " gold_score = self._score_sentence(feats, tags)\n", 1238 | " return forward_score - gold_score\n", 1239 | " else:\n", 1240 | " tags = Variable(tags)\n", 1241 | " scores = nn.functional.cross_entropy(feats, tags)\n", 1242 | " return scores" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "metadata": {}, 1248 | "source": [ 1249 | "##### Main Model Class" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 24, 1255 | "metadata": {}, 1256 | "outputs": [], 1257 | "source": [ 1258 | "class BiLSTM_CRF(nn.Module):\n", 1259 | "\n", 1260 | " def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim,\n", 1261 | " char_to_ix=None, pre_word_embeds=None, char_out_dimension=25,char_embedding_dim=25, use_gpu=False\n", 1262 | " , use_crf=True):\n", 1263 | " '''\n", 1264 | " Input parameters:\n", 1265 | " \n", 1266 | " vocab_size= Size of vocabulary (int)\n", 1267 | " tag_to_ix = Dictionary that maps NER tags to indices\n", 1268 | " embedding_dim = Dimension of word embeddings (int)\n", 1269 | " hidden_dim = The hidden dimension of the LSTM layer (int)\n", 1270 | " char_to_ix = Dictionary that maps characters to indices\n", 1271 | " pre_word_embeds = Numpy array which provides mapping from word embeddings to word indices\n", 1272 | " char_out_dimension = Output dimension from the CNN encoder for character\n", 1273 | " char_embedding_dim = Dimension of the character embeddings\n", 1274 | " use_gpu = defines availability of GPU, \n", 1275 | " when True: CUDA function calls are made\n", 1276 | " else: Normal CPU function calls are made\n", 1277 | " use_crf = parameter which decides if you want to use the CRF layer for output decoding\n", 1278 | " '''\n", 1279 | " \n", 1280 | " super(BiLSTM_CRF, self).__init__()\n", 1281 | " \n", 1282 | " #parameter initialization for the model\n", 1283 | " self.use_gpu = use_gpu\n", 1284 | " self.embedding_dim = embedding_dim\n", 1285 | " self.hidden_dim = hidden_dim\n", 1286 | " self.vocab_size = vocab_size\n", 1287 | " self.tag_to_ix = tag_to_ix\n", 1288 | " self.use_crf = use_crf\n", 1289 | " self.tagset_size = len(tag_to_ix)\n", 1290 | " self.out_channels = char_out_dimension\n", 1291 | "\n", 1292 | " if char_embedding_dim is not None:\n", 1293 | " self.char_embedding_dim = char_embedding_dim\n", 1294 | " \n", 1295 | " #Initializing the character embedding layer\n", 1296 | " self.char_embeds = nn.Embedding(len(char_to_ix), char_embedding_dim)\n", 1297 | " init_embedding(self.char_embeds.weight)\n", 1298 | " \n", 1299 | " #Performing CNN encoding on the character embeddings\n", 1300 | " self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))\n", 1301 | "\n", 1302 | " #Creating Embedding layer with dimension of ( number of words * dimension of each word)\n", 1303 | " self.word_embeds = nn.Embedding(vocab_size, embedding_dim)\n", 1304 | " if pre_word_embeds is not None:\n", 1305 | " #Initializes the word embeddings with pretrained word embeddings\n", 1306 | " self.pre_word_embeds = True\n", 1307 | " self.word_embeds.weight = nn.Parameter(torch.FloatTensor(pre_word_embeds))\n", 1308 | " else:\n", 1309 | " self.pre_word_embeds = False\n", 1310 | " \n", 1311 | " #Initializing the dropout layer, with dropout specificed in parameters\n", 1312 | " self.dropout = nn.Dropout(parameters['dropout'])\n", 1313 | " \n", 1314 | " #Lstm Layer:\n", 1315 | " #input dimension: word embedding dimension + character level representation\n", 1316 | " #bidirectional=True, specifies that we are using the bidirectional LSTM\n", 1317 | " self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)\n", 1318 | " \n", 1319 | " #Initializing the lstm layer using predefined function for initialization\n", 1320 | " init_lstm(self.lstm)\n", 1321 | " \n", 1322 | " # Linear layer which maps the output of the bidirectional LSTM into tag space.\n", 1323 | " self.hidden2tag = nn.Linear(hidden_dim*2, self.tagset_size)\n", 1324 | " \n", 1325 | " #Initializing the linear layer using predefined function for initialization\n", 1326 | " init_linear(self.hidden2tag) \n", 1327 | "\n", 1328 | " if self.use_crf:\n", 1329 | " # Matrix of transition parameters. Entry i,j is the score of transitioning *to* i *from* j.\n", 1330 | " # Matrix has a dimension of (total number of tags * total number of tags)\n", 1331 | " self.transitions = nn.Parameter(\n", 1332 | " torch.zeros(self.tagset_size, self.tagset_size))\n", 1333 | " \n", 1334 | " # These two statements enforce the constraint that we never transfer\n", 1335 | " # to the start tag and we never transfer from the stop tag\n", 1336 | " self.transitions.data[tag_to_ix[START_TAG], :] = -10000\n", 1337 | " self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000\n", 1338 | "\n", 1339 | " #assigning the functions, which we have defined earlier\n", 1340 | " _score_sentence = score_sentences\n", 1341 | " _get_lstm_features = get_lstm_features\n", 1342 | " _forward_alg = forward_alg\n", 1343 | " viterbi_decode = viterbi_algo\n", 1344 | " neg_log_likelihood = get_neg_log_likelihood\n", 1345 | " forward = forward_calc" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": 25, 1351 | "metadata": {}, 1352 | "outputs": [ 1353 | { 1354 | "name": "stdout", 1355 | "output_type": "stream", 1356 | "text": [ 1357 | "Model Initialized!!!\n" 1358 | ] 1359 | } 1360 | ], 1361 | "source": [ 1362 | "#creating the model using the Class defined above\n", 1363 | "model = BiLSTM_CRF(vocab_size=len(word_to_id),\n", 1364 | " tag_to_ix=tag_to_id,\n", 1365 | " embedding_dim=parameters['word_dim'],\n", 1366 | " hidden_dim=parameters['word_lstm_dim'],\n", 1367 | " use_gpu=use_gpu,\n", 1368 | " char_to_ix=char_to_id,\n", 1369 | " pre_word_embeds=word_embeds,\n", 1370 | " use_crf=parameters['crf'])\n", 1371 | "print(\"Model Initialized!!!\")" 1372 | ] 1373 | }, 1374 | { 1375 | "cell_type": "code", 1376 | "execution_count": 26, 1377 | "metadata": {}, 1378 | "outputs": [ 1379 | { 1380 | "name": "stdout", 1381 | "output_type": "stream", 1382 | "text": [ 1383 | "downloading pre-trained model\n", 1384 | "model reloaded : ./models/pre-trained-model\n" 1385 | ] 1386 | } 1387 | ], 1388 | "source": [ 1389 | "#Reload a saved model, if parameter[\"reload\"] is set to a path\n", 1390 | "if parameters['reload']:\n", 1391 | " if not os.path.exists(parameters['reload']):\n", 1392 | " print(\"downloading pre-trained model\")\n", 1393 | " model_url=\"https://github.com/TheAnig/NER-LSTM-CNN-Pytorch/raw/master/trained-model-cpu\"\n", 1394 | " urllib.request.urlretrieve(model_url, parameters['reload'])\n", 1395 | " model.load_state_dict(torch.load(parameters['reload']))\n", 1396 | " print(\"model reloaded :\", parameters['reload'])\n", 1397 | "\n", 1398 | "if use_gpu:\n", 1399 | " model.cuda()" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "##### Training Paramaters" 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "code", 1411 | "execution_count": 27, 1412 | "metadata": {}, 1413 | "outputs": [], 1414 | "source": [ 1415 | "#Initializing the optimizer\n", 1416 | "#The best results in the paper where achived using stochastic gradient descent (SGD) \n", 1417 | "#learning rate=0.015 and momentum=0.9 \n", 1418 | "#decay_rate=0.05 \n", 1419 | "\n", 1420 | "learning_rate = 0.015\n", 1421 | "momentum = 0.9\n", 1422 | "number_of_epochs = parameters['epoch'] \n", 1423 | "decay_rate = 0.05\n", 1424 | "gradient_clip = parameters['gradient_clip']\n", 1425 | "optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)\n", 1426 | "\n", 1427 | "#variables which will used in training process\n", 1428 | "losses = [] #list to store all losses\n", 1429 | "loss = 0.0 #Loss Initializatoin\n", 1430 | "best_dev_F = -1.0 # Current best F-1 Score on Dev Set\n", 1431 | "best_test_F = -1.0 # Current best F-1 Score on Test Set\n", 1432 | "best_train_F = -1.0 # Current best F-1 Score on Train Set\n", 1433 | "all_F = [[0, 0, 0]] # List storing all the F-1 Scores\n", 1434 | "eval_every = len(train_data) # Calculate F-1 Score after this many iterations\n", 1435 | "plot_every = 2000 # Store loss after this many iterations\n", 1436 | "count = 0 #Counts the number of iterations" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "markdown", 1441 | "metadata": {}, 1442 | "source": [ 1443 | "### Evaluation" 1444 | ] 1445 | }, 1446 | { 1447 | "cell_type": "markdown", 1448 | "metadata": {}, 1449 | "source": [ 1450 | "##### Helper functions for evaluation" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "code", 1455 | "execution_count": 28, 1456 | "metadata": {}, 1457 | "outputs": [], 1458 | "source": [ 1459 | "def get_chunk_type(tok, idx_to_tag):\n", 1460 | " \"\"\"\n", 1461 | " The function takes in a chunk (\"B-PER\") and then splits it into the tag (PER) and its class (B)\n", 1462 | " as defined in BIOES\n", 1463 | " \n", 1464 | " Args:\n", 1465 | " tok: id of token, ex 4\n", 1466 | " idx_to_tag: dictionary {4: \"B-PER\", ...}\n", 1467 | "\n", 1468 | " Returns:\n", 1469 | " tuple: \"B\", \"PER\"\n", 1470 | "\n", 1471 | " \"\"\"\n", 1472 | " \n", 1473 | " tag_name = idx_to_tag[tok]\n", 1474 | " tag_class = tag_name.split('-')[0]\n", 1475 | " tag_type = tag_name.split('-')[-1]\n", 1476 | " return tag_class, tag_type" 1477 | ] 1478 | }, 1479 | { 1480 | "cell_type": "code", 1481 | "execution_count": 29, 1482 | "metadata": {}, 1483 | "outputs": [], 1484 | "source": [ 1485 | "def get_chunks(seq, tags):\n", 1486 | " \"\"\"Given a sequence of tags, group entities and their position\n", 1487 | "\n", 1488 | " Args:\n", 1489 | " seq: [4, 4, 0, 0, ...] sequence of labels\n", 1490 | " tags: dict[\"O\"] = 4\n", 1491 | "\n", 1492 | " Returns:\n", 1493 | " list of (chunk_type, chunk_start, chunk_end)\n", 1494 | "\n", 1495 | " Example:\n", 1496 | " seq = [4, 5, 0, 3]\n", 1497 | " tags = {\"B-PER\": 4, \"I-PER\": 5, \"B-LOC\": 3}\n", 1498 | " result = [(\"PER\", 0, 2), (\"LOC\", 3, 4)]\n", 1499 | "\n", 1500 | " \"\"\"\n", 1501 | " \n", 1502 | " # We assume by default the tags lie outside a named entity\n", 1503 | " default = tags[\"O\"]\n", 1504 | " \n", 1505 | " idx_to_tag = {idx: tag for tag, idx in tags.items()}\n", 1506 | " \n", 1507 | " chunks = []\n", 1508 | " \n", 1509 | " chunk_type, chunk_start = None, None\n", 1510 | " for i, tok in enumerate(seq):\n", 1511 | " # End of a chunk 1\n", 1512 | " if tok == default and chunk_type is not None:\n", 1513 | " # Add a chunk.\n", 1514 | " chunk = (chunk_type, chunk_start, i)\n", 1515 | " chunks.append(chunk)\n", 1516 | " chunk_type, chunk_start = None, None\n", 1517 | "\n", 1518 | " # End of a chunk + start of a chunk!\n", 1519 | " elif tok != default:\n", 1520 | " tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag)\n", 1521 | " if chunk_type is None:\n", 1522 | " # Initialize chunk for each entity\n", 1523 | " chunk_type, chunk_start = tok_chunk_type, i\n", 1524 | " elif tok_chunk_type != chunk_type or tok_chunk_class == \"B\":\n", 1525 | " # If chunk class is B, i.e., its a beginning of a new named entity\n", 1526 | " # or, if the chunk type is different from the previous one, then we\n", 1527 | " # start labelling it as a new entity\n", 1528 | " chunk = (chunk_type, chunk_start, i)\n", 1529 | " chunks.append(chunk)\n", 1530 | " chunk_type, chunk_start = tok_chunk_type, i\n", 1531 | " else:\n", 1532 | " pass\n", 1533 | "\n", 1534 | " # end condition\n", 1535 | " if chunk_type is not None:\n", 1536 | " chunk = (chunk_type, chunk_start, len(seq))\n", 1537 | " chunks.append(chunk)\n", 1538 | "\n", 1539 | " return chunks" 1540 | ] 1541 | }, 1542 | { 1543 | "cell_type": "code", 1544 | "execution_count": 30, 1545 | "metadata": {}, 1546 | "outputs": [], 1547 | "source": [ 1548 | "def evaluating(model, datas, best_F,dataset=\"Train\"):\n", 1549 | " '''\n", 1550 | " The function takes as input the model, data and calcuates F-1 Score\n", 1551 | " It performs conditional updates \n", 1552 | " 1) Flag to save the model \n", 1553 | " 2) Best F-1 score\n", 1554 | " ,if the F-1 score calculated improves on the previous F-1 score\n", 1555 | " '''\n", 1556 | " # Initializations\n", 1557 | " prediction = [] # A list that stores predicted tags\n", 1558 | " save = False # Flag that tells us if the model needs to be saved\n", 1559 | " new_F = 0.0 # Variable to store the current F1-Score (may not be the best)\n", 1560 | " correct_preds, total_correct, total_preds = 0., 0., 0. # Count variables\n", 1561 | " \n", 1562 | " for data in datas:\n", 1563 | " ground_truth_id = data['tags']\n", 1564 | " words = data['str_words']\n", 1565 | " chars2 = data['chars']\n", 1566 | "\n", 1567 | " d = {} \n", 1568 | " \n", 1569 | " # Padding the each word to max word size of that sentence\n", 1570 | " chars2_length = [len(c) for c in chars2]\n", 1571 | " char_maxl = max(chars2_length)\n", 1572 | " chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')\n", 1573 | " for i, c in enumerate(chars2):\n", 1574 | " chars2_mask[i, :chars2_length[i]] = c\n", 1575 | " chars2_mask = Variable(torch.LongTensor(chars2_mask))\n", 1576 | "\n", 1577 | " dwords = Variable(torch.LongTensor(data['words']))\n", 1578 | " \n", 1579 | " # We are getting the predicted output from our model\n", 1580 | " if use_gpu:\n", 1581 | " val,out = model(dwords.cuda(), chars2_mask.cuda(), chars2_length, d)\n", 1582 | " else:\n", 1583 | " val,out = model(dwords, chars2_mask, chars2_length, d)\n", 1584 | " predicted_id = out\n", 1585 | " \n", 1586 | " \n", 1587 | " # We use the get chunks function defined above to get the true chunks\n", 1588 | " # and the predicted chunks from true labels and predicted labels respectively\n", 1589 | " lab_chunks = set(get_chunks(ground_truth_id,tag_to_id))\n", 1590 | " lab_pred_chunks = set(get_chunks(predicted_id,\n", 1591 | " tag_to_id))\n", 1592 | "\n", 1593 | " # Updating the count variables\n", 1594 | " correct_preds += len(lab_chunks & lab_pred_chunks)\n", 1595 | " total_preds += len(lab_pred_chunks)\n", 1596 | " total_correct += len(lab_chunks)\n", 1597 | " \n", 1598 | " # Calculating the F1-Score\n", 1599 | " p = correct_preds / total_preds if correct_preds > 0 else 0\n", 1600 | " r = correct_preds / total_correct if correct_preds > 0 else 0\n", 1601 | " new_F = 2 * p * r / (p + r) if correct_preds > 0 else 0\n", 1602 | "\n", 1603 | " print(\"{}: new_F: {} best_F: {} \".format(dataset,new_F,best_F))\n", 1604 | " \n", 1605 | " # If our current F1-Score is better than the previous best, we update the best\n", 1606 | " # to current F1 and we set the flag to indicate that we need to checkpoint this model\n", 1607 | " \n", 1608 | " if new_F>best_F:\n", 1609 | " best_F=new_F\n", 1610 | " save=True\n", 1611 | "\n", 1612 | " return best_F, new_F, save" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "markdown", 1617 | "metadata": {}, 1618 | "source": [ 1619 | "##### Helper function for performing Learning rate decay" 1620 | ] 1621 | }, 1622 | { 1623 | "cell_type": "code", 1624 | "execution_count": 31, 1625 | "metadata": {}, 1626 | "outputs": [], 1627 | "source": [ 1628 | "def adjust_learning_rate(optimizer, lr):\n", 1629 | " \"\"\"\n", 1630 | " shrink learning rate\n", 1631 | " \"\"\"\n", 1632 | " for param_group in optimizer.param_groups:\n", 1633 | " param_group['lr'] = lr" 1634 | ] 1635 | }, 1636 | { 1637 | "cell_type": "markdown", 1638 | "metadata": {}, 1639 | "source": [ 1640 | "### Training Step" 1641 | ] 1642 | }, 1643 | { 1644 | "cell_type": "markdown", 1645 | "metadata": {}, 1646 | "source": [ 1647 | "If `parameters['reload']` is set, we already have a model to load of off, so we can skip the training. We have originally specified a pre-trained model since training is an expensive process, but we encourage readers to try this out once they're done with the tutorial." 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "code", 1652 | "execution_count": 32, 1653 | "metadata": {}, 1654 | "outputs": [], 1655 | "source": [ 1656 | "#parameters['reload']=False\n", 1657 | "\n", 1658 | "if not parameters['reload']:\n", 1659 | " tr = time.time()\n", 1660 | " model.train(True)\n", 1661 | " for epoch in range(1,number_of_epochs):\n", 1662 | " for i, index in enumerate(np.random.permutation(len(train_data))):\n", 1663 | " count += 1\n", 1664 | " data = train_data[index]\n", 1665 | "\n", 1666 | " ##gradient updates for each data entry\n", 1667 | " model.zero_grad()\n", 1668 | "\n", 1669 | " sentence_in = data['words']\n", 1670 | " sentence_in = Variable(torch.LongTensor(sentence_in))\n", 1671 | " tags = data['tags']\n", 1672 | " chars2 = data['chars']\n", 1673 | "\n", 1674 | " d = {}\n", 1675 | "\n", 1676 | " ## Padding the each word to max word size of that sentence\n", 1677 | " chars2_length = [len(c) for c in chars2]\n", 1678 | " char_maxl = max(chars2_length)\n", 1679 | " chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')\n", 1680 | " for i, c in enumerate(chars2):\n", 1681 | " chars2_mask[i, :chars2_length[i]] = c\n", 1682 | " chars2_mask = Variable(torch.LongTensor(chars2_mask))\n", 1683 | "\n", 1684 | "\n", 1685 | " targets = torch.LongTensor(tags)\n", 1686 | "\n", 1687 | " #we calculate the negative log-likelihood for the predicted tags using the predefined function\n", 1688 | " if use_gpu:\n", 1689 | " neg_log_likelihood = model.neg_log_likelihood(sentence_in.cuda(), targets.cuda(), chars2_mask.cuda(), chars2_length, d)\n", 1690 | " else:\n", 1691 | " neg_log_likelihood = model.neg_log_likelihood(sentence_in, targets, chars2_mask, chars2_length, d)\n", 1692 | " loss += neg_log_likelihood.data[0] / len(data['words'])\n", 1693 | " neg_log_likelihood.backward()\n", 1694 | "\n", 1695 | " #we use gradient clipping to avoid exploding gradients\n", 1696 | " torch.nn.utils.clip_grad_norm(model.parameters(), gradient_clip)\n", 1697 | " optimizer.step()\n", 1698 | "\n", 1699 | " #Storing loss\n", 1700 | " if count % plot_every == 0:\n", 1701 | " loss /= plot_every\n", 1702 | " print(count, ': ', loss)\n", 1703 | " if losses == []:\n", 1704 | " losses.append(loss)\n", 1705 | " losses.append(loss)\n", 1706 | " loss = 0.0\n", 1707 | "\n", 1708 | " #Evaluating on Train, Test, Dev Sets\n", 1709 | " if count % (eval_every) == 0 and count > (eval_every * 20) or \\\n", 1710 | " count % (eval_every*4) == 0 and count < (eval_every * 20):\n", 1711 | " model.train(False)\n", 1712 | " best_train_F, new_train_F, _ = evaluating(model, train_data, best_train_F,\"Train\")\n", 1713 | " best_dev_F, new_dev_F, save = evaluating(model, dev_data, best_dev_F,\"Dev\")\n", 1714 | " if save:\n", 1715 | " print(\"Saving Model to \", model_name)\n", 1716 | " torch.save(model.state_dict(), model_name)\n", 1717 | " best_test_F, new_test_F, _ = evaluating(model, test_data, best_test_F,\"Test\")\n", 1718 | "\n", 1719 | " all_F.append([new_train_F, new_dev_F, new_test_F])\n", 1720 | " model.train(True)\n", 1721 | "\n", 1722 | " #Performing decay on the learning rate\n", 1723 | " if count % len(train_data) == 0:\n", 1724 | " adjust_learning_rate(optimizer, lr=learning_rate/(1+decay_rate*count/len(train_data)))\n", 1725 | "\n", 1726 | " print(time.time() - tr)\n", 1727 | " plt.plot(losses)\n", 1728 | " plt.show()\n", 1729 | "\n", 1730 | "if not parameters['reload']:\n", 1731 | " #reload the best model saved from training\n", 1732 | " model.load_state_dict(torch.load(model_name))" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "markdown", 1737 | "metadata": {}, 1738 | "source": [ 1739 | "### Model Testing\n", 1740 | "\n", 1741 | "This is where we provide our readers with some fun, they can try out how the trained model functions on the sentences that you throw at it. Feel free to play around.\n", 1742 | "\n", 1743 | "\n", 1744 | "##### LIVE: PRODUCTION!" 1745 | ] 1746 | }, 1747 | { 1748 | "cell_type": "code", 1749 | "execution_count": 38, 1750 | "metadata": {}, 1751 | "outputs": [ 1752 | { 1753 | "name": "stdout", 1754 | "output_type": "stream", 1755 | "text": [ 1756 | "Prediction:\n", 1757 | "word : tag\n", 1758 | "Jay : PER\n", 1759 | "is : NA\n", 1760 | "from : NA\n", 1761 | "India : LOC\n", 1762 | "\n", 1763 | "\n", 1764 | "Donald : PER\n", 1765 | "is : NA\n", 1766 | "the : NA\n", 1767 | "president : NA\n", 1768 | "of : NA\n", 1769 | "USA : LOC\n", 1770 | "\n", 1771 | "\n" 1772 | ] 1773 | } 1774 | ], 1775 | "source": [ 1776 | "model_testing_sentences = ['Jay is from India','Donald is the president of USA']\n", 1777 | "\n", 1778 | "#parameters\n", 1779 | "lower=parameters['lower']\n", 1780 | "\n", 1781 | "#preprocessing\n", 1782 | "final_test_data = []\n", 1783 | "for sentence in model_testing_sentences:\n", 1784 | " s=sentence.split()\n", 1785 | " str_words = [w for w in s]\n", 1786 | " words = [word_to_id[lower_case(w,lower) if lower_case(w,lower) in word_to_id else ''] for w in str_words]\n", 1787 | " \n", 1788 | " # Skip characters that are not in the training set\n", 1789 | " chars = [[char_to_id[c] for c in w if c in char_to_id] for w in str_words]\n", 1790 | " \n", 1791 | " final_test_data.append({\n", 1792 | " 'str_words': str_words,\n", 1793 | " 'words': words,\n", 1794 | " 'chars': chars,\n", 1795 | " })\n", 1796 | "\n", 1797 | "#prediction\n", 1798 | "predictions = []\n", 1799 | "print(\"Prediction:\")\n", 1800 | "print(\"word : tag\")\n", 1801 | "for data in final_test_data:\n", 1802 | " words = data['str_words']\n", 1803 | " chars2 = data['chars']\n", 1804 | "\n", 1805 | " d = {} \n", 1806 | " \n", 1807 | " # Padding the each word to max word size of that sentence\n", 1808 | " chars2_length = [len(c) for c in chars2]\n", 1809 | " char_maxl = max(chars2_length)\n", 1810 | " chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')\n", 1811 | " for i, c in enumerate(chars2):\n", 1812 | " chars2_mask[i, :chars2_length[i]] = c\n", 1813 | " chars2_mask = Variable(torch.LongTensor(chars2_mask))\n", 1814 | "\n", 1815 | " dwords = Variable(torch.LongTensor(data['words']))\n", 1816 | "\n", 1817 | " # We are getting the predicted output from our model\n", 1818 | " if use_gpu:\n", 1819 | " val,predicted_id = model(dwords.cuda(), chars2_mask.cuda(), chars2_length, d)\n", 1820 | " else:\n", 1821 | " val,predicted_id = model(dwords, chars2_mask, chars2_length, d)\n", 1822 | "\n", 1823 | " pred_chunks = get_chunks(predicted_id,tag_to_id)\n", 1824 | " temp_list_tags=['NA']*len(words)\n", 1825 | " for p in pred_chunks:\n", 1826 | " temp_list_tags[p[1]]=p[0]\n", 1827 | " \n", 1828 | " for word,tag in zip(words,temp_list_tags):\n", 1829 | " print(word,':',tag)\n", 1830 | " print('\\n')" 1831 | ] 1832 | }, 1833 | { 1834 | "cell_type": "markdown", 1835 | "metadata": {}, 1836 | "source": [ 1837 | "### References" 1838 | ] 1839 | }, 1840 | { 1841 | "cell_type": "markdown", 1842 | "metadata": {}, 1843 | "source": [ 1844 | "1) Xuezhe Ma and Eduard Hovy. 2016. ** End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF .** In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics, Berlin, Germany ** (https://arxiv.org/pdf/1603.01354.pdf) **\n", 1845 | "\n", 1846 | "2) Official PyTorch Tutorial : [** Advanced: Making Dynamic Decisions and the Bi-LSTM CRF **](http://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html#sphx-glr-beginner-nlp-advanced-tutorial-py)\n", 1847 | "\n", 1848 | "3) [** Sequence Tagging with Tensorflow **](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html) using bi-LSTM + CRF with character embeddings for NER and POS by Guillaume Genthial\n", 1849 | "\n", 1850 | "4) Github Repository - [** Reference Github Repository **](https://github.com/jayavardhanr/End-to-end-Sequence-Labeling-via-Bi-directional-LSTM-CNNs-CRF-Tutorial)\n" 1851 | ] 1852 | } 1853 | ], 1854 | "metadata": { 1855 | "kernelspec": { 1856 | "display_name": "Python 3", 1857 | "language": "python", 1858 | "name": "python3" 1859 | }, 1860 | "language_info": { 1861 | "codemirror_mode": { 1862 | "name": "ipython", 1863 | "version": 3 1864 | }, 1865 | "file_extension": ".py", 1866 | "mimetype": "text/x-python", 1867 | "name": "python", 1868 | "nbconvert_exporter": "python", 1869 | "pygments_lexer": "ipython3", 1870 | "version": "3.5.5" 1871 | } 1872 | }, 1873 | "nbformat": 4, 1874 | "nbformat_minor": 2 1875 | } 1876 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Named-Entity-Recognition 2 | 3 | #### Model 4 | - Attention mechanism 5 | - CNNs-BiLSTM-CRF 6 | - self-adaption learning rate 7 | - clipped_gradients 8 | 9 | #### Datasets 10 | - [Conll2003](https://www.clips.uantwerpen.be/conll2003/) BIOES 11 | - [Jnlpba2004](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) BIOES 12 | 13 | #### Reference 14 | - [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) 15 | - [基于CNN-LSTM-CRF模型的生物医学命名实体识别](http://www.cips-cl.org/static/anthology/CCL-2017/CCL-17-001.pdf) 16 | -------------------------------------------------------------------------------- /bio2bioes.py: -------------------------------------------------------------------------------- 1 | 2 | import codecs,re 3 | 4 | def iob2(tags): 5 | """ 6 | Check that tags have a valid BIO format. 7 | Tags in BIO1 format are converted to BIO2. 8 | """ 9 | for i, tag in enumerate(tags): 10 | if tag == 'O': 11 | continue 12 | split = tag.split('-') 13 | if len(split) != 2 or split[0] not in ['I', 'B']: 14 | return False 15 | if split[0] == 'B': 16 | continue 17 | elif i == 0 or tags[i - 1] == 'O': # conversion IOB1 to IOB2 18 | tags[i] = 'B' + tag[1:] 19 | elif tags[i - 1][1:] == tag[1:]: 20 | continue 21 | else: # conversion IOB1 to IOB2 22 | tags[i] = 'B' + tag[1:] 23 | return True 24 | 25 | def iob_iobes(tags): 26 | """ 27 | the function is used to convert 28 | BIO -> BIOES tagging 29 | """ 30 | new_tags = [] 31 | for i, tag in enumerate(tags): 32 | if tag == 'O': 33 | new_tags.append(tag) 34 | elif tag.split('-')[0] == 'B': 35 | if i + 1 != len(tags) and \ 36 | tags[i + 1].split('-')[0] == 'I': 37 | new_tags.append(tag) 38 | else: 39 | new_tags.append(tag.replace('B-', 'S-')) 40 | elif tag.split('-')[0] == 'I': 41 | if i + 1 < len(tags) and \ 42 | tags[i + 1].split('-')[0] == 'I': 43 | new_tags.append(tag) 44 | else: 45 | new_tags.append(tag.replace('I-', 'E-')) 46 | else: 47 | raise Exception('Invalid IOB format!') 48 | return new_tags 49 | 50 | def zero_digits(s): 51 | """ 52 | Replace every digit in a string by a zero. 53 | """ 54 | return re.sub('\d', '0', s) 55 | 56 | def update_tag_scheme(sentences, tag_scheme): 57 | """ 58 | Check and update sentences tagging scheme to BIO2 59 | Only BIO1 and BIO2 schemes are accepted for input data. 60 | """ 61 | for i, s in enumerate(sentences): 62 | tags = [w[-1] for w in s] 63 | # Check that tags are given in the BIO format 64 | if not iob2(tags): 65 | s_str = '\n'.join(' '.join(w) for w in s) 66 | raise Exception('Sentences should be given in BIO format! ' + 67 | 'Please check sentence %i:\n%s' % (i, s_str)) 68 | if tag_scheme == 'BIOES': 69 | new_tags = iob_iobes(tags) 70 | for word, new_tag in zip(s, new_tags): 71 | word[-1] = new_tag 72 | else: 73 | raise Exception('Wrong tagging scheme!') 74 | 75 | def load_sentences(path, zeros): 76 | """ 77 | Load sentences. A line must contain at least a word and its tag. 78 | Sentences are separated by empty lines. 79 | """ 80 | sentences = [] 81 | sentence = [] 82 | for line in codecs.open(path, 'r', 'utf8'): 83 | line = zero_digits(line.rstrip()) if zeros else line.rstrip() 84 | if not line: 85 | if len(sentence) > 0: 86 | if 'DOCSTART' not in sentence[0][0]: 87 | sentences.append(sentence) 88 | sentence = [] 89 | else: 90 | word = line.split() 91 | assert len(word) >= 2 92 | sentence.append(word) 93 | if len(sentence) > 0: 94 | if 'DOCSTART' not in sentence[0][0]: 95 | sentences.append(sentence) 96 | return sentences 97 | 98 | if __name__ == '__main__': 99 | path = 'E:\\HotEvent\\NER_corpus\\NER_corpus\\jnlpba2004\\ner\\train.txt' 100 | bioes_path = 'E:\\HotEvent\\NER_corpus\\NER_corpus\\jnlpba2004\\ner\\bioes_train.txt' 101 | sentences = load_sentences(path, True) 102 | update_tag_scheme(sentences, 'BIOES') 103 | sens_string = '' 104 | for sentence in sentences: 105 | for word_label in sentence: 106 | sens_string = sens_string + word_label[0] + '\t' + word_label[1] + '\n' 107 | print(word_label[0],word_label[1]) 108 | sens_string = sens_string + '\n' 109 | with open(bioes_path,'wb') as f: 110 | f.write(sens_string.encode('utf-8')) 111 | --------------------------------------------------------------------------------