├── README.md └── NER solution.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # spaCy-Custom-NER-creation 2 | 3 | ## spaCy for NER 4 | SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens which are contiguous. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc. Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples. 5 | 6 | 1. Load the model 7 | 1.1. spacy.load('en') 8 | --> Disable existing pipe line (nlp.disable_pipes) 9 | 1.2. spacy.blank('en') 10 | --> Added Entity Recognizer to Pipeline 11 | 2. Shuffle and loop over the examples 12 | --> update the model (nlp.update) 13 | 3. Save the trained model (nlp.to_disk) 14 | 4. Test 15 | -------------------------------------------------------------------------------- /NER solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Load Packages" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from __future__ import unicode_literals, print_function\n", 17 | "import plac\n", 18 | "import random\n", 19 | "from pathlib import Path\n", 20 | "import spacy\n", 21 | "from tqdm import tqdm " 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "nlp1 = spacy.load('en_core_web_lg')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## Working of NER" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "docx1 = nlp1(u\"Who is Nishanth?\")" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 4, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "Nishanth 7 15 PERSON\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "for token in docx1.ents:\n", 64 | " print(token.text,token.start_char, token.end_char,token.label_)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 5, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "docx2 = nlp1(u\"Who is Kamal Khumar?\")" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 6, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Kamal Khumar 7 19 PERSON\n" 86 | ] 87 | } 88 | ], 89 | "source": [ 90 | "for token in docx2.ents:\n", 91 | " print(token.text,token.start_char, token.end_char,token.label_)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Train Data" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 7, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "TRAIN_DATA = [\n", 108 | " ('Who is Nishanth?', {\n", 109 | " 'entities': [(7, 15, 'PERSON')]\n", 110 | " }),\n", 111 | " ('Who is Kamal Khumar?', {\n", 112 | " 'entities': [(7, 19, 'PERSON')]\n", 113 | " }),\n", 114 | " ('I like London and Berlin.', {\n", 115 | " 'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]\n", 116 | " })\n", 117 | "]" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Define our variables" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 8, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "model = None\n", 134 | "output_dir=Path(\"C:\\\\Users\\\\nithi\\\\Documents\\\\ner\")\n", 135 | "n_iter=100" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "## Load the model" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 9, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "Created blank 'en' model\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "if model is not None:\n", 160 | " nlp = spacy.load(model) \n", 161 | " print(\"Loaded model '%s'\" % model)\n", 162 | "else:\n", 163 | " nlp = spacy.blank('en') \n", 164 | " print(\"Created blank 'en' model\")" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## Set up the pipeline" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 10, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "if 'ner' not in nlp.pipe_names:\n", 181 | " ner = nlp.create_pipe('ner')\n", 182 | " nlp.add_pipe(ner, last=True)\n", 183 | "else:\n", 184 | " ner = nlp.get_pipe('ner')" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "## Train the Recognizer" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 11, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "name": "stderr", 201 | "output_type": "stream", 202 | "text": [ 203 | "C:\\Users\\nithi\\Anaconda3\\lib\\site-packages\\spacy\\language.py:639: UserWarning: [W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.\n", 204 | " **kwargs\n", 205 | "100%|██████████| 3/3 [00:00<00:00, 26.62it/s]\n" 206 | ] 207 | }, 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "{'ner': 13.289173007011414}\n" 213 | ] 214 | }, 215 | { 216 | "name": "stderr", 217 | "output_type": "stream", 218 | "text": [ 219 | "100%|██████████| 3/3 [00:00<00:00, 28.11it/s]\n" 220 | ] 221 | }, 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "{'ner': 12.414458155632019}\n" 227 | ] 228 | }, 229 | { 230 | "name": "stderr", 231 | "output_type": "stream", 232 | "text": [ 233 | "100%|██████████| 3/3 [00:00<00:00, 27.10it/s]\n" 234 | ] 235 | }, 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "{'ner': 11.202702164649963}\n" 241 | ] 242 | }, 243 | { 244 | "name": "stderr", 245 | "output_type": "stream", 246 | "text": [ 247 | "100%|██████████| 3/3 [00:00<00:00, 32.00it/s]\n" 248 | ] 249 | }, 250 | { 251 | "name": "stdout", 252 | "output_type": "stream", 253 | "text": [ 254 | "{'ner': 10.165919601917267}\n" 255 | ] 256 | }, 257 | { 258 | "name": "stderr", 259 | "output_type": "stream", 260 | "text": [ 261 | "100%|██████████| 3/3 [00:00<00:00, 30.38it/s]\n" 262 | ] 263 | }, 264 | { 265 | "name": "stdout", 266 | "output_type": "stream", 267 | "text": [ 268 | "{'ner': 8.44960543513298}\n" 269 | ] 270 | }, 271 | { 272 | "name": "stderr", 273 | "output_type": "stream", 274 | "text": [ 275 | "100%|██████████| 3/3 [00:00<00:00, 28.11it/s]\n" 276 | ] 277 | }, 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "{'ner': 7.798196479678154}\n" 283 | ] 284 | }, 285 | { 286 | "name": "stderr", 287 | "output_type": "stream", 288 | "text": [ 289 | "100%|██████████| 3/3 [00:00<00:00, 33.42it/s]\n" 290 | ] 291 | }, 292 | { 293 | "name": "stdout", 294 | "output_type": "stream", 295 | "text": [ 296 | "{'ner': 6.569828731939197}\n" 297 | ] 298 | }, 299 | { 300 | "name": "stderr", 301 | "output_type": "stream", 302 | "text": [ 303 | "100%|██████████| 3/3 [00:00<00:00, 29.20it/s]\n" 304 | ] 305 | }, 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "{'ner': 6.784278305480257}\n" 311 | ] 312 | }, 313 | { 314 | "name": "stderr", 315 | "output_type": "stream", 316 | "text": [ 317 | "100%|██████████| 3/3 [00:00<00:00, 33.42it/s]\n" 318 | ] 319 | }, 320 | { 321 | "name": "stdout", 322 | "output_type": "stream", 323 | "text": [ 324 | "{'ner': 6.996531369164586}\n" 325 | ] 326 | }, 327 | { 328 | "name": "stderr", 329 | "output_type": "stream", 330 | "text": [ 331 | "100%|██████████| 3/3 [00:00<00:00, 32.70it/s]\n" 332 | ] 333 | }, 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "{'ner': 6.852636652998626}\n" 339 | ] 340 | }, 341 | { 342 | "name": "stderr", 343 | "output_type": "stream", 344 | "text": [ 345 | "100%|██████████| 3/3 [00:00<00:00, 32.34it/s]\n" 346 | ] 347 | }, 348 | { 349 | "name": "stdout", 350 | "output_type": "stream", 351 | "text": [ 352 | "{'ner': 6.637710725655779}\n" 353 | ] 354 | }, 355 | { 356 | "name": "stderr", 357 | "output_type": "stream", 358 | "text": [ 359 | "100%|██████████| 3/3 [00:00<00:00, 31.01it/s]\n" 360 | ] 361 | }, 362 | { 363 | "name": "stdout", 364 | "output_type": "stream", 365 | "text": [ 366 | "{'ner': 5.308007724117488}\n" 367 | ] 368 | }, 369 | { 370 | "name": "stderr", 371 | "output_type": "stream", 372 | "text": [ 373 | "100%|██████████| 3/3 [00:00<00:00, 34.18it/s]\n" 374 | ] 375 | }, 376 | { 377 | "name": "stdout", 378 | "output_type": "stream", 379 | "text": [ 380 | "{'ner': 6.261936842012801}\n" 381 | ] 382 | }, 383 | { 384 | "name": "stderr", 385 | "output_type": "stream", 386 | "text": [ 387 | "100%|██████████| 3/3 [00:00<00:00, 31.33it/s]\n" 388 | ] 389 | }, 390 | { 391 | "name": "stdout", 392 | "output_type": "stream", 393 | "text": [ 394 | "{'ner': 5.696804424747825}\n" 395 | ] 396 | }, 397 | { 398 | "name": "stderr", 399 | "output_type": "stream", 400 | "text": [ 401 | "100%|██████████| 3/3 [00:00<00:00, 32.66it/s]\n" 402 | ] 403 | }, 404 | { 405 | "name": "stdout", 406 | "output_type": "stream", 407 | "text": [ 408 | "{'ner': 4.2220914154313505}\n" 409 | ] 410 | }, 411 | { 412 | "name": "stderr", 413 | "output_type": "stream", 414 | "text": [ 415 | "100%|██████████| 3/3 [00:00<00:00, 29.78it/s]\n" 416 | ] 417 | }, 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "{'ner': 7.32105001504533}\n" 423 | ] 424 | }, 425 | { 426 | "name": "stderr", 427 | "output_type": "stream", 428 | "text": [ 429 | "100%|██████████| 3/3 [00:00<00:00, 29.20it/s]\n" 430 | ] 431 | }, 432 | { 433 | "name": "stdout", 434 | "output_type": "stream", 435 | "text": [ 436 | "{'ner': 4.733935753349215}\n" 437 | ] 438 | }, 439 | { 440 | "name": "stderr", 441 | "output_type": "stream", 442 | "text": [ 443 | "100%|██████████| 3/3 [00:00<00:00, 33.80it/s]\n" 444 | ] 445 | }, 446 | { 447 | "name": "stdout", 448 | "output_type": "stream", 449 | "text": [ 450 | "{'ner': 4.7929259040392935}\n" 451 | ] 452 | }, 453 | { 454 | "name": "stderr", 455 | "output_type": "stream", 456 | "text": [ 457 | "100%|██████████| 3/3 [00:00<00:00, 31.66it/s]\n" 458 | ] 459 | }, 460 | { 461 | "name": "stdout", 462 | "output_type": "stream", 463 | "text": [ 464 | "{'ner': 3.7480255567934364}\n" 465 | ] 466 | }, 467 | { 468 | "name": "stderr", 469 | "output_type": "stream", 470 | "text": [ 471 | "100%|██████████| 3/3 [00:00<00:00, 32.70it/s]\n" 472 | ] 473 | }, 474 | { 475 | "name": "stdout", 476 | "output_type": "stream", 477 | "text": [ 478 | "{'ner': 3.6030448971432634}\n" 479 | ] 480 | }, 481 | { 482 | "name": "stderr", 483 | "output_type": "stream", 484 | "text": [ 485 | "100%|██████████| 3/3 [00:00<00:00, 31.01it/s]\n" 486 | ] 487 | }, 488 | { 489 | "name": "stdout", 490 | "output_type": "stream", 491 | "text": [ 492 | "{'ner': 2.984586422695429}\n" 493 | ] 494 | }, 495 | { 496 | "name": "stderr", 497 | "output_type": "stream", 498 | "text": [ 499 | "100%|██████████| 3/3 [00:00<00:00, 35.35it/s]\n" 500 | ] 501 | }, 502 | { 503 | "name": "stdout", 504 | "output_type": "stream", 505 | "text": [ 506 | "{'ner': 4.080246267847542}\n" 507 | ] 508 | }, 509 | { 510 | "name": "stderr", 511 | "output_type": "stream", 512 | "text": [ 513 | "100%|██████████| 3/3 [00:00<00:00, 35.80it/s]\n" 514 | ] 515 | }, 516 | { 517 | "name": "stdout", 518 | "output_type": "stream", 519 | "text": [ 520 | "{'ner': 2.396151978294256}\n" 521 | ] 522 | }, 523 | { 524 | "name": "stderr", 525 | "output_type": "stream", 526 | "text": [ 527 | "100%|██████████| 3/3 [00:00<00:00, 33.41it/s]\n" 528 | ] 529 | }, 530 | { 531 | "name": "stdout", 532 | "output_type": "stream", 533 | "text": [ 534 | "{'ner': 2.9708919061977213}\n" 535 | ] 536 | }, 537 | { 538 | "name": "stderr", 539 | "output_type": "stream", 540 | "text": [ 541 | "100%|██████████| 3/3 [00:00<00:00, 37.58it/s]\n" 542 | ] 543 | }, 544 | { 545 | "name": "stdout", 546 | "output_type": "stream", 547 | "text": [ 548 | "{'ner': 3.124516086777021}\n" 549 | ] 550 | }, 551 | { 552 | "name": "stderr", 553 | "output_type": "stream", 554 | "text": [ 555 | "100%|██████████| 3/3 [00:00<00:00, 31.33it/s]\n" 556 | ] 557 | }, 558 | { 559 | "name": "stdout", 560 | "output_type": "stream", 561 | "text": [ 562 | "{'ner': 2.266252386643032}\n" 563 | ] 564 | }, 565 | { 566 | "name": "stderr", 567 | "output_type": "stream", 568 | "text": [ 569 | "100%|██████████| 3/3 [00:00<00:00, 28.65it/s]\n" 570 | ] 571 | }, 572 | { 573 | "name": "stdout", 574 | "output_type": "stream", 575 | "text": [ 576 | "{'ner': 2.0699961034052876}\n" 577 | ] 578 | }, 579 | { 580 | "name": "stderr", 581 | "output_type": "stream", 582 | "text": [ 583 | "100%|██████████| 3/3 [00:00<00:00, 32.45it/s]\n" 584 | ] 585 | }, 586 | { 587 | "name": "stdout", 588 | "output_type": "stream", 589 | "text": [ 590 | "{'ner': 1.2966782864483477}\n" 591 | ] 592 | }, 593 | { 594 | "name": "stderr", 595 | "output_type": "stream", 596 | "text": [ 597 | "100%|██████████| 3/3 [00:00<00:00, 35.07it/s]\n" 598 | ] 599 | }, 600 | { 601 | "name": "stdout", 602 | "output_type": "stream", 603 | "text": [ 604 | "{'ner': 1.645277187816894}\n" 605 | ] 606 | }, 607 | { 608 | "name": "stderr", 609 | "output_type": "stream", 610 | "text": [ 611 | "100%|██████████| 3/3 [00:00<00:00, 35.49it/s]\n" 612 | ] 613 | }, 614 | { 615 | "name": "stdout", 616 | "output_type": "stream", 617 | "text": [ 618 | "{'ner': 1.2471649073949607}\n" 619 | ] 620 | }, 621 | { 622 | "name": "stderr", 623 | "output_type": "stream", 624 | "text": [ 625 | "100%|██████████| 3/3 [00:00<00:00, 35.80it/s]\n" 626 | ] 627 | }, 628 | { 629 | "name": "stdout", 630 | "output_type": "stream", 631 | "text": [ 632 | "{'ner': 1.9767626742924236}\n" 633 | ] 634 | }, 635 | { 636 | "name": "stderr", 637 | "output_type": "stream", 638 | "text": [ 639 | "100%|██████████| 3/3 [00:00<00:00, 33.65it/s]\n" 640 | ] 641 | }, 642 | { 643 | "name": "stdout", 644 | "output_type": "stream", 645 | "text": [ 646 | "{'ner': 2.2609614619708998}\n" 647 | ] 648 | }, 649 | { 650 | "name": "stderr", 651 | "output_type": "stream", 652 | "text": [ 653 | "100%|██████████| 3/3 [00:00<00:00, 25.22it/s]\n" 654 | ] 655 | }, 656 | { 657 | "name": "stdout", 658 | "output_type": "stream", 659 | "text": [ 660 | "{'ner': 1.0743873100139631}\n" 661 | ] 662 | }, 663 | { 664 | "name": "stderr", 665 | "output_type": "stream", 666 | "text": [ 667 | "100%|██████████| 3/3 [00:00<00:00, 24.06it/s]\n" 668 | ] 669 | }, 670 | { 671 | "name": "stdout", 672 | "output_type": "stream", 673 | "text": [ 674 | "{'ner': 1.8448130177425868}\n" 675 | ] 676 | }, 677 | { 678 | "name": "stderr", 679 | "output_type": "stream", 680 | "text": [ 681 | "100%|██████████| 3/3 [00:00<00:00, 26.39it/s]\n" 682 | ] 683 | }, 684 | { 685 | "name": "stdout", 686 | "output_type": "stream", 687 | "text": [ 688 | "{'ner': 1.357637208115494}\n" 689 | ] 690 | }, 691 | { 692 | "name": "stderr", 693 | "output_type": "stream", 694 | "text": [ 695 | "100%|██████████| 3/3 [00:00<00:00, 21.49it/s]\n" 696 | ] 697 | }, 698 | { 699 | "name": "stdout", 700 | "output_type": "stream", 701 | "text": [ 702 | "{'ner': 1.8424517484679943}\n" 703 | ] 704 | }, 705 | { 706 | "name": "stderr", 707 | "output_type": "stream", 708 | "text": [ 709 | "100%|██████████| 3/3 [00:00<00:00, 23.32it/s]\n" 710 | ] 711 | }, 712 | { 713 | "name": "stdout", 714 | "output_type": "stream", 715 | "text": [ 716 | "{'ner': 0.9615059040750317}\n" 717 | ] 718 | }, 719 | { 720 | "name": "stderr", 721 | "output_type": "stream", 722 | "text": [ 723 | "100%|██████████| 3/3 [00:00<00:00, 25.93it/s]\n" 724 | ] 725 | }, 726 | { 727 | "name": "stdout", 728 | "output_type": "stream", 729 | "text": [ 730 | "{'ner': 0.537510085635887}\n" 731 | ] 732 | }, 733 | { 734 | "name": "stderr", 735 | "output_type": "stream", 736 | "text": [ 737 | "100%|██████████| 3/3 [00:00<00:00, 24.66it/s]\n" 738 | ] 739 | }, 740 | { 741 | "name": "stdout", 742 | "output_type": "stream", 743 | "text": [ 744 | "{'ner': 0.7948578974412663}\n" 745 | ] 746 | }, 747 | { 748 | "name": "stderr", 749 | "output_type": "stream", 750 | "text": [ 751 | "100%|██████████| 3/3 [00:00<00:00, 24.13it/s]\n" 752 | ] 753 | }, 754 | { 755 | "name": "stdout", 756 | "output_type": "stream", 757 | "text": [ 758 | "{'ner': 0.1137402939171647}\n" 759 | ] 760 | }, 761 | { 762 | "name": "stderr", 763 | "output_type": "stream", 764 | "text": [ 765 | "100%|██████████| 3/3 [00:00<00:00, 24.13it/s]\n" 766 | ] 767 | }, 768 | { 769 | "name": "stdout", 770 | "output_type": "stream", 771 | "text": [ 772 | "{'ner': 0.31659301493247805}\n" 773 | ] 774 | }, 775 | { 776 | "name": "stderr", 777 | "output_type": "stream", 778 | "text": [ 779 | "100%|██████████| 3/3 [00:00<00:00, 23.38it/s]\n" 780 | ] 781 | }, 782 | { 783 | "name": "stdout", 784 | "output_type": "stream", 785 | "text": [ 786 | "{'ner': 0.2985648904777062}\n" 787 | ] 788 | }, 789 | { 790 | "name": "stderr", 791 | "output_type": "stream", 792 | "text": [ 793 | "100%|██████████| 3/3 [00:00<00:00, 25.94it/s]\n" 794 | ] 795 | }, 796 | { 797 | "name": "stdout", 798 | "output_type": "stream", 799 | "text": [ 800 | "{'ner': 0.005982262522983435}\n" 801 | ] 802 | }, 803 | { 804 | "name": "stderr", 805 | "output_type": "stream", 806 | "text": [ 807 | "100%|██████████| 3/3 [00:00<00:00, 25.33it/s]\n" 808 | ] 809 | }, 810 | { 811 | "name": "stdout", 812 | "output_type": "stream", 813 | "text": [ 814 | "{'ner': 0.19967248298938595}\n" 815 | ] 816 | }, 817 | { 818 | "name": "stderr", 819 | "output_type": "stream", 820 | "text": [ 821 | "100%|██████████| 3/3 [00:00<00:00, 20.23it/s]\n" 822 | ] 823 | }, 824 | { 825 | "name": "stdout", 826 | "output_type": "stream", 827 | "text": [ 828 | "{'ner': 0.027748550969521342}\n" 829 | ] 830 | }, 831 | { 832 | "name": "stderr", 833 | "output_type": "stream", 834 | "text": [ 835 | "100%|██████████| 3/3 [00:00<00:00, 25.40it/s]\n" 836 | ] 837 | }, 838 | { 839 | "name": "stdout", 840 | "output_type": "stream", 841 | "text": [ 842 | "{'ner': 0.0002355359583202347}\n" 843 | ] 844 | }, 845 | { 846 | "name": "stderr", 847 | "output_type": "stream", 848 | "text": [ 849 | "100%|██████████| 3/3 [00:00<00:00, 25.93it/s]\n" 850 | ] 851 | }, 852 | { 853 | "name": "stdout", 854 | "output_type": "stream", 855 | "text": [ 856 | "{'ner': 0.001245846615631348}\n" 857 | ] 858 | }, 859 | { 860 | "name": "stderr", 861 | "output_type": "stream", 862 | "text": [ 863 | "100%|██████████| 3/3 [00:00<00:00, 25.73it/s]\n" 864 | ] 865 | }, 866 | { 867 | "name": "stdout", 868 | "output_type": "stream", 869 | "text": [ 870 | "{'ner': 0.1384277629389947}\n" 871 | ] 872 | }, 873 | { 874 | "name": "stderr", 875 | "output_type": "stream", 876 | "text": [ 877 | "100%|██████████| 3/3 [00:00<00:00, 23.83it/s]\n" 878 | ] 879 | }, 880 | { 881 | "name": "stdout", 882 | "output_type": "stream", 883 | "text": [ 884 | "{'ner': 2.5388879362475033e-06}\n" 885 | ] 886 | }, 887 | { 888 | "name": "stderr", 889 | "output_type": "stream", 890 | "text": [ 891 | "100%|██████████| 3/3 [00:00<00:00, 25.71it/s]\n" 892 | ] 893 | }, 894 | { 895 | "name": "stdout", 896 | "output_type": "stream", 897 | "text": [ 898 | "{'ner': 0.014741281109069068}\n" 899 | ] 900 | }, 901 | { 902 | "name": "stderr", 903 | "output_type": "stream", 904 | "text": [ 905 | "100%|██████████| 3/3 [00:00<00:00, 23.71it/s]\n" 906 | ] 907 | }, 908 | { 909 | "name": "stdout", 910 | "output_type": "stream", 911 | "text": [ 912 | "{'ner': 0.7157285214382185}\n" 913 | ] 914 | }, 915 | { 916 | "name": "stderr", 917 | "output_type": "stream", 918 | "text": [ 919 | "100%|██████████| 3/3 [00:00<00:00, 25.28it/s]\n" 920 | ] 921 | }, 922 | { 923 | "name": "stdout", 924 | "output_type": "stream", 925 | "text": [ 926 | "{'ner': 3.244267929675676e-05}\n" 927 | ] 928 | }, 929 | { 930 | "name": "stderr", 931 | "output_type": "stream", 932 | "text": [ 933 | "100%|██████████| 3/3 [00:00<00:00, 25.71it/s]\n" 934 | ] 935 | }, 936 | { 937 | "name": "stdout", 938 | "output_type": "stream", 939 | "text": [ 940 | "{'ner': 0.05835713364862018}\n" 941 | ] 942 | }, 943 | { 944 | "name": "stderr", 945 | "output_type": "stream", 946 | "text": [ 947 | "100%|██████████| 3/3 [00:00<00:00, 25.93it/s]\n" 948 | ] 949 | }, 950 | { 951 | "name": "stdout", 952 | "output_type": "stream", 953 | "text": [ 954 | "{'ner': 0.0002508708162295204}\n" 955 | ] 956 | }, 957 | { 958 | "name": "stderr", 959 | "output_type": "stream", 960 | "text": [ 961 | "100%|██████████| 3/3 [00:00<00:00, 34.18it/s]\n" 962 | ] 963 | }, 964 | { 965 | "name": "stdout", 966 | "output_type": "stream", 967 | "text": [ 968 | "{'ner': 1.6946091970760512e-05}\n" 969 | ] 970 | }, 971 | { 972 | "name": "stderr", 973 | "output_type": "stream", 974 | "text": [ 975 | "100%|██████████| 3/3 [00:00<00:00, 27.85it/s]\n" 976 | ] 977 | }, 978 | { 979 | "name": "stdout", 980 | "output_type": "stream", 981 | "text": [ 982 | "{'ner': 9.62541011568001e-06}\n" 983 | ] 984 | }, 985 | { 986 | "name": "stderr", 987 | "output_type": "stream", 988 | "text": [ 989 | "100%|██████████| 3/3 [00:00<00:00, 26.38it/s]\n" 990 | ] 991 | }, 992 | { 993 | "name": "stdout", 994 | "output_type": "stream", 995 | "text": [ 996 | "{'ner': 0.10284944012563473}\n" 997 | ] 998 | }, 999 | { 1000 | "name": "stderr", 1001 | "output_type": "stream", 1002 | "text": [ 1003 | "100%|██████████| 3/3 [00:00<00:00, 24.66it/s]\n" 1004 | ] 1005 | }, 1006 | { 1007 | "name": "stdout", 1008 | "output_type": "stream", 1009 | "text": [ 1010 | "{'ner': 0.0007663793138746722}\n" 1011 | ] 1012 | }, 1013 | { 1014 | "name": "stderr", 1015 | "output_type": "stream", 1016 | "text": [ 1017 | "100%|██████████| 3/3 [00:00<00:00, 25.27it/s]\n" 1018 | ] 1019 | }, 1020 | { 1021 | "name": "stdout", 1022 | "output_type": "stream", 1023 | "text": [ 1024 | "{'ner': 3.126371562519889e-07}\n" 1025 | ] 1026 | }, 1027 | { 1028 | "name": "stderr", 1029 | "output_type": "stream", 1030 | "text": [ 1031 | "100%|██████████| 3/3 [00:00<00:00, 25.07it/s]\n" 1032 | ] 1033 | }, 1034 | { 1035 | "name": "stdout", 1036 | "output_type": "stream", 1037 | "text": [ 1038 | "{'ner': 2.0217684843293586e-05}\n" 1039 | ] 1040 | }, 1041 | { 1042 | "name": "stderr", 1043 | "output_type": "stream", 1044 | "text": [ 1045 | "100%|██████████| 3/3 [00:00<00:00, 24.06it/s]\n" 1046 | ] 1047 | }, 1048 | { 1049 | "name": "stdout", 1050 | "output_type": "stream", 1051 | "text": [ 1052 | "{'ner': 1.218231428182522e-05}\n" 1053 | ] 1054 | }, 1055 | { 1056 | "name": "stderr", 1057 | "output_type": "stream", 1058 | "text": [ 1059 | "100%|██████████| 3/3 [00:00<00:00, 24.46it/s]\n" 1060 | ] 1061 | }, 1062 | { 1063 | "name": "stdout", 1064 | "output_type": "stream", 1065 | "text": [ 1066 | "{'ner': 3.181537376351195e-06}\n" 1067 | ] 1068 | }, 1069 | { 1070 | "name": "stderr", 1071 | "output_type": "stream", 1072 | "text": [ 1073 | "100%|██████████| 3/3 [00:00<00:00, 21.33it/s]\n" 1074 | ] 1075 | }, 1076 | { 1077 | "name": "stdout", 1078 | "output_type": "stream", 1079 | "text": [ 1080 | "{'ner': 0.00026197582790653343}\n" 1081 | ] 1082 | }, 1083 | { 1084 | "name": "stderr", 1085 | "output_type": "stream", 1086 | "text": [ 1087 | "100%|██████████| 3/3 [00:00<00:00, 21.80it/s]\n" 1088 | ] 1089 | }, 1090 | { 1091 | "name": "stdout", 1092 | "output_type": "stream", 1093 | "text": [ 1094 | "{'ner': 0.0003894786458399904}\n" 1095 | ] 1096 | }, 1097 | { 1098 | "name": "stderr", 1099 | "output_type": "stream", 1100 | "text": [ 1101 | "100%|██████████| 3/3 [00:00<00:00, 22.28it/s]\n" 1102 | ] 1103 | }, 1104 | { 1105 | "name": "stdout", 1106 | "output_type": "stream", 1107 | "text": [ 1108 | "{'ner': 3.4010406020859926e-05}\n" 1109 | ] 1110 | }, 1111 | { 1112 | "name": "stderr", 1113 | "output_type": "stream", 1114 | "text": [ 1115 | "100%|██████████| 3/3 [00:00<00:00, 21.65it/s]\n" 1116 | ] 1117 | }, 1118 | { 1119 | "name": "stdout", 1120 | "output_type": "stream", 1121 | "text": [ 1122 | "{'ner': 1.9612036935329582e-05}\n" 1123 | ] 1124 | }, 1125 | { 1126 | "name": "stderr", 1127 | "output_type": "stream", 1128 | "text": [ 1129 | "100%|██████████| 3/3 [00:00<00:00, 22.11it/s]\n" 1130 | ] 1131 | }, 1132 | { 1133 | "name": "stdout", 1134 | "output_type": "stream", 1135 | "text": [ 1136 | "{'ner': 0.004094531692732815}\n" 1137 | ] 1138 | }, 1139 | { 1140 | "name": "stderr", 1141 | "output_type": "stream", 1142 | "text": [ 1143 | "100%|██████████| 3/3 [00:00<00:00, 19.72it/s]\n" 1144 | ] 1145 | }, 1146 | { 1147 | "name": "stdout", 1148 | "output_type": "stream", 1149 | "text": [ 1150 | "{'ner': 3.1664290765182284e-07}\n" 1151 | ] 1152 | }, 1153 | { 1154 | "name": "stderr", 1155 | "output_type": "stream", 1156 | "text": [ 1157 | "100%|██████████| 3/3 [00:00<00:00, 21.91it/s]\n" 1158 | ] 1159 | }, 1160 | { 1161 | "name": "stdout", 1162 | "output_type": "stream", 1163 | "text": [ 1164 | "{'ner': 7.285047079350139e-06}\n" 1165 | ] 1166 | }, 1167 | { 1168 | "name": "stderr", 1169 | "output_type": "stream", 1170 | "text": [ 1171 | "100%|██████████| 3/3 [00:00<00:00, 19.66it/s]\n" 1172 | ] 1173 | }, 1174 | { 1175 | "name": "stdout", 1176 | "output_type": "stream", 1177 | "text": [ 1178 | "{'ner': 2.394377973120872e-07}\n" 1179 | ] 1180 | }, 1181 | { 1182 | "name": "stderr", 1183 | "output_type": "stream", 1184 | "text": [ 1185 | "100%|██████████| 3/3 [00:00<00:00, 18.12it/s]\n" 1186 | ] 1187 | }, 1188 | { 1189 | "name": "stdout", 1190 | "output_type": "stream", 1191 | "text": [ 1192 | "{'ner': 0.00022465953246274834}\n" 1193 | ] 1194 | }, 1195 | { 1196 | "name": "stderr", 1197 | "output_type": "stream", 1198 | "text": [ 1199 | "100%|██████████| 3/3 [00:00<00:00, 22.14it/s]\n" 1200 | ] 1201 | }, 1202 | { 1203 | "name": "stdout", 1204 | "output_type": "stream", 1205 | "text": [ 1206 | "{'ner': 1.0863004763571723e-06}\n" 1207 | ] 1208 | }, 1209 | { 1210 | "name": "stderr", 1211 | "output_type": "stream", 1212 | "text": [ 1213 | "100%|██████████| 3/3 [00:00<00:00, 22.74it/s]\n" 1214 | ] 1215 | }, 1216 | { 1217 | "name": "stdout", 1218 | "output_type": "stream", 1219 | "text": [ 1220 | "{'ner': 0.0023946468426480406}\n" 1221 | ] 1222 | }, 1223 | { 1224 | "name": "stderr", 1225 | "output_type": "stream", 1226 | "text": [ 1227 | "100%|██████████| 3/3 [00:00<00:00, 20.35it/s]\n" 1228 | ] 1229 | }, 1230 | { 1231 | "name": "stdout", 1232 | "output_type": "stream", 1233 | "text": [ 1234 | "{'ner': 6.169837382418367e-06}\n" 1235 | ] 1236 | }, 1237 | { 1238 | "name": "stderr", 1239 | "output_type": "stream", 1240 | "text": [ 1241 | "100%|██████████| 3/3 [00:00<00:00, 22.12it/s]\n" 1242 | ] 1243 | }, 1244 | { 1245 | "name": "stdout", 1246 | "output_type": "stream", 1247 | "text": [ 1248 | "{'ner': 0.00030678138916277324}\n" 1249 | ] 1250 | }, 1251 | { 1252 | "name": "stderr", 1253 | "output_type": "stream", 1254 | "text": [ 1255 | "100%|██████████| 3/3 [00:00<00:00, 23.14it/s]\n" 1256 | ] 1257 | }, 1258 | { 1259 | "name": "stdout", 1260 | "output_type": "stream", 1261 | "text": [ 1262 | "{'ner': 0.00022935201453786304}\n" 1263 | ] 1264 | }, 1265 | { 1266 | "name": "stderr", 1267 | "output_type": "stream", 1268 | "text": [ 1269 | "100%|██████████| 3/3 [00:00<00:00, 18.57it/s]\n" 1270 | ] 1271 | }, 1272 | { 1273 | "name": "stdout", 1274 | "output_type": "stream", 1275 | "text": [ 1276 | "{'ner': 6.255226670428841e-06}\n" 1277 | ] 1278 | }, 1279 | { 1280 | "name": "stderr", 1281 | "output_type": "stream", 1282 | "text": [ 1283 | "100%|██████████| 3/3 [00:00<00:00, 18.99it/s]\n" 1284 | ] 1285 | }, 1286 | { 1287 | "name": "stdout", 1288 | "output_type": "stream", 1289 | "text": [ 1290 | "{'ner': 4.085394059302123e-08}\n" 1291 | ] 1292 | }, 1293 | { 1294 | "name": "stderr", 1295 | "output_type": "stream", 1296 | "text": [ 1297 | "100%|██████████| 3/3 [00:00<00:00, 19.44it/s]\n" 1298 | ] 1299 | }, 1300 | { 1301 | "name": "stdout", 1302 | "output_type": "stream", 1303 | "text": [ 1304 | "{'ner': 6.995940536268303e-07}\n" 1305 | ] 1306 | }, 1307 | { 1308 | "name": "stderr", 1309 | "output_type": "stream", 1310 | "text": [ 1311 | "100%|██████████| 3/3 [00:00<00:00, 20.89it/s]\n" 1312 | ] 1313 | }, 1314 | { 1315 | "name": "stdout", 1316 | "output_type": "stream", 1317 | "text": [ 1318 | "{'ner': 4.706886355837702e-07}\n" 1319 | ] 1320 | }, 1321 | { 1322 | "name": "stderr", 1323 | "output_type": "stream", 1324 | "text": [ 1325 | "100%|██████████| 3/3 [00:00<00:00, 20.32it/s]\n" 1326 | ] 1327 | }, 1328 | { 1329 | "name": "stdout", 1330 | "output_type": "stream", 1331 | "text": [ 1332 | "{'ner': 0.011415514144148941}\n" 1333 | ] 1334 | }, 1335 | { 1336 | "name": "stderr", 1337 | "output_type": "stream", 1338 | "text": [ 1339 | "100%|██████████| 3/3 [00:00<00:00, 21.33it/s]\n" 1340 | ] 1341 | }, 1342 | { 1343 | "name": "stdout", 1344 | "output_type": "stream", 1345 | "text": [ 1346 | "{'ner': 5.458422404451642e-08}\n" 1347 | ] 1348 | }, 1349 | { 1350 | "name": "stderr", 1351 | "output_type": "stream", 1352 | "text": [ 1353 | "100%|██████████| 3/3 [00:00<00:00, 17.70it/s]\n" 1354 | ] 1355 | }, 1356 | { 1357 | "name": "stdout", 1358 | "output_type": "stream", 1359 | "text": [ 1360 | "{'ner': 2.5626111289965546e-08}\n" 1361 | ] 1362 | }, 1363 | { 1364 | "name": "stderr", 1365 | "output_type": "stream", 1366 | "text": [ 1367 | "100%|██████████| 3/3 [00:00<00:00, 9.47it/s]\n" 1368 | ] 1369 | }, 1370 | { 1371 | "name": "stdout", 1372 | "output_type": "stream", 1373 | "text": [ 1374 | "{'ner': 0.0005705031495488346}\n" 1375 | ] 1376 | }, 1377 | { 1378 | "name": "stderr", 1379 | "output_type": "stream", 1380 | "text": [ 1381 | "100%|██████████| 3/3 [00:00<00:00, 13.14it/s]\n" 1382 | ] 1383 | }, 1384 | { 1385 | "name": "stdout", 1386 | "output_type": "stream", 1387 | "text": [ 1388 | "{'ner': 3.657292176990035e-08}\n" 1389 | ] 1390 | }, 1391 | { 1392 | "name": "stderr", 1393 | "output_type": "stream", 1394 | "text": [ 1395 | "100%|██████████| 3/3 [00:00<00:00, 16.35it/s]\n" 1396 | ] 1397 | }, 1398 | { 1399 | "name": "stdout", 1400 | "output_type": "stream", 1401 | "text": [ 1402 | "{'ner': 5.172763367355009e-06}\n" 1403 | ] 1404 | }, 1405 | { 1406 | "name": "stderr", 1407 | "output_type": "stream", 1408 | "text": [ 1409 | "100%|██████████| 3/3 [00:00<00:00, 17.49it/s]\n" 1410 | ] 1411 | }, 1412 | { 1413 | "name": "stdout", 1414 | "output_type": "stream", 1415 | "text": [ 1416 | "{'ner': 8.243823683565664e-08}\n" 1417 | ] 1418 | }, 1419 | { 1420 | "name": "stderr", 1421 | "output_type": "stream", 1422 | "text": [ 1423 | "100%|██████████| 3/3 [00:00<00:00, 16.80it/s]\n" 1424 | ] 1425 | }, 1426 | { 1427 | "name": "stdout", 1428 | "output_type": "stream", 1429 | "text": [ 1430 | "{'ner': 4.928377747025868e-07}\n" 1431 | ] 1432 | }, 1433 | { 1434 | "name": "stderr", 1435 | "output_type": "stream", 1436 | "text": [ 1437 | "100%|██████████| 3/3 [00:00<00:00, 17.49it/s]\n" 1438 | ] 1439 | }, 1440 | { 1441 | "name": "stdout", 1442 | "output_type": "stream", 1443 | "text": [ 1444 | "{'ner': 8.718774975073686e-09}\n" 1445 | ] 1446 | }, 1447 | { 1448 | "name": "stderr", 1449 | "output_type": "stream", 1450 | "text": [ 1451 | "100%|██████████| 3/3 [00:00<00:00, 16.90it/s]\n" 1452 | ] 1453 | }, 1454 | { 1455 | "name": "stdout", 1456 | "output_type": "stream", 1457 | "text": [ 1458 | "{'ner': 1.1960221041722968e-05}\n" 1459 | ] 1460 | }, 1461 | { 1462 | "name": "stderr", 1463 | "output_type": "stream", 1464 | "text": [ 1465 | "100%|██████████| 3/3 [00:00<00:00, 20.32it/s]\n" 1466 | ] 1467 | }, 1468 | { 1469 | "name": "stdout", 1470 | "output_type": "stream", 1471 | "text": [ 1472 | "{'ner': 2.9751551858409105e-05}\n" 1473 | ] 1474 | }, 1475 | { 1476 | "name": "stderr", 1477 | "output_type": "stream", 1478 | "text": [ 1479 | "100%|██████████| 3/3 [00:00<00:00, 19.16it/s]\n" 1480 | ] 1481 | }, 1482 | { 1483 | "name": "stdout", 1484 | "output_type": "stream", 1485 | "text": [ 1486 | "{'ner': 2.96942204058517e-06}\n" 1487 | ] 1488 | }, 1489 | { 1490 | "name": "stderr", 1491 | "output_type": "stream", 1492 | "text": [ 1493 | "100%|██████████| 3/3 [00:00<00:00, 21.49it/s]\n" 1494 | ] 1495 | }, 1496 | { 1497 | "name": "stdout", 1498 | "output_type": "stream", 1499 | "text": [ 1500 | "{'ner': 0.0016165699260966425}\n" 1501 | ] 1502 | }, 1503 | { 1504 | "name": "stderr", 1505 | "output_type": "stream", 1506 | "text": [ 1507 | "100%|██████████| 3/3 [00:00<00:00, 21.64it/s]\n" 1508 | ] 1509 | }, 1510 | { 1511 | "name": "stdout", 1512 | "output_type": "stream", 1513 | "text": [ 1514 | "{'ner': 4.713544226093801e-10}\n" 1515 | ] 1516 | }, 1517 | { 1518 | "name": "stderr", 1519 | "output_type": "stream", 1520 | "text": [ 1521 | "100%|██████████| 3/3 [00:00<00:00, 22.62it/s]\n" 1522 | ] 1523 | }, 1524 | { 1525 | "name": "stdout", 1526 | "output_type": "stream", 1527 | "text": [ 1528 | "{'ner': 0.0031288532863410316}\n" 1529 | ] 1530 | }, 1531 | { 1532 | "name": "stderr", 1533 | "output_type": "stream", 1534 | "text": [ 1535 | "100%|██████████| 3/3 [00:00<00:00, 19.92it/s]\n" 1536 | ] 1537 | }, 1538 | { 1539 | "name": "stdout", 1540 | "output_type": "stream", 1541 | "text": [ 1542 | "{'ner': 3.34105816504464e-05}\n" 1543 | ] 1544 | }, 1545 | { 1546 | "name": "stderr", 1547 | "output_type": "stream", 1548 | "text": [ 1549 | "100%|██████████| 3/3 [00:00<00:00, 15.51it/s]\n" 1550 | ] 1551 | }, 1552 | { 1553 | "name": "stdout", 1554 | "output_type": "stream", 1555 | "text": [ 1556 | "{'ner': 5.541132249760118e-10}\n" 1557 | ] 1558 | }, 1559 | { 1560 | "name": "stderr", 1561 | "output_type": "stream", 1562 | "text": [ 1563 | "100%|██████████| 3/3 [00:00<00:00, 14.97it/s]\n" 1564 | ] 1565 | }, 1566 | { 1567 | "name": "stdout", 1568 | "output_type": "stream", 1569 | "text": [ 1570 | "{'ner': 3.6742865249447716e-06}\n" 1571 | ] 1572 | }, 1573 | { 1574 | "name": "stderr", 1575 | "output_type": "stream", 1576 | "text": [ 1577 | "100%|██████████| 3/3 [00:00<00:00, 15.43it/s]\n" 1578 | ] 1579 | }, 1580 | { 1581 | "name": "stdout", 1582 | "output_type": "stream", 1583 | "text": [ 1584 | "{'ner': 1.8795149241263365e-05}\n" 1585 | ] 1586 | }, 1587 | { 1588 | "name": "stderr", 1589 | "output_type": "stream", 1590 | "text": [ 1591 | "100%|██████████| 3/3 [00:00<00:00, 12.23it/s]\n" 1592 | ] 1593 | }, 1594 | { 1595 | "name": "stdout", 1596 | "output_type": "stream", 1597 | "text": [ 1598 | "{'ner': 2.7214211207259498e-09}\n" 1599 | ] 1600 | } 1601 | ], 1602 | "source": [ 1603 | "for _, annotations in TRAIN_DATA:\n", 1604 | " for ent in annotations.get('entities'):\n", 1605 | " ner.add_label(ent[2])\n", 1606 | "\n", 1607 | "other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']\n", 1608 | "with nlp.disable_pipes(*other_pipes): # only train NER\n", 1609 | " optimizer = nlp.begin_training()\n", 1610 | " for itn in range(n_iter):\n", 1611 | " random.shuffle(TRAIN_DATA)\n", 1612 | " losses = {}\n", 1613 | " for text, annotations in tqdm(TRAIN_DATA):\n", 1614 | " nlp.update(\n", 1615 | " [text], \n", 1616 | " [annotations], \n", 1617 | " drop=0.5, \n", 1618 | " sgd=optimizer,\n", 1619 | " losses=losses)\n", 1620 | " print(losses)" 1621 | ] 1622 | }, 1623 | { 1624 | "cell_type": "markdown", 1625 | "metadata": {}, 1626 | "source": [ 1627 | "## Test the trained model" 1628 | ] 1629 | }, 1630 | { 1631 | "cell_type": "code", 1632 | "execution_count": 12, 1633 | "metadata": {}, 1634 | "outputs": [ 1635 | { 1636 | "name": "stdout", 1637 | "output_type": "stream", 1638 | "text": [ 1639 | "Entities [('Kamal Khumar', 'PERSON')]\n", 1640 | "Tokens [('Who', '', 2), ('is', '', 2), ('Kamal', 'PERSON', 3), ('Khumar', 'PERSON', 1), ('?', '', 2)]\n", 1641 | "Entities [('Nishanth', 'PERSON')]\n", 1642 | "Tokens [('Who', '', 2), ('is', '', 2), ('Nishanth', 'PERSON', 3), ('?', '', 2)]\n", 1643 | "Entities [('London', 'LOC'), ('Berlin', 'LOC')]\n", 1644 | "Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3), ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]\n" 1645 | ] 1646 | } 1647 | ], 1648 | "source": [ 1649 | "for text, _ in TRAIN_DATA:\n", 1650 | " doc = nlp(text)\n", 1651 | " print('Entities', [(ent.text, ent.label_) for ent in doc.ents])\n", 1652 | " print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])" 1653 | ] 1654 | }, 1655 | { 1656 | "cell_type": "markdown", 1657 | "metadata": {}, 1658 | "source": [ 1659 | "## Save the model" 1660 | ] 1661 | }, 1662 | { 1663 | "cell_type": "code", 1664 | "execution_count": 16, 1665 | "metadata": {}, 1666 | "outputs": [ 1667 | { 1668 | "name": "stdout", 1669 | "output_type": "stream", 1670 | "text": [ 1671 | "Saved model to C:\\Users\\nithi\\Documents\\ner\n" 1672 | ] 1673 | } 1674 | ], 1675 | "source": [ 1676 | "if output_dir is not None:\n", 1677 | " output_dir = Path(output_dir)\n", 1678 | " if not output_dir.exists():\n", 1679 | " output_dir.mkdir()\n", 1680 | " nlp.to_disk(output_dir)\n", 1681 | " print(\"Saved model to\", output_dir) " 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "markdown", 1686 | "metadata": {}, 1687 | "source": [ 1688 | "## Test the saved model" 1689 | ] 1690 | }, 1691 | { 1692 | "cell_type": "code", 1693 | "execution_count": 14, 1694 | "metadata": {}, 1695 | "outputs": [ 1696 | { 1697 | "name": "stdout", 1698 | "output_type": "stream", 1699 | "text": [ 1700 | "Loading from C:\\Users\\nithi\\Documents\\ner\n", 1701 | "Entities [('Kamal Khumar', 'PERSON')]\n", 1702 | "Tokens [('Who', '', 2), ('is', '', 2), ('Kamal', 'PERSON', 3), ('Khumar', 'PERSON', 1), ('?', '', 2)]\n", 1703 | "Entities [('Nishanth', 'PERSON')]\n", 1704 | "Tokens [('Who', '', 2), ('is', '', 2), ('Nishanth', 'PERSON', 3), ('?', '', 2)]\n", 1705 | "Entities [('London', 'LOC'), ('Berlin', 'LOC')]\n", 1706 | "Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3), ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]\n" 1707 | ] 1708 | } 1709 | ], 1710 | "source": [ 1711 | "print(\"Loading from\", output_dir)\n", 1712 | "nlp2 = spacy.load(output_dir)\n", 1713 | "for text, _ in TRAIN_DATA:\n", 1714 | " doc = nlp2(text)\n", 1715 | " print('Entities', [(ent.text, ent.label_) for ent in doc.ents])\n", 1716 | " print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])" 1717 | ] 1718 | }, 1719 | { 1720 | "cell_type": "code", 1721 | "execution_count": null, 1722 | "metadata": {}, 1723 | "outputs": [], 1724 | "source": [] 1725 | } 1726 | ], 1727 | "metadata": { 1728 | "kernelspec": { 1729 | "display_name": "Python 3", 1730 | "language": "python", 1731 | "name": "python3" 1732 | }, 1733 | "language_info": { 1734 | "codemirror_mode": { 1735 | "name": "ipython", 1736 | "version": 3 1737 | }, 1738 | "file_extension": ".py", 1739 | "mimetype": "text/x-python", 1740 | "name": "python", 1741 | "nbconvert_exporter": "python", 1742 | "pygments_lexer": "ipython3", 1743 | "version": "3.7.4" 1744 | } 1745 | }, 1746 | "nbformat": 4, 1747 | "nbformat_minor": 4 1748 | } 1749 | --------------------------------------------------------------------------------