├── NLP using Deep Learning in Python.ipynb └── README.md /NLP using Deep Learning in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP using Deep Learning in Python - Quora Duplicate Questions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Problem Statement:\n", 15 | "\n", 16 | "Over 100 million people visit Quora every month, so it's no surprise that **many people ask similarly worded questions**. Multiple questions with the same intent **can cause seekers to spend more time finding the best answer to their question**, and **make writers feel they need to answer multiple versions of the same question**. Quora values canonical questions because they **provide a better experience to active seekers and writers**, and offer more value to both of these groups in the long term.\n", 17 | "\n", 18 | "**Reference:** https://www.kaggle.com/c/quora-question-pairs" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 322, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import pandas as pd\n", 28 | "import numpy as np\n", 29 | "import sklearn" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 323, 35 | "metadata": {}, 36 | "outputs": [ 37 | { 38 | "data": { 39 | "text/html": [ 40 | "
\n", 41 | "\n", 54 | "\n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | "
idqid1qid2question1question2is_duplicate
0012What is the step by step guide to invest in sh...What is the step by step guide to invest in sh...0
1134What is the story of Kohinoor (Koh-i-Noor) Dia...What would happen if the Indian government sto...0
2256How can I increase the speed of my internet co...How can Internet speed be increased by hacking...0
3378Why am I mentally very lonely? How can I solve...Find the remainder when [math]23^{24}[/math] i...0
44910Which one dissolve in water quikly sugar, salt...Which fish would survive in salt water?0
\n", 114 | "
" 115 | ], 116 | "text/plain": [ 117 | " id qid1 qid2 question1 \\\n", 118 | "0 0 1 2 What is the step by step guide to invest in sh... \n", 119 | "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n", 120 | "2 2 5 6 How can I increase the speed of my internet co... \n", 121 | "3 3 7 8 Why am I mentally very lonely? How can I solve... \n", 122 | "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n", 123 | "\n", 124 | " question2 is_duplicate \n", 125 | "0 What is the step by step guide to invest in sh... 0 \n", 126 | "1 What would happen if the Indian government sto... 0 \n", 127 | "2 How can Internet speed be increased by hacking... 0 \n", 128 | "3 Find the remainder when [math]23^{24}[/math] i... 0 \n", 129 | "4 Which fish would survive in salt water? 0 " 130 | ] 131 | }, 132 | "execution_count": 323, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | } 136 | ], 137 | "source": [ 138 | "question_pairs = pd.read_csv(\"../../raw_data/questions.csv\")\n", 139 | "question_pairs.head()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 325, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "text/plain": [ 150 | "(808702, 2)" 151 | ] 152 | }, 153 | "execution_count": 325, 154 | "metadata": {}, 155 | "output_type": "execute_result" 156 | } 157 | ], 158 | "source": [ 159 | "question_pairs_1 = question_pairs[['qid1', 'question1']]\n", 160 | "question_pairs_1.columns = ['id', 'question']\n", 161 | "question_pairs_2 = question_pairs[['qid2', 'question2']]\n", 162 | "question_pairs_2.columns = ['id', 'question']\n", 163 | "questions_list = pd.concat([question_pairs_1,question_pairs_2]).sort_values('id')\n", 164 | "questions_list.shape" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 326, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/plain": [ 175 | "['What is the step by step guide to invest in share market in india?',\n", 176 | " 'What is the step by step guide to invest in share market?',\n", 177 | " 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',\n", 178 | " 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',\n", 179 | " 'How can I increase the speed of my internet connection while using a VPN?',\n", 180 | " 'How can Internet speed be increased by hacking through DNS?',\n", 181 | " 'Why am I mentally very lonely? How can I solve it?',\n", 182 | " 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',\n", 183 | " 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?',\n", 184 | " 'Which fish would survive in salt water?']" 185 | ] 186 | }, 187 | "execution_count": 326, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "corpus = questions_list['question'].tolist()\n", 194 | "corpus[:10]" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 327, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "corpus = list(np.unique(corpus))" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "-------\n", 211 | "## Feature Extraction" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "### Count Vectorizer" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 79, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "data": { 228 | "text/html": [ 229 | "
\n", 230 | "\n", 243 | "\n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | "
10002000500aboutaddallamandanyare...waswewhatwhowillwinwithworldyouyour
00000000010...1010001000
11010000200...0010100000
20000100000...0000000000
30000010000...0001110010
41000001000...0000000000
50110000101...0000000000
60100000001...0000001000
70001000001...0001001011
80000000001...0100000100
90000000001...0100000100
\n", 513 | "

10 rows × 96 columns

\n", 514 | "
" 515 | ], 516 | "text/plain": [ 517 | " 1000 2000 500 about add all am and any are ... was we what \\\n", 518 | "0 0 0 0 0 0 0 0 0 1 0 ... 1 0 1 \n", 519 | "1 1 0 1 0 0 0 0 2 0 0 ... 0 0 1 \n", 520 | "2 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 \n", 521 | "3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 \n", 522 | "4 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 \n", 523 | "5 0 1 1 0 0 0 0 1 0 1 ... 0 0 0 \n", 524 | "6 0 1 0 0 0 0 0 0 0 1 ... 0 0 0 \n", 525 | "7 0 0 0 1 0 0 0 0 0 1 ... 0 0 0 \n", 526 | "8 0 0 0 0 0 0 0 0 0 1 ... 0 1 0 \n", 527 | "9 0 0 0 0 0 0 0 0 0 1 ... 0 1 0 \n", 528 | "\n", 529 | " who will win with world you your \n", 530 | "0 0 0 0 1 0 0 0 \n", 531 | "1 0 1 0 0 0 0 0 \n", 532 | "2 0 0 0 0 0 0 0 \n", 533 | "3 1 1 1 0 0 1 0 \n", 534 | "4 0 0 0 0 0 0 0 \n", 535 | "5 0 0 0 0 0 0 0 \n", 536 | "6 0 0 0 1 0 0 0 \n", 537 | "7 1 0 0 1 0 1 1 \n", 538 | "8 0 0 0 0 1 0 0 \n", 539 | "9 0 0 0 0 1 0 0 \n", 540 | "\n", 541 | "[10 rows x 96 columns]" 542 | ] 543 | }, 544 | "execution_count": 79, 545 | "metadata": {}, 546 | "output_type": "execute_result" 547 | } 548 | ], 549 | "source": [ 550 | "from sklearn.feature_extraction.text import CountVectorizer\n", 551 | "count_vect = CountVectorizer()\n", 552 | "\n", 553 | "X_train_counts = count_vect.fit_transform(corpus[:10])\n", 554 | "X_train_counts = pd.DataFrame(X_train_counts.toarray())\n", 555 | "X_train_counts.columns = count_vect.get_feature_names()\n", 556 | "X_train_counts" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 80, 562 | "metadata": {}, 563 | "outputs": [ 564 | { 565 | "data": { 566 | "text/plain": [ 567 | "'\"The question was marked as needing improvement\" how to deal with this, what ever I do still this error pops up? Is it Quora bot or any user?'" 568 | ] 569 | }, 570 | "execution_count": 80, 571 | "metadata": {}, 572 | "output_type": "execute_result" 573 | } 574 | ], 575 | "source": [ 576 | "corpus[0]" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 81, 582 | "metadata": {}, 583 | "outputs": [ 584 | { 585 | "data": { 586 | "text/plain": [ 587 | "1000 0\n", 588 | "2000 0\n", 589 | "500 0\n", 590 | "about 0\n", 591 | "add 0\n", 592 | "all 0\n", 593 | "am 0\n", 594 | "and 0\n", 595 | "any 1\n", 596 | "are 0\n", 597 | "as 1\n", 598 | "aside 0\n", 599 | "at 0\n", 600 | "banned 0\n", 601 | "based 0\n", 602 | "be 0\n", 603 | "been 0\n", 604 | "biases 0\n", 605 | "big 0\n", 606 | "bot 1\n", 607 | "can 0\n", 608 | "chip 0\n", 609 | "closer 0\n", 610 | "currency 0\n", 611 | "deal 1\n", 612 | "distance 0\n", 613 | "do 1\n", 614 | "election 0\n", 615 | "embedded 0\n", 616 | "error 1\n", 617 | " ..\n", 618 | "really 0\n", 619 | "relationship 0\n", 620 | "relationships 0\n", 621 | "rs 0\n", 622 | "see 0\n", 623 | "short 0\n", 624 | "starting 0\n", 625 | "still 1\n", 626 | "successful 0\n", 627 | "tell 0\n", 628 | "term 0\n", 629 | "the 1\n", 630 | "there 0\n", 631 | "think 0\n", 632 | "this 2\n", 633 | "time 0\n", 634 | "to 1\n", 635 | "up 1\n", 636 | "user 1\n", 637 | "war 0\n", 638 | "was 1\n", 639 | "we 0\n", 640 | "what 1\n", 641 | "who 0\n", 642 | "will 0\n", 643 | "win 0\n", 644 | "with 1\n", 645 | "world 0\n", 646 | "you 0\n", 647 | "your 0\n", 648 | "Name: 0, Length: 96, dtype: int64" 649 | ] 650 | }, 651 | "execution_count": 81, 652 | "metadata": {}, 653 | "output_type": "execute_result" 654 | } 655 | ], 656 | "source": [ 657 | "X_train_counts.loc[0]" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "### Tf-Idf (Term Frequency - Inverse Document Frequency)" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": 82, 670 | "metadata": {}, 671 | "outputs": [ 672 | { 673 | "data": { 674 | "text/html": [ 675 | "
\n", 676 | "\n", 689 | "\n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | "
10002000500aboutaddallamandanyare...waswewhatwhowillwinwithworldyouyour
00.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.202040.000000...0.202040.0000000.1717530.0000000.0000000.0000000.1502630.0000000.0000000.000000
10.2057760.0000000.2057760.0000000.0000000.0000000.0000000.4115530.000000.000000...0.000000.0000000.2057760.0000000.2057760.0000000.0000000.0000000.0000000.000000
20.0000000.0000000.0000000.0000000.5182910.0000000.0000000.0000000.000000.000000...0.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
30.0000000.0000000.0000000.0000000.0000000.2630250.0000000.0000000.000000.000000...0.000000.0000000.0000000.2235950.2235950.2630250.0000000.0000000.2235950.000000
40.2665930.0000000.0000000.0000000.0000000.0000000.3136050.0000000.000000.000000...0.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50.0000000.2515490.2515490.0000000.0000000.0000000.0000000.2515490.000000.175717...0.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
60.0000000.3133400.0000000.0000000.0000000.0000000.0000000.0000000.000000.218880...0.000000.0000000.0000000.0000000.0000000.0000000.2741360.0000000.0000000.000000
70.0000000.0000000.0000000.2042760.0000000.0000000.0000000.0000000.000000.121303...0.000000.0000000.0000000.1736530.0000000.0000000.1519260.0000000.1736530.204276
80.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.263628...0.000000.3774000.0000000.0000000.0000000.0000000.0000000.3774000.0000000.000000
90.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.235430...0.000000.3370330.0000000.0000000.0000000.0000000.0000000.3370330.0000000.000000
\n", 959 | "

10 rows × 96 columns

\n", 960 | "
" 961 | ], 962 | "text/plain": [ 963 | " 1000 2000 500 about add all am \\\n", 964 | "0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 965 | "1 0.205776 0.000000 0.205776 0.000000 0.000000 0.000000 0.000000 \n", 966 | "2 0.000000 0.000000 0.000000 0.000000 0.518291 0.000000 0.000000 \n", 967 | "3 0.000000 0.000000 0.000000 0.000000 0.000000 0.263025 0.000000 \n", 968 | "4 0.266593 0.000000 0.000000 0.000000 0.000000 0.000000 0.313605 \n", 969 | "5 0.000000 0.251549 0.251549 0.000000 0.000000 0.000000 0.000000 \n", 970 | "6 0.000000 0.313340 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 971 | "7 0.000000 0.000000 0.000000 0.204276 0.000000 0.000000 0.000000 \n", 972 | "8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 973 | "9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 974 | "\n", 975 | " and any are ... was we what who \\\n", 976 | "0 0.000000 0.20204 0.000000 ... 0.20204 0.000000 0.171753 0.000000 \n", 977 | "1 0.411553 0.00000 0.000000 ... 0.00000 0.000000 0.205776 0.000000 \n", 978 | "2 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 \n", 979 | "3 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.223595 \n", 980 | "4 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 \n", 981 | "5 0.251549 0.00000 0.175717 ... 0.00000 0.000000 0.000000 0.000000 \n", 982 | "6 0.000000 0.00000 0.218880 ... 0.00000 0.000000 0.000000 0.000000 \n", 983 | "7 0.000000 0.00000 0.121303 ... 0.00000 0.000000 0.000000 0.173653 \n", 984 | "8 0.000000 0.00000 0.263628 ... 0.00000 0.377400 0.000000 0.000000 \n", 985 | "9 0.000000 0.00000 0.235430 ... 0.00000 0.337033 0.000000 0.000000 \n", 986 | "\n", 987 | " will win with world you your \n", 988 | "0 0.000000 0.000000 0.150263 0.000000 0.000000 0.000000 \n", 989 | "1 0.205776 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 990 | "2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 991 | "3 0.223595 0.263025 0.000000 0.000000 0.223595 0.000000 \n", 992 | "4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 993 | "5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 994 | "6 0.000000 0.000000 0.274136 0.000000 0.000000 0.000000 \n", 995 | "7 0.000000 0.000000 0.151926 0.000000 0.173653 0.204276 \n", 996 | "8 0.000000 0.000000 0.000000 0.377400 0.000000 0.000000 \n", 997 | "9 0.000000 0.000000 0.000000 0.337033 0.000000 0.000000 \n", 998 | "\n", 999 | "[10 rows x 96 columns]" 1000 | ] 1001 | }, 1002 | "execution_count": 82, 1003 | "metadata": {}, 1004 | "output_type": "execute_result" 1005 | } 1006 | ], 1007 | "source": [ 1008 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1009 | "vectorizer = TfidfVectorizer()\n", 1010 | "\n", 1011 | "X_train_tfidf = vectorizer.fit_transform(corpus[:10])\n", 1012 | "X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray())\n", 1013 | "X_train_tfidf.columns = vectorizer.get_feature_names()\n", 1014 | "X_train_tfidf" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "markdown", 1019 | "metadata": {}, 1020 | "source": [ 1021 | "$$Tf(w, d) = Number \\ of \\ times \\ word \\ `w` \\ appears \\ in \\ document \\ `d`$$\n", 1022 | "\n", 1023 | "$$IDF(w) = \\log \\frac{Total \\ number \\ of \\ documents}{Number \\ of \\ documents \\ with \\ word \\ `w`}$$\n", 1024 | "\n", 1025 | "$$Tfidf(w, d) = Tf(w, d) * IDF(w)$$" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "### Word Vectors\n", 1033 | "\n", 1034 | "Word vectors - also called *word embeddings* - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive *context*.\n", 1035 | "\n", 1036 | "**There are two possible approaches:**\n", 1037 | "\n", 1038 | "\"Drawing\"\n", 1039 | "\n", 1040 | "**CBOW (Continuous Bag Of Words):** It predicts the word, given context around the word as input\n", 1041 | "\n", 1042 | "**Skip-gram:** It predicts the context, given the word as input" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": 83, 1048 | "metadata": {}, 1049 | "outputs": [], 1050 | "source": [ 1051 | "import spacy\n", 1052 | "nlp = spacy.load('en_core_web_md')" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "execution_count": 84, 1058 | "metadata": {}, 1059 | "outputs": [ 1060 | { 1061 | "data": { 1062 | "text/plain": [ 1063 | "300" 1064 | ] 1065 | }, 1066 | "execution_count": 84, 1067 | "metadata": {}, 1068 | "output_type": "execute_result" 1069 | } 1070 | ], 1071 | "source": [ 1072 | "len(nlp('dog').vector)" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": 85, 1078 | "metadata": {}, 1079 | "outputs": [], 1080 | "source": [ 1081 | "def most_similar(word, topn=5):\n", 1082 | " word = nlp.vocab[str(word)]\n", 1083 | " queries = [\n", 1084 | " w for w in word.vocab \n", 1085 | " if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)\n", 1086 | " ]\n", 1087 | "\n", 1088 | " by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)\n", 1089 | " return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": 86, 1095 | "metadata": {}, 1096 | "outputs": [ 1097 | { 1098 | "data": { 1099 | "text/plain": [ 1100 | "[('princes', 0.7876614),\n", 1101 | " ('kings', 0.7876614),\n", 1102 | " ('prince', 0.73377377),\n", 1103 | " ('queen', 0.72526103),\n", 1104 | " ('scepter', 0.6726005),\n", 1105 | " ('throne', 0.6726005),\n", 1106 | " ('kingdoms', 0.6604046),\n", 1107 | " ('kingdom', 0.6604046),\n", 1108 | " ('lord', 0.6439695),\n", 1109 | " ('royal', 0.6168811)]" 1110 | ] 1111 | }, 1112 | "execution_count": 86, 1113 | "metadata": {}, 1114 | "output_type": "execute_result" 1115 | } 1116 | ], 1117 | "source": [ 1118 | "most_similar(\"king\", topn=10)" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "code", 1123 | "execution_count": 87, 1124 | "metadata": {}, 1125 | "outputs": [ 1126 | { 1127 | "data": { 1128 | "text/plain": [ 1129 | "[('cheetah', 0.9999999),\n", 1130 | " ('lions', 0.7758893),\n", 1131 | " ('tiger', 0.7359829),\n", 1132 | " ('panther', 0.7359829),\n", 1133 | " ('leopard', 0.7359829),\n", 1134 | " ('elephant', 0.71239567),\n", 1135 | " ('hippo', 0.71239567),\n", 1136 | " ('zebra', 0.71239567),\n", 1137 | " ('rhino', 0.71239567),\n", 1138 | " ('giraffe', 0.71239567)]" 1139 | ] 1140 | }, 1141 | "execution_count": 87, 1142 | "metadata": {}, 1143 | "output_type": "execute_result" 1144 | } 1145 | ], 1146 | "source": [ 1147 | "most_similar(\"lion\", topn=10)" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "markdown", 1152 | "metadata": {}, 1153 | "source": [ 1154 | "Sentence (or document) objects have vectors, derived from the averages of individual token vectors. This makes it possible to compare similarities between whole documents." 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "code", 1159 | "execution_count": 88, 1160 | "metadata": {}, 1161 | "outputs": [ 1162 | { 1163 | "data": { 1164 | "text/plain": [ 1165 | "300" 1166 | ] 1167 | }, 1168 | "execution_count": 88, 1169 | "metadata": {}, 1170 | "output_type": "execute_result" 1171 | } 1172 | ], 1173 | "source": [ 1174 | "doc = nlp('The quick brown fox jumped over the lazy dogs.')\n", 1175 | "len(doc.vector)" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "### Bert Sentence Transformer" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 89, 1188 | "metadata": {}, 1189 | "outputs": [], 1190 | "source": [ 1191 | "from sentence_transformers import SentenceTransformer\n", 1192 | "import scipy.spatial\n", 1193 | "embedder = SentenceTransformer('bert-base-nli-mean-tokens')" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": 90, 1199 | "metadata": {}, 1200 | "outputs": [ 1201 | { 1202 | "name": "stdout", 1203 | "output_type": "stream", 1204 | "text": [ 1205 | "CPU times: user 1min 54s, sys: 3.71 s, total: 1min 58s\n", 1206 | "Wall time: 34.9 s\n" 1207 | ] 1208 | } 1209 | ], 1210 | "source": [ 1211 | "%%time\n", 1212 | "corpus_embeddings = embedder.encode(corpus)" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "----\n", 1220 | "## Candidate Genration using Faiss vector similarity search library" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "markdown", 1225 | "metadata": {}, 1226 | "source": [ 1227 | "Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.\n", 1228 | "\n", 1229 | "**References:**\n", 1230 | "\n", 1231 | "1. [Tutorial](https://github.com/facebookresearch/faiss/wiki/Getting-started)\n", 1232 | "2. [facebookresearch/faiss](https://github.com/facebookresearch/faiss)" 1233 | ] 1234 | }, 1235 | { 1236 | "cell_type": "code", 1237 | "execution_count": 91, 1238 | "metadata": {}, 1239 | "outputs": [ 1240 | { 1241 | "name": "stdout", 1242 | "output_type": "stream", 1243 | "text": [ 1244 | "True\n", 1245 | "1362\n" 1246 | ] 1247 | } 1248 | ], 1249 | "source": [ 1250 | "import faiss\n", 1251 | "d= 768\n", 1252 | "index = faiss.IndexFlatL2(d)\n", 1253 | "print(index.is_trained)\n", 1254 | "index.add(np.stack(corpus_embeddings, axis=0))\n", 1255 | "print(index.ntotal)" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "code", 1260 | "execution_count": 92, 1261 | "metadata": {}, 1262 | "outputs": [], 1263 | "source": [ 1264 | "# queries = ['What is the step by step guide to invest in share market in india?', 'How can Internet speed be increased by hacking through DNS?']\n", 1265 | "queries = question_pairs['question1'][:3].tolist()\n", 1266 | "query_embeddings = embedder.encode(queries)" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "execution_count": 93, 1272 | "metadata": {}, 1273 | "outputs": [ 1274 | { 1275 | "name": "stdout", 1276 | "output_type": "stream", 1277 | "text": [ 1278 | "[[ 858 998 1025 983 73]\n", 1279 | " [ 771 1015 775 1014 1133]\n", 1280 | " [ 436 455 457 463 462]]\n" 1281 | ] 1282 | } 1283 | ], 1284 | "source": [ 1285 | "k = 5 # we want to see 4 nearest neighbors\n", 1286 | "D, I = index.search(np.stack(query_embeddings, axis=0), k) # actual search\n", 1287 | "print(I) # neighbors of the 5 first queries" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": 94, 1293 | "metadata": {}, 1294 | "outputs": [ 1295 | { 1296 | "name": "stdout", 1297 | "output_type": "stream", 1298 | "text": [ 1299 | "\n", 1300 | "======================\n", 1301 | "\n", 1302 | "Query: What is purpose of life?\n", 1303 | "\n", 1304 | "Top 5 most similar sentences in corpus:\n", 1305 | "What is purpose of life? (Distance: 0.0000)\n", 1306 | "What is the purpose of life? (Distance: 7.9868)\n", 1307 | "What is your purpose of life? (Distance: 12.3884)\n", 1308 | "What is the meaning or purpose of life? (Distance: 13.6231)\n", 1309 | "From your perspective, what is the purpose of life? (Distance: 17.5448)\n", 1310 | "\n", 1311 | "======================\n", 1312 | "\n", 1313 | "Query: What are your New Year's resolutions for 2017?\n", 1314 | "\n", 1315 | "Top 5 most similar sentences in corpus:\n", 1316 | "What are your New Year's resolutions for 2017? (Distance: 0.0000)\n", 1317 | "What is your New Year's resolutions for 2017? (Distance: 0.6093)\n", 1318 | "What are your new year resolutions for 2017? (Distance: 5.8446)\n", 1319 | "What is your New Year's resolution for 2017? (Distance: 6.6011)\n", 1320 | "What's your New Year's resolution for 2017? (Distance: 8.1350)\n", 1321 | "\n", 1322 | "======================\n", 1323 | "\n", 1324 | "Query: How will Indian GDP be affected from banning 500 and 1000 rupees notes?\n", 1325 | "\n", 1326 | "Top 5 most similar sentences in corpus:\n", 1327 | "How will Indian GDP be affected from banning 500 and 1000 rupees notes? (Distance: 0.0000)\n", 1328 | "How will the ban of 1000 and 500 rupee notes affect the Indian economy? (Distance: 24.4171)\n", 1329 | "How will the ban of Rs 500 and Rs 1000 notes affect Indian economy? (Distance: 26.4862)\n", 1330 | "How will the ban on Rs 500 and 1000 notes impact the Indian economy? (Distance: 26.5522)\n", 1331 | "How will the ban on 500₹ and 1000₹ notes impact the Indian economy? (Distance: 28.7938)\n" 1332 | ] 1333 | } 1334 | ], 1335 | "source": [ 1336 | "for query, query_embedding in zip(queries, query_embeddings):\n", 1337 | " distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),k)\n", 1338 | " print(\"\\n======================\\n\")\n", 1339 | " print(\"Query:\", query)\n", 1340 | " print(\"\\nTop 5 most similar sentences in corpus:\")\n", 1341 | " for idx in range(0,5):\n", 1342 | " print(corpus[indices[0,idx]], \"(Distance: %.4f)\" % distances[0,idx])" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "code", 1347 | "execution_count": 122, 1348 | "metadata": {}, 1349 | "outputs": [], 1350 | "source": [ 1351 | "query = \"How will Indian GDP be affected from banning 500 and 1000 rupees notes?\"\n", 1352 | "query_embed = embedder.encode(query)\n", 1353 | "distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),50)\n", 1354 | "relevant_docs = [corpus[indices[0,idx]] for idx in range(50)]" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "markdown", 1359 | "metadata": {}, 1360 | "source": [ 1361 | "----\n", 1362 | "## Reranking using Bidirectional LSTM model" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "\"Drawing\"\n", 1370 | "\n", 1371 | "**Reference:** https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/" 1372 | ] 1373 | }, 1374 | { 1375 | "cell_type": "code", 1376 | "execution_count": 344, 1377 | "metadata": {}, 1378 | "outputs": [], 1379 | "source": [ 1380 | "import re\n", 1381 | "import nltk\n", 1382 | "from nltk.tokenize.toktok import ToktokTokenizer\n", 1383 | "from nltk.stem import WordNetLemmatizer, SnowballStemmer\n", 1384 | "toko_tokenizer = ToktokTokenizer()\n", 1385 | "wordnet_lemmatizer = WordNetLemmatizer()\n", 1386 | "\n", 1387 | "def normalize_text(text):\n", 1388 | " puncts = ['/', ',', '.', '\"', ':', ')', '(', '-', '!', '?', '|', ';', '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\\\', '•', '~', '@', '£', \n", 1389 | " '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', \n", 1390 | " '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', \n", 1391 | " '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', \n", 1392 | " '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]\n", 1393 | "\n", 1394 | " def clean_text(text):\n", 1395 | " text = str(text)\n", 1396 | " text = text.replace('\\n', '')\n", 1397 | " text = text.replace('\\r', '')\n", 1398 | " for punct in puncts:\n", 1399 | " if punct in text:\n", 1400 | " text = text.replace(punct, '')\n", 1401 | " return text.lower()\n", 1402 | "\n", 1403 | " def clean_numbers(text):\n", 1404 | " if bool(re.search(r'\\d', text)):\n", 1405 | " text = re.sub('[0-9]{5,}', '#####', text)\n", 1406 | " text = re.sub('[0-9]{4}', '####', text)\n", 1407 | " text = re.sub('[0-9]{3}', '###', text)\n", 1408 | " text = re.sub('[0-9]{2}', '##', text)\n", 1409 | " return text\n", 1410 | "\n", 1411 | " contraction_dict = {\"ain't\": \"is not\", \"aren't\": \"are not\",\"can't\": \"cannot\", \"'cause\": \"because\", \"could've\": \"could have\", \"couldn't\": \"could not\", \"didn't\": \"did not\", \"doesn't\": \"does not\", \"don't\": \"do not\", \"hadn't\": \"had not\", \"hasn't\": \"has not\", \"haven't\": \"have not\", \"he'd\": \"he would\",\"he'll\": \"he will\", \"he's\": \"he is\", \"how'd\": \"how did\", \"how'd'y\": \"how do you\", \"how'll\": \"how will\", \"how's\": \"how is\", \"I'd\": \"I would\", \"I'd've\": \"I would have\", \"I'll\": \"I will\", \"I'll've\": \"I will have\",\"I'm\": \"I am\", \"I've\": \"I have\", \"i'd\": \"i would\", \"i'd've\": \"i would have\", \"i'll\": \"i will\", \"i'll've\": \"i will have\",\"i'm\": \"i am\", \"i've\": \"i have\", \"isn't\": \"is not\", \"it'd\": \"it would\", \"it'd've\": \"it would have\", \"it'll\": \"it will\", \"it'll've\": \"it will have\",\"it's\": \"it is\", \"let's\": \"let us\", \"ma'am\": \"madam\", \"mayn't\": \"may not\", \"might've\": \"might have\",\"mightn't\": \"might not\",\"mightn't've\": \"might not have\", \"must've\": \"must have\", \"mustn't\": \"must not\", \"mustn't've\": \"must not have\", \"needn't\": \"need not\", \"needn't've\": \"need not have\",\"o'clock\": \"of the clock\", \"oughtn't\": \"ought not\", \"oughtn't've\": \"ought not have\", \"shan't\": \"shall not\", \"sha'n't\": \"shall not\", \"shan't've\": \"shall not have\", \"she'd\": \"she would\", \"she'd've\": \"she would have\", \"she'll\": \"she will\", \"she'll've\": \"she will have\", \"she's\": \"she is\", \"should've\": \"should have\", \"shouldn't\": \"should not\", \"shouldn't've\": \"should not have\", \"so've\": \"so have\",\"so's\": \"so as\", \"this's\": \"this is\",\"that'd\": \"that would\", \"that'd've\": \"that would have\", \"that's\": \"that is\", \"there'd\": \"there would\", \"there'd've\": \"there would have\", \"there's\": \"there is\", \"here's\": \"here is\",\"they'd\": \"they would\", \"they'd've\": \"they would have\", \"they'll\": \"they will\", \"they'll've\": \"they will have\", \"they're\": \"they are\", \"they've\": \"they have\", \"to've\": \"to have\", \"wasn't\": \"was not\", \"we'd\": \"we would\", \"we'd've\": \"we would have\", \"we'll\": \"we will\", \"we'll've\": \"we will have\", \"we're\": \"we are\", \"we've\": \"we have\", \"weren't\": \"were not\", \"what'll\": \"what will\", \"what'll've\": \"what will have\", \"what're\": \"what are\", \"what's\": \"what is\", \"what've\": \"what have\", \"when's\": \"when is\", \"when've\": \"when have\", \"where'd\": \"where did\", \"where's\": \"where is\", \"where've\": \"where have\", \"who'll\": \"who will\", \"who'll've\": \"who will have\", \"who's\": \"who is\", \"who've\": \"who have\", \"why's\": \"why is\", \"why've\": \"why have\", \"will've\": \"will have\", \"won't\": \"will not\", \"won't've\": \"will not have\", \"would've\": \"would have\", \"wouldn't\": \"would not\", \"wouldn't've\": \"would not have\", \"y'all\": \"you all\", \"y'all'd\": \"you all would\",\"y'all'd've\": \"you all would have\",\"y'all're\": \"you all are\",\"y'all've\": \"you all have\",\"you'd\": \"you would\", \"you'd've\": \"you would have\", \"you'll\": \"you will\", \"you'll've\": \"you will have\", \"you're\": \"you are\", \"you've\": \"you have\"}\n", 1412 | "\n", 1413 | " def _get_contractions(contraction_dict):\n", 1414 | " contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))\n", 1415 | " return contraction_dict, contraction_re\n", 1416 | "\n", 1417 | " contractions, contractions_re = _get_contractions(contraction_dict)\n", 1418 | "\n", 1419 | " def replace_contractions(text):\n", 1420 | " def replace(match):\n", 1421 | " return contractions[match.group(0)]\n", 1422 | " return contractions_re.sub(replace, text)\n", 1423 | "\n", 1424 | " stopword_list = nltk.corpus.stopwords.words('english')\n", 1425 | "\n", 1426 | " def remove_stopwords(text, is_lower_case=True):\n", 1427 | " tokens = toko_tokenizer.tokenize(text)\n", 1428 | " tokens = [token.strip() for token in tokens]\n", 1429 | " if is_lower_case:\n", 1430 | " filtered_tokens = [token for token in tokens if token not in stopword_list]\n", 1431 | " else:\n", 1432 | " filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]\n", 1433 | " filtered_text = ' '.join(filtered_tokens) \n", 1434 | " return filtered_text\n", 1435 | "\n", 1436 | " def lemmatizer(text):\n", 1437 | " tokens = toko_tokenizer.tokenize(text)\n", 1438 | " tokens = [token.strip() for token in tokens]\n", 1439 | " tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]\n", 1440 | " return ' '.join(tokens)\n", 1441 | "\n", 1442 | " def trim_text(text):\n", 1443 | " tokens = toko_tokenizer.tokenize(text)\n", 1444 | " tokens = [token.strip() for token in tokens]\n", 1445 | " return ' '.join(tokens)\n", 1446 | " \n", 1447 | " def remove_non_english(text):\n", 1448 | " tokens = toko_tokenizer.tokenize(text)\n", 1449 | " tokens = [token.strip() for token in tokens]\n", 1450 | " tokens = [token for token in tokens if d.check(token)]\n", 1451 | " eng_text = ' '.join(tokens)\n", 1452 | " return eng_text\n", 1453 | "\n", 1454 | " text_norm = clean_text(text)\n", 1455 | " text_norm = clean_numbers(text_norm)\n", 1456 | " text_norm = replace_contractions(text_norm)\n", 1457 | "# text_norm = remove_stopwords(text_norm)\n", 1458 | "# text_norm = remove_non_english(text_norm)\n", 1459 | " text_norm = lemmatizer(text_norm)\n", 1460 | " text_norm = trim_text(text_norm)\n", 1461 | " return text_norm" 1462 | ] 1463 | }, 1464 | { 1465 | "cell_type": "code", 1466 | "execution_count": 328, 1467 | "metadata": {}, 1468 | "outputs": [ 1469 | { 1470 | "data": { 1471 | "text/html": [ 1472 | "
\n", 1473 | "\n", 1486 | "\n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | "
idqid1qid2question1question2is_duplicate
0012What is the step by step guide to invest in sh...What is the step by step guide to invest in sh...0
1134What is the story of Kohinoor (Koh-i-Noor) Dia...What would happen if the Indian government sto...0
2256How can I increase the speed of my internet co...How can Internet speed be increased by hacking...0
3378Why am I mentally very lonely? How can I solve...Find the remainder when [math]23^{24}[/math] i...0
44910Which one dissolve in water quikly sugar, salt...Which fish would survive in salt water?0
\n", 1546 | "
" 1547 | ], 1548 | "text/plain": [ 1549 | " id qid1 qid2 question1 \\\n", 1550 | "0 0 1 2 What is the step by step guide to invest in sh... \n", 1551 | "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n", 1552 | "2 2 5 6 How can I increase the speed of my internet co... \n", 1553 | "3 3 7 8 Why am I mentally very lonely? How can I solve... \n", 1554 | "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n", 1555 | "\n", 1556 | " question2 is_duplicate \n", 1557 | "0 What is the step by step guide to invest in sh... 0 \n", 1558 | "1 What would happen if the Indian government sto... 0 \n", 1559 | "2 How can Internet speed be increased by hacking... 0 \n", 1560 | "3 Find the remainder when [math]23^{24}[/math] i... 0 \n", 1561 | "4 Which fish would survive in salt water? 0 " 1562 | ] 1563 | }, 1564 | "execution_count": 328, 1565 | "metadata": {}, 1566 | "output_type": "execute_result" 1567 | } 1568 | ], 1569 | "source": [ 1570 | "question_pairs.head()" 1571 | ] 1572 | }, 1573 | { 1574 | "cell_type": "code", 1575 | "execution_count": 329, 1576 | "metadata": {}, 1577 | "outputs": [ 1578 | { 1579 | "data": { 1580 | "text/plain": [ 1581 | "(404351, 6)" 1582 | ] 1583 | }, 1584 | "execution_count": 329, 1585 | "metadata": {}, 1586 | "output_type": "execute_result" 1587 | } 1588 | ], 1589 | "source": [ 1590 | "question_pairs.shape" 1591 | ] 1592 | }, 1593 | { 1594 | "cell_type": "code", 1595 | "execution_count": 293, 1596 | "metadata": {}, 1597 | "outputs": [], 1598 | "source": [ 1599 | "embedding_path = \"./../../Embeddings/glove.twitter.27B/glove.twitter.27B.200d.txt\"\n", 1600 | "def get_word2vec(file_path):\n", 1601 | " file = open(embedding_path, \"r\")\n", 1602 | " if (file):\n", 1603 | " word2vec = dict()\n", 1604 | " split = file.read().splitlines()\n", 1605 | " for line in split:\n", 1606 | " key = line.split(' ',1)[0]\n", 1607 | " value = np.array([float(val) for val in line.split(' ')[1:]])\n", 1608 | " word2vec[key] = value\n", 1609 | " return (word2vec)\n", 1610 | " else:\n", 1611 | " print(\"invalid fiel path\")\n", 1612 | "w2v = get_word2vec(embedding_path)" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "code", 1617 | "execution_count": 352, 1618 | "metadata": {}, 1619 | "outputs": [], 1620 | "source": [ 1621 | "total_text = pd.concat([question_pairs['question1'], question_pairs['question2']]).reset_index(drop=True)\n", 1622 | "total_text = total_text.apply(lambda x: str(x))\n", 1623 | "total_text = total_text.apply(lambda x: normalize_text(x))\n", 1624 | "max_features = 6000\n", 1625 | "tokenizer = Tokenizer(num_words=max_features)\n", 1626 | "tokenizer.fit_on_texts(total_text)\n", 1627 | "question_1_sequenced = tokenizer.texts_to_sequences(question_pairs['question1'].apply(lambda x: normalize_text(x)))\n", 1628 | "question_2_sequenced = tokenizer.texts_to_sequences(question_pairs['question2'].apply(lambda x: normalize_text(x)))\n", 1629 | "vocab_size = len(tokenizer.word_index) + 1" 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "code", 1634 | "execution_count": 353, 1635 | "metadata": {}, 1636 | "outputs": [ 1637 | { 1638 | "data": { 1639 | "text/plain": [ 1640 | "92423" 1641 | ] 1642 | }, 1643 | "execution_count": 353, 1644 | "metadata": {}, 1645 | "output_type": "execute_result" 1646 | } 1647 | ], 1648 | "source": [ 1649 | "vocab_size" 1650 | ] 1651 | }, 1652 | { 1653 | "cell_type": "code", 1654 | "execution_count": 354, 1655 | "metadata": {}, 1656 | "outputs": [], 1657 | "source": [ 1658 | "maxlen = 100\n", 1659 | "question_1_padded = pad_sequences(question_1_sequenced, maxlen=maxlen)\n", 1660 | "question_2_padded = pad_sequences(question_2_sequenced, maxlen=maxlen)" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "code", 1665 | "execution_count": 359, 1666 | "metadata": {}, 1667 | "outputs": [], 1668 | "source": [ 1669 | "y = question_pairs['is_duplicate']" 1670 | ] 1671 | }, 1672 | { 1673 | "cell_type": "code", 1674 | "execution_count": 381, 1675 | "metadata": {}, 1676 | "outputs": [], 1677 | "source": [ 1678 | "from tqdm import tqdm" 1679 | ] 1680 | }, 1681 | { 1682 | "cell_type": "code", 1683 | "execution_count": 384, 1684 | "metadata": {}, 1685 | "outputs": [], 1686 | "source": [ 1687 | "from numpy import zeros\n", 1688 | "embedding_matrix = zeros((vocab_size, 768))\n", 1689 | "for word, i in tokenizer.word_index.items():\n", 1690 | " embedding_vector = w2v.get(word)\n", 1691 | " if embedding_vector is not None:\n", 1692 | " embedding_matrix[i] = embedding_vector[0]" 1693 | ] 1694 | }, 1695 | { 1696 | "cell_type": "code", 1697 | "execution_count": 357, 1698 | "metadata": {}, 1699 | "outputs": [], 1700 | "source": [ 1701 | "embedding_size = 128\n", 1702 | "max_len = 100\n", 1703 | "\n", 1704 | "inp1 = Input(shape=(100,))\n", 1705 | "inp2 = Input(shape=(100,))\n", 1706 | "\n", 1707 | "x1 = Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=max_len)(inp1)\n", 1708 | "x2 = Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=max_len)(inp2)\n", 1709 | "\n", 1710 | "x3 = Bidirectional(LSTM(32, return_sequences = True))(x1)\n", 1711 | "x4 = Bidirectional(LSTM(32, return_sequences = True))(x2)\n", 1712 | "\n", 1713 | "x5 = GlobalMaxPool1D()(x3)\n", 1714 | "x6 = GlobalMaxPool1D()(x4)\n", 1715 | "\n", 1716 | "x7 = dot([x5, x6], axes=1)\n", 1717 | "\n", 1718 | "x8 = Dense(40, activation='relu')(x7)\n", 1719 | "x9 = Dropout(0.05)(x8)\n", 1720 | "x10 = Dense(10, activation='relu')(x9)\n", 1721 | "output = Dense(1, activation=\"sigmoid\")(x10)\n", 1722 | "\n", 1723 | "model = Model(inputs=[inp1, inp2], outputs=output)\n", 1724 | "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", 1725 | "batch_size = 256\n", 1726 | "epochs = 4" 1727 | ] 1728 | }, 1729 | { 1730 | "cell_type": "code", 1731 | "execution_count": 360, 1732 | "metadata": {}, 1733 | "outputs": [ 1734 | { 1735 | "name": "stdout", 1736 | "output_type": "stream", 1737 | "text": [ 1738 | "Train on 323480 samples, validate on 80871 samples\n", 1739 | "Epoch 1/4\n", 1740 | "323480/323480 [==============================] - 1096s 3ms/step - loss: 0.5212 - acc: 0.7372 - val_loss: 0.4596 - val_acc: 0.7802\n", 1741 | "Epoch 2/4\n", 1742 | "323480/323480 [==============================] - 1101s 3ms/step - loss: 0.4169 - acc: 0.8038 - val_loss: 0.4300 - val_acc: 0.7978\n", 1743 | "Epoch 3/4\n", 1744 | "323480/323480 [==============================] - 1135s 4ms/step - loss: 0.3536 - acc: 0.8400 - val_loss: 0.4340 - val_acc: 0.7976\n", 1745 | "Epoch 4/4\n", 1746 | "323480/323480 [==============================] - 1077s 3ms/step - loss: 0.2963 - acc: 0.8701 - val_loss: 0.4411 - val_acc: 0.8035\n" 1747 | ] 1748 | }, 1749 | { 1750 | "data": { 1751 | "text/plain": [ 1752 | "" 1753 | ] 1754 | }, 1755 | "execution_count": 360, 1756 | "metadata": {}, 1757 | "output_type": "execute_result" 1758 | } 1759 | ], 1760 | "source": [ 1761 | "model.fit([question_1_padded, question_2_padded], y, batch_size=batch_size, epochs=epochs, validation_split=0.2, )" 1762 | ] 1763 | }, 1764 | { 1765 | "cell_type": "markdown", 1766 | "metadata": {}, 1767 | "source": [ 1768 | "--------\n", 1769 | "## Combining candidate generation and reranking" 1770 | ] 1771 | }, 1772 | { 1773 | "cell_type": "code", 1774 | "execution_count": 361, 1775 | "metadata": {}, 1776 | "outputs": [ 1777 | { 1778 | "data": { 1779 | "text/plain": [ 1780 | "'How will Indian GDP be affected from banning 500 and 1000 rupees notes?'" 1781 | ] 1782 | }, 1783 | "execution_count": 361, 1784 | "metadata": {}, 1785 | "output_type": "execute_result" 1786 | } 1787 | ], 1788 | "source": [ 1789 | "query" 1790 | ] 1791 | }, 1792 | { 1793 | "cell_type": "code", 1794 | "execution_count": 363, 1795 | "metadata": {}, 1796 | "outputs": [], 1797 | "source": [ 1798 | "query_copy = [query]*len(relevant_docs)\n", 1799 | "question_1_sequenced_final = tokenizer.texts_to_sequences(query_copy)\n", 1800 | "question_2_sequenced_final = tokenizer.texts_to_sequences(relevant_docs)" 1801 | ] 1802 | }, 1803 | { 1804 | "cell_type": "code", 1805 | "execution_count": 364, 1806 | "metadata": {}, 1807 | "outputs": [], 1808 | "source": [ 1809 | "maxlen = 100\n", 1810 | "question_1_padded_final = pad_sequences(question_1_sequenced_final, maxlen=maxlen)\n", 1811 | "question_2_padded_final = pad_sequences(question_2_sequenced_final, maxlen=maxlen)" 1812 | ] 1813 | }, 1814 | { 1815 | "cell_type": "code", 1816 | "execution_count": 365, 1817 | "metadata": {}, 1818 | "outputs": [], 1819 | "source": [ 1820 | "preds_test = model.predict([question_1_padded_final, question_2_padded_final])\n", 1821 | "preds_test = np.array([x[0] for x in preds_test])" 1822 | ] 1823 | }, 1824 | { 1825 | "cell_type": "code", 1826 | "execution_count": 390, 1827 | "metadata": {}, 1828 | "outputs": [ 1829 | { 1830 | "data": { 1831 | "text/plain": [ 1832 | "['What do you think about banning 500 and 1000 rupee notes in India?',\n", 1833 | " 'What will be the implications of banning 500 and 1000 rupees currency notes on Indian economy?',\n", 1834 | " 'What will be the consequences of 500 and 1000 rupee notes banning?',\n", 1835 | " 'What will be the effects after banning on 500 and 1000 rupee notes?',\n", 1836 | " 'What will be the impact on real estate by banning 500 and 1000 rupee notes from India?',\n", 1837 | " 'How is banning 500 and 1000 INR going to help Indian economy?',\n", 1838 | " 'What are your views on India banning 500 and 1000 notes? In what way it will affect Indian economy?',\n", 1839 | " 'How is discontinuing 500 and 1000 rupee note going to put a hold on black money in India?',\n", 1840 | " 'What will be the result of banning 500 and 1000 rupees note in India?',\n", 1841 | " 'What are the economic implications of banning 500 and 1000 rupee notes?']" 1842 | ] 1843 | }, 1844 | "execution_count": 390, 1845 | "metadata": {}, 1846 | "output_type": "execute_result" 1847 | } 1848 | ], 1849 | "source": [ 1850 | "[relevant_docs[x] for x in preds_test.argsort()[::-1]][:10]" 1851 | ] 1852 | }, 1853 | { 1854 | "cell_type": "code", 1855 | "execution_count": null, 1856 | "metadata": {}, 1857 | "outputs": [], 1858 | "source": [] 1859 | } 1860 | ], 1861 | "metadata": { 1862 | "kernelspec": { 1863 | "display_name": "Python 3", 1864 | "language": "python", 1865 | "name": "python3" 1866 | }, 1867 | "language_info": { 1868 | "codemirror_mode": { 1869 | "name": "ipython", 1870 | "version": 3 1871 | }, 1872 | "file_extension": ".py", 1873 | "mimetype": "text/x-python", 1874 | "name": "python", 1875 | "nbconvert_exporter": "python", 1876 | "pygments_lexer": "ipython3", 1877 | "version": "3.6.10" 1878 | } 1879 | }, 1880 | "nbformat": 4, 1881 | "nbformat_minor": 4 1882 | } 1883 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep-Learning-for-Semantic-Text-Matching 2 | 3 | [Deep Learning for Semantic Text Matching](https://medium.com/swlh/deep-learning-for-semantic-text-matching-d4df6c2cf4c5) 4 | --------------------------------------------------------------------------------