├── NLP using Deep Learning in Python.ipynb
└── README.md
/NLP using Deep Learning in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# NLP using Deep Learning in Python - Quora Duplicate Questions"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Problem Statement:\n",
15 | "\n",
16 | "Over 100 million people visit Quora every month, so it's no surprise that **many people ask similarly worded questions**. Multiple questions with the same intent **can cause seekers to spend more time finding the best answer to their question**, and **make writers feel they need to answer multiple versions of the same question**. Quora values canonical questions because they **provide a better experience to active seekers and writers**, and offer more value to both of these groups in the long term.\n",
17 | "\n",
18 | "**Reference:** https://www.kaggle.com/c/quora-question-pairs"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 322,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import pandas as pd\n",
28 | "import numpy as np\n",
29 | "import sklearn"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 323,
35 | "metadata": {},
36 | "outputs": [
37 | {
38 | "data": {
39 | "text/html": [
40 | "
\n",
41 | "\n",
54 | "
\n",
55 | " \n",
56 | " \n",
57 | " | \n",
58 | " id | \n",
59 | " qid1 | \n",
60 | " qid2 | \n",
61 | " question1 | \n",
62 | " question2 | \n",
63 | " is_duplicate | \n",
64 | "
\n",
65 | " \n",
66 | " \n",
67 | " \n",
68 | " 0 | \n",
69 | " 0 | \n",
70 | " 1 | \n",
71 | " 2 | \n",
72 | " What is the step by step guide to invest in sh... | \n",
73 | " What is the step by step guide to invest in sh... | \n",
74 | " 0 | \n",
75 | "
\n",
76 | " \n",
77 | " 1 | \n",
78 | " 1 | \n",
79 | " 3 | \n",
80 | " 4 | \n",
81 | " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
82 | " What would happen if the Indian government sto... | \n",
83 | " 0 | \n",
84 | "
\n",
85 | " \n",
86 | " 2 | \n",
87 | " 2 | \n",
88 | " 5 | \n",
89 | " 6 | \n",
90 | " How can I increase the speed of my internet co... | \n",
91 | " How can Internet speed be increased by hacking... | \n",
92 | " 0 | \n",
93 | "
\n",
94 | " \n",
95 | " 3 | \n",
96 | " 3 | \n",
97 | " 7 | \n",
98 | " 8 | \n",
99 | " Why am I mentally very lonely? How can I solve... | \n",
100 | " Find the remainder when [math]23^{24}[/math] i... | \n",
101 | " 0 | \n",
102 | "
\n",
103 | " \n",
104 | " 4 | \n",
105 | " 4 | \n",
106 | " 9 | \n",
107 | " 10 | \n",
108 | " Which one dissolve in water quikly sugar, salt... | \n",
109 | " Which fish would survive in salt water? | \n",
110 | " 0 | \n",
111 | "
\n",
112 | " \n",
113 | "
\n",
114 | "
"
115 | ],
116 | "text/plain": [
117 | " id qid1 qid2 question1 \\\n",
118 | "0 0 1 2 What is the step by step guide to invest in sh... \n",
119 | "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
120 | "2 2 5 6 How can I increase the speed of my internet co... \n",
121 | "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
122 | "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
123 | "\n",
124 | " question2 is_duplicate \n",
125 | "0 What is the step by step guide to invest in sh... 0 \n",
126 | "1 What would happen if the Indian government sto... 0 \n",
127 | "2 How can Internet speed be increased by hacking... 0 \n",
128 | "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
129 | "4 Which fish would survive in salt water? 0 "
130 | ]
131 | },
132 | "execution_count": 323,
133 | "metadata": {},
134 | "output_type": "execute_result"
135 | }
136 | ],
137 | "source": [
138 | "question_pairs = pd.read_csv(\"../../raw_data/questions.csv\")\n",
139 | "question_pairs.head()"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 325,
145 | "metadata": {},
146 | "outputs": [
147 | {
148 | "data": {
149 | "text/plain": [
150 | "(808702, 2)"
151 | ]
152 | },
153 | "execution_count": 325,
154 | "metadata": {},
155 | "output_type": "execute_result"
156 | }
157 | ],
158 | "source": [
159 | "question_pairs_1 = question_pairs[['qid1', 'question1']]\n",
160 | "question_pairs_1.columns = ['id', 'question']\n",
161 | "question_pairs_2 = question_pairs[['qid2', 'question2']]\n",
162 | "question_pairs_2.columns = ['id', 'question']\n",
163 | "questions_list = pd.concat([question_pairs_1,question_pairs_2]).sort_values('id')\n",
164 | "questions_list.shape"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 326,
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "data": {
174 | "text/plain": [
175 | "['What is the step by step guide to invest in share market in india?',\n",
176 | " 'What is the step by step guide to invest in share market?',\n",
177 | " 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',\n",
178 | " 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',\n",
179 | " 'How can I increase the speed of my internet connection while using a VPN?',\n",
180 | " 'How can Internet speed be increased by hacking through DNS?',\n",
181 | " 'Why am I mentally very lonely? How can I solve it?',\n",
182 | " 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',\n",
183 | " 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?',\n",
184 | " 'Which fish would survive in salt water?']"
185 | ]
186 | },
187 | "execution_count": 326,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "corpus = questions_list['question'].tolist()\n",
194 | "corpus[:10]"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 327,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "corpus = list(np.unique(corpus))"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "-------\n",
211 | "## Feature Extraction"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "### Count Vectorizer"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": 79,
224 | "metadata": {},
225 | "outputs": [
226 | {
227 | "data": {
228 | "text/html": [
229 | "\n",
230 | "\n",
243 | "
\n",
244 | " \n",
245 | " \n",
246 | " | \n",
247 | " 1000 | \n",
248 | " 2000 | \n",
249 | " 500 | \n",
250 | " about | \n",
251 | " add | \n",
252 | " all | \n",
253 | " am | \n",
254 | " and | \n",
255 | " any | \n",
256 | " are | \n",
257 | " ... | \n",
258 | " was | \n",
259 | " we | \n",
260 | " what | \n",
261 | " who | \n",
262 | " will | \n",
263 | " win | \n",
264 | " with | \n",
265 | " world | \n",
266 | " you | \n",
267 | " your | \n",
268 | "
\n",
269 | " \n",
270 | " \n",
271 | " \n",
272 | " 0 | \n",
273 | " 0 | \n",
274 | " 0 | \n",
275 | " 0 | \n",
276 | " 0 | \n",
277 | " 0 | \n",
278 | " 0 | \n",
279 | " 0 | \n",
280 | " 0 | \n",
281 | " 1 | \n",
282 | " 0 | \n",
283 | " ... | \n",
284 | " 1 | \n",
285 | " 0 | \n",
286 | " 1 | \n",
287 | " 0 | \n",
288 | " 0 | \n",
289 | " 0 | \n",
290 | " 1 | \n",
291 | " 0 | \n",
292 | " 0 | \n",
293 | " 0 | \n",
294 | "
\n",
295 | " \n",
296 | " 1 | \n",
297 | " 1 | \n",
298 | " 0 | \n",
299 | " 1 | \n",
300 | " 0 | \n",
301 | " 0 | \n",
302 | " 0 | \n",
303 | " 0 | \n",
304 | " 2 | \n",
305 | " 0 | \n",
306 | " 0 | \n",
307 | " ... | \n",
308 | " 0 | \n",
309 | " 0 | \n",
310 | " 1 | \n",
311 | " 0 | \n",
312 | " 1 | \n",
313 | " 0 | \n",
314 | " 0 | \n",
315 | " 0 | \n",
316 | " 0 | \n",
317 | " 0 | \n",
318 | "
\n",
319 | " \n",
320 | " 2 | \n",
321 | " 0 | \n",
322 | " 0 | \n",
323 | " 0 | \n",
324 | " 0 | \n",
325 | " 1 | \n",
326 | " 0 | \n",
327 | " 0 | \n",
328 | " 0 | \n",
329 | " 0 | \n",
330 | " 0 | \n",
331 | " ... | \n",
332 | " 0 | \n",
333 | " 0 | \n",
334 | " 0 | \n",
335 | " 0 | \n",
336 | " 0 | \n",
337 | " 0 | \n",
338 | " 0 | \n",
339 | " 0 | \n",
340 | " 0 | \n",
341 | " 0 | \n",
342 | "
\n",
343 | " \n",
344 | " 3 | \n",
345 | " 0 | \n",
346 | " 0 | \n",
347 | " 0 | \n",
348 | " 0 | \n",
349 | " 0 | \n",
350 | " 1 | \n",
351 | " 0 | \n",
352 | " 0 | \n",
353 | " 0 | \n",
354 | " 0 | \n",
355 | " ... | \n",
356 | " 0 | \n",
357 | " 0 | \n",
358 | " 0 | \n",
359 | " 1 | \n",
360 | " 1 | \n",
361 | " 1 | \n",
362 | " 0 | \n",
363 | " 0 | \n",
364 | " 1 | \n",
365 | " 0 | \n",
366 | "
\n",
367 | " \n",
368 | " 4 | \n",
369 | " 1 | \n",
370 | " 0 | \n",
371 | " 0 | \n",
372 | " 0 | \n",
373 | " 0 | \n",
374 | " 0 | \n",
375 | " 1 | \n",
376 | " 0 | \n",
377 | " 0 | \n",
378 | " 0 | \n",
379 | " ... | \n",
380 | " 0 | \n",
381 | " 0 | \n",
382 | " 0 | \n",
383 | " 0 | \n",
384 | " 0 | \n",
385 | " 0 | \n",
386 | " 0 | \n",
387 | " 0 | \n",
388 | " 0 | \n",
389 | " 0 | \n",
390 | "
\n",
391 | " \n",
392 | " 5 | \n",
393 | " 0 | \n",
394 | " 1 | \n",
395 | " 1 | \n",
396 | " 0 | \n",
397 | " 0 | \n",
398 | " 0 | \n",
399 | " 0 | \n",
400 | " 1 | \n",
401 | " 0 | \n",
402 | " 1 | \n",
403 | " ... | \n",
404 | " 0 | \n",
405 | " 0 | \n",
406 | " 0 | \n",
407 | " 0 | \n",
408 | " 0 | \n",
409 | " 0 | \n",
410 | " 0 | \n",
411 | " 0 | \n",
412 | " 0 | \n",
413 | " 0 | \n",
414 | "
\n",
415 | " \n",
416 | " 6 | \n",
417 | " 0 | \n",
418 | " 1 | \n",
419 | " 0 | \n",
420 | " 0 | \n",
421 | " 0 | \n",
422 | " 0 | \n",
423 | " 0 | \n",
424 | " 0 | \n",
425 | " 0 | \n",
426 | " 1 | \n",
427 | " ... | \n",
428 | " 0 | \n",
429 | " 0 | \n",
430 | " 0 | \n",
431 | " 0 | \n",
432 | " 0 | \n",
433 | " 0 | \n",
434 | " 1 | \n",
435 | " 0 | \n",
436 | " 0 | \n",
437 | " 0 | \n",
438 | "
\n",
439 | " \n",
440 | " 7 | \n",
441 | " 0 | \n",
442 | " 0 | \n",
443 | " 0 | \n",
444 | " 1 | \n",
445 | " 0 | \n",
446 | " 0 | \n",
447 | " 0 | \n",
448 | " 0 | \n",
449 | " 0 | \n",
450 | " 1 | \n",
451 | " ... | \n",
452 | " 0 | \n",
453 | " 0 | \n",
454 | " 0 | \n",
455 | " 1 | \n",
456 | " 0 | \n",
457 | " 0 | \n",
458 | " 1 | \n",
459 | " 0 | \n",
460 | " 1 | \n",
461 | " 1 | \n",
462 | "
\n",
463 | " \n",
464 | " 8 | \n",
465 | " 0 | \n",
466 | " 0 | \n",
467 | " 0 | \n",
468 | " 0 | \n",
469 | " 0 | \n",
470 | " 0 | \n",
471 | " 0 | \n",
472 | " 0 | \n",
473 | " 0 | \n",
474 | " 1 | \n",
475 | " ... | \n",
476 | " 0 | \n",
477 | " 1 | \n",
478 | " 0 | \n",
479 | " 0 | \n",
480 | " 0 | \n",
481 | " 0 | \n",
482 | " 0 | \n",
483 | " 1 | \n",
484 | " 0 | \n",
485 | " 0 | \n",
486 | "
\n",
487 | " \n",
488 | " 9 | \n",
489 | " 0 | \n",
490 | " 0 | \n",
491 | " 0 | \n",
492 | " 0 | \n",
493 | " 0 | \n",
494 | " 0 | \n",
495 | " 0 | \n",
496 | " 0 | \n",
497 | " 0 | \n",
498 | " 1 | \n",
499 | " ... | \n",
500 | " 0 | \n",
501 | " 1 | \n",
502 | " 0 | \n",
503 | " 0 | \n",
504 | " 0 | \n",
505 | " 0 | \n",
506 | " 0 | \n",
507 | " 1 | \n",
508 | " 0 | \n",
509 | " 0 | \n",
510 | "
\n",
511 | " \n",
512 | "
\n",
513 | "
10 rows × 96 columns
\n",
514 | "
"
515 | ],
516 | "text/plain": [
517 | " 1000 2000 500 about add all am and any are ... was we what \\\n",
518 | "0 0 0 0 0 0 0 0 0 1 0 ... 1 0 1 \n",
519 | "1 1 0 1 0 0 0 0 2 0 0 ... 0 0 1 \n",
520 | "2 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 \n",
521 | "3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 \n",
522 | "4 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 \n",
523 | "5 0 1 1 0 0 0 0 1 0 1 ... 0 0 0 \n",
524 | "6 0 1 0 0 0 0 0 0 0 1 ... 0 0 0 \n",
525 | "7 0 0 0 1 0 0 0 0 0 1 ... 0 0 0 \n",
526 | "8 0 0 0 0 0 0 0 0 0 1 ... 0 1 0 \n",
527 | "9 0 0 0 0 0 0 0 0 0 1 ... 0 1 0 \n",
528 | "\n",
529 | " who will win with world you your \n",
530 | "0 0 0 0 1 0 0 0 \n",
531 | "1 0 1 0 0 0 0 0 \n",
532 | "2 0 0 0 0 0 0 0 \n",
533 | "3 1 1 1 0 0 1 0 \n",
534 | "4 0 0 0 0 0 0 0 \n",
535 | "5 0 0 0 0 0 0 0 \n",
536 | "6 0 0 0 1 0 0 0 \n",
537 | "7 1 0 0 1 0 1 1 \n",
538 | "8 0 0 0 0 1 0 0 \n",
539 | "9 0 0 0 0 1 0 0 \n",
540 | "\n",
541 | "[10 rows x 96 columns]"
542 | ]
543 | },
544 | "execution_count": 79,
545 | "metadata": {},
546 | "output_type": "execute_result"
547 | }
548 | ],
549 | "source": [
550 | "from sklearn.feature_extraction.text import CountVectorizer\n",
551 | "count_vect = CountVectorizer()\n",
552 | "\n",
553 | "X_train_counts = count_vect.fit_transform(corpus[:10])\n",
554 | "X_train_counts = pd.DataFrame(X_train_counts.toarray())\n",
555 | "X_train_counts.columns = count_vect.get_feature_names()\n",
556 | "X_train_counts"
557 | ]
558 | },
559 | {
560 | "cell_type": "code",
561 | "execution_count": 80,
562 | "metadata": {},
563 | "outputs": [
564 | {
565 | "data": {
566 | "text/plain": [
567 | "'\"The question was marked as needing improvement\" how to deal with this, what ever I do still this error pops up? Is it Quora bot or any user?'"
568 | ]
569 | },
570 | "execution_count": 80,
571 | "metadata": {},
572 | "output_type": "execute_result"
573 | }
574 | ],
575 | "source": [
576 | "corpus[0]"
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": 81,
582 | "metadata": {},
583 | "outputs": [
584 | {
585 | "data": {
586 | "text/plain": [
587 | "1000 0\n",
588 | "2000 0\n",
589 | "500 0\n",
590 | "about 0\n",
591 | "add 0\n",
592 | "all 0\n",
593 | "am 0\n",
594 | "and 0\n",
595 | "any 1\n",
596 | "are 0\n",
597 | "as 1\n",
598 | "aside 0\n",
599 | "at 0\n",
600 | "banned 0\n",
601 | "based 0\n",
602 | "be 0\n",
603 | "been 0\n",
604 | "biases 0\n",
605 | "big 0\n",
606 | "bot 1\n",
607 | "can 0\n",
608 | "chip 0\n",
609 | "closer 0\n",
610 | "currency 0\n",
611 | "deal 1\n",
612 | "distance 0\n",
613 | "do 1\n",
614 | "election 0\n",
615 | "embedded 0\n",
616 | "error 1\n",
617 | " ..\n",
618 | "really 0\n",
619 | "relationship 0\n",
620 | "relationships 0\n",
621 | "rs 0\n",
622 | "see 0\n",
623 | "short 0\n",
624 | "starting 0\n",
625 | "still 1\n",
626 | "successful 0\n",
627 | "tell 0\n",
628 | "term 0\n",
629 | "the 1\n",
630 | "there 0\n",
631 | "think 0\n",
632 | "this 2\n",
633 | "time 0\n",
634 | "to 1\n",
635 | "up 1\n",
636 | "user 1\n",
637 | "war 0\n",
638 | "was 1\n",
639 | "we 0\n",
640 | "what 1\n",
641 | "who 0\n",
642 | "will 0\n",
643 | "win 0\n",
644 | "with 1\n",
645 | "world 0\n",
646 | "you 0\n",
647 | "your 0\n",
648 | "Name: 0, Length: 96, dtype: int64"
649 | ]
650 | },
651 | "execution_count": 81,
652 | "metadata": {},
653 | "output_type": "execute_result"
654 | }
655 | ],
656 | "source": [
657 | "X_train_counts.loc[0]"
658 | ]
659 | },
660 | {
661 | "cell_type": "markdown",
662 | "metadata": {},
663 | "source": [
664 | "### Tf-Idf (Term Frequency - Inverse Document Frequency)"
665 | ]
666 | },
667 | {
668 | "cell_type": "code",
669 | "execution_count": 82,
670 | "metadata": {},
671 | "outputs": [
672 | {
673 | "data": {
674 | "text/html": [
675 | "\n",
676 | "\n",
689 | "
\n",
690 | " \n",
691 | " \n",
692 | " | \n",
693 | " 1000 | \n",
694 | " 2000 | \n",
695 | " 500 | \n",
696 | " about | \n",
697 | " add | \n",
698 | " all | \n",
699 | " am | \n",
700 | " and | \n",
701 | " any | \n",
702 | " are | \n",
703 | " ... | \n",
704 | " was | \n",
705 | " we | \n",
706 | " what | \n",
707 | " who | \n",
708 | " will | \n",
709 | " win | \n",
710 | " with | \n",
711 | " world | \n",
712 | " you | \n",
713 | " your | \n",
714 | "
\n",
715 | " \n",
716 | " \n",
717 | " \n",
718 | " 0 | \n",
719 | " 0.000000 | \n",
720 | " 0.000000 | \n",
721 | " 0.000000 | \n",
722 | " 0.000000 | \n",
723 | " 0.000000 | \n",
724 | " 0.000000 | \n",
725 | " 0.000000 | \n",
726 | " 0.000000 | \n",
727 | " 0.20204 | \n",
728 | " 0.000000 | \n",
729 | " ... | \n",
730 | " 0.20204 | \n",
731 | " 0.000000 | \n",
732 | " 0.171753 | \n",
733 | " 0.000000 | \n",
734 | " 0.000000 | \n",
735 | " 0.000000 | \n",
736 | " 0.150263 | \n",
737 | " 0.000000 | \n",
738 | " 0.000000 | \n",
739 | " 0.000000 | \n",
740 | "
\n",
741 | " \n",
742 | " 1 | \n",
743 | " 0.205776 | \n",
744 | " 0.000000 | \n",
745 | " 0.205776 | \n",
746 | " 0.000000 | \n",
747 | " 0.000000 | \n",
748 | " 0.000000 | \n",
749 | " 0.000000 | \n",
750 | " 0.411553 | \n",
751 | " 0.00000 | \n",
752 | " 0.000000 | \n",
753 | " ... | \n",
754 | " 0.00000 | \n",
755 | " 0.000000 | \n",
756 | " 0.205776 | \n",
757 | " 0.000000 | \n",
758 | " 0.205776 | \n",
759 | " 0.000000 | \n",
760 | " 0.000000 | \n",
761 | " 0.000000 | \n",
762 | " 0.000000 | \n",
763 | " 0.000000 | \n",
764 | "
\n",
765 | " \n",
766 | " 2 | \n",
767 | " 0.000000 | \n",
768 | " 0.000000 | \n",
769 | " 0.000000 | \n",
770 | " 0.000000 | \n",
771 | " 0.518291 | \n",
772 | " 0.000000 | \n",
773 | " 0.000000 | \n",
774 | " 0.000000 | \n",
775 | " 0.00000 | \n",
776 | " 0.000000 | \n",
777 | " ... | \n",
778 | " 0.00000 | \n",
779 | " 0.000000 | \n",
780 | " 0.000000 | \n",
781 | " 0.000000 | \n",
782 | " 0.000000 | \n",
783 | " 0.000000 | \n",
784 | " 0.000000 | \n",
785 | " 0.000000 | \n",
786 | " 0.000000 | \n",
787 | " 0.000000 | \n",
788 | "
\n",
789 | " \n",
790 | " 3 | \n",
791 | " 0.000000 | \n",
792 | " 0.000000 | \n",
793 | " 0.000000 | \n",
794 | " 0.000000 | \n",
795 | " 0.000000 | \n",
796 | " 0.263025 | \n",
797 | " 0.000000 | \n",
798 | " 0.000000 | \n",
799 | " 0.00000 | \n",
800 | " 0.000000 | \n",
801 | " ... | \n",
802 | " 0.00000 | \n",
803 | " 0.000000 | \n",
804 | " 0.000000 | \n",
805 | " 0.223595 | \n",
806 | " 0.223595 | \n",
807 | " 0.263025 | \n",
808 | " 0.000000 | \n",
809 | " 0.000000 | \n",
810 | " 0.223595 | \n",
811 | " 0.000000 | \n",
812 | "
\n",
813 | " \n",
814 | " 4 | \n",
815 | " 0.266593 | \n",
816 | " 0.000000 | \n",
817 | " 0.000000 | \n",
818 | " 0.000000 | \n",
819 | " 0.000000 | \n",
820 | " 0.000000 | \n",
821 | " 0.313605 | \n",
822 | " 0.000000 | \n",
823 | " 0.00000 | \n",
824 | " 0.000000 | \n",
825 | " ... | \n",
826 | " 0.00000 | \n",
827 | " 0.000000 | \n",
828 | " 0.000000 | \n",
829 | " 0.000000 | \n",
830 | " 0.000000 | \n",
831 | " 0.000000 | \n",
832 | " 0.000000 | \n",
833 | " 0.000000 | \n",
834 | " 0.000000 | \n",
835 | " 0.000000 | \n",
836 | "
\n",
837 | " \n",
838 | " 5 | \n",
839 | " 0.000000 | \n",
840 | " 0.251549 | \n",
841 | " 0.251549 | \n",
842 | " 0.000000 | \n",
843 | " 0.000000 | \n",
844 | " 0.000000 | \n",
845 | " 0.000000 | \n",
846 | " 0.251549 | \n",
847 | " 0.00000 | \n",
848 | " 0.175717 | \n",
849 | " ... | \n",
850 | " 0.00000 | \n",
851 | " 0.000000 | \n",
852 | " 0.000000 | \n",
853 | " 0.000000 | \n",
854 | " 0.000000 | \n",
855 | " 0.000000 | \n",
856 | " 0.000000 | \n",
857 | " 0.000000 | \n",
858 | " 0.000000 | \n",
859 | " 0.000000 | \n",
860 | "
\n",
861 | " \n",
862 | " 6 | \n",
863 | " 0.000000 | \n",
864 | " 0.313340 | \n",
865 | " 0.000000 | \n",
866 | " 0.000000 | \n",
867 | " 0.000000 | \n",
868 | " 0.000000 | \n",
869 | " 0.000000 | \n",
870 | " 0.000000 | \n",
871 | " 0.00000 | \n",
872 | " 0.218880 | \n",
873 | " ... | \n",
874 | " 0.00000 | \n",
875 | " 0.000000 | \n",
876 | " 0.000000 | \n",
877 | " 0.000000 | \n",
878 | " 0.000000 | \n",
879 | " 0.000000 | \n",
880 | " 0.274136 | \n",
881 | " 0.000000 | \n",
882 | " 0.000000 | \n",
883 | " 0.000000 | \n",
884 | "
\n",
885 | " \n",
886 | " 7 | \n",
887 | " 0.000000 | \n",
888 | " 0.000000 | \n",
889 | " 0.000000 | \n",
890 | " 0.204276 | \n",
891 | " 0.000000 | \n",
892 | " 0.000000 | \n",
893 | " 0.000000 | \n",
894 | " 0.000000 | \n",
895 | " 0.00000 | \n",
896 | " 0.121303 | \n",
897 | " ... | \n",
898 | " 0.00000 | \n",
899 | " 0.000000 | \n",
900 | " 0.000000 | \n",
901 | " 0.173653 | \n",
902 | " 0.000000 | \n",
903 | " 0.000000 | \n",
904 | " 0.151926 | \n",
905 | " 0.000000 | \n",
906 | " 0.173653 | \n",
907 | " 0.204276 | \n",
908 | "
\n",
909 | " \n",
910 | " 8 | \n",
911 | " 0.000000 | \n",
912 | " 0.000000 | \n",
913 | " 0.000000 | \n",
914 | " 0.000000 | \n",
915 | " 0.000000 | \n",
916 | " 0.000000 | \n",
917 | " 0.000000 | \n",
918 | " 0.000000 | \n",
919 | " 0.00000 | \n",
920 | " 0.263628 | \n",
921 | " ... | \n",
922 | " 0.00000 | \n",
923 | " 0.377400 | \n",
924 | " 0.000000 | \n",
925 | " 0.000000 | \n",
926 | " 0.000000 | \n",
927 | " 0.000000 | \n",
928 | " 0.000000 | \n",
929 | " 0.377400 | \n",
930 | " 0.000000 | \n",
931 | " 0.000000 | \n",
932 | "
\n",
933 | " \n",
934 | " 9 | \n",
935 | " 0.000000 | \n",
936 | " 0.000000 | \n",
937 | " 0.000000 | \n",
938 | " 0.000000 | \n",
939 | " 0.000000 | \n",
940 | " 0.000000 | \n",
941 | " 0.000000 | \n",
942 | " 0.000000 | \n",
943 | " 0.00000 | \n",
944 | " 0.235430 | \n",
945 | " ... | \n",
946 | " 0.00000 | \n",
947 | " 0.337033 | \n",
948 | " 0.000000 | \n",
949 | " 0.000000 | \n",
950 | " 0.000000 | \n",
951 | " 0.000000 | \n",
952 | " 0.000000 | \n",
953 | " 0.337033 | \n",
954 | " 0.000000 | \n",
955 | " 0.000000 | \n",
956 | "
\n",
957 | " \n",
958 | "
\n",
959 | "
10 rows × 96 columns
\n",
960 | "
"
961 | ],
962 | "text/plain": [
963 | " 1000 2000 500 about add all am \\\n",
964 | "0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
965 | "1 0.205776 0.000000 0.205776 0.000000 0.000000 0.000000 0.000000 \n",
966 | "2 0.000000 0.000000 0.000000 0.000000 0.518291 0.000000 0.000000 \n",
967 | "3 0.000000 0.000000 0.000000 0.000000 0.000000 0.263025 0.000000 \n",
968 | "4 0.266593 0.000000 0.000000 0.000000 0.000000 0.000000 0.313605 \n",
969 | "5 0.000000 0.251549 0.251549 0.000000 0.000000 0.000000 0.000000 \n",
970 | "6 0.000000 0.313340 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
971 | "7 0.000000 0.000000 0.000000 0.204276 0.000000 0.000000 0.000000 \n",
972 | "8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
973 | "9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
974 | "\n",
975 | " and any are ... was we what who \\\n",
976 | "0 0.000000 0.20204 0.000000 ... 0.20204 0.000000 0.171753 0.000000 \n",
977 | "1 0.411553 0.00000 0.000000 ... 0.00000 0.000000 0.205776 0.000000 \n",
978 | "2 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 \n",
979 | "3 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.223595 \n",
980 | "4 0.000000 0.00000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 \n",
981 | "5 0.251549 0.00000 0.175717 ... 0.00000 0.000000 0.000000 0.000000 \n",
982 | "6 0.000000 0.00000 0.218880 ... 0.00000 0.000000 0.000000 0.000000 \n",
983 | "7 0.000000 0.00000 0.121303 ... 0.00000 0.000000 0.000000 0.173653 \n",
984 | "8 0.000000 0.00000 0.263628 ... 0.00000 0.377400 0.000000 0.000000 \n",
985 | "9 0.000000 0.00000 0.235430 ... 0.00000 0.337033 0.000000 0.000000 \n",
986 | "\n",
987 | " will win with world you your \n",
988 | "0 0.000000 0.000000 0.150263 0.000000 0.000000 0.000000 \n",
989 | "1 0.205776 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
990 | "2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
991 | "3 0.223595 0.263025 0.000000 0.000000 0.223595 0.000000 \n",
992 | "4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
993 | "5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
994 | "6 0.000000 0.000000 0.274136 0.000000 0.000000 0.000000 \n",
995 | "7 0.000000 0.000000 0.151926 0.000000 0.173653 0.204276 \n",
996 | "8 0.000000 0.000000 0.000000 0.377400 0.000000 0.000000 \n",
997 | "9 0.000000 0.000000 0.000000 0.337033 0.000000 0.000000 \n",
998 | "\n",
999 | "[10 rows x 96 columns]"
1000 | ]
1001 | },
1002 | "execution_count": 82,
1003 | "metadata": {},
1004 | "output_type": "execute_result"
1005 | }
1006 | ],
1007 | "source": [
1008 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
1009 | "vectorizer = TfidfVectorizer()\n",
1010 | "\n",
1011 | "X_train_tfidf = vectorizer.fit_transform(corpus[:10])\n",
1012 | "X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray())\n",
1013 | "X_train_tfidf.columns = vectorizer.get_feature_names()\n",
1014 | "X_train_tfidf"
1015 | ]
1016 | },
1017 | {
1018 | "cell_type": "markdown",
1019 | "metadata": {},
1020 | "source": [
1021 | "$$Tf(w, d) = Number \\ of \\ times \\ word \\ `w` \\ appears \\ in \\ document \\ `d`$$\n",
1022 | "\n",
1023 | "$$IDF(w) = \\log \\frac{Total \\ number \\ of \\ documents}{Number \\ of \\ documents \\ with \\ word \\ `w`}$$\n",
1024 | "\n",
1025 | "$$Tfidf(w, d) = Tf(w, d) * IDF(w)$$"
1026 | ]
1027 | },
1028 | {
1029 | "cell_type": "markdown",
1030 | "metadata": {},
1031 | "source": [
1032 | "### Word Vectors\n",
1033 | "\n",
1034 | "Word vectors - also called *word embeddings* - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive *context*.\n",
1035 | "\n",
1036 | "**There are two possible approaches:**\n",
1037 | "\n",
1038 | "
\n",
1039 | "\n",
1040 | "**CBOW (Continuous Bag Of Words):** It predicts the word, given context around the word as input\n",
1041 | "\n",
1042 | "**Skip-gram:** It predicts the context, given the word as input"
1043 | ]
1044 | },
1045 | {
1046 | "cell_type": "code",
1047 | "execution_count": 83,
1048 | "metadata": {},
1049 | "outputs": [],
1050 | "source": [
1051 | "import spacy\n",
1052 | "nlp = spacy.load('en_core_web_md')"
1053 | ]
1054 | },
1055 | {
1056 | "cell_type": "code",
1057 | "execution_count": 84,
1058 | "metadata": {},
1059 | "outputs": [
1060 | {
1061 | "data": {
1062 | "text/plain": [
1063 | "300"
1064 | ]
1065 | },
1066 | "execution_count": 84,
1067 | "metadata": {},
1068 | "output_type": "execute_result"
1069 | }
1070 | ],
1071 | "source": [
1072 | "len(nlp('dog').vector)"
1073 | ]
1074 | },
1075 | {
1076 | "cell_type": "code",
1077 | "execution_count": 85,
1078 | "metadata": {},
1079 | "outputs": [],
1080 | "source": [
1081 | "def most_similar(word, topn=5):\n",
1082 | " word = nlp.vocab[str(word)]\n",
1083 | " queries = [\n",
1084 | " w for w in word.vocab \n",
1085 | " if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)\n",
1086 | " ]\n",
1087 | "\n",
1088 | " by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)\n",
1089 | " return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]"
1090 | ]
1091 | },
1092 | {
1093 | "cell_type": "code",
1094 | "execution_count": 86,
1095 | "metadata": {},
1096 | "outputs": [
1097 | {
1098 | "data": {
1099 | "text/plain": [
1100 | "[('princes', 0.7876614),\n",
1101 | " ('kings', 0.7876614),\n",
1102 | " ('prince', 0.73377377),\n",
1103 | " ('queen', 0.72526103),\n",
1104 | " ('scepter', 0.6726005),\n",
1105 | " ('throne', 0.6726005),\n",
1106 | " ('kingdoms', 0.6604046),\n",
1107 | " ('kingdom', 0.6604046),\n",
1108 | " ('lord', 0.6439695),\n",
1109 | " ('royal', 0.6168811)]"
1110 | ]
1111 | },
1112 | "execution_count": 86,
1113 | "metadata": {},
1114 | "output_type": "execute_result"
1115 | }
1116 | ],
1117 | "source": [
1118 | "most_similar(\"king\", topn=10)"
1119 | ]
1120 | },
1121 | {
1122 | "cell_type": "code",
1123 | "execution_count": 87,
1124 | "metadata": {},
1125 | "outputs": [
1126 | {
1127 | "data": {
1128 | "text/plain": [
1129 | "[('cheetah', 0.9999999),\n",
1130 | " ('lions', 0.7758893),\n",
1131 | " ('tiger', 0.7359829),\n",
1132 | " ('panther', 0.7359829),\n",
1133 | " ('leopard', 0.7359829),\n",
1134 | " ('elephant', 0.71239567),\n",
1135 | " ('hippo', 0.71239567),\n",
1136 | " ('zebra', 0.71239567),\n",
1137 | " ('rhino', 0.71239567),\n",
1138 | " ('giraffe', 0.71239567)]"
1139 | ]
1140 | },
1141 | "execution_count": 87,
1142 | "metadata": {},
1143 | "output_type": "execute_result"
1144 | }
1145 | ],
1146 | "source": [
1147 | "most_similar(\"lion\", topn=10)"
1148 | ]
1149 | },
1150 | {
1151 | "cell_type": "markdown",
1152 | "metadata": {},
1153 | "source": [
1154 | "Sentence (or document) objects have vectors, derived from the averages of individual token vectors. This makes it possible to compare similarities between whole documents."
1155 | ]
1156 | },
1157 | {
1158 | "cell_type": "code",
1159 | "execution_count": 88,
1160 | "metadata": {},
1161 | "outputs": [
1162 | {
1163 | "data": {
1164 | "text/plain": [
1165 | "300"
1166 | ]
1167 | },
1168 | "execution_count": 88,
1169 | "metadata": {},
1170 | "output_type": "execute_result"
1171 | }
1172 | ],
1173 | "source": [
1174 | "doc = nlp('The quick brown fox jumped over the lazy dogs.')\n",
1175 | "len(doc.vector)"
1176 | ]
1177 | },
1178 | {
1179 | "cell_type": "markdown",
1180 | "metadata": {},
1181 | "source": [
1182 | "### Bert Sentence Transformer"
1183 | ]
1184 | },
1185 | {
1186 | "cell_type": "code",
1187 | "execution_count": 89,
1188 | "metadata": {},
1189 | "outputs": [],
1190 | "source": [
1191 | "from sentence_transformers import SentenceTransformer\n",
1192 | "import scipy.spatial\n",
1193 | "embedder = SentenceTransformer('bert-base-nli-mean-tokens')"
1194 | ]
1195 | },
1196 | {
1197 | "cell_type": "code",
1198 | "execution_count": 90,
1199 | "metadata": {},
1200 | "outputs": [
1201 | {
1202 | "name": "stdout",
1203 | "output_type": "stream",
1204 | "text": [
1205 | "CPU times: user 1min 54s, sys: 3.71 s, total: 1min 58s\n",
1206 | "Wall time: 34.9 s\n"
1207 | ]
1208 | }
1209 | ],
1210 | "source": [
1211 | "%%time\n",
1212 | "corpus_embeddings = embedder.encode(corpus)"
1213 | ]
1214 | },
1215 | {
1216 | "cell_type": "markdown",
1217 | "metadata": {},
1218 | "source": [
1219 | "----\n",
1220 | "## Candidate Genration using Faiss vector similarity search library"
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "markdown",
1225 | "metadata": {},
1226 | "source": [
1227 | "Faiss is a library developed by Facebook AI Research. It is for effecient similarity search and clustering of dense vectors.\n",
1228 | "\n",
1229 | "**References:**\n",
1230 | "\n",
1231 | "1. [Tutorial](https://github.com/facebookresearch/faiss/wiki/Getting-started)\n",
1232 | "2. [facebookresearch/faiss](https://github.com/facebookresearch/faiss)"
1233 | ]
1234 | },
1235 | {
1236 | "cell_type": "code",
1237 | "execution_count": 91,
1238 | "metadata": {},
1239 | "outputs": [
1240 | {
1241 | "name": "stdout",
1242 | "output_type": "stream",
1243 | "text": [
1244 | "True\n",
1245 | "1362\n"
1246 | ]
1247 | }
1248 | ],
1249 | "source": [
1250 | "import faiss\n",
1251 | "d= 768\n",
1252 | "index = faiss.IndexFlatL2(d)\n",
1253 | "print(index.is_trained)\n",
1254 | "index.add(np.stack(corpus_embeddings, axis=0))\n",
1255 | "print(index.ntotal)"
1256 | ]
1257 | },
1258 | {
1259 | "cell_type": "code",
1260 | "execution_count": 92,
1261 | "metadata": {},
1262 | "outputs": [],
1263 | "source": [
1264 | "# queries = ['What is the step by step guide to invest in share market in india?', 'How can Internet speed be increased by hacking through DNS?']\n",
1265 | "queries = question_pairs['question1'][:3].tolist()\n",
1266 | "query_embeddings = embedder.encode(queries)"
1267 | ]
1268 | },
1269 | {
1270 | "cell_type": "code",
1271 | "execution_count": 93,
1272 | "metadata": {},
1273 | "outputs": [
1274 | {
1275 | "name": "stdout",
1276 | "output_type": "stream",
1277 | "text": [
1278 | "[[ 858 998 1025 983 73]\n",
1279 | " [ 771 1015 775 1014 1133]\n",
1280 | " [ 436 455 457 463 462]]\n"
1281 | ]
1282 | }
1283 | ],
1284 | "source": [
1285 | "k = 5 # we want to see 4 nearest neighbors\n",
1286 | "D, I = index.search(np.stack(query_embeddings, axis=0), k) # actual search\n",
1287 | "print(I) # neighbors of the 5 first queries"
1288 | ]
1289 | },
1290 | {
1291 | "cell_type": "code",
1292 | "execution_count": 94,
1293 | "metadata": {},
1294 | "outputs": [
1295 | {
1296 | "name": "stdout",
1297 | "output_type": "stream",
1298 | "text": [
1299 | "\n",
1300 | "======================\n",
1301 | "\n",
1302 | "Query: What is purpose of life?\n",
1303 | "\n",
1304 | "Top 5 most similar sentences in corpus:\n",
1305 | "What is purpose of life? (Distance: 0.0000)\n",
1306 | "What is the purpose of life? (Distance: 7.9868)\n",
1307 | "What is your purpose of life? (Distance: 12.3884)\n",
1308 | "What is the meaning or purpose of life? (Distance: 13.6231)\n",
1309 | "From your perspective, what is the purpose of life? (Distance: 17.5448)\n",
1310 | "\n",
1311 | "======================\n",
1312 | "\n",
1313 | "Query: What are your New Year's resolutions for 2017?\n",
1314 | "\n",
1315 | "Top 5 most similar sentences in corpus:\n",
1316 | "What are your New Year's resolutions for 2017? (Distance: 0.0000)\n",
1317 | "What is your New Year's resolutions for 2017? (Distance: 0.6093)\n",
1318 | "What are your new year resolutions for 2017? (Distance: 5.8446)\n",
1319 | "What is your New Year's resolution for 2017? (Distance: 6.6011)\n",
1320 | "What's your New Year's resolution for 2017? (Distance: 8.1350)\n",
1321 | "\n",
1322 | "======================\n",
1323 | "\n",
1324 | "Query: How will Indian GDP be affected from banning 500 and 1000 rupees notes?\n",
1325 | "\n",
1326 | "Top 5 most similar sentences in corpus:\n",
1327 | "How will Indian GDP be affected from banning 500 and 1000 rupees notes? (Distance: 0.0000)\n",
1328 | "How will the ban of 1000 and 500 rupee notes affect the Indian economy? (Distance: 24.4171)\n",
1329 | "How will the ban of Rs 500 and Rs 1000 notes affect Indian economy? (Distance: 26.4862)\n",
1330 | "How will the ban on Rs 500 and 1000 notes impact the Indian economy? (Distance: 26.5522)\n",
1331 | "How will the ban on 500₹ and 1000₹ notes impact the Indian economy? (Distance: 28.7938)\n"
1332 | ]
1333 | }
1334 | ],
1335 | "source": [
1336 | "for query, query_embedding in zip(queries, query_embeddings):\n",
1337 | " distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),k)\n",
1338 | " print(\"\\n======================\\n\")\n",
1339 | " print(\"Query:\", query)\n",
1340 | " print(\"\\nTop 5 most similar sentences in corpus:\")\n",
1341 | " for idx in range(0,5):\n",
1342 | " print(corpus[indices[0,idx]], \"(Distance: %.4f)\" % distances[0,idx])"
1343 | ]
1344 | },
1345 | {
1346 | "cell_type": "code",
1347 | "execution_count": 122,
1348 | "metadata": {},
1349 | "outputs": [],
1350 | "source": [
1351 | "query = \"How will Indian GDP be affected from banning 500 and 1000 rupees notes?\"\n",
1352 | "query_embed = embedder.encode(query)\n",
1353 | "distances, indices = index.search(np.asarray(query_embedding).reshape(1,768),50)\n",
1354 | "relevant_docs = [corpus[indices[0,idx]] for idx in range(50)]"
1355 | ]
1356 | },
1357 | {
1358 | "cell_type": "markdown",
1359 | "metadata": {},
1360 | "source": [
1361 | "----\n",
1362 | "## Reranking using Bidirectional LSTM model"
1363 | ]
1364 | },
1365 | {
1366 | "cell_type": "markdown",
1367 | "metadata": {},
1368 | "source": [
1369 | "
\n",
1370 | "\n",
1371 | "**Reference:** https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/"
1372 | ]
1373 | },
1374 | {
1375 | "cell_type": "code",
1376 | "execution_count": 344,
1377 | "metadata": {},
1378 | "outputs": [],
1379 | "source": [
1380 | "import re\n",
1381 | "import nltk\n",
1382 | "from nltk.tokenize.toktok import ToktokTokenizer\n",
1383 | "from nltk.stem import WordNetLemmatizer, SnowballStemmer\n",
1384 | "toko_tokenizer = ToktokTokenizer()\n",
1385 | "wordnet_lemmatizer = WordNetLemmatizer()\n",
1386 | "\n",
1387 | "def normalize_text(text):\n",
1388 | " puncts = ['/', ',', '.', '\"', ':', ')', '(', '-', '!', '?', '|', ';', '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\\\', '•', '~', '@', '£', \n",
1389 | " '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', \n",
1390 | " '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', \n",
1391 | " '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', \n",
1392 | " '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]\n",
1393 | "\n",
1394 | " def clean_text(text):\n",
1395 | " text = str(text)\n",
1396 | " text = text.replace('\\n', '')\n",
1397 | " text = text.replace('\\r', '')\n",
1398 | " for punct in puncts:\n",
1399 | " if punct in text:\n",
1400 | " text = text.replace(punct, '')\n",
1401 | " return text.lower()\n",
1402 | "\n",
1403 | " def clean_numbers(text):\n",
1404 | " if bool(re.search(r'\\d', text)):\n",
1405 | " text = re.sub('[0-9]{5,}', '#####', text)\n",
1406 | " text = re.sub('[0-9]{4}', '####', text)\n",
1407 | " text = re.sub('[0-9]{3}', '###', text)\n",
1408 | " text = re.sub('[0-9]{2}', '##', text)\n",
1409 | " return text\n",
1410 | "\n",
1411 | " contraction_dict = {\"ain't\": \"is not\", \"aren't\": \"are not\",\"can't\": \"cannot\", \"'cause\": \"because\", \"could've\": \"could have\", \"couldn't\": \"could not\", \"didn't\": \"did not\", \"doesn't\": \"does not\", \"don't\": \"do not\", \"hadn't\": \"had not\", \"hasn't\": \"has not\", \"haven't\": \"have not\", \"he'd\": \"he would\",\"he'll\": \"he will\", \"he's\": \"he is\", \"how'd\": \"how did\", \"how'd'y\": \"how do you\", \"how'll\": \"how will\", \"how's\": \"how is\", \"I'd\": \"I would\", \"I'd've\": \"I would have\", \"I'll\": \"I will\", \"I'll've\": \"I will have\",\"I'm\": \"I am\", \"I've\": \"I have\", \"i'd\": \"i would\", \"i'd've\": \"i would have\", \"i'll\": \"i will\", \"i'll've\": \"i will have\",\"i'm\": \"i am\", \"i've\": \"i have\", \"isn't\": \"is not\", \"it'd\": \"it would\", \"it'd've\": \"it would have\", \"it'll\": \"it will\", \"it'll've\": \"it will have\",\"it's\": \"it is\", \"let's\": \"let us\", \"ma'am\": \"madam\", \"mayn't\": \"may not\", \"might've\": \"might have\",\"mightn't\": \"might not\",\"mightn't've\": \"might not have\", \"must've\": \"must have\", \"mustn't\": \"must not\", \"mustn't've\": \"must not have\", \"needn't\": \"need not\", \"needn't've\": \"need not have\",\"o'clock\": \"of the clock\", \"oughtn't\": \"ought not\", \"oughtn't've\": \"ought not have\", \"shan't\": \"shall not\", \"sha'n't\": \"shall not\", \"shan't've\": \"shall not have\", \"she'd\": \"she would\", \"she'd've\": \"she would have\", \"she'll\": \"she will\", \"she'll've\": \"she will have\", \"she's\": \"she is\", \"should've\": \"should have\", \"shouldn't\": \"should not\", \"shouldn't've\": \"should not have\", \"so've\": \"so have\",\"so's\": \"so as\", \"this's\": \"this is\",\"that'd\": \"that would\", \"that'd've\": \"that would have\", \"that's\": \"that is\", \"there'd\": \"there would\", \"there'd've\": \"there would have\", \"there's\": \"there is\", \"here's\": \"here is\",\"they'd\": \"they would\", \"they'd've\": \"they would have\", \"they'll\": \"they will\", \"they'll've\": \"they will have\", \"they're\": \"they are\", \"they've\": \"they have\", \"to've\": \"to have\", \"wasn't\": \"was not\", \"we'd\": \"we would\", \"we'd've\": \"we would have\", \"we'll\": \"we will\", \"we'll've\": \"we will have\", \"we're\": \"we are\", \"we've\": \"we have\", \"weren't\": \"were not\", \"what'll\": \"what will\", \"what'll've\": \"what will have\", \"what're\": \"what are\", \"what's\": \"what is\", \"what've\": \"what have\", \"when's\": \"when is\", \"when've\": \"when have\", \"where'd\": \"where did\", \"where's\": \"where is\", \"where've\": \"where have\", \"who'll\": \"who will\", \"who'll've\": \"who will have\", \"who's\": \"who is\", \"who've\": \"who have\", \"why's\": \"why is\", \"why've\": \"why have\", \"will've\": \"will have\", \"won't\": \"will not\", \"won't've\": \"will not have\", \"would've\": \"would have\", \"wouldn't\": \"would not\", \"wouldn't've\": \"would not have\", \"y'all\": \"you all\", \"y'all'd\": \"you all would\",\"y'all'd've\": \"you all would have\",\"y'all're\": \"you all are\",\"y'all've\": \"you all have\",\"you'd\": \"you would\", \"you'd've\": \"you would have\", \"you'll\": \"you will\", \"you'll've\": \"you will have\", \"you're\": \"you are\", \"you've\": \"you have\"}\n",
1412 | "\n",
1413 | " def _get_contractions(contraction_dict):\n",
1414 | " contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))\n",
1415 | " return contraction_dict, contraction_re\n",
1416 | "\n",
1417 | " contractions, contractions_re = _get_contractions(contraction_dict)\n",
1418 | "\n",
1419 | " def replace_contractions(text):\n",
1420 | " def replace(match):\n",
1421 | " return contractions[match.group(0)]\n",
1422 | " return contractions_re.sub(replace, text)\n",
1423 | "\n",
1424 | " stopword_list = nltk.corpus.stopwords.words('english')\n",
1425 | "\n",
1426 | " def remove_stopwords(text, is_lower_case=True):\n",
1427 | " tokens = toko_tokenizer.tokenize(text)\n",
1428 | " tokens = [token.strip() for token in tokens]\n",
1429 | " if is_lower_case:\n",
1430 | " filtered_tokens = [token for token in tokens if token not in stopword_list]\n",
1431 | " else:\n",
1432 | " filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]\n",
1433 | " filtered_text = ' '.join(filtered_tokens) \n",
1434 | " return filtered_text\n",
1435 | "\n",
1436 | " def lemmatizer(text):\n",
1437 | " tokens = toko_tokenizer.tokenize(text)\n",
1438 | " tokens = [token.strip() for token in tokens]\n",
1439 | " tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]\n",
1440 | " return ' '.join(tokens)\n",
1441 | "\n",
1442 | " def trim_text(text):\n",
1443 | " tokens = toko_tokenizer.tokenize(text)\n",
1444 | " tokens = [token.strip() for token in tokens]\n",
1445 | " return ' '.join(tokens)\n",
1446 | " \n",
1447 | " def remove_non_english(text):\n",
1448 | " tokens = toko_tokenizer.tokenize(text)\n",
1449 | " tokens = [token.strip() for token in tokens]\n",
1450 | " tokens = [token for token in tokens if d.check(token)]\n",
1451 | " eng_text = ' '.join(tokens)\n",
1452 | " return eng_text\n",
1453 | "\n",
1454 | " text_norm = clean_text(text)\n",
1455 | " text_norm = clean_numbers(text_norm)\n",
1456 | " text_norm = replace_contractions(text_norm)\n",
1457 | "# text_norm = remove_stopwords(text_norm)\n",
1458 | "# text_norm = remove_non_english(text_norm)\n",
1459 | " text_norm = lemmatizer(text_norm)\n",
1460 | " text_norm = trim_text(text_norm)\n",
1461 | " return text_norm"
1462 | ]
1463 | },
1464 | {
1465 | "cell_type": "code",
1466 | "execution_count": 328,
1467 | "metadata": {},
1468 | "outputs": [
1469 | {
1470 | "data": {
1471 | "text/html": [
1472 | "\n",
1473 | "\n",
1486 | "
\n",
1487 | " \n",
1488 | " \n",
1489 | " | \n",
1490 | " id | \n",
1491 | " qid1 | \n",
1492 | " qid2 | \n",
1493 | " question1 | \n",
1494 | " question2 | \n",
1495 | " is_duplicate | \n",
1496 | "
\n",
1497 | " \n",
1498 | " \n",
1499 | " \n",
1500 | " 0 | \n",
1501 | " 0 | \n",
1502 | " 1 | \n",
1503 | " 2 | \n",
1504 | " What is the step by step guide to invest in sh... | \n",
1505 | " What is the step by step guide to invest in sh... | \n",
1506 | " 0 | \n",
1507 | "
\n",
1508 | " \n",
1509 | " 1 | \n",
1510 | " 1 | \n",
1511 | " 3 | \n",
1512 | " 4 | \n",
1513 | " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
1514 | " What would happen if the Indian government sto... | \n",
1515 | " 0 | \n",
1516 | "
\n",
1517 | " \n",
1518 | " 2 | \n",
1519 | " 2 | \n",
1520 | " 5 | \n",
1521 | " 6 | \n",
1522 | " How can I increase the speed of my internet co... | \n",
1523 | " How can Internet speed be increased by hacking... | \n",
1524 | " 0 | \n",
1525 | "
\n",
1526 | " \n",
1527 | " 3 | \n",
1528 | " 3 | \n",
1529 | " 7 | \n",
1530 | " 8 | \n",
1531 | " Why am I mentally very lonely? How can I solve... | \n",
1532 | " Find the remainder when [math]23^{24}[/math] i... | \n",
1533 | " 0 | \n",
1534 | "
\n",
1535 | " \n",
1536 | " 4 | \n",
1537 | " 4 | \n",
1538 | " 9 | \n",
1539 | " 10 | \n",
1540 | " Which one dissolve in water quikly sugar, salt... | \n",
1541 | " Which fish would survive in salt water? | \n",
1542 | " 0 | \n",
1543 | "
\n",
1544 | " \n",
1545 | "
\n",
1546 | "
"
1547 | ],
1548 | "text/plain": [
1549 | " id qid1 qid2 question1 \\\n",
1550 | "0 0 1 2 What is the step by step guide to invest in sh... \n",
1551 | "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
1552 | "2 2 5 6 How can I increase the speed of my internet co... \n",
1553 | "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
1554 | "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
1555 | "\n",
1556 | " question2 is_duplicate \n",
1557 | "0 What is the step by step guide to invest in sh... 0 \n",
1558 | "1 What would happen if the Indian government sto... 0 \n",
1559 | "2 How can Internet speed be increased by hacking... 0 \n",
1560 | "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
1561 | "4 Which fish would survive in salt water? 0 "
1562 | ]
1563 | },
1564 | "execution_count": 328,
1565 | "metadata": {},
1566 | "output_type": "execute_result"
1567 | }
1568 | ],
1569 | "source": [
1570 | "question_pairs.head()"
1571 | ]
1572 | },
1573 | {
1574 | "cell_type": "code",
1575 | "execution_count": 329,
1576 | "metadata": {},
1577 | "outputs": [
1578 | {
1579 | "data": {
1580 | "text/plain": [
1581 | "(404351, 6)"
1582 | ]
1583 | },
1584 | "execution_count": 329,
1585 | "metadata": {},
1586 | "output_type": "execute_result"
1587 | }
1588 | ],
1589 | "source": [
1590 | "question_pairs.shape"
1591 | ]
1592 | },
1593 | {
1594 | "cell_type": "code",
1595 | "execution_count": 293,
1596 | "metadata": {},
1597 | "outputs": [],
1598 | "source": [
1599 | "embedding_path = \"./../../Embeddings/glove.twitter.27B/glove.twitter.27B.200d.txt\"\n",
1600 | "def get_word2vec(file_path):\n",
1601 | " file = open(embedding_path, \"r\")\n",
1602 | " if (file):\n",
1603 | " word2vec = dict()\n",
1604 | " split = file.read().splitlines()\n",
1605 | " for line in split:\n",
1606 | " key = line.split(' ',1)[0]\n",
1607 | " value = np.array([float(val) for val in line.split(' ')[1:]])\n",
1608 | " word2vec[key] = value\n",
1609 | " return (word2vec)\n",
1610 | " else:\n",
1611 | " print(\"invalid fiel path\")\n",
1612 | "w2v = get_word2vec(embedding_path)"
1613 | ]
1614 | },
1615 | {
1616 | "cell_type": "code",
1617 | "execution_count": 352,
1618 | "metadata": {},
1619 | "outputs": [],
1620 | "source": [
1621 | "total_text = pd.concat([question_pairs['question1'], question_pairs['question2']]).reset_index(drop=True)\n",
1622 | "total_text = total_text.apply(lambda x: str(x))\n",
1623 | "total_text = total_text.apply(lambda x: normalize_text(x))\n",
1624 | "max_features = 6000\n",
1625 | "tokenizer = Tokenizer(num_words=max_features)\n",
1626 | "tokenizer.fit_on_texts(total_text)\n",
1627 | "question_1_sequenced = tokenizer.texts_to_sequences(question_pairs['question1'].apply(lambda x: normalize_text(x)))\n",
1628 | "question_2_sequenced = tokenizer.texts_to_sequences(question_pairs['question2'].apply(lambda x: normalize_text(x)))\n",
1629 | "vocab_size = len(tokenizer.word_index) + 1"
1630 | ]
1631 | },
1632 | {
1633 | "cell_type": "code",
1634 | "execution_count": 353,
1635 | "metadata": {},
1636 | "outputs": [
1637 | {
1638 | "data": {
1639 | "text/plain": [
1640 | "92423"
1641 | ]
1642 | },
1643 | "execution_count": 353,
1644 | "metadata": {},
1645 | "output_type": "execute_result"
1646 | }
1647 | ],
1648 | "source": [
1649 | "vocab_size"
1650 | ]
1651 | },
1652 | {
1653 | "cell_type": "code",
1654 | "execution_count": 354,
1655 | "metadata": {},
1656 | "outputs": [],
1657 | "source": [
1658 | "maxlen = 100\n",
1659 | "question_1_padded = pad_sequences(question_1_sequenced, maxlen=maxlen)\n",
1660 | "question_2_padded = pad_sequences(question_2_sequenced, maxlen=maxlen)"
1661 | ]
1662 | },
1663 | {
1664 | "cell_type": "code",
1665 | "execution_count": 359,
1666 | "metadata": {},
1667 | "outputs": [],
1668 | "source": [
1669 | "y = question_pairs['is_duplicate']"
1670 | ]
1671 | },
1672 | {
1673 | "cell_type": "code",
1674 | "execution_count": 381,
1675 | "metadata": {},
1676 | "outputs": [],
1677 | "source": [
1678 | "from tqdm import tqdm"
1679 | ]
1680 | },
1681 | {
1682 | "cell_type": "code",
1683 | "execution_count": 384,
1684 | "metadata": {},
1685 | "outputs": [],
1686 | "source": [
1687 | "from numpy import zeros\n",
1688 | "embedding_matrix = zeros((vocab_size, 768))\n",
1689 | "for word, i in tokenizer.word_index.items():\n",
1690 | " embedding_vector = w2v.get(word)\n",
1691 | " if embedding_vector is not None:\n",
1692 | " embedding_matrix[i] = embedding_vector[0]"
1693 | ]
1694 | },
1695 | {
1696 | "cell_type": "code",
1697 | "execution_count": 357,
1698 | "metadata": {},
1699 | "outputs": [],
1700 | "source": [
1701 | "embedding_size = 128\n",
1702 | "max_len = 100\n",
1703 | "\n",
1704 | "inp1 = Input(shape=(100,))\n",
1705 | "inp2 = Input(shape=(100,))\n",
1706 | "\n",
1707 | "x1 = Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=max_len)(inp1)\n",
1708 | "x2 = Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=max_len)(inp2)\n",
1709 | "\n",
1710 | "x3 = Bidirectional(LSTM(32, return_sequences = True))(x1)\n",
1711 | "x4 = Bidirectional(LSTM(32, return_sequences = True))(x2)\n",
1712 | "\n",
1713 | "x5 = GlobalMaxPool1D()(x3)\n",
1714 | "x6 = GlobalMaxPool1D()(x4)\n",
1715 | "\n",
1716 | "x7 = dot([x5, x6], axes=1)\n",
1717 | "\n",
1718 | "x8 = Dense(40, activation='relu')(x7)\n",
1719 | "x9 = Dropout(0.05)(x8)\n",
1720 | "x10 = Dense(10, activation='relu')(x9)\n",
1721 | "output = Dense(1, activation=\"sigmoid\")(x10)\n",
1722 | "\n",
1723 | "model = Model(inputs=[inp1, inp2], outputs=output)\n",
1724 | "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
1725 | "batch_size = 256\n",
1726 | "epochs = 4"
1727 | ]
1728 | },
1729 | {
1730 | "cell_type": "code",
1731 | "execution_count": 360,
1732 | "metadata": {},
1733 | "outputs": [
1734 | {
1735 | "name": "stdout",
1736 | "output_type": "stream",
1737 | "text": [
1738 | "Train on 323480 samples, validate on 80871 samples\n",
1739 | "Epoch 1/4\n",
1740 | "323480/323480 [==============================] - 1096s 3ms/step - loss: 0.5212 - acc: 0.7372 - val_loss: 0.4596 - val_acc: 0.7802\n",
1741 | "Epoch 2/4\n",
1742 | "323480/323480 [==============================] - 1101s 3ms/step - loss: 0.4169 - acc: 0.8038 - val_loss: 0.4300 - val_acc: 0.7978\n",
1743 | "Epoch 3/4\n",
1744 | "323480/323480 [==============================] - 1135s 4ms/step - loss: 0.3536 - acc: 0.8400 - val_loss: 0.4340 - val_acc: 0.7976\n",
1745 | "Epoch 4/4\n",
1746 | "323480/323480 [==============================] - 1077s 3ms/step - loss: 0.2963 - acc: 0.8701 - val_loss: 0.4411 - val_acc: 0.8035\n"
1747 | ]
1748 | },
1749 | {
1750 | "data": {
1751 | "text/plain": [
1752 | ""
1753 | ]
1754 | },
1755 | "execution_count": 360,
1756 | "metadata": {},
1757 | "output_type": "execute_result"
1758 | }
1759 | ],
1760 | "source": [
1761 | "model.fit([question_1_padded, question_2_padded], y, batch_size=batch_size, epochs=epochs, validation_split=0.2, )"
1762 | ]
1763 | },
1764 | {
1765 | "cell_type": "markdown",
1766 | "metadata": {},
1767 | "source": [
1768 | "--------\n",
1769 | "## Combining candidate generation and reranking"
1770 | ]
1771 | },
1772 | {
1773 | "cell_type": "code",
1774 | "execution_count": 361,
1775 | "metadata": {},
1776 | "outputs": [
1777 | {
1778 | "data": {
1779 | "text/plain": [
1780 | "'How will Indian GDP be affected from banning 500 and 1000 rupees notes?'"
1781 | ]
1782 | },
1783 | "execution_count": 361,
1784 | "metadata": {},
1785 | "output_type": "execute_result"
1786 | }
1787 | ],
1788 | "source": [
1789 | "query"
1790 | ]
1791 | },
1792 | {
1793 | "cell_type": "code",
1794 | "execution_count": 363,
1795 | "metadata": {},
1796 | "outputs": [],
1797 | "source": [
1798 | "query_copy = [query]*len(relevant_docs)\n",
1799 | "question_1_sequenced_final = tokenizer.texts_to_sequences(query_copy)\n",
1800 | "question_2_sequenced_final = tokenizer.texts_to_sequences(relevant_docs)"
1801 | ]
1802 | },
1803 | {
1804 | "cell_type": "code",
1805 | "execution_count": 364,
1806 | "metadata": {},
1807 | "outputs": [],
1808 | "source": [
1809 | "maxlen = 100\n",
1810 | "question_1_padded_final = pad_sequences(question_1_sequenced_final, maxlen=maxlen)\n",
1811 | "question_2_padded_final = pad_sequences(question_2_sequenced_final, maxlen=maxlen)"
1812 | ]
1813 | },
1814 | {
1815 | "cell_type": "code",
1816 | "execution_count": 365,
1817 | "metadata": {},
1818 | "outputs": [],
1819 | "source": [
1820 | "preds_test = model.predict([question_1_padded_final, question_2_padded_final])\n",
1821 | "preds_test = np.array([x[0] for x in preds_test])"
1822 | ]
1823 | },
1824 | {
1825 | "cell_type": "code",
1826 | "execution_count": 390,
1827 | "metadata": {},
1828 | "outputs": [
1829 | {
1830 | "data": {
1831 | "text/plain": [
1832 | "['What do you think about banning 500 and 1000 rupee notes in India?',\n",
1833 | " 'What will be the implications of banning 500 and 1000 rupees currency notes on Indian economy?',\n",
1834 | " 'What will be the consequences of 500 and 1000 rupee notes banning?',\n",
1835 | " 'What will be the effects after banning on 500 and 1000 rupee notes?',\n",
1836 | " 'What will be the impact on real estate by banning 500 and 1000 rupee notes from India?',\n",
1837 | " 'How is banning 500 and 1000 INR going to help Indian economy?',\n",
1838 | " 'What are your views on India banning 500 and 1000 notes? In what way it will affect Indian economy?',\n",
1839 | " 'How is discontinuing 500 and 1000 rupee note going to put a hold on black money in India?',\n",
1840 | " 'What will be the result of banning 500 and 1000 rupees note in India?',\n",
1841 | " 'What are the economic implications of banning 500 and 1000 rupee notes?']"
1842 | ]
1843 | },
1844 | "execution_count": 390,
1845 | "metadata": {},
1846 | "output_type": "execute_result"
1847 | }
1848 | ],
1849 | "source": [
1850 | "[relevant_docs[x] for x in preds_test.argsort()[::-1]][:10]"
1851 | ]
1852 | },
1853 | {
1854 | "cell_type": "code",
1855 | "execution_count": null,
1856 | "metadata": {},
1857 | "outputs": [],
1858 | "source": []
1859 | }
1860 | ],
1861 | "metadata": {
1862 | "kernelspec": {
1863 | "display_name": "Python 3",
1864 | "language": "python",
1865 | "name": "python3"
1866 | },
1867 | "language_info": {
1868 | "codemirror_mode": {
1869 | "name": "ipython",
1870 | "version": 3
1871 | },
1872 | "file_extension": ".py",
1873 | "mimetype": "text/x-python",
1874 | "name": "python",
1875 | "nbconvert_exporter": "python",
1876 | "pygments_lexer": "ipython3",
1877 | "version": "3.6.10"
1878 | }
1879 | },
1880 | "nbformat": 4,
1881 | "nbformat_minor": 4
1882 | }
1883 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Deep-Learning-for-Semantic-Text-Matching
2 |
3 | [Deep Learning for Semantic Text Matching](https://medium.com/swlh/deep-learning-for-semantic-text-matching-d4df6c2cf4c5)
4 |
--------------------------------------------------------------------------------