├── Images
├── Classification_Report.JPG
├── prediction.JPG
└── top10_tags.JPG
├── README.md
├── Stackoverflow Clean Questions.ipynb
└── Stackoverflow Tags Map & Model.ipynb
/Images/Classification_Report.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/Classification_Report.JPG
--------------------------------------------------------------------------------
/Images/prediction.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/prediction.JPG
--------------------------------------------------------------------------------
/Images/top10_tags.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/top10_tags.JPG
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Multiclass Multilabel Prediction For StackOverflow Questions
2 |
3 | **Data set** : https://www.kaggle.com/therajeshreddy/stackoverflow
4 |
5 | **Objective** : Given text for Questions from StackoverFlow posts, predict tags associated with them.
6 |
7 | This is a scaled down version of predecting only top 10 most occurring tags
8 |
9 | **Programming Language** : Python using nltk & Keras
10 |
11 | **Model Architecture** : Deep Learning using Recurrent Neural Network (RNN)
12 |
13 | **About Data Set**
14 |
15 | Dataset has text of questions, answers and thier corresponding tags from the Stack Overflow programming Q&A website.
16 |
17 | This is organized as three files:
18 |
19 | 1. Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions.
20 |
21 | 2. Tags contains the tags on each of these questions.
22 |
23 | 3. Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table. *We don't use this file as we want to predict Tags given a question*
24 |
25 | **Data Pre-Processing**
26 |
27 | >Questions File
28 | *Code* : Stackoverflow Clean Questions.ipynb
29 |
30 | 1. Read Questions File
31 | 2. Drop All columns except Id,Title and Body
32 | 3. Now the text in the Body column seem to have many html tags in the text. We use Regular Expressions and Clean the Body column text by removing the html tags
33 | ```python
34 | import re
35 | def rem_html_tags(body):
36 | regex = re.compile('<.*?>')
37 | return re.sub(regex, '', body)
38 | ques['Body'] = ques['Body'].apply(rem_html_tags)
39 | ```
40 | 4. Save the questions file for later use
41 | ```python
42 | ques.to_csv('question_clean.csv',index=False)
43 | ```
44 |
45 | >Tags File
46 | *Code* : Stackoverflow Tags Map & Model.ipynb
47 |
48 | 1. Read Tags File
49 | 2. Identify top 10 Tags by count
50 | ```python
51 | tagCount = collections.Counter(list(df_tags['Tag'])).most_common(10)
52 | print(tagCount)
53 |
54 | [('javascript', 124155), ('java', 115212), ('c#', 101186), ('php', 98808), ('android', 90659), ('jquery', 78542), ('python', 64601), ('html', 58976), ('c++', 47591), ('ios', 47009)]
55 | ```
56 |
57 |
58 |
59 | 3. Manipulate the tags dataframe so that all the Tags for an ID are as a list in a row (grouped by Question ID)
60 |
61 | ```python
62 | def add_tags(question_id):
63 | return tag_top10[tag_top10['Id'] == question_id['Id']].Tag.values
64 |
65 | top10 = tag_top10.apply(add_tags, axis=1)
66 | ```
67 |
68 |
69 | >Combine the Questions and Tags
70 | *Code* : Stackoverflow Tags Map & Model.ipynb
71 |
72 | Merge the Questions and Tags data frame by ID
73 |
74 | ```python
75 | total=pd.merge(ques, top10_tags, on='Id')
76 | ```
77 |
78 | Our Dataset would now have only Id, Title, Body & Tags
79 |
80 | >Text Preprocessing
81 | *Code* : Stackoverflow Tags Map & Model.ipynb
82 |
83 | We will use nltk, preprocessing from Keras and sklearn to process the text data
84 |
85 | *Tags preprocesing*
86 | Use MultiLabelBinarizer from sklearn on the Class labels(Tags)
87 | ```python
88 | from sklearn.preprocessing import MultiLabelBinarizer
89 | multilabel_binarizer = MultiLabelBinarizer()
90 | multilabel_binarizer.fit(total.Tags)
91 | print(multilabel_binarizer.classes_)
92 |
93 | array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript','jquery', 'php', 'python'], dtype=object)
94 | ```
95 |
96 | *Title & Body Preprocessing*
97 | 1. Tokenize the words
98 | 2. Convert the tokenized words to sequences
99 |
100 | **Model Building**
101 |
102 | Implemented a Hybrid model in TensorFlow using Keras as high level api. Architecture used is RNN. In this model first we train a model using the Title data, then train a model using the Body data. Outputs of both are concatenated and passed thorugh the dense layers before connecting to the output layer
103 |
104 | *RNN Model* : The model first uses GRU for the sequence data training with 2 GRU layers one for Title and other for Body.
105 |
106 | RNN for Title has
107 | - 1 Embedding Layer has input of Title vocabulary length(68969) + 1(for 0 padding) and out put of 2000 embeddings (for better results use full vocabulary length+1)
108 | - 1 Gated recurrent unit (GRU) layer
109 | - 1 dense output layer of shape 10(No of classes(tags) we are trying to predict)
110 |
111 | ```python
112 | # Title Only
113 | title_input = Input(name='title_input',shape=[max_len_t])
114 | title_Embed = Embedding(vocab_len_t+1,2000,input_length=max_len_t,mask_zero=True,name='title_Embed')(title_input)
115 | gru_out_t = GRU(300)(title_Embed)
116 | # auxiliary output to tune GRU weights smoothly
117 | auxiliary_output = Dense(10, activation='sigmoid', name='aux_output')(gru_out_t)
118 | ```
119 |
120 | RNN for Body has
121 | - 1 Embedding Layer has input of Title vocabulary length(1292018) + 1(for 0 padding) and out put of 170 embeddings (for better results use full vocabulary length+1)
122 | - 1 Gated recurrent unit (GRU) layer
123 |
124 | ```python
125 | # Body Only
126 | body_input = Input(name='body_input',shape=[max_len_b])
127 | body_Embed = Embedding(vocab_len_b+1,170,input_length=max_len_b,mask_zero=True,name='body_Embed')(body_input)
128 | gru_out_b = GRU(200)(body_Embed)
129 | ```
130 |
131 | Combine the 2 GRU outputs
132 | ```python
133 | com = concatenate([gru_out_t, gru_out_b])
134 | ```
135 |
136 | The fully connected network has
137 | - 2 Dense Layers
138 | - 1 Dropout layer
139 | - 1 BatchNormalization layer
140 | - 1 Dense Output layer
141 |
142 | ```python
143 | # now the combined data is being fed to dense layers
144 | dense1 = Dense(400,activation='relu')(com)
145 | dp1 = Dropout(0.5)(dense1)
146 | bn = BatchNormalization()(dp1)
147 | dense2 = Dense(150,activation='relu')(bn)
148 | main_output = Dense(10, activation='sigmoid', name='main_output')(dense2)
149 | ```
150 |
151 | *Model Compilattion with optimizer='adam', loss='categorical_crossentropy', metrics='accuracy')*
152 |
153 | **Model Performance Review**
154 |
155 | *Classification Report to check Precision, Recall and F1 Score*
156 |
157 | The Model seem to performing good enough with score of 84%. Increase in the Embedding, GRU and dense layers would help in getting better results
158 |
159 |
160 |
161 | **Random Validation on Test Data**
162 |
163 |
164 |
165 |
166 | **Save the Model & Weights**
167 |
168 | Saving the model for transfer learning or model execution later
169 |
170 | ```python
171 | model.save('./stackoverflow_tags.h5')
172 | ```
173 |
--------------------------------------------------------------------------------
/Stackoverflow Clean Questions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
9 | },
10 | "outputs": [
11 | {
12 | "name": "stdout",
13 | "output_type": "stream",
14 | "text": [
15 | "['Answers.csv', 'Tags.csv', 'Questions.csv']\n"
16 | ]
17 | }
18 | ],
19 | "source": [
20 | "# This Python 3 environment comes with many helpful analytics libraries installed\n",
21 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n",
22 | "# For example, here's several helpful packages to load in \n",
23 | "\n",
24 | "import numpy as np # linear algebra\n",
25 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
26 | "\n",
27 | "# for counting\n",
28 | "import collections\n",
29 | "\n",
30 | "# Input data files are available in the \"../input/\" directory.\n",
31 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n",
32 | "\n",
33 | "import os\n",
34 | "print(os.listdir(\"../input\"))\n",
35 | "\n",
36 | "# Any results you write to the current directory are saved as output."
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 2,
42 | "metadata": {
43 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
44 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
45 | },
46 | "outputs": [
47 | {
48 | "data": {
49 | "text/html": [
50 | "
\n",
51 | "\n",
64 | "
\n",
65 | " \n",
66 | " \n",
67 | " \n",
68 | " Id \n",
69 | " OwnerUserId \n",
70 | " CreationDate \n",
71 | " ClosedDate \n",
72 | " Score \n",
73 | " Title \n",
74 | " Body \n",
75 | " \n",
76 | " \n",
77 | " \n",
78 | " \n",
79 | " 0 \n",
80 | " 80 \n",
81 | " 26.0 \n",
82 | " 2008-08-01T13:57:07Z \n",
83 | " NaN \n",
84 | " 26 \n",
85 | " SQLStatement.execute() - multiple queries in o... \n",
86 | " <p>I've written a database generation script i... \n",
87 | " \n",
88 | " \n",
89 | " 1 \n",
90 | " 90 \n",
91 | " 58.0 \n",
92 | " 2008-08-01T14:41:24Z \n",
93 | " 2012-12-26T03:45:49Z \n",
94 | " 144 \n",
95 | " Good branching and merging tutorials for Torto... \n",
96 | " <p>Are there any really good tutorials explain... \n",
97 | " \n",
98 | " \n",
99 | " 2 \n",
100 | " 120 \n",
101 | " 83.0 \n",
102 | " 2008-08-01T15:50:08Z \n",
103 | " NaN \n",
104 | " 21 \n",
105 | " ASP.NET Site Maps \n",
106 | " <p>Has anyone got experience creating <strong>... \n",
107 | " \n",
108 | " \n",
109 | " 3 \n",
110 | " 180 \n",
111 | " 2089740.0 \n",
112 | " 2008-08-01T18:42:19Z \n",
113 | " NaN \n",
114 | " 53 \n",
115 | " Function for creating color wheels \n",
116 | " <p>This is something I've pseudo-solved many t... \n",
117 | " \n",
118 | " \n",
119 | " 4 \n",
120 | " 260 \n",
121 | " 91.0 \n",
122 | " 2008-08-01T23:22:08Z \n",
123 | " NaN \n",
124 | " 49 \n",
125 | " Adding scripting functionality to .NET applica... \n",
126 | " <p>I have a little game written in C#. It uses... \n",
127 | " \n",
128 | " \n",
129 | " 5 \n",
130 | " 330 \n",
131 | " 63.0 \n",
132 | " 2008-08-02T02:51:36Z \n",
133 | " NaN \n",
134 | " 29 \n",
135 | " Should I use nested classes in this case? \n",
136 | " <p>I am working on a collection of classes use... \n",
137 | " \n",
138 | " \n",
139 | " 6 \n",
140 | " 470 \n",
141 | " 71.0 \n",
142 | " 2008-08-02T15:11:47Z \n",
143 | " 2016-03-26T05:23:29Z \n",
144 | " 13 \n",
145 | " Homegrown consumption of web services \n",
146 | " <p>I've been writing a few web services for a ... \n",
147 | " \n",
148 | " \n",
149 | " 7 \n",
150 | " 580 \n",
151 | " 91.0 \n",
152 | " 2008-08-02T23:30:59Z \n",
153 | " NaN \n",
154 | " 21 \n",
155 | " Deploying SQL Server Databases from Test to Live \n",
156 | " <p>I wonder how you guys manage deployment of ... \n",
157 | " \n",
158 | " \n",
159 | " 8 \n",
160 | " 650 \n",
161 | " 143.0 \n",
162 | " 2008-08-03T11:12:52Z \n",
163 | " NaN \n",
164 | " 79 \n",
165 | " Automatically update version number \n",
166 | " <p>I would like the version property of my app... \n",
167 | " \n",
168 | " \n",
169 | " 9 \n",
170 | " 810 \n",
171 | " 233.0 \n",
172 | " 2008-08-03T20:35:01Z \n",
173 | " NaN \n",
174 | " 9 \n",
175 | " Visual Studio Setup Project - Per User Registr... \n",
176 | " <p>I'm trying to maintain a Setup Project in <... \n",
177 | " \n",
178 | " \n",
179 | "
\n",
180 | "
"
181 | ],
182 | "text/plain": [
183 | " Id ... Body\n",
184 | "0 80 ... I've written a database generation script i...\n",
185 | "1 90 ...
Are there any really good tutorials explain...\n",
186 | "2 120 ...
Has anyone got experience creating ...\n",
187 | "3 180 ... This is something I've pseudo-solved many t...\n",
188 | "4 260 ...
I have a little game written in C#. It uses...\n",
189 | "5 330 ...
I am working on a collection of classes use...\n",
190 | "6 470 ...
I've been writing a few web services for a ...\n",
191 | "7 580 ...
I wonder how you guys manage deployment of ...\n",
192 | "8 650 ...
I would like the version property of my app...\n",
193 | "9 810 ...
I'm trying to maintain a Setup Project in <...\n",
194 | "\n",
195 | "[10 rows x 7 columns]"
196 | ]
197 | },
198 | "execution_count": 2,
199 | "metadata": {},
200 | "output_type": "execute_result"
201 | }
202 | ],
203 | "source": [
204 | "ques = pd.read_csv('../input/Questions.csv',encoding='iso-8859-1')\n",
205 | "ques.head(10)"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 3,
211 | "metadata": {
212 | "_uuid": "3a963c29707e29d8df14b8f48e45d25b336fcdb0"
213 | },
214 | "outputs": [
215 | {
216 | "data": {
217 | "text/html": [
218 | "
\n",
219 | "\n",
232 | "
\n",
233 | " \n",
234 | " \n",
235 | " \n",
236 | " Id \n",
237 | " Title \n",
238 | " Body \n",
239 | " \n",
240 | " \n",
241 | " \n",
242 | " \n",
243 | " 0 \n",
244 | " 80 \n",
245 | " SQLStatement.execute() - multiple queries in o... \n",
246 | " <p>I've written a database generation script i... \n",
247 | " \n",
248 | " \n",
249 | " 1 \n",
250 | " 90 \n",
251 | " Good branching and merging tutorials for Torto... \n",
252 | " <p>Are there any really good tutorials explain... \n",
253 | " \n",
254 | " \n",
255 | " 2 \n",
256 | " 120 \n",
257 | " ASP.NET Site Maps \n",
258 | " <p>Has anyone got experience creating <strong>... \n",
259 | " \n",
260 | " \n",
261 | " 3 \n",
262 | " 180 \n",
263 | " Function for creating color wheels \n",
264 | " <p>This is something I've pseudo-solved many t... \n",
265 | " \n",
266 | " \n",
267 | " 4 \n",
268 | " 260 \n",
269 | " Adding scripting functionality to .NET applica... \n",
270 | " <p>I have a little game written in C#. It uses... \n",
271 | " \n",
272 | " \n",
273 | " 5 \n",
274 | " 330 \n",
275 | " Should I use nested classes in this case? \n",
276 | " <p>I am working on a collection of classes use... \n",
277 | " \n",
278 | " \n",
279 | " 6 \n",
280 | " 470 \n",
281 | " Homegrown consumption of web services \n",
282 | " <p>I've been writing a few web services for a ... \n",
283 | " \n",
284 | " \n",
285 | " 7 \n",
286 | " 580 \n",
287 | " Deploying SQL Server Databases from Test to Live \n",
288 | " <p>I wonder how you guys manage deployment of ... \n",
289 | " \n",
290 | " \n",
291 | " 8 \n",
292 | " 650 \n",
293 | " Automatically update version number \n",
294 | " <p>I would like the version property of my app... \n",
295 | " \n",
296 | " \n",
297 | " 9 \n",
298 | " 810 \n",
299 | " Visual Studio Setup Project - Per User Registr... \n",
300 | " <p>I'm trying to maintain a Setup Project in <... \n",
301 | " \n",
302 | " \n",
303 | "
\n",
304 | "
"
305 | ],
306 | "text/plain": [
307 | " Id ... Body\n",
308 | "0 80 ... I've written a database generation script i...\n",
309 | "1 90 ...
Are there any really good tutorials explain...\n",
310 | "2 120 ...
Has anyone got experience creating ...\n",
311 | "3 180 ... This is something I've pseudo-solved many t...\n",
312 | "4 260 ...
I have a little game written in C#. It uses...\n",
313 | "5 330 ...
I am working on a collection of classes use...\n",
314 | "6 470 ...
I've been writing a few web services for a ...\n",
315 | "7 580 ...
I wonder how you guys manage deployment of ...\n",
316 | "8 650 ...
I would like the version property of my app...\n",
317 | "9 810 ...
I'm trying to maintain a Setup Project in <...\n",
318 | "\n",
319 | "[10 rows x 3 columns]"
320 | ]
321 | },
322 | "execution_count": 3,
323 | "metadata": {},
324 | "output_type": "execute_result"
325 | }
326 | ],
327 | "source": [
328 | "ques.drop([\"OwnerUserId\",\"CreationDate\",\"ClosedDate\",\"Score\"], axis=1, inplace=True)\n",
329 | "ques.head(10)"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 4,
335 | "metadata": {
336 | "_uuid": "b1305b3807f362fec50a13a76d68d35fa45c3380"
337 | },
338 | "outputs": [],
339 | "source": [
340 | "import re \n",
341 | "\n",
342 | "def rem_html_tags(body):\n",
343 | " regex = re.compile('<.*?>')\n",
344 | " return re.sub(regex, '', body)"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 5,
350 | "metadata": {
351 | "_uuid": "f0340076f170cc798549f9da6ce46962d9f81c6b"
352 | },
353 | "outputs": [
354 | {
355 | "data": {
356 | "text/html": [
357 | "
\n",
358 | "\n",
371 | "
\n",
372 | " \n",
373 | " \n",
374 | " \n",
375 | " Id \n",
376 | " Title \n",
377 | " Body \n",
378 | " \n",
379 | " \n",
380 | " \n",
381 | " \n",
382 | " 0 \n",
383 | " 80 \n",
384 | " SQLStatement.execute() - multiple queries in o... \n",
385 | " I've written a database generation script in S... \n",
386 | " \n",
387 | " \n",
388 | " 1 \n",
389 | " 90 \n",
390 | " Good branching and merging tutorials for Torto... \n",
391 | " Are there any really good tutorials explaining... \n",
392 | " \n",
393 | " \n",
394 | " 2 \n",
395 | " 120 \n",
396 | " ASP.NET Site Maps \n",
397 | " Has anyone got experience creating SQL-based A... \n",
398 | " \n",
399 | " \n",
400 | " 3 \n",
401 | " 180 \n",
402 | " Function for creating color wheels \n",
403 | " This is something I've pseudo-solved many time... \n",
404 | " \n",
405 | " \n",
406 | " 4 \n",
407 | " 260 \n",
408 | " Adding scripting functionality to .NET applica... \n",
409 | " I have a little game written in C#. It uses a ... \n",
410 | " \n",
411 | " \n",
412 | "
\n",
413 | "
"
414 | ],
415 | "text/plain": [
416 | " Id ... Body\n",
417 | "0 80 ... I've written a database generation script in S...\n",
418 | "1 90 ... Are there any really good tutorials explaining...\n",
419 | "2 120 ... Has anyone got experience creating SQL-based A...\n",
420 | "3 180 ... This is something I've pseudo-solved many time...\n",
421 | "4 260 ... I have a little game written in C#. It uses a ...\n",
422 | "\n",
423 | "[5 rows x 3 columns]"
424 | ]
425 | },
426 | "execution_count": 5,
427 | "metadata": {},
428 | "output_type": "execute_result"
429 | }
430 | ],
431 | "source": [
432 | "ques['Body'] = ques['Body'].apply(rem_html_tags)\n",
433 | "ques.head()"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 6,
439 | "metadata": {
440 | "_uuid": "af150557dcf11050e0e7dbc8bcbb0ef2c5015636"
441 | },
442 | "outputs": [],
443 | "source": [
444 | "ques.to_csv('question_clean.csv',index=False)"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": 7,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": []
453 | }
454 | ],
455 | "metadata": {
456 | "kernelspec": {
457 | "display_name": "Python 3",
458 | "language": "python",
459 | "name": "python3"
460 | },
461 | "language_info": {
462 | "codemirror_mode": {
463 | "name": "ipython",
464 | "version": 3
465 | },
466 | "file_extension": ".py",
467 | "mimetype": "text/x-python",
468 | "name": "python",
469 | "nbconvert_exporter": "python",
470 | "pygments_lexer": "ipython3",
471 | "version": "3.6.7"
472 | }
473 | },
474 | "nbformat": 4,
475 | "nbformat_minor": 1
476 | }
477 |
--------------------------------------------------------------------------------
/Stackoverflow Tags Map & Model.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
9 | },
10 | "outputs": [
11 | {
12 | "name": "stdout",
13 | "output_type": "stream",
14 | "text": [
15 | "['stackoverflow-clean-questions-file-v2', 'stackoverflow']\n"
16 | ]
17 | }
18 | ],
19 | "source": [
20 | "# This Python 3 environment comes with many helpful analytics libraries installed\n",
21 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n",
22 | "# For example, here's several helpful packages to load in \n",
23 | "\n",
24 | "import numpy as np # linear algebra\n",
25 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
26 | "\n",
27 | "# Input data files are available in the \"../input/\" directory.\n",
28 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n",
29 | "\n",
30 | "import os\n",
31 | "print(os.listdir(\"../input\"))\n",
32 | "\n",
33 | "# Plotting Libs\n",
34 | "import matplotlib.pyplot as plt\n",
35 | "import matplotlib.cm as cm\n",
36 | "# magic function\n",
37 | "%matplotlib inline\n",
38 | "\n",
39 | "import collections"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 2,
45 | "metadata": {
46 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
47 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
48 | },
49 | "outputs": [],
50 | "source": [
51 | "df_tags = pd.read_csv('../input/stackoverflow/Tags.csv', encoding='iso-8859-1')"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 3,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "def plot_tags(tagCount):\n",
61 | " \n",
62 | " x,y = zip(*tagCount)\n",
63 | "\n",
64 | " colormap = plt.cm.gist_ncar #nipy_spectral, Set1,Paired \n",
65 | " colors = [colormap(i) for i in np.linspace(0, 0.8,50)] \n",
66 | "\n",
67 | " area = [i/4000 for i in list(y)] # 0 to 15 point radiuses\n",
68 | " plt.figure(figsize=(9,8))\n",
69 | " plt.ylabel(\"Number of question associations\")\n",
70 | " for i in range(len(y)):\n",
71 | " plt.plot(i,y[i], marker='o', linestyle='',ms=area[i],label=x[i])\n",
72 | "\n",
73 | " plt.legend(numpoints=1)\n",
74 | " plt.show()"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 4,
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "name": "stdout",
84 | "output_type": "stream",
85 | "text": [
86 | "[('javascript', 124155), ('java', 115212), ('c#', 101186), ('php', 98808), ('android', 90659), ('jquery', 78542), ('python', 64601), ('html', 58976), ('c++', 47591), ('ios', 47009)]\n"
87 | ]
88 | },
89 | {
90 | "data": {
91 | "image/png": "\n",
92 | "text/plain": [
93 | ""
94 | ]
95 | },
96 | "metadata": {},
97 | "output_type": "display_data"
98 | }
99 | ],
100 | "source": [
101 | "tagCount = collections.Counter(list(df_tags['Tag'])).most_common(10)\n",
102 | "print(tagCount)\n",
103 | "plot_tags(tagCount)"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 5,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "top10=['javascript','java','c#','php','android','jquery','python','html','c++','ios']"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 6,
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "name": "stdout",
122 | "output_type": "stream",
123 | "text": [
124 | "(826739, 2)\n"
125 | ]
126 | },
127 | {
128 | "data": {
129 | "text/html": [
130 | "\n",
131 | "\n",
144 | "
\n",
145 | " \n",
146 | " \n",
147 | " \n",
148 | " Id \n",
149 | " Tag \n",
150 | " \n",
151 | " \n",
152 | " \n",
153 | " \n",
154 | " 14 \n",
155 | " 260 \n",
156 | " c# \n",
157 | " \n",
158 | " \n",
159 | " 18 \n",
160 | " 330 \n",
161 | " c++ \n",
162 | " \n",
163 | " \n",
164 | " 28 \n",
165 | " 650 \n",
166 | " c# \n",
167 | " \n",
168 | " \n",
169 | " 35 \n",
170 | " 930 \n",
171 | " c# \n",
172 | " \n",
173 | " \n",
174 | " 39 \n",
175 | " 1010 \n",
176 | " c# \n",
177 | " \n",
178 | " \n",
179 | "
\n",
180 | "
"
181 | ],
182 | "text/plain": [
183 | " Id Tag\n",
184 | "14 260 c#\n",
185 | "18 330 c++\n",
186 | "28 650 c#\n",
187 | "35 930 c#\n",
188 | "39 1010 c#"
189 | ]
190 | },
191 | "execution_count": 6,
192 | "metadata": {},
193 | "output_type": "execute_result"
194 | }
195 | ],
196 | "source": [
197 | "tag_top10= df_tags[df_tags.Tag.isin(top10)]\n",
198 | "print (tag_top10.shape)\n",
199 | "tag_top10.head()"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 7,
205 | "metadata": {},
206 | "outputs": [
207 | {
208 | "data": {
209 | "text/plain": [
210 | "30798790 5\n",
211 | "31085960 5\n",
212 | "11648170 5\n",
213 | "35318730 5\n",
214 | "4009250 5\n",
215 | "30289880 5\n",
216 | "23267320 5\n",
217 | "35283570 5\n",
218 | "30991580 5\n",
219 | "23484760 5\n",
220 | "Name: Id, dtype: int64"
221 | ]
222 | },
223 | "execution_count": 7,
224 | "metadata": {},
225 | "output_type": "execute_result"
226 | }
227 | ],
228 | "source": [
229 | "tag_top10['Id'].value_counts().head(10)"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 8,
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "data": {
239 | "text/html": [
240 | "\n",
241 | "\n",
254 | "
\n",
255 | " \n",
256 | " \n",
257 | " \n",
258 | " Id \n",
259 | " Tag \n",
260 | " \n",
261 | " \n",
262 | " \n",
263 | " \n",
264 | " 14 \n",
265 | " 260 \n",
266 | " c# \n",
267 | " \n",
268 | " \n",
269 | " 18 \n",
270 | " 330 \n",
271 | " c++ \n",
272 | " \n",
273 | " \n",
274 | " 28 \n",
275 | " 650 \n",
276 | " c# \n",
277 | " \n",
278 | " \n",
279 | " 35 \n",
280 | " 930 \n",
281 | " c# \n",
282 | " \n",
283 | " \n",
284 | " 39 \n",
285 | " 1010 \n",
286 | " c# \n",
287 | " \n",
288 | " \n",
289 | "
\n",
290 | "
"
291 | ],
292 | "text/plain": [
293 | " Id Tag\n",
294 | "14 260 c#\n",
295 | "18 330 c++\n",
296 | "28 650 c#\n",
297 | "35 930 c#\n",
298 | "39 1010 c#"
299 | ]
300 | },
301 | "execution_count": 8,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "tag_top10.head()"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": 9,
313 | "metadata": {},
314 | "outputs": [],
315 | "source": [
316 | "def add_tags(question_id):\n",
317 | " return tag_top10[tag_top10['Id'] == question_id['Id']].Tag.values\n",
318 | "\n",
319 | "top10 = tag_top10.apply(add_tags, axis=1)"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 10,
325 | "metadata": {},
326 | "outputs": [
327 | {
328 | "data": {
329 | "text/plain": [
330 | "(826739, (826739, 2))"
331 | ]
332 | },
333 | "execution_count": 10,
334 | "metadata": {},
335 | "output_type": "execute_result"
336 | }
337 | ],
338 | "source": [
339 | "len(top10),tag_top10.shape"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 11,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/html": [
350 | "\n",
351 | "\n",
364 | "
\n",
365 | " \n",
366 | " \n",
367 | " \n",
368 | " Id \n",
369 | " Tag \n",
370 | " Tags \n",
371 | " \n",
372 | " \n",
373 | " \n",
374 | " \n",
375 | " 14 \n",
376 | " 260 \n",
377 | " c# \n",
378 | " [c#] \n",
379 | " \n",
380 | " \n",
381 | " 18 \n",
382 | " 330 \n",
383 | " c++ \n",
384 | " [c++] \n",
385 | " \n",
386 | " \n",
387 | " 28 \n",
388 | " 650 \n",
389 | " c# \n",
390 | " [c#] \n",
391 | " \n",
392 | " \n",
393 | " 35 \n",
394 | " 930 \n",
395 | " c# \n",
396 | " [c#] \n",
397 | " \n",
398 | " \n",
399 | " 39 \n",
400 | " 1010 \n",
401 | " c# \n",
402 | " [c#] \n",
403 | " \n",
404 | " \n",
405 | "
\n",
406 | "
"
407 | ],
408 | "text/plain": [
409 | " Id Tag Tags\n",
410 | "14 260 c# [c#]\n",
411 | "18 330 c++ [c++]\n",
412 | "28 650 c# [c#]\n",
413 | "35 930 c# [c#]\n",
414 | "39 1010 c# [c#]"
415 | ]
416 | },
417 | "execution_count": 11,
418 | "metadata": {},
419 | "output_type": "execute_result"
420 | }
421 | ],
422 | "source": [
423 | "tag_top10=pd.concat([tag_top10, top10.rename('Tags')], axis=1)\n",
424 | "tag_top10.head()"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": 12,
430 | "metadata": {},
431 | "outputs": [
432 | {
433 | "data": {
434 | "text/plain": [
435 | "(826739, 2)"
436 | ]
437 | },
438 | "execution_count": 12,
439 | "metadata": {},
440 | "output_type": "execute_result"
441 | }
442 | ],
443 | "source": [
444 | "tag_top10.drop([\"Tag\"], axis=1, inplace=True)\n",
445 | "tag_top10.shape"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": 13,
451 | "metadata": {},
452 | "outputs": [],
453 | "source": [
454 | "top10_tags=tag_top10.loc[tag_top10.astype(str).drop_duplicates().index]"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": 14,
460 | "metadata": {},
461 | "outputs": [
462 | {
463 | "data": {
464 | "text/html": [
465 | "\n",
466 | "\n",
479 | "
\n",
480 | " \n",
481 | " \n",
482 | " \n",
483 | " Id \n",
484 | " Title \n",
485 | " Body \n",
486 | " \n",
487 | " \n",
488 | " \n",
489 | " \n",
490 | " 0 \n",
491 | " 80 \n",
492 | " SQLStatement.execute() - multiple queries in o... \n",
493 | " I've written a database generation script in S... \n",
494 | " \n",
495 | " \n",
496 | " 1 \n",
497 | " 90 \n",
498 | " Good branching and merging tutorials for Torto... \n",
499 | " Are there any really good tutorials explaining... \n",
500 | " \n",
501 | " \n",
502 | " 2 \n",
503 | " 120 \n",
504 | " ASP.NET Site Maps \n",
505 | " Has anyone got experience creating SQL-based A... \n",
506 | " \n",
507 | " \n",
508 | " 3 \n",
509 | " 180 \n",
510 | " Function for creating color wheels \n",
511 | " This is something I've pseudo-solved many time... \n",
512 | " \n",
513 | " \n",
514 | " 4 \n",
515 | " 260 \n",
516 | " Adding scripting functionality to .NET applica... \n",
517 | " I have a little game written in C#. It uses a ... \n",
518 | " \n",
519 | " \n",
520 | "
\n",
521 | "
"
522 | ],
523 | "text/plain": [
524 | " Id ... Body\n",
525 | "0 80 ... I've written a database generation script in S...\n",
526 | "1 90 ... Are there any really good tutorials explaining...\n",
527 | "2 120 ... Has anyone got experience creating SQL-based A...\n",
528 | "3 180 ... This is something I've pseudo-solved many time...\n",
529 | "4 260 ... I have a little game written in C#. It uses a ...\n",
530 | "\n",
531 | "[5 rows x 3 columns]"
532 | ]
533 | },
534 | "execution_count": 14,
535 | "metadata": {},
536 | "output_type": "execute_result"
537 | }
538 | ],
539 | "source": [
540 | "ques = pd.read_csv('../input/stackoverflow-clean-questions-file-v2/question_clean.csv', encoding='iso-8859-1')\n",
541 | "ques.head()"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": 15,
547 | "metadata": {},
548 | "outputs": [
549 | {
550 | "name": "stdout",
551 | "output_type": "stream",
552 | "text": [
553 | "(706336, 4)\n"
554 | ]
555 | },
556 | {
557 | "data": {
558 | "text/html": [
559 | "\n",
560 | "\n",
573 | "
\n",
574 | " \n",
575 | " \n",
576 | " \n",
577 | " Id \n",
578 | " Title \n",
579 | " Body \n",
580 | " Tags \n",
581 | " \n",
582 | " \n",
583 | " \n",
584 | " \n",
585 | " 0 \n",
586 | " 260 \n",
587 | " Adding scripting functionality to .NET applica... \n",
588 | " I have a little game written in C#. It uses a ... \n",
589 | " [c#] \n",
590 | " \n",
591 | " \n",
592 | " 1 \n",
593 | " 330 \n",
594 | " Should I use nested classes in this case? \n",
595 | " I am working on a collection of classes used f... \n",
596 | " [c++] \n",
597 | " \n",
598 | " \n",
599 | " 2 \n",
600 | " 650 \n",
601 | " Automatically update version number \n",
602 | " I would like the version property of my applic... \n",
603 | " [c#] \n",
604 | " \n",
605 | " \n",
606 | " 3 \n",
607 | " 930 \n",
608 | " How do I connect to a database and loop over a... \n",
609 | " What's the simplest way to connect and query a... \n",
610 | " [c#] \n",
611 | " \n",
612 | " \n",
613 | " 4 \n",
614 | " 1010 \n",
615 | " How to get the value of built, encoded ViewState? \n",
616 | " I need to grab the base64-encoded representati... \n",
617 | " [c#] \n",
618 | " \n",
619 | " \n",
620 | "
\n",
621 | "
"
622 | ],
623 | "text/plain": [
624 | " Id ... Tags\n",
625 | "0 260 ... [c#]\n",
626 | "1 330 ... [c++]\n",
627 | "2 650 ... [c#]\n",
628 | "3 930 ... [c#]\n",
629 | "4 1010 ... [c#]\n",
630 | "\n",
631 | "[5 rows x 4 columns]"
632 | ]
633 | },
634 | "execution_count": 15,
635 | "metadata": {},
636 | "output_type": "execute_result"
637 | }
638 | ],
639 | "source": [
640 | "total=pd.merge(ques, top10_tags, on='Id')\n",
641 | "print(total.shape)\n",
642 | "total.head()"
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": 16,
648 | "metadata": {},
649 | "outputs": [
650 | {
651 | "name": "stderr",
652 | "output_type": "stream",
653 | "text": [
654 | "Using TensorFlow backend.\n"
655 | ]
656 | }
657 | ],
658 | "source": [
659 | "from sklearn.model_selection import train_test_split\n",
660 | "from sklearn.preprocessing import MultiLabelBinarizer\n",
661 | "from nltk import word_tokenize\n",
662 | "from keras.preprocessing.text import Tokenizer\n",
663 | "from keras.preprocessing import sequence\n",
664 | "from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, BatchNormalization, GRU ,concatenate\n",
665 | "from keras.models import Model"
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 17,
671 | "metadata": {},
672 | "outputs": [
673 | {
674 | "data": {
675 | "text/plain": [
676 | "array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript',\n",
677 | " 'jquery', 'php', 'python'], dtype=object)"
678 | ]
679 | },
680 | "execution_count": 17,
681 | "metadata": {},
682 | "output_type": "execute_result"
683 | }
684 | ],
685 | "source": [
686 | "multilabel_binarizer = MultiLabelBinarizer()\n",
687 | "multilabel_binarizer.fit(total.Tags)\n",
688 | "labels = multilabel_binarizer.classes_\n",
689 | "labels"
690 | ]
691 | },
692 | {
693 | "cell_type": "code",
694 | "execution_count": 18,
695 | "metadata": {},
696 | "outputs": [],
697 | "source": [
698 | "train,test=train_test_split(total[:550000],test_size=0.25,random_state=24)"
699 | ]
700 | },
701 | {
702 | "cell_type": "code",
703 | "execution_count": 19,
704 | "metadata": {},
705 | "outputs": [
706 | {
707 | "data": {
708 | "text/plain": [
709 | "((412500, 4), (137500, 4))"
710 | ]
711 | },
712 | "execution_count": 19,
713 | "metadata": {},
714 | "output_type": "execute_result"
715 | }
716 | ],
717 | "source": [
718 | "train.shape,test.shape"
719 | ]
720 | },
721 | {
722 | "cell_type": "code",
723 | "execution_count": 20,
724 | "metadata": {},
725 | "outputs": [],
726 | "source": [
727 | "X_train_t=train['Title']\n",
728 | "X_train_b=train['Body']\n",
729 | "y_train=multilabel_binarizer.transform(train['Tags'])\n",
730 | "X_test_t=test['Title']\n",
731 | "X_test_b=test['Body']\n",
732 | "y_test=multilabel_binarizer.transform(test['Tags'])"
733 | ]
734 | },
735 | {
736 | "cell_type": "code",
737 | "execution_count": 21,
738 | "metadata": {},
739 | "outputs": [
740 | {
741 | "data": {
742 | "text/plain": [
743 | "59"
744 | ]
745 | },
746 | "execution_count": 21,
747 | "metadata": {},
748 | "output_type": "execute_result"
749 | }
750 | ],
751 | "source": [
752 | "sent_lens_t=[]\n",
753 | "for sent in train['Title']:\n",
754 | " sent_lens_t.append(len(word_tokenize(sent)))\n",
755 | "max(sent_lens_t)"
756 | ]
757 | },
758 | {
759 | "cell_type": "code",
760 | "execution_count": 22,
761 | "metadata": {},
762 | "outputs": [
763 | {
764 | "data": {
765 | "text/plain": [
766 | "18.0"
767 | ]
768 | },
769 | "execution_count": 22,
770 | "metadata": {},
771 | "output_type": "execute_result"
772 | }
773 | ],
774 | "source": [
775 | "np.quantile(sent_lens_t,0.97)"
776 | ]
777 | },
778 | {
779 | "cell_type": "code",
780 | "execution_count": 23,
781 | "metadata": {},
782 | "outputs": [],
783 | "source": [
784 | "max_len_t = 18\n",
785 | "tok = Tokenizer(char_level=False,split=' ')\n",
786 | "tok.fit_on_texts(X_train_t)\n",
787 | "sequences_train_t = tok.texts_to_sequences(X_train_t)"
788 | ]
789 | },
790 | {
791 | "cell_type": "code",
792 | "execution_count": 24,
793 | "metadata": {},
794 | "outputs": [
795 | {
796 | "data": {
797 | "text/plain": [
798 | "68969"
799 | ]
800 | },
801 | "execution_count": 24,
802 | "metadata": {},
803 | "output_type": "execute_result"
804 | }
805 | ],
806 | "source": [
807 | "vocab_len_t=len(tok.index_word.keys())\n",
808 | "vocab_len_t"
809 | ]
810 | },
811 | {
812 | "cell_type": "code",
813 | "execution_count": 25,
814 | "metadata": {},
815 | "outputs": [
816 | {
817 | "data": {
818 | "text/plain": [
819 | "array([[ 0, 0, 0, ..., 1, 957, 197],\n",
820 | " [ 0, 0, 0, ..., 9081, 45, 533],\n",
821 | " [ 0, 0, 0, ..., 147, 8, 230],\n",
822 | " ...,\n",
823 | " [ 0, 0, 0, ..., 10, 71, 2985],\n",
824 | " [ 0, 0, 0, ..., 2, 18, 75],\n",
825 | " [ 0, 0, 0, ..., 11009, 809, 267]], dtype=int32)"
826 | ]
827 | },
828 | "execution_count": 25,
829 | "metadata": {},
830 | "output_type": "execute_result"
831 | }
832 | ],
833 | "source": [
834 | "sequences_matrix_train_t = sequence.pad_sequences(sequences_train_t,maxlen=max_len_t)\n",
835 | "sequences_matrix_train_t"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 26,
841 | "metadata": {},
842 | "outputs": [],
843 | "source": [
844 | "sequences_test_t = tok.texts_to_sequences(X_test_t)\n",
845 | "sequences_matrix_test_t = sequence.pad_sequences(sequences_test_t,maxlen=max_len_t)"
846 | ]
847 | },
848 | {
849 | "cell_type": "code",
850 | "execution_count": 27,
851 | "metadata": {},
852 | "outputs": [
853 | {
854 | "data": {
855 | "text/plain": [
856 | "((412500, 18), (137500, 18), (412500, 10), (137500, 10))"
857 | ]
858 | },
859 | "execution_count": 27,
860 | "metadata": {},
861 | "output_type": "execute_result"
862 | }
863 | ],
864 | "source": [
865 | "sequences_matrix_train_t.shape,sequences_matrix_test_t.shape,y_train.shape,y_test.shape"
866 | ]
867 | },
868 | {
869 | "cell_type": "code",
870 | "execution_count": 28,
871 | "metadata": {},
872 | "outputs": [
873 | {
874 | "data": {
875 | "text/plain": [
876 | "20853"
877 | ]
878 | },
879 | "execution_count": 28,
880 | "metadata": {},
881 | "output_type": "execute_result"
882 | }
883 | ],
884 | "source": [
885 | "sent_lens_b=[]\n",
886 | "for sent in train['Body']:\n",
887 | " sent_lens_b.append(len(word_tokenize(sent)))\n",
888 | "max(sent_lens_b)"
889 | ]
890 | },
891 | {
892 | "cell_type": "code",
893 | "execution_count": 29,
894 | "metadata": {},
895 | "outputs": [
896 | {
897 | "data": {
898 | "text/plain": [
899 | "575.0"
900 | ]
901 | },
902 | "execution_count": 29,
903 | "metadata": {},
904 | "output_type": "execute_result"
905 | }
906 | ],
907 | "source": [
908 | "np.quantile(sent_lens_b,0.90)"
909 | ]
910 | },
911 | {
912 | "cell_type": "code",
913 | "execution_count": 30,
914 | "metadata": {},
915 | "outputs": [],
916 | "source": [
917 | "max_len_b = 600\n",
918 | "tok = Tokenizer(char_level=False,split=' ')\n",
919 | "tok.fit_on_texts(X_train_b)\n",
920 | "sequences_train_b = tok.texts_to_sequences(X_train_b)"
921 | ]
922 | },
923 | {
924 | "cell_type": "code",
925 | "execution_count": 31,
926 | "metadata": {},
927 | "outputs": [
928 | {
929 | "data": {
930 | "text/plain": [
931 | "1292018"
932 | ]
933 | },
934 | "execution_count": 31,
935 | "metadata": {},
936 | "output_type": "execute_result"
937 | }
938 | ],
939 | "source": [
940 | "vocab_len_b =len(tok.index_word.keys())\n",
941 | "vocab_len_b "
942 | ]
943 | },
944 | {
945 | "cell_type": "code",
946 | "execution_count": 32,
947 | "metadata": {},
948 | "outputs": [
949 | {
950 | "data": {
951 | "text/plain": [
952 | "array([[ 0, 0, 0, ..., 51, 2082, 91],\n",
953 | " [ 0, 0, 0, ..., 1408, 203, 825],\n",
954 | " [ 0, 0, 0, ..., 34, 51, 83],\n",
955 | " ...,\n",
956 | " [ 0, 0, 0, ..., 20, 68, 687],\n",
957 | " [ 0, 0, 0, ..., 187, 58, 10],\n",
958 | " [ 0, 0, 0, ..., 194, 197, 10]], dtype=int32)"
959 | ]
960 | },
961 | "execution_count": 32,
962 | "metadata": {},
963 | "output_type": "execute_result"
964 | }
965 | ],
966 | "source": [
967 | "sequences_matrix_train_b = sequence.pad_sequences(sequences_train_b,maxlen=max_len_b)\n",
968 | "sequences_matrix_train_b"
969 | ]
970 | },
971 | {
972 | "cell_type": "code",
973 | "execution_count": 33,
974 | "metadata": {},
975 | "outputs": [],
976 | "source": [
977 | "sequences_test_b = tok.texts_to_sequences(X_test_b)\n",
978 | "sequences_matrix_test_b = sequence.pad_sequences(sequences_test_b,maxlen=max_len_b)"
979 | ]
980 | },
981 | {
982 | "cell_type": "code",
983 | "execution_count": 34,
984 | "metadata": {},
985 | "outputs": [
986 | {
987 | "data": {
988 | "text/plain": [
989 | "((412500, 18), (412500, 600), (412500, 10))"
990 | ]
991 | },
992 | "execution_count": 34,
993 | "metadata": {},
994 | "output_type": "execute_result"
995 | }
996 | ],
997 | "source": [
998 | "sequences_matrix_train_t.shape,sequences_matrix_train_b.shape,y_train.shape"
999 | ]
1000 | },
1001 | {
1002 | "cell_type": "code",
1003 | "execution_count": 35,
1004 | "metadata": {},
1005 | "outputs": [
1006 | {
1007 | "data": {
1008 | "text/plain": [
1009 | "((137500, 18), (137500, 600), (137500, 10))"
1010 | ]
1011 | },
1012 | "execution_count": 35,
1013 | "metadata": {},
1014 | "output_type": "execute_result"
1015 | }
1016 | ],
1017 | "source": [
1018 | "sequences_matrix_test_t.shape,sequences_matrix_test_b.shape,y_test.shape"
1019 | ]
1020 | },
1021 | {
1022 | "cell_type": "code",
1023 | "execution_count": 36,
1024 | "metadata": {},
1025 | "outputs": [],
1026 | "source": [
1027 | "def RNN():\n",
1028 | " # Title Only\n",
1029 | " title_input = Input(name='title_input',shape=[max_len_t])\n",
1030 | " title_Embed = Embedding(vocab_len_t+1,2000,input_length=max_len_t,mask_zero=True,name='title_Embed')(title_input)\n",
1031 | " gru_out_t = GRU(300)(title_Embed)\n",
1032 | " # auxiliary output to tune GRU weights smoothly \n",
1033 | " auxiliary_output = Dense(10, activation='sigmoid', name='aux_output')(gru_out_t) \n",
1034 | " \n",
1035 | " # Body Only\n",
1036 | " body_input = Input(name='body_input',shape=[max_len_b]) \n",
1037 | " body_Embed = Embedding(vocab_len_b+1,170,input_length=max_len_b,mask_zero=True,name='body_Embed')(body_input)\n",
1038 | " gru_out_b = GRU(200)(body_Embed)\n",
1039 | " \n",
1040 | " # combined with GRU output\n",
1041 | " com = concatenate([gru_out_t, gru_out_b])\n",
1042 | " \n",
1043 | " # now the combined data is being fed to dense layers\n",
1044 | " dense1 = Dense(400,activation='relu')(com)\n",
1045 | " dp1 = Dropout(0.5)(dense1)\n",
1046 | " bn = BatchNormalization()(dp1) \n",
1047 | " dense2 = Dense(150,activation='relu')(bn)\n",
1048 | " \n",
1049 | " main_output = Dense(10, activation='sigmoid', name='main_output')(dense2)\n",
1050 | " \n",
1051 | " model = Model(inputs=[title_input, body_input],outputs=[main_output, auxiliary_output])\n",
1052 | " return model"
1053 | ]
1054 | },
1055 | {
1056 | "cell_type": "code",
1057 | "execution_count": 37,
1058 | "metadata": {},
1059 | "outputs": [
1060 | {
1061 | "name": "stdout",
1062 | "output_type": "stream",
1063 | "text": [
1064 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
1065 | "Instructions for updating:\n",
1066 | "Colocations handled automatically by placer.\n",
1067 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
1068 | "Instructions for updating:\n",
1069 | "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
1070 | "__________________________________________________________________________________________________\n",
1071 | "Layer (type) Output Shape Param # Connected to \n",
1072 | "==================================================================================================\n",
1073 | "title_input (InputLayer) (None, 18) 0 \n",
1074 | "__________________________________________________________________________________________________\n",
1075 | "body_input (InputLayer) (None, 600) 0 \n",
1076 | "__________________________________________________________________________________________________\n",
1077 | "title_Embed (Embedding) (None, 18, 2000) 137940000 title_input[0][0] \n",
1078 | "__________________________________________________________________________________________________\n",
1079 | "body_Embed (Embedding) (None, 600, 170) 219643230 body_input[0][0] \n",
1080 | "__________________________________________________________________________________________________\n",
1081 | "gru_1 (GRU) (None, 300) 2070900 title_Embed[0][0] \n",
1082 | "__________________________________________________________________________________________________\n",
1083 | "gru_2 (GRU) (None, 200) 222600 body_Embed[0][0] \n",
1084 | "__________________________________________________________________________________________________\n",
1085 | "concatenate_1 (Concatenate) (None, 500) 0 gru_1[0][0] \n",
1086 | " gru_2[0][0] \n",
1087 | "__________________________________________________________________________________________________\n",
1088 | "dense_1 (Dense) (None, 400) 200400 concatenate_1[0][0] \n",
1089 | "__________________________________________________________________________________________________\n",
1090 | "dropout_1 (Dropout) (None, 400) 0 dense_1[0][0] \n",
1091 | "__________________________________________________________________________________________________\n",
1092 | "batch_normalization_1 (BatchNor (None, 400) 1600 dropout_1[0][0] \n",
1093 | "__________________________________________________________________________________________________\n",
1094 | "dense_2 (Dense) (None, 150) 60150 batch_normalization_1[0][0] \n",
1095 | "__________________________________________________________________________________________________\n",
1096 | "main_output (Dense) (None, 10) 1510 dense_2[0][0] \n",
1097 | "__________________________________________________________________________________________________\n",
1098 | "aux_output (Dense) (None, 10) 3010 gru_1[0][0] \n",
1099 | "==================================================================================================\n",
1100 | "Total params: 360,143,400\n",
1101 | "Trainable params: 360,142,600\n",
1102 | "Non-trainable params: 800\n",
1103 | "__________________________________________________________________________________________________\n"
1104 | ]
1105 | }
1106 | ],
1107 | "source": [
1108 | "model = RNN()\n",
1109 | "model.summary()"
1110 | ]
1111 | },
1112 | {
1113 | "cell_type": "code",
1114 | "execution_count": 38,
1115 | "metadata": {},
1116 | "outputs": [],
1117 | "source": [
1118 | "model.compile(optimizer='adam',loss={'main_output': 'categorical_crossentropy', 'aux_output': 'categorical_crossentropy'},\n",
1119 | " metrics=['accuracy'])"
1120 | ]
1121 | },
1122 | {
1123 | "cell_type": "code",
1124 | "execution_count": 39,
1125 | "metadata": {},
1126 | "outputs": [
1127 | {
1128 | "name": "stdout",
1129 | "output_type": "stream",
1130 | "text": [
1131 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
1132 | "Instructions for updating:\n",
1133 | "Use tf.cast instead.\n"
1134 | ]
1135 | },
1136 | {
1137 | "name": "stderr",
1138 | "output_type": "stream",
1139 | "text": [
1140 | "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:107: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 137940000 elements. This may consume a large amount of memory.\n",
1141 | " num_elements)\n",
1142 | "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:107: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 219643230 elements. This may consume a large amount of memory.\n",
1143 | " num_elements)\n"
1144 | ]
1145 | },
1146 | {
1147 | "name": "stdout",
1148 | "output_type": "stream",
1149 | "text": [
1150 | "Train on 412500 samples, validate on 137500 samples\n",
1151 | "Epoch 1/5\n",
1152 | "412500/412500 [==============================] - 819s 2ms/step - loss: 2.3168 - main_output_loss: 1.0405 - aux_output_loss: 1.2763 - main_output_acc: 0.7285 - aux_output_acc: 0.6705 - val_loss: 1.7714 - val_main_output_loss: 0.7258 - val_aux_output_loss: 1.0456 - val_main_output_acc: 0.8254 - val_aux_output_acc: 0.7285\n",
1153 | "Epoch 2/5\n",
1154 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.5874 - main_output_loss: 0.6490 - aux_output_loss: 0.9384 - main_output_acc: 0.8443 - aux_output_acc: 0.7582 - val_loss: 1.7043 - val_main_output_loss: 0.6581 - val_aux_output_loss: 1.0462 - val_main_output_acc: 0.8379 - val_aux_output_acc: 0.7286\n",
1155 | "Epoch 3/5\n",
1156 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.3994 - main_output_loss: 0.5472 - aux_output_loss: 0.8522 - main_output_acc: 0.8650 - aux_output_acc: 0.7774 - val_loss: 1.7369 - val_main_output_loss: 0.6660 - val_aux_output_loss: 1.0708 - val_main_output_acc: 0.8364 - val_aux_output_acc: 0.7263\n",
1157 | "Epoch 4/5\n",
1158 | "412500/412500 [==============================] - 798s 2ms/step - loss: 1.2719 - main_output_loss: 0.4735 - aux_output_loss: 0.7984 - main_output_acc: 0.8774 - aux_output_acc: 0.7885 - val_loss: 1.7988 - val_main_output_loss: 0.6976 - val_aux_output_loss: 1.1012 - val_main_output_acc: 0.8369 - val_aux_output_acc: 0.7240\n",
1159 | "Epoch 5/5\n",
1160 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.1665 - main_output_loss: 0.4110 - aux_output_loss: 0.7555 - main_output_acc: 0.8868 - aux_output_acc: 0.7976 - val_loss: 1.9099 - val_main_output_loss: 0.7671 - val_aux_output_loss: 1.1428 - val_main_output_acc: 0.8307 - val_aux_output_acc: 0.7237\n"
1161 | ]
1162 | }
1163 | ],
1164 | "source": [
1165 | "results=model.fit({'title_input': sequences_matrix_train_t, 'body_input': sequences_matrix_train_b},\n",
1166 | " {'main_output': y_train, 'aux_output': y_train},\n",
1167 | " validation_data=[{'title_input': sequences_matrix_test_t, 'body_input': sequences_matrix_test_b},\n",
1168 | " {'main_output': y_test, 'aux_output': y_test}],\n",
1169 | " epochs=5, batch_size=800)"
1170 | ]
1171 | },
1172 | {
1173 | "cell_type": "code",
1174 | "execution_count": 68,
1175 | "metadata": {},
1176 | "outputs": [
1177 | {
1178 | "name": "stdout",
1179 | "output_type": "stream",
1180 | "text": [
1181 | "137500/137500 [==============================] - 1270s 9ms/step\n"
1182 | ]
1183 | }
1184 | ],
1185 | "source": [
1186 | "(predicted_main, predicted_aux)=model.predict({'title_input': sequences_matrix_test_t, 'body_input': sequences_matrix_test_b},verbose=1)"
1187 | ]
1188 | },
1189 | {
1190 | "cell_type": "code",
1191 | "execution_count": 70,
1192 | "metadata": {},
1193 | "outputs": [],
1194 | "source": [
1195 | "from sklearn.metrics import classification_report,f1_score"
1196 | ]
1197 | },
1198 | {
1199 | "cell_type": "code",
1200 | "execution_count": 138,
1201 | "metadata": {},
1202 | "outputs": [
1203 | {
1204 | "name": "stdout",
1205 | "output_type": "stream",
1206 | "text": [
1207 | "0.8424636536796537\n"
1208 | ]
1209 | },
1210 | {
1211 | "name": "stderr",
1212 | "output_type": "stream",
1213 | "text": [
1214 | "/opt/conda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in samples with no predicted labels.\n",
1215 | " 'precision', 'predicted', average, warn_for)\n"
1216 | ]
1217 | }
1218 | ],
1219 | "source": [
1220 | "print(f1_score(y_test,predicted_main>.55,average='samples'))"
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "code",
1225 | "execution_count": 137,
1226 | "metadata": {},
1227 | "outputs": [
1228 | {
1229 | "name": "stdout",
1230 | "output_type": "stream",
1231 | "text": [
1232 | " precision recall f1-score support\n",
1233 | "\n",
1234 | " 0 0.97 0.93 0.95 17054\n",
1235 | " 1 0.92 0.84 0.88 20681\n",
1236 | " 2 0.92 0.81 0.86 9700\n",
1237 | " 3 0.69 0.53 0.60 11304\n",
1238 | " 4 0.96 0.91 0.94 8897\n",
1239 | " 5 0.91 0.80 0.85 22472\n",
1240 | " 6 0.82 0.72 0.76 22938\n",
1241 | " 7 0.81 0.83 0.82 16150\n",
1242 | " 8 0.92 0.90 0.91 19659\n",
1243 | " 9 0.97 0.92 0.95 11576\n",
1244 | "\n",
1245 | " micro avg 0.89 0.82 0.85 160431\n",
1246 | " macro avg 0.89 0.82 0.85 160431\n",
1247 | "weighted avg 0.89 0.82 0.85 160431\n",
1248 | " samples avg 0.86 0.85 0.84 160431\n",
1249 | "\n"
1250 | ]
1251 | },
1252 | {
1253 | "name": "stderr",
1254 | "output_type": "stream",
1255 | "text": [
1256 | "/opt/conda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels.\n",
1257 | " 'precision', 'predicted', average, warn_for)\n"
1258 | ]
1259 | }
1260 | ],
1261 | "source": [
1262 | "print(classification_report(y_test,predicted_main>.55))"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": 131,
1268 | "metadata": {},
1269 | "outputs": [
1270 | {
1271 | "data": {
1272 | "text/plain": [
1273 | "Id 16470700\n",
1274 | "Title NetworkOnMainThreadException- Have tried makin...\n",
1275 | "Body I've been trying to get this to work for a whi...\n",
1276 | "Tags [java, android]\n",
1277 | "Name: 250148, dtype: object"
1278 | ]
1279 | },
1280 | "execution_count": 131,
1281 | "metadata": {},
1282 | "output_type": "execute_result"
1283 | }
1284 | ],
1285 | "source": [
1286 | "test.iloc[24]"
1287 | ]
1288 | },
1289 | {
1290 | "cell_type": "code",
1291 | "execution_count": 134,
1292 | "metadata": {},
1293 | "outputs": [
1294 | {
1295 | "data": {
1296 | "text/plain": [
1297 | "array([1. , 0. , 0. , 0. , 0. , 0.84, 0. , 0. , 0. , 0. ],\n",
1298 | " dtype=float32)"
1299 | ]
1300 | },
1301 | "execution_count": 134,
1302 | "metadata": {},
1303 | "output_type": "execute_result"
1304 | }
1305 | ],
1306 | "source": [
1307 | "predicted_main[24].round(decimals = 2)"
1308 | ]
1309 | },
1310 | {
1311 | "cell_type": "code",
1312 | "execution_count": 92,
1313 | "metadata": {},
1314 | "outputs": [
1315 | {
1316 | "data": {
1317 | "text/plain": [
1318 | "array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript',\n",
1319 | " 'jquery', 'php', 'python'], dtype=object)"
1320 | ]
1321 | },
1322 | "execution_count": 92,
1323 | "metadata": {},
1324 | "output_type": "execute_result"
1325 | }
1326 | ],
1327 | "source": [
1328 | "labels"
1329 | ]
1330 | },
1331 | {
1332 | "cell_type": "code",
1333 | "execution_count": 79,
1334 | "metadata": {},
1335 | "outputs": [],
1336 | "source": [
1337 | "model.save('./stackoverflow_tags.h5')"
1338 | ]
1339 | }
1340 | ],
1341 | "metadata": {
1342 | "kernelspec": {
1343 | "display_name": "Python 3",
1344 | "language": "python",
1345 | "name": "python3"
1346 | },
1347 | "language_info": {
1348 | "codemirror_mode": {
1349 | "name": "ipython",
1350 | "version": 3
1351 | },
1352 | "file_extension": ".py",
1353 | "mimetype": "text/x-python",
1354 | "name": "python",
1355 | "nbconvert_exporter": "python",
1356 | "pygments_lexer": "ipython3",
1357 | "version": "3.6.7"
1358 | }
1359 | },
1360 | "nbformat": 4,
1361 | "nbformat_minor": 1
1362 | }
1363 |
--------------------------------------------------------------------------------