├── Images ├── Classification_Report.JPG ├── prediction.JPG └── top10_tags.JPG ├── README.md ├── Stackoverflow Clean Questions.ipynb └── Stackoverflow Tags Map & Model.ipynb /Images/Classification_Report.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/Classification_Report.JPG -------------------------------------------------------------------------------- /Images/prediction.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/prediction.JPG -------------------------------------------------------------------------------- /Images/top10_tags.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theRajeshReddy/StackOverFlow-Classification/8ba027db0dfa3d09724a62838bebdd979b031af0/Images/top10_tags.JPG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Multiclass Multilabel Prediction For StackOverflow Questions 2 | 3 | **Data set** : https://www.kaggle.com/therajeshreddy/stackoverflow 4 | 5 | **Objective** : Given text for Questions from StackoverFlow posts, predict tags associated with them. 6 | 7 | This is a scaled down version of predecting only top 10 most occurring tags 8 | 9 | **Programming Language** : Python using nltk & Keras 10 | 11 | **Model Architecture** : Deep Learning using Recurrent Neural Network (RNN) 12 | 13 | **About Data Set** 14 | 15 | Dataset has text of questions, answers and thier corresponding tags from the Stack Overflow programming Q&A website. 16 | 17 | This is organized as three files: 18 | 19 | 1. Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions. 20 | 21 | 2. Tags contains the tags on each of these questions. 22 | 23 | 3. Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table. *We don't use this file as we want to predict Tags given a question* 24 | 25 | **Data Pre-Processing** 26 | 27 | >Questions File 28 | *Code* : Stackoverflow Clean Questions.ipynb 29 | 30 | 1. Read Questions File 31 | 2. Drop All columns except Id,Title and Body 32 | 3. Now the text in the Body column seem to have many html tags in the text. We use Regular Expressions and Clean the Body column text by removing the html tags 33 | ```python 34 | import re 35 | def rem_html_tags(body): 36 | regex = re.compile('<.*?>') 37 | return re.sub(regex, '', body) 38 | ques['Body'] = ques['Body'].apply(rem_html_tags) 39 | ``` 40 | 4. Save the questions file for later use 41 | ```python 42 | ques.to_csv('question_clean.csv',index=False) 43 | ``` 44 | 45 | >Tags File 46 | *Code* : Stackoverflow Tags Map & Model.ipynb 47 | 48 | 1. Read Tags File 49 | 2. Identify top 10 Tags by count 50 | ```python 51 | tagCount = collections.Counter(list(df_tags['Tag'])).most_common(10) 52 | print(tagCount) 53 | 54 | [('javascript', 124155), ('java', 115212), ('c#', 101186), ('php', 98808), ('android', 90659), ('jquery', 78542), ('python', 64601), ('html', 58976), ('c++', 47591), ('ios', 47009)] 55 | ``` 56 | 57 | 58 | 59 | 3. Manipulate the tags dataframe so that all the Tags for an ID are as a list in a row (grouped by Question ID) 60 | 61 | ```python 62 | def add_tags(question_id): 63 | return tag_top10[tag_top10['Id'] == question_id['Id']].Tag.values 64 | 65 | top10 = tag_top10.apply(add_tags, axis=1) 66 | ``` 67 | 68 | 69 | >Combine the Questions and Tags 70 | *Code* : Stackoverflow Tags Map & Model.ipynb 71 | 72 | Merge the Questions and Tags data frame by ID 73 | 74 | ```python 75 | total=pd.merge(ques, top10_tags, on='Id') 76 | ``` 77 | 78 | Our Dataset would now have only Id, Title, Body & Tags 79 | 80 | >Text Preprocessing 81 | *Code* : Stackoverflow Tags Map & Model.ipynb 82 | 83 | We will use nltk, preprocessing from Keras and sklearn to process the text data 84 | 85 | *Tags preprocesing* 86 | Use MultiLabelBinarizer from sklearn on the Class labels(Tags) 87 | ```python 88 | from sklearn.preprocessing import MultiLabelBinarizer 89 | multilabel_binarizer = MultiLabelBinarizer() 90 | multilabel_binarizer.fit(total.Tags) 91 | print(multilabel_binarizer.classes_) 92 | 93 | array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript','jquery', 'php', 'python'], dtype=object) 94 | ``` 95 | 96 | *Title & Body Preprocessing* 97 | 1. Tokenize the words 98 | 2. Convert the tokenized words to sequences 99 | 100 | **Model Building** 101 | 102 | Implemented a Hybrid model in TensorFlow using Keras as high level api. Architecture used is RNN. In this model first we train a model using the Title data, then train a model using the Body data. Outputs of both are concatenated and passed thorugh the dense layers before connecting to the output layer 103 | 104 | *RNN Model* : The model first uses GRU for the sequence data training with 2 GRU layers one for Title and other for Body. 105 | 106 | RNN for Title has 107 | - 1 Embedding Layer has input of Title vocabulary length(68969) + 1(for 0 padding) and out put of 2000 embeddings (for better results use full vocabulary length+1) 108 | - 1 Gated recurrent unit (GRU) layer 109 | - 1 dense output layer of shape 10(No of classes(tags) we are trying to predict) 110 | 111 | ```python 112 | # Title Only 113 | title_input = Input(name='title_input',shape=[max_len_t]) 114 | title_Embed = Embedding(vocab_len_t+1,2000,input_length=max_len_t,mask_zero=True,name='title_Embed')(title_input) 115 | gru_out_t = GRU(300)(title_Embed) 116 | # auxiliary output to tune GRU weights smoothly 117 | auxiliary_output = Dense(10, activation='sigmoid', name='aux_output')(gru_out_t) 118 | ``` 119 | 120 | RNN for Body has 121 | - 1 Embedding Layer has input of Title vocabulary length(1292018) + 1(for 0 padding) and out put of 170 embeddings (for better results use full vocabulary length+1) 122 | - 1 Gated recurrent unit (GRU) layer 123 | 124 | ```python 125 | # Body Only 126 | body_input = Input(name='body_input',shape=[max_len_b]) 127 | body_Embed = Embedding(vocab_len_b+1,170,input_length=max_len_b,mask_zero=True,name='body_Embed')(body_input) 128 | gru_out_b = GRU(200)(body_Embed) 129 | ``` 130 | 131 | Combine the 2 GRU outputs 132 | ```python 133 | com = concatenate([gru_out_t, gru_out_b]) 134 | ``` 135 | 136 | The fully connected network has 137 | - 2 Dense Layers 138 | - 1 Dropout layer 139 | - 1 BatchNormalization layer 140 | - 1 Dense Output layer 141 | 142 | ```python 143 | # now the combined data is being fed to dense layers 144 | dense1 = Dense(400,activation='relu')(com) 145 | dp1 = Dropout(0.5)(dense1) 146 | bn = BatchNormalization()(dp1) 147 | dense2 = Dense(150,activation='relu')(bn) 148 | main_output = Dense(10, activation='sigmoid', name='main_output')(dense2) 149 | ``` 150 | 151 | *Model Compilattion with optimizer='adam', loss='categorical_crossentropy', metrics='accuracy')* 152 | 153 | **Model Performance Review** 154 | 155 | *Classification Report to check Precision, Recall and F1 Score* 156 | 157 | The Model seem to performing good enough with score of 84%. Increase in the Embedding, GRU and dense layers would help in getting better results 158 | 159 | 160 | 161 | **Random Validation on Test Data** 162 | 163 | 164 | 165 | 166 | **Save the Model & Weights** 167 | 168 | Saving the model for transfer learning or model execution later 169 | 170 | ```python 171 | model.save('./stackoverflow_tags.h5') 172 | ``` 173 | -------------------------------------------------------------------------------- /Stackoverflow Clean Questions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 9 | }, 10 | "outputs": [ 11 | { 12 | "name": "stdout", 13 | "output_type": "stream", 14 | "text": [ 15 | "['Answers.csv', 'Tags.csv', 'Questions.csv']\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 21 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 22 | "# For example, here's several helpful packages to load in \n", 23 | "\n", 24 | "import numpy as np # linear algebra\n", 25 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 26 | "\n", 27 | "# for counting\n", 28 | "import collections\n", 29 | "\n", 30 | "# Input data files are available in the \"../input/\" directory.\n", 31 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", 32 | "\n", 33 | "import os\n", 34 | "print(os.listdir(\"../input\"))\n", 35 | "\n", 36 | "# Any results you write to the current directory are saved as output." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": { 43 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 44 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 45 | }, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/html": [ 50 | "
\n", 51 | "\n", 64 | "\n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
IdOwnerUserIdCreationDateClosedDateScoreTitleBody
08026.02008-08-01T13:57:07ZNaN26SQLStatement.execute() - multiple queries in o...<p>I've written a database generation script i...
19058.02008-08-01T14:41:24Z2012-12-26T03:45:49Z144Good branching and merging tutorials for Torto...<p>Are there any really good tutorials explain...
212083.02008-08-01T15:50:08ZNaN21ASP.NET Site Maps<p>Has anyone got experience creating <strong>...
31802089740.02008-08-01T18:42:19ZNaN53Function for creating color wheels<p>This is something I've pseudo-solved many t...
426091.02008-08-01T23:22:08ZNaN49Adding scripting functionality to .NET applica...<p>I have a little game written in C#. It uses...
533063.02008-08-02T02:51:36ZNaN29Should I use nested classes in this case?<p>I am working on a collection of classes use...
647071.02008-08-02T15:11:47Z2016-03-26T05:23:29Z13Homegrown consumption of web services<p>I've been writing a few web services for a ...
758091.02008-08-02T23:30:59ZNaN21Deploying SQL Server Databases from Test to Live<p>I wonder how you guys manage deployment of ...
8650143.02008-08-03T11:12:52ZNaN79Automatically update version number<p>I would like the version property of my app...
9810233.02008-08-03T20:35:01ZNaN9Visual Studio Setup Project - Per User Registr...<p>I'm trying to maintain a Setup Project in <...
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " Id ... Body\n", 184 | "0 80 ...

I've written a database generation script i...\n", 185 | "1 90 ...

Are there any really good tutorials explain...\n", 186 | "2 120 ...

Has anyone got experience creating ...\n", 187 | "3 180 ...

This is something I've pseudo-solved many t...\n", 188 | "4 260 ...

I have a little game written in C#. It uses...\n", 189 | "5 330 ...

I am working on a collection of classes use...\n", 190 | "6 470 ...

I've been writing a few web services for a ...\n", 191 | "7 580 ...

I wonder how you guys manage deployment of ...\n", 192 | "8 650 ...

I would like the version property of my app...\n", 193 | "9 810 ...

I'm trying to maintain a Setup Project in <...\n", 194 | "\n", 195 | "[10 rows x 7 columns]" 196 | ] 197 | }, 198 | "execution_count": 2, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "ques = pd.read_csv('../input/Questions.csv',encoding='iso-8859-1')\n", 205 | "ques.head(10)" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 3, 211 | "metadata": { 212 | "_uuid": "3a963c29707e29d8df14b8f48e45d25b336fcdb0" 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/html": [ 218 | "

\n", 219 | "\n", 232 | "\n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | "
IdTitleBody
080SQLStatement.execute() - multiple queries in o...<p>I've written a database generation script i...
190Good branching and merging tutorials for Torto...<p>Are there any really good tutorials explain...
2120ASP.NET Site Maps<p>Has anyone got experience creating <strong>...
3180Function for creating color wheels<p>This is something I've pseudo-solved many t...
4260Adding scripting functionality to .NET applica...<p>I have a little game written in C#. It uses...
5330Should I use nested classes in this case?<p>I am working on a collection of classes use...
6470Homegrown consumption of web services<p>I've been writing a few web services for a ...
7580Deploying SQL Server Databases from Test to Live<p>I wonder how you guys manage deployment of ...
8650Automatically update version number<p>I would like the version property of my app...
9810Visual Studio Setup Project - Per User Registr...<p>I'm trying to maintain a Setup Project in <...
\n", 304 | "
" 305 | ], 306 | "text/plain": [ 307 | " Id ... Body\n", 308 | "0 80 ...

I've written a database generation script i...\n", 309 | "1 90 ...

Are there any really good tutorials explain...\n", 310 | "2 120 ...

Has anyone got experience creating ...\n", 311 | "3 180 ...

This is something I've pseudo-solved many t...\n", 312 | "4 260 ...

I have a little game written in C#. It uses...\n", 313 | "5 330 ...

I am working on a collection of classes use...\n", 314 | "6 470 ...

I've been writing a few web services for a ...\n", 315 | "7 580 ...

I wonder how you guys manage deployment of ...\n", 316 | "8 650 ...

I would like the version property of my app...\n", 317 | "9 810 ...

I'm trying to maintain a Setup Project in <...\n", 318 | "\n", 319 | "[10 rows x 3 columns]" 320 | ] 321 | }, 322 | "execution_count": 3, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "ques.drop([\"OwnerUserId\",\"CreationDate\",\"ClosedDate\",\"Score\"], axis=1, inplace=True)\n", 329 | "ques.head(10)" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 4, 335 | "metadata": { 336 | "_uuid": "b1305b3807f362fec50a13a76d68d35fa45c3380" 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "import re \n", 341 | "\n", 342 | "def rem_html_tags(body):\n", 343 | " regex = re.compile('<.*?>')\n", 344 | " return re.sub(regex, '', body)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 5, 350 | "metadata": { 351 | "_uuid": "f0340076f170cc798549f9da6ce46962d9f81c6b" 352 | }, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "text/html": [ 357 | "

\n", 358 | "\n", 371 | "\n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | "
IdTitleBody
080SQLStatement.execute() - multiple queries in o...I've written a database generation script in S...
190Good branching and merging tutorials for Torto...Are there any really good tutorials explaining...
2120ASP.NET Site MapsHas anyone got experience creating SQL-based A...
3180Function for creating color wheelsThis is something I've pseudo-solved many time...
4260Adding scripting functionality to .NET applica...I have a little game written in C#. It uses a ...
\n", 413 | "
" 414 | ], 415 | "text/plain": [ 416 | " Id ... Body\n", 417 | "0 80 ... I've written a database generation script in S...\n", 418 | "1 90 ... Are there any really good tutorials explaining...\n", 419 | "2 120 ... Has anyone got experience creating SQL-based A...\n", 420 | "3 180 ... This is something I've pseudo-solved many time...\n", 421 | "4 260 ... I have a little game written in C#. It uses a ...\n", 422 | "\n", 423 | "[5 rows x 3 columns]" 424 | ] 425 | }, 426 | "execution_count": 5, 427 | "metadata": {}, 428 | "output_type": "execute_result" 429 | } 430 | ], 431 | "source": [ 432 | "ques['Body'] = ques['Body'].apply(rem_html_tags)\n", 433 | "ques.head()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 6, 439 | "metadata": { 440 | "_uuid": "af150557dcf11050e0e7dbc8bcbb0ef2c5015636" 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "ques.to_csv('question_clean.csv',index=False)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 7, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [] 453 | } 454 | ], 455 | "metadata": { 456 | "kernelspec": { 457 | "display_name": "Python 3", 458 | "language": "python", 459 | "name": "python3" 460 | }, 461 | "language_info": { 462 | "codemirror_mode": { 463 | "name": "ipython", 464 | "version": 3 465 | }, 466 | "file_extension": ".py", 467 | "mimetype": "text/x-python", 468 | "name": "python", 469 | "nbconvert_exporter": "python", 470 | "pygments_lexer": "ipython3", 471 | "version": "3.6.7" 472 | } 473 | }, 474 | "nbformat": 4, 475 | "nbformat_minor": 1 476 | } 477 | -------------------------------------------------------------------------------- /Stackoverflow Tags Map & Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 9 | }, 10 | "outputs": [ 11 | { 12 | "name": "stdout", 13 | "output_type": "stream", 14 | "text": [ 15 | "['stackoverflow-clean-questions-file-v2', 'stackoverflow']\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 21 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 22 | "# For example, here's several helpful packages to load in \n", 23 | "\n", 24 | "import numpy as np # linear algebra\n", 25 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 26 | "\n", 27 | "# Input data files are available in the \"../input/\" directory.\n", 28 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", 29 | "\n", 30 | "import os\n", 31 | "print(os.listdir(\"../input\"))\n", 32 | "\n", 33 | "# Plotting Libs\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "import matplotlib.cm as cm\n", 36 | "# magic function\n", 37 | "%matplotlib inline\n", 38 | "\n", 39 | "import collections" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": { 46 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 47 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "df_tags = pd.read_csv('../input/stackoverflow/Tags.csv', encoding='iso-8859-1')" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "def plot_tags(tagCount):\n", 61 | " \n", 62 | " x,y = zip(*tagCount)\n", 63 | "\n", 64 | " colormap = plt.cm.gist_ncar #nipy_spectral, Set1,Paired \n", 65 | " colors = [colormap(i) for i in np.linspace(0, 0.8,50)] \n", 66 | "\n", 67 | " area = [i/4000 for i in list(y)] # 0 to 15 point radiuses\n", 68 | " plt.figure(figsize=(9,8))\n", 69 | " plt.ylabel(\"Number of question associations\")\n", 70 | " for i in range(len(y)):\n", 71 | " plt.plot(i,y[i], marker='o', linestyle='',ms=area[i],label=x[i])\n", 72 | "\n", 73 | " plt.legend(numpoints=1)\n", 74 | " plt.show()" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 4, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "[('javascript', 124155), ('java', 115212), ('c#', 101186), ('php', 98808), ('android', 90659), ('jquery', 78542), ('python', 64601), ('html', 58976), ('c++', 47591), ('ios', 47009)]\n" 87 | ] 88 | }, 89 | { 90 | "data": { 91 | "image/png": "\n", 92 | "text/plain": [ 93 | "
" 94 | ] 95 | }, 96 | "metadata": {}, 97 | "output_type": "display_data" 98 | } 99 | ], 100 | "source": [ 101 | "tagCount = collections.Counter(list(df_tags['Tag'])).most_common(10)\n", 102 | "print(tagCount)\n", 103 | "plot_tags(tagCount)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 5, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "top10=['javascript','java','c#','php','android','jquery','python','html','c++','ios']" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 6, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "(826739, 2)\n" 125 | ] 126 | }, 127 | { 128 | "data": { 129 | "text/html": [ 130 | "
\n", 131 | "\n", 144 | "\n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
IdTag
14260c#
18330c++
28650c#
35930c#
391010c#
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " Id Tag\n", 184 | "14 260 c#\n", 185 | "18 330 c++\n", 186 | "28 650 c#\n", 187 | "35 930 c#\n", 188 | "39 1010 c#" 189 | ] 190 | }, 191 | "execution_count": 6, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "tag_top10= df_tags[df_tags.Tag.isin(top10)]\n", 198 | "print (tag_top10.shape)\n", 199 | "tag_top10.head()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 7, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "30798790 5\n", 211 | "31085960 5\n", 212 | "11648170 5\n", 213 | "35318730 5\n", 214 | "4009250 5\n", 215 | "30289880 5\n", 216 | "23267320 5\n", 217 | "35283570 5\n", 218 | "30991580 5\n", 219 | "23484760 5\n", 220 | "Name: Id, dtype: int64" 221 | ] 222 | }, 223 | "execution_count": 7, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "tag_top10['Id'].value_counts().head(10)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 8, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/html": [ 240 | "
\n", 241 | "\n", 254 | "\n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | "
IdTag
14260c#
18330c++
28650c#
35930c#
391010c#
\n", 290 | "
" 291 | ], 292 | "text/plain": [ 293 | " Id Tag\n", 294 | "14 260 c#\n", 295 | "18 330 c++\n", 296 | "28 650 c#\n", 297 | "35 930 c#\n", 298 | "39 1010 c#" 299 | ] 300 | }, 301 | "execution_count": 8, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "tag_top10.head()" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 9, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "def add_tags(question_id):\n", 317 | " return tag_top10[tag_top10['Id'] == question_id['Id']].Tag.values\n", 318 | "\n", 319 | "top10 = tag_top10.apply(add_tags, axis=1)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 10, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/plain": [ 330 | "(826739, (826739, 2))" 331 | ] 332 | }, 333 | "execution_count": 10, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "len(top10),tag_top10.shape" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 11, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/html": [ 350 | "
\n", 351 | "\n", 364 | "\n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | "
IdTagTags
14260c#[c#]
18330c++[c++]
28650c#[c#]
35930c#[c#]
391010c#[c#]
\n", 406 | "
" 407 | ], 408 | "text/plain": [ 409 | " Id Tag Tags\n", 410 | "14 260 c# [c#]\n", 411 | "18 330 c++ [c++]\n", 412 | "28 650 c# [c#]\n", 413 | "35 930 c# [c#]\n", 414 | "39 1010 c# [c#]" 415 | ] 416 | }, 417 | "execution_count": 11, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "tag_top10=pd.concat([tag_top10, top10.rename('Tags')], axis=1)\n", 424 | "tag_top10.head()" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 12, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "data": { 434 | "text/plain": [ 435 | "(826739, 2)" 436 | ] 437 | }, 438 | "execution_count": 12, 439 | "metadata": {}, 440 | "output_type": "execute_result" 441 | } 442 | ], 443 | "source": [ 444 | "tag_top10.drop([\"Tag\"], axis=1, inplace=True)\n", 445 | "tag_top10.shape" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 13, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "top10_tags=tag_top10.loc[tag_top10.astype(str).drop_duplicates().index]" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 14, 460 | "metadata": {}, 461 | "outputs": [ 462 | { 463 | "data": { 464 | "text/html": [ 465 | "
\n", 466 | "\n", 479 | "\n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | "
IdTitleBody
080SQLStatement.execute() - multiple queries in o...I've written a database generation script in S...
190Good branching and merging tutorials for Torto...Are there any really good tutorials explaining...
2120ASP.NET Site MapsHas anyone got experience creating SQL-based A...
3180Function for creating color wheelsThis is something I've pseudo-solved many time...
4260Adding scripting functionality to .NET applica...I have a little game written in C#. It uses a ...
\n", 521 | "
" 522 | ], 523 | "text/plain": [ 524 | " Id ... Body\n", 525 | "0 80 ... I've written a database generation script in S...\n", 526 | "1 90 ... Are there any really good tutorials explaining...\n", 527 | "2 120 ... Has anyone got experience creating SQL-based A...\n", 528 | "3 180 ... This is something I've pseudo-solved many time...\n", 529 | "4 260 ... I have a little game written in C#. It uses a ...\n", 530 | "\n", 531 | "[5 rows x 3 columns]" 532 | ] 533 | }, 534 | "execution_count": 14, 535 | "metadata": {}, 536 | "output_type": "execute_result" 537 | } 538 | ], 539 | "source": [ 540 | "ques = pd.read_csv('../input/stackoverflow-clean-questions-file-v2/question_clean.csv', encoding='iso-8859-1')\n", 541 | "ques.head()" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 15, 547 | "metadata": {}, 548 | "outputs": [ 549 | { 550 | "name": "stdout", 551 | "output_type": "stream", 552 | "text": [ 553 | "(706336, 4)\n" 554 | ] 555 | }, 556 | { 557 | "data": { 558 | "text/html": [ 559 | "
\n", 560 | "\n", 573 | "\n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | "
IdTitleBodyTags
0260Adding scripting functionality to .NET applica...I have a little game written in C#. It uses a ...[c#]
1330Should I use nested classes in this case?I am working on a collection of classes used f...[c++]
2650Automatically update version numberI would like the version property of my applic...[c#]
3930How do I connect to a database and loop over a...What's the simplest way to connect and query a...[c#]
41010How to get the value of built, encoded ViewState?I need to grab the base64-encoded representati...[c#]
\n", 621 | "
" 622 | ], 623 | "text/plain": [ 624 | " Id ... Tags\n", 625 | "0 260 ... [c#]\n", 626 | "1 330 ... [c++]\n", 627 | "2 650 ... [c#]\n", 628 | "3 930 ... [c#]\n", 629 | "4 1010 ... [c#]\n", 630 | "\n", 631 | "[5 rows x 4 columns]" 632 | ] 633 | }, 634 | "execution_count": 15, 635 | "metadata": {}, 636 | "output_type": "execute_result" 637 | } 638 | ], 639 | "source": [ 640 | "total=pd.merge(ques, top10_tags, on='Id')\n", 641 | "print(total.shape)\n", 642 | "total.head()" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 16, 648 | "metadata": {}, 649 | "outputs": [ 650 | { 651 | "name": "stderr", 652 | "output_type": "stream", 653 | "text": [ 654 | "Using TensorFlow backend.\n" 655 | ] 656 | } 657 | ], 658 | "source": [ 659 | "from sklearn.model_selection import train_test_split\n", 660 | "from sklearn.preprocessing import MultiLabelBinarizer\n", 661 | "from nltk import word_tokenize\n", 662 | "from keras.preprocessing.text import Tokenizer\n", 663 | "from keras.preprocessing import sequence\n", 664 | "from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, BatchNormalization, GRU ,concatenate\n", 665 | "from keras.models import Model" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 17, 671 | "metadata": {}, 672 | "outputs": [ 673 | { 674 | "data": { 675 | "text/plain": [ 676 | "array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript',\n", 677 | " 'jquery', 'php', 'python'], dtype=object)" 678 | ] 679 | }, 680 | "execution_count": 17, 681 | "metadata": {}, 682 | "output_type": "execute_result" 683 | } 684 | ], 685 | "source": [ 686 | "multilabel_binarizer = MultiLabelBinarizer()\n", 687 | "multilabel_binarizer.fit(total.Tags)\n", 688 | "labels = multilabel_binarizer.classes_\n", 689 | "labels" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 18, 695 | "metadata": {}, 696 | "outputs": [], 697 | "source": [ 698 | "train,test=train_test_split(total[:550000],test_size=0.25,random_state=24)" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": 19, 704 | "metadata": {}, 705 | "outputs": [ 706 | { 707 | "data": { 708 | "text/plain": [ 709 | "((412500, 4), (137500, 4))" 710 | ] 711 | }, 712 | "execution_count": 19, 713 | "metadata": {}, 714 | "output_type": "execute_result" 715 | } 716 | ], 717 | "source": [ 718 | "train.shape,test.shape" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": 20, 724 | "metadata": {}, 725 | "outputs": [], 726 | "source": [ 727 | "X_train_t=train['Title']\n", 728 | "X_train_b=train['Body']\n", 729 | "y_train=multilabel_binarizer.transform(train['Tags'])\n", 730 | "X_test_t=test['Title']\n", 731 | "X_test_b=test['Body']\n", 732 | "y_test=multilabel_binarizer.transform(test['Tags'])" 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": 21, 738 | "metadata": {}, 739 | "outputs": [ 740 | { 741 | "data": { 742 | "text/plain": [ 743 | "59" 744 | ] 745 | }, 746 | "execution_count": 21, 747 | "metadata": {}, 748 | "output_type": "execute_result" 749 | } 750 | ], 751 | "source": [ 752 | "sent_lens_t=[]\n", 753 | "for sent in train['Title']:\n", 754 | " sent_lens_t.append(len(word_tokenize(sent)))\n", 755 | "max(sent_lens_t)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": 22, 761 | "metadata": {}, 762 | "outputs": [ 763 | { 764 | "data": { 765 | "text/plain": [ 766 | "18.0" 767 | ] 768 | }, 769 | "execution_count": 22, 770 | "metadata": {}, 771 | "output_type": "execute_result" 772 | } 773 | ], 774 | "source": [ 775 | "np.quantile(sent_lens_t,0.97)" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 23, 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "max_len_t = 18\n", 785 | "tok = Tokenizer(char_level=False,split=' ')\n", 786 | "tok.fit_on_texts(X_train_t)\n", 787 | "sequences_train_t = tok.texts_to_sequences(X_train_t)" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 24, 793 | "metadata": {}, 794 | "outputs": [ 795 | { 796 | "data": { 797 | "text/plain": [ 798 | "68969" 799 | ] 800 | }, 801 | "execution_count": 24, 802 | "metadata": {}, 803 | "output_type": "execute_result" 804 | } 805 | ], 806 | "source": [ 807 | "vocab_len_t=len(tok.index_word.keys())\n", 808 | "vocab_len_t" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": 25, 814 | "metadata": {}, 815 | "outputs": [ 816 | { 817 | "data": { 818 | "text/plain": [ 819 | "array([[ 0, 0, 0, ..., 1, 957, 197],\n", 820 | " [ 0, 0, 0, ..., 9081, 45, 533],\n", 821 | " [ 0, 0, 0, ..., 147, 8, 230],\n", 822 | " ...,\n", 823 | " [ 0, 0, 0, ..., 10, 71, 2985],\n", 824 | " [ 0, 0, 0, ..., 2, 18, 75],\n", 825 | " [ 0, 0, 0, ..., 11009, 809, 267]], dtype=int32)" 826 | ] 827 | }, 828 | "execution_count": 25, 829 | "metadata": {}, 830 | "output_type": "execute_result" 831 | } 832 | ], 833 | "source": [ 834 | "sequences_matrix_train_t = sequence.pad_sequences(sequences_train_t,maxlen=max_len_t)\n", 835 | "sequences_matrix_train_t" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 26, 841 | "metadata": {}, 842 | "outputs": [], 843 | "source": [ 844 | "sequences_test_t = tok.texts_to_sequences(X_test_t)\n", 845 | "sequences_matrix_test_t = sequence.pad_sequences(sequences_test_t,maxlen=max_len_t)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": 27, 851 | "metadata": {}, 852 | "outputs": [ 853 | { 854 | "data": { 855 | "text/plain": [ 856 | "((412500, 18), (137500, 18), (412500, 10), (137500, 10))" 857 | ] 858 | }, 859 | "execution_count": 27, 860 | "metadata": {}, 861 | "output_type": "execute_result" 862 | } 863 | ], 864 | "source": [ 865 | "sequences_matrix_train_t.shape,sequences_matrix_test_t.shape,y_train.shape,y_test.shape" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": 28, 871 | "metadata": {}, 872 | "outputs": [ 873 | { 874 | "data": { 875 | "text/plain": [ 876 | "20853" 877 | ] 878 | }, 879 | "execution_count": 28, 880 | "metadata": {}, 881 | "output_type": "execute_result" 882 | } 883 | ], 884 | "source": [ 885 | "sent_lens_b=[]\n", 886 | "for sent in train['Body']:\n", 887 | " sent_lens_b.append(len(word_tokenize(sent)))\n", 888 | "max(sent_lens_b)" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": 29, 894 | "metadata": {}, 895 | "outputs": [ 896 | { 897 | "data": { 898 | "text/plain": [ 899 | "575.0" 900 | ] 901 | }, 902 | "execution_count": 29, 903 | "metadata": {}, 904 | "output_type": "execute_result" 905 | } 906 | ], 907 | "source": [ 908 | "np.quantile(sent_lens_b,0.90)" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": 30, 914 | "metadata": {}, 915 | "outputs": [], 916 | "source": [ 917 | "max_len_b = 600\n", 918 | "tok = Tokenizer(char_level=False,split=' ')\n", 919 | "tok.fit_on_texts(X_train_b)\n", 920 | "sequences_train_b = tok.texts_to_sequences(X_train_b)" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": 31, 926 | "metadata": {}, 927 | "outputs": [ 928 | { 929 | "data": { 930 | "text/plain": [ 931 | "1292018" 932 | ] 933 | }, 934 | "execution_count": 31, 935 | "metadata": {}, 936 | "output_type": "execute_result" 937 | } 938 | ], 939 | "source": [ 940 | "vocab_len_b =len(tok.index_word.keys())\n", 941 | "vocab_len_b " 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": 32, 947 | "metadata": {}, 948 | "outputs": [ 949 | { 950 | "data": { 951 | "text/plain": [ 952 | "array([[ 0, 0, 0, ..., 51, 2082, 91],\n", 953 | " [ 0, 0, 0, ..., 1408, 203, 825],\n", 954 | " [ 0, 0, 0, ..., 34, 51, 83],\n", 955 | " ...,\n", 956 | " [ 0, 0, 0, ..., 20, 68, 687],\n", 957 | " [ 0, 0, 0, ..., 187, 58, 10],\n", 958 | " [ 0, 0, 0, ..., 194, 197, 10]], dtype=int32)" 959 | ] 960 | }, 961 | "execution_count": 32, 962 | "metadata": {}, 963 | "output_type": "execute_result" 964 | } 965 | ], 966 | "source": [ 967 | "sequences_matrix_train_b = sequence.pad_sequences(sequences_train_b,maxlen=max_len_b)\n", 968 | "sequences_matrix_train_b" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 33, 974 | "metadata": {}, 975 | "outputs": [], 976 | "source": [ 977 | "sequences_test_b = tok.texts_to_sequences(X_test_b)\n", 978 | "sequences_matrix_test_b = sequence.pad_sequences(sequences_test_b,maxlen=max_len_b)" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": 34, 984 | "metadata": {}, 985 | "outputs": [ 986 | { 987 | "data": { 988 | "text/plain": [ 989 | "((412500, 18), (412500, 600), (412500, 10))" 990 | ] 991 | }, 992 | "execution_count": 34, 993 | "metadata": {}, 994 | "output_type": "execute_result" 995 | } 996 | ], 997 | "source": [ 998 | "sequences_matrix_train_t.shape,sequences_matrix_train_b.shape,y_train.shape" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": 35, 1004 | "metadata": {}, 1005 | "outputs": [ 1006 | { 1007 | "data": { 1008 | "text/plain": [ 1009 | "((137500, 18), (137500, 600), (137500, 10))" 1010 | ] 1011 | }, 1012 | "execution_count": 35, 1013 | "metadata": {}, 1014 | "output_type": "execute_result" 1015 | } 1016 | ], 1017 | "source": [ 1018 | "sequences_matrix_test_t.shape,sequences_matrix_test_b.shape,y_test.shape" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "execution_count": 36, 1024 | "metadata": {}, 1025 | "outputs": [], 1026 | "source": [ 1027 | "def RNN():\n", 1028 | " # Title Only\n", 1029 | " title_input = Input(name='title_input',shape=[max_len_t])\n", 1030 | " title_Embed = Embedding(vocab_len_t+1,2000,input_length=max_len_t,mask_zero=True,name='title_Embed')(title_input)\n", 1031 | " gru_out_t = GRU(300)(title_Embed)\n", 1032 | " # auxiliary output to tune GRU weights smoothly \n", 1033 | " auxiliary_output = Dense(10, activation='sigmoid', name='aux_output')(gru_out_t) \n", 1034 | " \n", 1035 | " # Body Only\n", 1036 | " body_input = Input(name='body_input',shape=[max_len_b]) \n", 1037 | " body_Embed = Embedding(vocab_len_b+1,170,input_length=max_len_b,mask_zero=True,name='body_Embed')(body_input)\n", 1038 | " gru_out_b = GRU(200)(body_Embed)\n", 1039 | " \n", 1040 | " # combined with GRU output\n", 1041 | " com = concatenate([gru_out_t, gru_out_b])\n", 1042 | " \n", 1043 | " # now the combined data is being fed to dense layers\n", 1044 | " dense1 = Dense(400,activation='relu')(com)\n", 1045 | " dp1 = Dropout(0.5)(dense1)\n", 1046 | " bn = BatchNormalization()(dp1) \n", 1047 | " dense2 = Dense(150,activation='relu')(bn)\n", 1048 | " \n", 1049 | " main_output = Dense(10, activation='sigmoid', name='main_output')(dense2)\n", 1050 | " \n", 1051 | " model = Model(inputs=[title_input, body_input],outputs=[main_output, auxiliary_output])\n", 1052 | " return model" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "execution_count": 37, 1058 | "metadata": {}, 1059 | "outputs": [ 1060 | { 1061 | "name": "stdout", 1062 | "output_type": "stream", 1063 | "text": [ 1064 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", 1065 | "Instructions for updating:\n", 1066 | "Colocations handled automatically by placer.\n", 1067 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n", 1068 | "Instructions for updating:\n", 1069 | "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n", 1070 | "__________________________________________________________________________________________________\n", 1071 | "Layer (type) Output Shape Param # Connected to \n", 1072 | "==================================================================================================\n", 1073 | "title_input (InputLayer) (None, 18) 0 \n", 1074 | "__________________________________________________________________________________________________\n", 1075 | "body_input (InputLayer) (None, 600) 0 \n", 1076 | "__________________________________________________________________________________________________\n", 1077 | "title_Embed (Embedding) (None, 18, 2000) 137940000 title_input[0][0] \n", 1078 | "__________________________________________________________________________________________________\n", 1079 | "body_Embed (Embedding) (None, 600, 170) 219643230 body_input[0][0] \n", 1080 | "__________________________________________________________________________________________________\n", 1081 | "gru_1 (GRU) (None, 300) 2070900 title_Embed[0][0] \n", 1082 | "__________________________________________________________________________________________________\n", 1083 | "gru_2 (GRU) (None, 200) 222600 body_Embed[0][0] \n", 1084 | "__________________________________________________________________________________________________\n", 1085 | "concatenate_1 (Concatenate) (None, 500) 0 gru_1[0][0] \n", 1086 | " gru_2[0][0] \n", 1087 | "__________________________________________________________________________________________________\n", 1088 | "dense_1 (Dense) (None, 400) 200400 concatenate_1[0][0] \n", 1089 | "__________________________________________________________________________________________________\n", 1090 | "dropout_1 (Dropout) (None, 400) 0 dense_1[0][0] \n", 1091 | "__________________________________________________________________________________________________\n", 1092 | "batch_normalization_1 (BatchNor (None, 400) 1600 dropout_1[0][0] \n", 1093 | "__________________________________________________________________________________________________\n", 1094 | "dense_2 (Dense) (None, 150) 60150 batch_normalization_1[0][0] \n", 1095 | "__________________________________________________________________________________________________\n", 1096 | "main_output (Dense) (None, 10) 1510 dense_2[0][0] \n", 1097 | "__________________________________________________________________________________________________\n", 1098 | "aux_output (Dense) (None, 10) 3010 gru_1[0][0] \n", 1099 | "==================================================================================================\n", 1100 | "Total params: 360,143,400\n", 1101 | "Trainable params: 360,142,600\n", 1102 | "Non-trainable params: 800\n", 1103 | "__________________________________________________________________________________________________\n" 1104 | ] 1105 | } 1106 | ], 1107 | "source": [ 1108 | "model = RNN()\n", 1109 | "model.summary()" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": 38, 1115 | "metadata": {}, 1116 | "outputs": [], 1117 | "source": [ 1118 | "model.compile(optimizer='adam',loss={'main_output': 'categorical_crossentropy', 'aux_output': 'categorical_crossentropy'},\n", 1119 | " metrics=['accuracy'])" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "code", 1124 | "execution_count": 39, 1125 | "metadata": {}, 1126 | "outputs": [ 1127 | { 1128 | "name": "stdout", 1129 | "output_type": "stream", 1130 | "text": [ 1131 | "WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", 1132 | "Instructions for updating:\n", 1133 | "Use tf.cast instead.\n" 1134 | ] 1135 | }, 1136 | { 1137 | "name": "stderr", 1138 | "output_type": "stream", 1139 | "text": [ 1140 | "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:107: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 137940000 elements. This may consume a large amount of memory.\n", 1141 | " num_elements)\n", 1142 | "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:107: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 219643230 elements. This may consume a large amount of memory.\n", 1143 | " num_elements)\n" 1144 | ] 1145 | }, 1146 | { 1147 | "name": "stdout", 1148 | "output_type": "stream", 1149 | "text": [ 1150 | "Train on 412500 samples, validate on 137500 samples\n", 1151 | "Epoch 1/5\n", 1152 | "412500/412500 [==============================] - 819s 2ms/step - loss: 2.3168 - main_output_loss: 1.0405 - aux_output_loss: 1.2763 - main_output_acc: 0.7285 - aux_output_acc: 0.6705 - val_loss: 1.7714 - val_main_output_loss: 0.7258 - val_aux_output_loss: 1.0456 - val_main_output_acc: 0.8254 - val_aux_output_acc: 0.7285\n", 1153 | "Epoch 2/5\n", 1154 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.5874 - main_output_loss: 0.6490 - aux_output_loss: 0.9384 - main_output_acc: 0.8443 - aux_output_acc: 0.7582 - val_loss: 1.7043 - val_main_output_loss: 0.6581 - val_aux_output_loss: 1.0462 - val_main_output_acc: 0.8379 - val_aux_output_acc: 0.7286\n", 1155 | "Epoch 3/5\n", 1156 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.3994 - main_output_loss: 0.5472 - aux_output_loss: 0.8522 - main_output_acc: 0.8650 - aux_output_acc: 0.7774 - val_loss: 1.7369 - val_main_output_loss: 0.6660 - val_aux_output_loss: 1.0708 - val_main_output_acc: 0.8364 - val_aux_output_acc: 0.7263\n", 1157 | "Epoch 4/5\n", 1158 | "412500/412500 [==============================] - 798s 2ms/step - loss: 1.2719 - main_output_loss: 0.4735 - aux_output_loss: 0.7984 - main_output_acc: 0.8774 - aux_output_acc: 0.7885 - val_loss: 1.7988 - val_main_output_loss: 0.6976 - val_aux_output_loss: 1.1012 - val_main_output_acc: 0.8369 - val_aux_output_acc: 0.7240\n", 1159 | "Epoch 5/5\n", 1160 | "412500/412500 [==============================] - 797s 2ms/step - loss: 1.1665 - main_output_loss: 0.4110 - aux_output_loss: 0.7555 - main_output_acc: 0.8868 - aux_output_acc: 0.7976 - val_loss: 1.9099 - val_main_output_loss: 0.7671 - val_aux_output_loss: 1.1428 - val_main_output_acc: 0.8307 - val_aux_output_acc: 0.7237\n" 1161 | ] 1162 | } 1163 | ], 1164 | "source": [ 1165 | "results=model.fit({'title_input': sequences_matrix_train_t, 'body_input': sequences_matrix_train_b},\n", 1166 | " {'main_output': y_train, 'aux_output': y_train},\n", 1167 | " validation_data=[{'title_input': sequences_matrix_test_t, 'body_input': sequences_matrix_test_b},\n", 1168 | " {'main_output': y_test, 'aux_output': y_test}],\n", 1169 | " epochs=5, batch_size=800)" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "code", 1174 | "execution_count": 68, 1175 | "metadata": {}, 1176 | "outputs": [ 1177 | { 1178 | "name": "stdout", 1179 | "output_type": "stream", 1180 | "text": [ 1181 | "137500/137500 [==============================] - 1270s 9ms/step\n" 1182 | ] 1183 | } 1184 | ], 1185 | "source": [ 1186 | "(predicted_main, predicted_aux)=model.predict({'title_input': sequences_matrix_test_t, 'body_input': sequences_matrix_test_b},verbose=1)" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "code", 1191 | "execution_count": 70, 1192 | "metadata": {}, 1193 | "outputs": [], 1194 | "source": [ 1195 | "from sklearn.metrics import classification_report,f1_score" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "code", 1200 | "execution_count": 138, 1201 | "metadata": {}, 1202 | "outputs": [ 1203 | { 1204 | "name": "stdout", 1205 | "output_type": "stream", 1206 | "text": [ 1207 | "0.8424636536796537\n" 1208 | ] 1209 | }, 1210 | { 1211 | "name": "stderr", 1212 | "output_type": "stream", 1213 | "text": [ 1214 | "/opt/conda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in samples with no predicted labels.\n", 1215 | " 'precision', 'predicted', average, warn_for)\n" 1216 | ] 1217 | } 1218 | ], 1219 | "source": [ 1220 | "print(f1_score(y_test,predicted_main>.55,average='samples'))" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 137, 1226 | "metadata": {}, 1227 | "outputs": [ 1228 | { 1229 | "name": "stdout", 1230 | "output_type": "stream", 1231 | "text": [ 1232 | " precision recall f1-score support\n", 1233 | "\n", 1234 | " 0 0.97 0.93 0.95 17054\n", 1235 | " 1 0.92 0.84 0.88 20681\n", 1236 | " 2 0.92 0.81 0.86 9700\n", 1237 | " 3 0.69 0.53 0.60 11304\n", 1238 | " 4 0.96 0.91 0.94 8897\n", 1239 | " 5 0.91 0.80 0.85 22472\n", 1240 | " 6 0.82 0.72 0.76 22938\n", 1241 | " 7 0.81 0.83 0.82 16150\n", 1242 | " 8 0.92 0.90 0.91 19659\n", 1243 | " 9 0.97 0.92 0.95 11576\n", 1244 | "\n", 1245 | " micro avg 0.89 0.82 0.85 160431\n", 1246 | " macro avg 0.89 0.82 0.85 160431\n", 1247 | "weighted avg 0.89 0.82 0.85 160431\n", 1248 | " samples avg 0.86 0.85 0.84 160431\n", 1249 | "\n" 1250 | ] 1251 | }, 1252 | { 1253 | "name": "stderr", 1254 | "output_type": "stream", 1255 | "text": [ 1256 | "/opt/conda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels.\n", 1257 | " 'precision', 'predicted', average, warn_for)\n" 1258 | ] 1259 | } 1260 | ], 1261 | "source": [ 1262 | "print(classification_report(y_test,predicted_main>.55))" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": 131, 1268 | "metadata": {}, 1269 | "outputs": [ 1270 | { 1271 | "data": { 1272 | "text/plain": [ 1273 | "Id 16470700\n", 1274 | "Title NetworkOnMainThreadException- Have tried makin...\n", 1275 | "Body I've been trying to get this to work for a whi...\n", 1276 | "Tags [java, android]\n", 1277 | "Name: 250148, dtype: object" 1278 | ] 1279 | }, 1280 | "execution_count": 131, 1281 | "metadata": {}, 1282 | "output_type": "execute_result" 1283 | } 1284 | ], 1285 | "source": [ 1286 | "test.iloc[24]" 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "execution_count": 134, 1292 | "metadata": {}, 1293 | "outputs": [ 1294 | { 1295 | "data": { 1296 | "text/plain": [ 1297 | "array([1. , 0. , 0. , 0. , 0. , 0.84, 0. , 0. , 0. , 0. ],\n", 1298 | " dtype=float32)" 1299 | ] 1300 | }, 1301 | "execution_count": 134, 1302 | "metadata": {}, 1303 | "output_type": "execute_result" 1304 | } 1305 | ], 1306 | "source": [ 1307 | "predicted_main[24].round(decimals = 2)" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": 92, 1313 | "metadata": {}, 1314 | "outputs": [ 1315 | { 1316 | "data": { 1317 | "text/plain": [ 1318 | "array(['android', 'c#', 'c++', 'html', 'ios', 'java', 'javascript',\n", 1319 | " 'jquery', 'php', 'python'], dtype=object)" 1320 | ] 1321 | }, 1322 | "execution_count": 92, 1323 | "metadata": {}, 1324 | "output_type": "execute_result" 1325 | } 1326 | ], 1327 | "source": [ 1328 | "labels" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "code", 1333 | "execution_count": 79, 1334 | "metadata": {}, 1335 | "outputs": [], 1336 | "source": [ 1337 | "model.save('./stackoverflow_tags.h5')" 1338 | ] 1339 | } 1340 | ], 1341 | "metadata": { 1342 | "kernelspec": { 1343 | "display_name": "Python 3", 1344 | "language": "python", 1345 | "name": "python3" 1346 | }, 1347 | "language_info": { 1348 | "codemirror_mode": { 1349 | "name": "ipython", 1350 | "version": 3 1351 | }, 1352 | "file_extension": ".py", 1353 | "mimetype": "text/x-python", 1354 | "name": "python", 1355 | "nbconvert_exporter": "python", 1356 | "pygments_lexer": "ipython3", 1357 | "version": "3.6.7" 1358 | } 1359 | }, 1360 | "nbformat": 4, 1361 | "nbformat_minor": 1 1362 | } 1363 | --------------------------------------------------------------------------------