├── Capstone_part10.ipynb └── README.md /Capstone_part10.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd \n", 12 | "import numpy as np\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "plt.style.use('fivethirtyeight')\n", 15 | "\n", 16 | "%matplotlib inline\n", 17 | "%config InlineBackend.figure_format = 'retina'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 3, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/html": [ 28 | "

\n", 29 | "\n", 42 | "\n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | "

	text	target
0	awww that bummer you shoulda got david carr of...	0
1	is upset that he can not update his facebook b...	0
2	dived many times for the ball managed to save ...	0
3	my whole body feels itchy and like its on fire	0
4	no it not behaving at all mad why am here beca...	0

\n", 78 | "

" 79 | ], 80 | "text/plain": [ 81 | " text target\n", 82 | "0 awww that bummer you shoulda got david carr of... 0\n", 83 | "1 is upset that he can not update his facebook b... 0\n", 84 | "2 dived many times for the ball managed to save ... 0\n", 85 | "3 my whole body feels itchy and like its on fire 0\n", 86 | "4 no it not behaving at all mad why am here beca... 0" 87 | ] 88 | }, 89 | "execution_count": 3, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "csv = 'clean_tweet.csv'\n", 96 | "my_df = pd.read_csv(csv,index_col=0)\n", 97 | "my_df.head()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 4, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | "\n", 110 | "RangeIndex: 1596019 entries, 0 to 1596018\n", 111 | "Data columns (total 2 columns):\n", 112 | "text 1596019 non-null object\n", 113 | "target 1596019 non-null int64\n", 114 | "dtypes: int64(1), object(1)\n", 115 | "memory usage: 24.4+ MB\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "my_df.dropna(inplace=True)\n", 121 | "my_df.reset_index(drop=True,inplace=True)\n", 122 | "my_df.info()" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 5, 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "x = my_df.text\n", 134 | "y = my_df.target" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 7, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "from sklearn.cross_validation import train_test_split\n", 146 | "SEED = 2000\n", 147 | "x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)\n", 148 | "x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 8, 154 | "metadata": {}, 155 | "outputs": [ 156 | { 157 | "name": "stdout", 158 | "output_type": "stream", 159 | "text": [ 160 | "Train set has total 1564098 entries with 50.00% negative, 50.00% positive\n", 161 | "Validation set has total 15960 entries with 50.40% negative, 49.60% positive\n", 162 | "Test set has total 15961 entries with 50.26% negative, 49.74% positive\n" 163 | ] 164 | } 165 | ], 166 | "source": [ 167 | "print \"Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive\".format(len(x_train),\n", 168 | " (len(x_train[y_train == 0]) / (len(x_train)*1.))*100,\n", 169 | " (len(x_train[y_train == 1]) / (len(x_train)*1.))*100)\n", 170 | "print \"Validation set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive\".format(len(x_validation),\n", 171 | " (len(x_validation[y_validation == 0]) / (len(x_validation)*1.))*100,\n", 172 | " (len(x_validation[y_validation == 1]) / (len(x_validation)*1.))*100)\n", 173 | "print \"Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive\".format(len(x_test),\n", 174 | " (len(x_test[y_test == 0]) / (len(x_test)*1.))*100,\n", 175 | " (len(x_test[y_test == 1]) / (len(x_test)*1.))*100)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "# Neural Networks with Doc2Vec" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "Before I jump into neural network modelling with the vectors I got from Doc2Vec, I would like to give you some background on how I got these document vectors. I have implemented Doc2Vec using Gensim library in the 6th part of this series. \n", 190 | "\n", 191 | "There are three different methods used to train Doc2Vec. Distributed Bag of Words, Distributed Memory (Mean), Distributed Memory (Concatenation). These models were trained with 1.5 million tweets through 30 epochs and the output of the models are 100 dimension vectors for each tweet. After I got document vectors from each model, I have tried concatenating these (so the concatenated document vectors have 200 dimensions) in combination: DBOW + DMM, DBOW + DMC, and saw an improvement to the performance when compared with models with one pure method. Using different methods of training and concatenating them to improve the performance has already been demonstrated by Le and Mikolov (2014) in their research paper.\n", 192 | "https://cs.stanford.edu/~quocle/paragraph_vector.pdf\n", 193 | "\n", 194 | "Finally, I have applied phrase modelling to detect bigram phrase and trigram phrase as a pre-step of Doc2Vec training and tried different combination across n-grams. When tested with a logistic regression model, I got the best performance result from 'unigram DBOW + trigram DMM' document vectors." 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "I will first start by loading Gensim's Doc2Vec, and define a function to extract document vectors, then load the doc2vec model I trained." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 9, 207 | "metadata": { 208 | "collapsed": true 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "from gensim.models import Doc2Vec\n", 213 | "\n", 214 | "def get_concat_vectors(model1,model2, corpus, size):\n", 215 | " vecs = np.zeros((len(corpus), size))\n", 216 | " n = 0\n", 217 | " for i in corpus.index:\n", 218 | " prefix = 'all_' + str(i)\n", 219 | " vecs[n] = np.append(model1.docvecs[prefix],model2.docvecs[prefix])\n", 220 | " n += 1\n", 221 | " return vecs" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 11, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "model_ug_dbow = Doc2Vec.load('d2v_model_ug_dbow.doc2vec')\n", 233 | "model_tg_dmm = Doc2Vec.load('d2v_model_tg_dmm.doc2vec')\n", 234 | "model_ug_dbow.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)\n", 235 | "model_tg_dmm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 12, 241 | "metadata": { 242 | "collapsed": true 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "train_vecs_ugdbow_tgdmm = get_concat_vectors(model_ug_dbow,model_tg_dmm, x_train, 200)\n", 247 | "validation_vecs_ugdbow_tgdmm = get_concat_vectors(model_ug_dbow,model_tg_dmm, x_validation, 200)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 13, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "name": "stdout", 257 | "output_type": "stream", 258 | "text": [ 259 | "CPU times: user 58.9 s, sys: 35 s, total: 1min 33s\n", 260 | "Wall time: 2min 3s\n" 261 | ] 262 | } 263 | ], 264 | "source": [ 265 | "%%time\n", 266 | "from sklearn.linear_model import LogisticRegression\n", 267 | "clf = LogisticRegression()\n", 268 | "clf.fit(train_vecs_ugdbow_tgdmm, y_train)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 14, 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "name": "stdout", 278 | "output_type": "stream", 279 | "text": [ 280 | "CPU times: user 1.04 s, sys: 4.62 s, total: 5.66 s\n", 281 | "Wall time: 8.79 s\n" 282 | ] 283 | }, 284 | { 285 | "data": { 286 | "text/plain": [ 287 | "0.7590662477670836" 288 | ] 289 | }, 290 | "execution_count": 14, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "%%time\n", 297 | "clf.score(train_vecs_ugdbow_tgdmm, y_train)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 15, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "name": "stdout", 307 | "output_type": "stream", 308 | "text": [ 309 | "CPU times: user 11.8 ms, sys: 47.9 ms, total: 59.6 ms\n", 310 | "Wall time: 90.1 ms\n" 311 | ] 312 | }, 313 | { 314 | "data": { 315 | "text/plain": [ 316 | "0.7576441102756892" 317 | ] 318 | }, 319 | "execution_count": 15, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "%%time\n", 326 | "clf.score(validation_vecs_ugdbow_tgdmm, y_validation)" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "When fed to a simple logistic regression, the concatenated document vectors (unigram DBOW + trigram DMM) yields 75.90% training set accuracy, and 75.76% validation set accuracy." 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "I will try different numbers of hidden layers, hidden nodes to compare the performance. In the below code block, you see I first define the seed as \"7\" but not setting the random seed, \"np.random.seed()\" will be defined at the start of each model. This is for a reproducibility of various results from different model structures." 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "*Side Note (reproducibility): To be honest, this took me a while to figure out. I first tried by setting the random seed before I import Keras, and ran one model after another. However, if I define the same model structure after it has run, I couldn't get the same result. But I also realised if I restart the kernel, and re-run code blocks from start it gives me the same result as the last kernel. So I figured, after running a model the random seed changes, and that is the reason why I cannot get the same result with the same structure if I run them in the same kernel consecutively. Anyway, that is why I set the random seed every time I try a different model. For your information, I am running Keras with Theano backend, and only using CPU not GPU. If you are on the same setting, this should work. I explicitly specified backend as Theano by launching Jupyter Notebook in the command line as follows: \"KERAS_BACKEND=theano jupyter notebook\"" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "Please note that not all of the dependencies loaded in the below cell has been used for this post, but imported for later use." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 12, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "name": "stderr", 364 | "output_type": "stream", 365 | "text": [ 366 | "/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", 367 | " from ._conv import register_converters as _register_converters\n", 368 | "Using Theano backend.\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "seed = 7\n", 374 | "\n", 375 | "from keras.models import Sequential\n", 376 | "from keras.layers import Dense, Dropout\n", 377 | "from keras.layers import Flatten\n", 378 | "from keras.layers.embeddings import Embedding\n", 379 | "from keras.preprocessing import sequence" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 16, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "name": "stdout", 389 | "output_type": "stream", 390 | "text": [ 391 | "Train on 1564098 samples, validate on 15960 samples\n", 392 | "Epoch 1/10\n", 393 | " - 30s - loss: 0.4791 - acc: 0.7749 - val_loss: 0.4661 - val_acc: 0.7787\n", 394 | "Epoch 2/10\n", 395 | " - 30s - loss: 0.4637 - acc: 0.7829 - val_loss: 0.4717 - val_acc: 0.7816\n", 396 | "Epoch 3/10\n", 397 | " - 30s - loss: 0.4593 - acc: 0.7852 - val_loss: 0.4614 - val_acc: 0.7838\n", 398 | "Epoch 4/10\n", 399 | " - 30s - loss: 0.4567 - acc: 0.7867 - val_loss: 0.4607 - val_acc: 0.7837\n", 400 | "Epoch 5/10\n", 401 | " - 29s - loss: 0.4552 - acc: 0.7878 - val_loss: 0.4586 - val_acc: 0.7862\n", 402 | "Epoch 6/10\n", 403 | " - 29s - loss: 0.4537 - acc: 0.7883 - val_loss: 0.4579 - val_acc: 0.7853\n", 404 | "Epoch 7/10\n", 405 | " - 30s - loss: 0.4527 - acc: 0.7887 - val_loss: 0.4576 - val_acc: 0.7863\n", 406 | "Epoch 8/10\n", 407 | " - 30s - loss: 0.4519 - acc: 0.7891 - val_loss: 0.4566 - val_acc: 0.7866\n", 408 | "Epoch 9/10\n", 409 | " - 30s - loss: 0.4512 - acc: 0.7896 - val_loss: 0.4573 - val_acc: 0.7877\n", 410 | "Epoch 10/10\n", 411 | " - 30s - loss: 0.4507 - acc: 0.7898 - val_loss: 0.4585 - val_acc: 0.7856\n", 412 | "CPU times: user 4min 53s, sys: 16.1 s, total: 5min 9s\n", 413 | "Wall time: 4min 58s\n" 414 | ] 415 | } 416 | ], 417 | "source": [ 418 | "%%time\n", 419 | "np.random.seed(seed)\n", 420 | "model_d2v_01 = Sequential()\n", 421 | "model_d2v_01.add(Dense(64, activation='relu', input_dim=200))\n", 422 | "model_d2v_01.add(Dense(1, activation='sigmoid'))\n", 423 | "model_d2v_01.compile(optimizer='adam',\n", 424 | " loss='binary_crossentropy',\n", 425 | " metrics=['accuracy'])\n", 426 | "\n", 427 | "model_d2v_01.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 19, 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "name": "stdout", 437 | "output_type": "stream", 438 | "text": [ 439 | "Train on 1564098 samples, validate on 15960 samples\n", 440 | "Epoch 1/10\n", 441 | " - 39s - loss: 0.4661 - acc: 0.7777 - val_loss: 0.4541 - val_acc: 0.7826\n", 442 | "Epoch 2/10\n", 443 | " - 39s - loss: 0.4483 - acc: 0.7878 - val_loss: 0.4504 - val_acc: 0.7901\n", 444 | "Epoch 3/10\n", 445 | " - 41s - loss: 0.4431 - acc: 0.7906 - val_loss: 0.4491 - val_acc: 0.7898\n", 446 | "Epoch 4/10\n", 447 | " - 43s - loss: 0.4403 - acc: 0.7923 - val_loss: 0.4472 - val_acc: 0.7927\n", 448 | "Epoch 5/10\n", 449 | " - 41s - loss: 0.4382 - acc: 0.7933 - val_loss: 0.4472 - val_acc: 0.7942\n", 450 | "Epoch 6/10\n", 451 | " - 40s - loss: 0.4369 - acc: 0.7940 - val_loss: 0.4441 - val_acc: 0.7912\n", 452 | "Epoch 7/10\n", 453 | " - 40s - loss: 0.4359 - acc: 0.7946 - val_loss: 0.4465 - val_acc: 0.7910\n", 454 | "Epoch 8/10\n", 455 | " - 40s - loss: 0.4348 - acc: 0.7951 - val_loss: 0.4495 - val_acc: 0.7955\n", 456 | "Epoch 9/10\n", 457 | " - 41s - loss: 0.4341 - acc: 0.7956 - val_loss: 0.4511 - val_acc: 0.7900\n", 458 | "Epoch 10/10\n", 459 | " - 40s - loss: 0.4336 - acc: 0.7961 - val_loss: 0.4457 - val_acc: 0.7928\n" 460 | ] 461 | }, 462 | { 463 | "data": { 464 | "text/plain": [ 465 | "" 466 | ] 467 | }, 468 | "execution_count": 19, 469 | "metadata": {}, 470 | "output_type": "execute_result" 471 | } 472 | ], 473 | "source": [ 474 | "np.random.seed(seed)\n", 475 | "model_d2v_02 = Sequential()\n", 476 | "model_d2v_02.add(Dense(64, activation='relu', input_dim=200))\n", 477 | "model_d2v_02.add(Dense(64, activation='relu'))\n", 478 | "model_d2v_02.add(Dense(1, activation='sigmoid'))\n", 479 | "model_d2v_02.compile(optimizer='adam',\n", 480 | " loss='binary_crossentropy',\n", 481 | " metrics=['accuracy'])\n", 482 | "\n", 483 | "model_d2v_02.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 22, 489 | "metadata": {}, 490 | "outputs": [ 491 | { 492 | "name": "stdout", 493 | "output_type": "stream", 494 | "text": [ 495 | "Train on 1564098 samples, validate on 15960 samples\n", 496 | "Epoch 1/10\n", 497 | " - 44s - loss: 0.4655 - acc: 0.7778 - val_loss: 0.4548 - val_acc: 0.7853\n", 498 | "Epoch 2/10\n", 499 | " - 49s - loss: 0.4479 - acc: 0.7883 - val_loss: 0.4484 - val_acc: 0.7905\n", 500 | "Epoch 3/10\n", 501 | " - 52s - loss: 0.4426 - acc: 0.7912 - val_loss: 0.4487 - val_acc: 0.7902\n", 502 | "Epoch 4/10\n", 503 | " - 55s - loss: 0.4395 - acc: 0.7925 - val_loss: 0.4465 - val_acc: 0.7916\n", 504 | "Epoch 5/10\n", 505 | " - 57s - loss: 0.4372 - acc: 0.7939 - val_loss: 0.4459 - val_acc: 0.7925\n", 506 | "Epoch 6/10\n", 507 | " - 58s - loss: 0.4358 - acc: 0.7948 - val_loss: 0.4432 - val_acc: 0.7919\n", 508 | "Epoch 7/10\n", 509 | " - 58s - loss: 0.4345 - acc: 0.7955 - val_loss: 0.4435 - val_acc: 0.7937\n", 510 | "Epoch 8/10\n", 511 | " - 60s - loss: 0.4336 - acc: 0.7960 - val_loss: 0.4433 - val_acc: 0.7934\n", 512 | "Epoch 9/10\n", 513 | " - 59s - loss: 0.4328 - acc: 0.7966 - val_loss: 0.4500 - val_acc: 0.7914\n", 514 | "Epoch 10/10\n", 515 | " - 60s - loss: 0.4322 - acc: 0.7967 - val_loss: 0.4421 - val_acc: 0.7912\n" 516 | ] 517 | }, 518 | { 519 | "data": { 520 | "text/plain": [ 521 | "" 522 | ] 523 | }, 524 | "execution_count": 22, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "np.random.seed(seed)\n", 531 | "model_d2v_03 = Sequential()\n", 532 | "model_d2v_03.add(Dense(64, activation='relu', input_dim=200))\n", 533 | "model_d2v_03.add(Dense(64, activation='relu'))\n", 534 | "model_d2v_03.add(Dense(64, activation='relu'))\n", 535 | "model_d2v_03.add(Dense(1, activation='sigmoid'))\n", 536 | "model_d2v_03.compile(optimizer='adam',\n", 537 | " loss='binary_crossentropy',\n", 538 | " metrics=['accuracy'])\n", 539 | "\n", 540 | "model_d2v_03.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": 27, 546 | "metadata": {}, 547 | "outputs": [ 548 | { 549 | "name": "stdout", 550 | "output_type": "stream", 551 | "text": [ 552 | "Train on 1564098 samples, validate on 15960 samples\n", 553 | "Epoch 1/10\n", 554 | " - 37s - loss: 0.4762 - acc: 0.7762 - val_loss: 0.4624 - val_acc: 0.7830\n", 555 | "Epoch 2/10\n", 556 | " - 39s - loss: 0.4592 - acc: 0.7851 - val_loss: 0.4640 - val_acc: 0.7843\n", 557 | "Epoch 3/10\n", 558 | " - 36s - loss: 0.4533 - acc: 0.7883 - val_loss: 0.4576 - val_acc: 0.7868\n", 559 | "Epoch 4/10\n", 560 | " - 36s - loss: 0.4497 - acc: 0.7903 - val_loss: 0.4561 - val_acc: 0.7883\n", 561 | "Epoch 5/10\n", 562 | " - 36s - loss: 0.4473 - acc: 0.7915 - val_loss: 0.4555 - val_acc: 0.7865\n", 563 | "Epoch 6/10\n", 564 | " - 38s - loss: 0.4455 - acc: 0.7928 - val_loss: 0.4538 - val_acc: 0.7882\n", 565 | "Epoch 7/10\n", 566 | " - 38s - loss: 0.4440 - acc: 0.7934 - val_loss: 0.4523 - val_acc: 0.7896\n", 567 | "Epoch 8/10\n", 568 | " - 36s - loss: 0.4428 - acc: 0.7940 - val_loss: 0.4537 - val_acc: 0.7911\n", 569 | "Epoch 9/10\n", 570 | " - 36s - loss: 0.4420 - acc: 0.7947 - val_loss: 0.4539 - val_acc: 0.7851\n", 571 | "Epoch 10/10\n", 572 | " - 36s - loss: 0.4412 - acc: 0.7946 - val_loss: 0.4533 - val_acc: 0.7914\n" 573 | ] 574 | }, 575 | { 576 | "data": { 577 | "text/plain": [ 578 | "" 579 | ] 580 | }, 581 | "execution_count": 27, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "np.random.seed(seed)\n", 588 | "model_d2v_04 = Sequential()\n", 589 | "model_d2v_04.add(Dense(128, activation='relu', input_dim=200))\n", 590 | "model_d2v_04.add(Dense(1, activation='sigmoid'))\n", 591 | "model_d2v_04.compile(optimizer='adam',\n", 592 | " loss='binary_crossentropy',\n", 593 | " metrics=['accuracy'])\n", 594 | "\n", 595 | "model_d2v_04.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 28, 601 | "metadata": {}, 602 | "outputs": [ 603 | { 604 | "name": "stdout", 605 | "output_type": "stream", 606 | "text": [ 607 | "Train on 1564098 samples, validate on 15960 samples\n", 608 | "Epoch 1/10\n", 609 | " - 69s - loss: 0.4609 - acc: 0.7807 - val_loss: 0.4496 - val_acc: 0.7887\n", 610 | "Epoch 2/10\n", 611 | " - 77s - loss: 0.4420 - acc: 0.7912 - val_loss: 0.4427 - val_acc: 0.7928\n", 612 | "Epoch 3/10\n", 613 | " - 88s - loss: 0.4358 - acc: 0.7948 - val_loss: 0.4433 - val_acc: 0.7902\n", 614 | "Epoch 4/10\n", 615 | " - 91s - loss: 0.4319 - acc: 0.7968 - val_loss: 0.4386 - val_acc: 0.7960\n", 616 | "Epoch 5/10\n", 617 | " - 93s - loss: 0.4290 - acc: 0.7983 - val_loss: 0.4398 - val_acc: 0.7950\n", 618 | "Epoch 6/10\n", 619 | " - 94s - loss: 0.4267 - acc: 0.7995 - val_loss: 0.4379 - val_acc: 0.7955\n", 620 | "Epoch 7/10\n", 621 | " - 95s - loss: 0.4251 - acc: 0.8003 - val_loss: 0.4383 - val_acc: 0.7942\n", 622 | "Epoch 8/10\n", 623 | " - 96s - loss: 0.4235 - acc: 0.8013 - val_loss: 0.4416 - val_acc: 0.7944\n", 624 | "Epoch 9/10\n", 625 | " - 96s - loss: 0.4223 - acc: 0.8019 - val_loss: 0.4445 - val_acc: 0.7926\n", 626 | "Epoch 10/10\n", 627 | " - 94s - loss: 0.4213 - acc: 0.8024 - val_loss: 0.4388 - val_acc: 0.7950\n" 628 | ] 629 | }, 630 | { 631 | "data": { 632 | "text/plain": [ 633 | "" 634 | ] 635 | }, 636 | "execution_count": 28, 637 | "metadata": {}, 638 | "output_type": "execute_result" 639 | } 640 | ], 641 | "source": [ 642 | "np.random.seed(seed)\n", 643 | "model_d2v_05 = Sequential()\n", 644 | "model_d2v_05.add(Dense(128, activation='relu', input_dim=200))\n", 645 | "model_d2v_05.add(Dense(128, activation='relu'))\n", 646 | "model_d2v_05.add(Dense(1, activation='sigmoid'))\n", 647 | "model_d2v_05.compile(optimizer='adam',\n", 648 | " loss='binary_crossentropy',\n", 649 | " metrics=['accuracy'])\n", 650 | "\n", 651 | "model_d2v_05.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": 29, 657 | "metadata": {}, 658 | "outputs": [ 659 | { 660 | "name": "stdout", 661 | "output_type": "stream", 662 | "text": [ 663 | "Train on 1564098 samples, validate on 15960 samples\n", 664 | "Epoch 1/10\n", 665 | " - 124s - loss: 0.4613 - acc: 0.7806 - val_loss: 0.4481 - val_acc: 0.7869\n", 666 | "Epoch 2/10\n", 667 | " - 146s - loss: 0.4420 - acc: 0.7914 - val_loss: 0.4426 - val_acc: 0.7923\n", 668 | "Epoch 3/10\n", 669 | " - 175s - loss: 0.4352 - acc: 0.7951 - val_loss: 0.4425 - val_acc: 0.7931\n", 670 | "Epoch 4/10\n", 671 | " - 184s - loss: 0.4312 - acc: 0.7971 - val_loss: 0.4381 - val_acc: 0.7953\n", 672 | "Epoch 5/10\n", 673 | " - 188s - loss: 0.4281 - acc: 0.7989 - val_loss: 0.4373 - val_acc: 0.7946\n", 674 | "Epoch 6/10\n", 675 | " - 189s - loss: 0.4259 - acc: 0.8000 - val_loss: 0.4370 - val_acc: 0.7975\n", 676 | "Epoch 7/10\n", 677 | " - 196s - loss: 0.4243 - acc: 0.8010 - val_loss: 0.4395 - val_acc: 0.7969\n", 678 | "Epoch 8/10\n", 679 | " - 224s - loss: 0.4228 - acc: 0.8017 - val_loss: 0.4398 - val_acc: 0.7942\n", 680 | "Epoch 9/10\n", 681 | " - 210s - loss: 0.4220 - acc: 0.8022 - val_loss: 0.4455 - val_acc: 0.7948\n", 682 | "Epoch 10/10\n", 683 | " - 221s - loss: 0.4209 - acc: 0.8029 - val_loss: 0.4373 - val_acc: 0.7947\n" 684 | ] 685 | }, 686 | { 687 | "data": { 688 | "text/plain": [ 689 | "" 690 | ] 691 | }, 692 | "execution_count": 29, 693 | "metadata": {}, 694 | "output_type": "execute_result" 695 | } 696 | ], 697 | "source": [ 698 | "np.random.seed(seed)\n", 699 | "model_d2v_06 = Sequential()\n", 700 | "model_d2v_06.add(Dense(128, activation='relu', input_dim=200))\n", 701 | "model_d2v_06.add(Dense(128, activation='relu'))\n", 702 | "model_d2v_06.add(Dense(128, activation='relu'))\n", 703 | "model_d2v_06.add(Dense(1, activation='sigmoid'))\n", 704 | "model_d2v_06.compile(optimizer='adam',\n", 705 | " loss='binary_crossentropy',\n", 706 | " metrics=['accuracy'])\n", 707 | "\n", 708 | "model_d2v_06.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": 30, 714 | "metadata": {}, 715 | "outputs": [ 716 | { 717 | "name": "stdout", 718 | "output_type": "stream", 719 | "text": [ 720 | "Train on 1564098 samples, validate on 15960 samples\n", 721 | "Epoch 1/10\n", 722 | " - 57s - loss: 0.4746 - acc: 0.7770 - val_loss: 0.4627 - val_acc: 0.7821\n", 723 | "Epoch 2/10\n", 724 | " - 73s - loss: 0.4572 - acc: 0.7861 - val_loss: 0.4639 - val_acc: 0.7840\n", 725 | "Epoch 3/10\n", 726 | " - 72s - loss: 0.4505 - acc: 0.7897 - val_loss: 0.4568 - val_acc: 0.7857\n", 727 | "Epoch 4/10\n", 728 | " - 71s - loss: 0.4458 - acc: 0.7923 - val_loss: 0.4541 - val_acc: 0.7860\n", 729 | "Epoch 5/10\n", 730 | " - 67s - loss: 0.4423 - acc: 0.7947 - val_loss: 0.4547 - val_acc: 0.7891\n", 731 | "Epoch 6/10\n", 732 | " - 68s - loss: 0.4395 - acc: 0.7962 - val_loss: 0.4526 - val_acc: 0.7870\n", 733 | "Epoch 7/10\n", 734 | " - 67s - loss: 0.4370 - acc: 0.7978 - val_loss: 0.4516 - val_acc: 0.7912\n", 735 | "Epoch 8/10\n", 736 | " - 67s - loss: 0.4349 - acc: 0.7988 - val_loss: 0.4548 - val_acc: 0.7904\n", 737 | "Epoch 9/10\n", 738 | " - 67s - loss: 0.4332 - acc: 0.7999 - val_loss: 0.4571 - val_acc: 0.7890\n", 739 | "Epoch 10/10\n", 740 | " - 67s - loss: 0.4318 - acc: 0.8007 - val_loss: 0.4580 - val_acc: 0.7895\n" 741 | ] 742 | }, 743 | { 744 | "data": { 745 | "text/plain": [ 746 | "" 747 | ] 748 | }, 749 | "execution_count": 30, 750 | "metadata": {}, 751 | "output_type": "execute_result" 752 | } 753 | ], 754 | "source": [ 755 | "np.random.seed(seed)\n", 756 | "model_d2v_07 = Sequential()\n", 757 | "model_d2v_07.add(Dense(256, activation='relu', input_dim=200))\n", 758 | "model_d2v_07.add(Dense(1, activation='sigmoid'))\n", 759 | "model_d2v_07.compile(optimizer='adam',\n", 760 | " loss='binary_crossentropy',\n", 761 | " metrics=['accuracy'])\n", 762 | "\n", 763 | "model_d2v_07.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 31, 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "name": "stdout", 773 | "output_type": "stream", 774 | "text": [ 775 | "Train on 1564098 samples, validate on 15960 samples\n", 776 | "Epoch 1/10\n", 777 | " - 172s - loss: 0.4581 - acc: 0.7824 - val_loss: 0.4465 - val_acc: 0.7866\n", 778 | "Epoch 2/10\n", 779 | " - 224s - loss: 0.4383 - acc: 0.7936 - val_loss: 0.4416 - val_acc: 0.7939\n", 780 | "Epoch 3/10\n", 781 | " - 283s - loss: 0.4304 - acc: 0.7979 - val_loss: 0.4431 - val_acc: 0.7927\n", 782 | "Epoch 4/10\n", 783 | " - 308s - loss: 0.4251 - acc: 0.8007 - val_loss: 0.4409 - val_acc: 0.7904\n", 784 | "Epoch 5/10\n", 785 | " - 323s - loss: 0.4209 - acc: 0.8029 - val_loss: 0.4400 - val_acc: 0.7908\n", 786 | "Epoch 6/10\n", 787 | " - 334s - loss: 0.4177 - acc: 0.8047 - val_loss: 0.4386 - val_acc: 0.7937\n", 788 | "Epoch 7/10\n", 789 | " - 341s - loss: 0.4150 - acc: 0.8062 - val_loss: 0.4427 - val_acc: 0.7948\n", 790 | "Epoch 8/10\n", 791 | " - 347s - loss: 0.4126 - acc: 0.8074 - val_loss: 0.4471 - val_acc: 0.7949\n", 792 | "Epoch 9/10\n", 793 | " - 354s - loss: 0.4105 - acc: 0.8083 - val_loss: 0.4449 - val_acc: 0.7926\n", 794 | "Epoch 10/10\n", 795 | " - 358s - loss: 0.4089 - acc: 0.8091 - val_loss: 0.4438 - val_acc: 0.7951\n" 796 | ] 797 | }, 798 | { 799 | "data": { 800 | "text/plain": [ 801 | "" 802 | ] 803 | }, 804 | "execution_count": 31, 805 | "metadata": {}, 806 | "output_type": "execute_result" 807 | } 808 | ], 809 | "source": [ 810 | "np.random.seed(seed)\n", 811 | "model_d2v_08 = Sequential()\n", 812 | "model_d2v_08.add(Dense(256, activation='relu', input_dim=200))\n", 813 | "model_d2v_08.add(Dense(256, activation='relu'))\n", 814 | "model_d2v_08.add(Dense(1, activation='sigmoid'))\n", 815 | "model_d2v_08.compile(optimizer='adam',\n", 816 | " loss='binary_crossentropy',\n", 817 | " metrics=['accuracy'])\n", 818 | "\n", 819 | "model_d2v_08.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 32, 825 | "metadata": {}, 826 | "outputs": [ 827 | { 828 | "name": "stdout", 829 | "output_type": "stream", 830 | "text": [ 831 | "Train on 1564098 samples, validate on 15960 samples\n", 832 | "Epoch 1/10\n", 833 | " - 349s - loss: 0.4579 - acc: 0.7827 - val_loss: 0.4448 - val_acc: 0.7904\n", 834 | "Epoch 2/10\n", 835 | " - 500s - loss: 0.4371 - acc: 0.7944 - val_loss: 0.4401 - val_acc: 0.7967\n", 836 | "Epoch 3/10\n", 837 | " - 649s - loss: 0.4287 - acc: 0.7987 - val_loss: 0.4396 - val_acc: 0.7948\n", 838 | "Epoch 4/10\n", 839 | " - 672s - loss: 0.4229 - acc: 0.8019 - val_loss: 0.4369 - val_acc: 0.7957\n", 840 | "Epoch 5/10\n", 841 | " - 664s - loss: 0.4182 - acc: 0.8046 - val_loss: 0.4353 - val_acc: 0.7953\n", 842 | "Epoch 6/10\n", 843 | " - 664s - loss: 0.4146 - acc: 0.8063 - val_loss: 0.4363 - val_acc: 0.7974\n", 844 | "Epoch 7/10\n", 845 | " - 670s - loss: 0.4115 - acc: 0.8079 - val_loss: 0.4403 - val_acc: 0.7993\n", 846 | "Epoch 8/10\n", 847 | " - 670s - loss: 0.4087 - acc: 0.8094 - val_loss: 0.4437 - val_acc: 0.7964\n", 848 | "Epoch 9/10\n", 849 | " - 672s - loss: 0.4061 - acc: 0.8107 - val_loss: 0.4435 - val_acc: 0.7926\n", 850 | "Epoch 10/10\n", 851 | " - 672s - loss: 0.4037 - acc: 0.8118 - val_loss: 0.4411 - val_acc: 0.7952\n" 852 | ] 853 | }, 854 | { 855 | "data": { 856 | "text/plain": [ 857 | "" 858 | ] 859 | }, 860 | "execution_count": 32, 861 | "metadata": {}, 862 | "output_type": "execute_result" 863 | } 864 | ], 865 | "source": [ 866 | "np.random.seed(seed)\n", 867 | "model_d2v_09 = Sequential()\n", 868 | "model_d2v_09.add(Dense(256, activation='relu', input_dim=200))\n", 869 | "model_d2v_09.add(Dense(256, activation='relu'))\n", 870 | "model_d2v_09.add(Dense(256, activation='relu'))\n", 871 | "model_d2v_09.add(Dense(1, activation='sigmoid'))\n", 872 | "model_d2v_09.compile(optimizer='adam',\n", 873 | " loss='binary_crossentropy',\n", 874 | " metrics=['accuracy'])\n", 875 | "\n", 876 | "model_d2v_09.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 33, 882 | "metadata": {}, 883 | "outputs": [ 884 | { 885 | "name": "stdout", 886 | "output_type": "stream", 887 | "text": [ 888 | "Train on 1564098 samples, validate on 15960 samples\n", 889 | "Epoch 1/10\n", 890 | " - 89s - loss: 0.4739 - acc: 0.7773 - val_loss: 0.4615 - val_acc: 0.7823\n", 891 | "Epoch 2/10\n", 892 | " - 129s - loss: 0.4556 - acc: 0.7872 - val_loss: 0.4603 - val_acc: 0.7864\n", 893 | "Epoch 3/10\n", 894 | " - 142s - loss: 0.4480 - acc: 0.7914 - val_loss: 0.4570 - val_acc: 0.7874\n", 895 | "Epoch 4/10\n", 896 | " - 154s - loss: 0.4418 - acc: 0.7948 - val_loss: 0.4522 - val_acc: 0.7865\n", 897 | "Epoch 5/10\n", 898 | " - 157s - loss: 0.4367 - acc: 0.7981 - val_loss: 0.4567 - val_acc: 0.7865\n", 899 | "Epoch 6/10\n", 900 | " - 159s - loss: 0.4319 - acc: 0.8009 - val_loss: 0.4577 - val_acc: 0.7872\n", 901 | "Epoch 7/10\n", 902 | " - 156s - loss: 0.4276 - acc: 0.8032 - val_loss: 0.4586 - val_acc: 0.7904\n", 903 | "Epoch 8/10\n", 904 | " - 157s - loss: 0.4237 - acc: 0.8058 - val_loss: 0.4602 - val_acc: 0.7873\n", 905 | "Epoch 9/10\n", 906 | " - 154s - loss: 0.4208 - acc: 0.8073 - val_loss: 0.4645 - val_acc: 0.7857\n", 907 | "Epoch 10/10\n", 908 | " - 154s - loss: 0.4179 - acc: 0.8091 - val_loss: 0.4719 - val_acc: 0.7835\n" 909 | ] 910 | }, 911 | { 912 | "data": { 913 | "text/plain": [ 914 | "" 915 | ] 916 | }, 917 | "execution_count": 33, 918 | "metadata": {}, 919 | "output_type": "execute_result" 920 | } 921 | ], 922 | "source": [ 923 | "np.random.seed(seed)\n", 924 | "model_d2v_10 = Sequential()\n", 925 | "model_d2v_10.add(Dense(512, activation='relu', input_dim=200))\n", 926 | "model_d2v_10.add(Dense(1, activation='sigmoid'))\n", 927 | "model_d2v_10.compile(optimizer='adam',\n", 928 | " loss='binary_crossentropy',\n", 929 | " metrics=['accuracy'])\n", 930 | "\n", 931 | "model_d2v_10.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 932 | ] 933 | }, 934 | { 935 | "cell_type": "code", 936 | "execution_count": 34, 937 | "metadata": {}, 938 | "outputs": [ 939 | { 940 | "name": "stdout", 941 | "output_type": "stream", 942 | "text": [ 943 | "Train on 1564098 samples, validate on 15960 samples\n", 944 | "Epoch 1/10\n", 945 | " - 623s - loss: 0.4564 - acc: 0.7835 - val_loss: 0.4439 - val_acc: 0.7911\n", 946 | "Epoch 2/10\n", 947 | " - 955s - loss: 0.4355 - acc: 0.7952 - val_loss: 0.4398 - val_acc: 0.7945\n", 948 | "Epoch 3/10\n", 949 | " - 1239s - loss: 0.4264 - acc: 0.8003 - val_loss: 0.4431 - val_acc: 0.7978\n", 950 | "Epoch 4/10\n", 951 | " - 1270s - loss: 0.4190 - acc: 0.8044 - val_loss: 0.4404 - val_acc: 0.7964\n", 952 | "Epoch 5/10\n", 953 | " - 1329s - loss: 0.4131 - acc: 0.8070 - val_loss: 0.4452 - val_acc: 0.7954\n", 954 | "Epoch 6/10\n", 955 | " - 1503s - loss: 0.4080 - acc: 0.8093 - val_loss: 0.4429 - val_acc: 0.7937\n", 956 | "Epoch 7/10\n", 957 | " - 1516s - loss: 0.4034 - acc: 0.8116 - val_loss: 0.4433 - val_acc: 0.7964\n", 958 | "Epoch 8/10\n", 959 | " - 1401s - loss: 0.3995 - acc: 0.8137 - val_loss: 0.4583 - val_acc: 0.7937\n", 960 | "Epoch 9/10\n", 961 | " - 1445s - loss: 0.3961 - acc: 0.8153 - val_loss: 0.4540 - val_acc: 0.7934\n", 962 | "Epoch 10/10\n", 963 | " - 1530s - loss: 0.3930 - acc: 0.8166 - val_loss: 0.4583 - val_acc: 0.7957\n" 964 | ] 965 | }, 966 | { 967 | "data": { 968 | "text/plain": [ 969 | "" 970 | ] 971 | }, 972 | "execution_count": 34, 973 | "metadata": {}, 974 | "output_type": "execute_result" 975 | } 976 | ], 977 | "source": [ 978 | "np.random.seed(seed)\n", 979 | "model_d2v_11 = Sequential()\n", 980 | "model_d2v_11.add(Dense(512, activation='relu', input_dim=200))\n", 981 | "model_d2v_11.add(Dense(512, activation='relu'))\n", 982 | "model_d2v_11.add(Dense(1, activation='sigmoid'))\n", 983 | "model_d2v_11.compile(optimizer='adam',\n", 984 | " loss='binary_crossentropy',\n", 985 | " metrics=['accuracy'])\n", 986 | "\n", 987 | "model_d2v_11.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": 35, 993 | "metadata": {}, 994 | "outputs": [ 995 | { 996 | "name": "stdout", 997 | "output_type": "stream", 998 | "text": [ 999 | "Train on 1564098 samples, validate on 15960 samples\n", 1000 | "Epoch 1/10\n", 1001 | " - 1384s - loss: 0.4558 - acc: 0.7840 - val_loss: 0.4426 - val_acc: 0.7876\n", 1002 | "Epoch 2/10\n", 1003 | " - 2148s - loss: 0.4332 - acc: 0.7966 - val_loss: 0.4344 - val_acc: 0.7965\n", 1004 | "Epoch 3/10\n", 1005 | " - 2682s - loss: 0.4224 - acc: 0.8027 - val_loss: 0.4385 - val_acc: 0.7951\n", 1006 | "Epoch 4/10\n", 1007 | " - 2612s - loss: 0.4140 - acc: 0.8070 - val_loss: 0.4338 - val_acc: 0.7977\n", 1008 | "Epoch 5/10\n", 1009 | " - 2596s - loss: 0.4068 - acc: 0.8109 - val_loss: 0.4341 - val_acc: 0.7962\n", 1010 | "Epoch 6/10\n", 1011 | " - 2618s - loss: 0.4006 - acc: 0.8144 - val_loss: 0.4362 - val_acc: 0.7956\n", 1012 | "Epoch 7/10\n", 1013 | " - 2624s - loss: 0.3948 - acc: 0.8171 - val_loss: 0.4364 - val_acc: 0.7983\n", 1014 | "Epoch 8/10\n", 1015 | " - 2709s - loss: 0.3895 - acc: 0.8201 - val_loss: 0.4442 - val_acc: 0.7965\n", 1016 | "Epoch 9/10\n", 1017 | " - 2730s - loss: 0.3847 - acc: 0.8226 - val_loss: 0.4431 - val_acc: 0.7948\n", 1018 | "Epoch 10/10\n", 1019 | " - 2708s - loss: 0.3804 - acc: 0.8251 - val_loss: 0.4458 - val_acc: 0.7917\n" 1020 | ] 1021 | }, 1022 | { 1023 | "data": { 1024 | "text/plain": [ 1025 | "" 1026 | ] 1027 | }, 1028 | "execution_count": 35, 1029 | "metadata": {}, 1030 | "output_type": "execute_result" 1031 | } 1032 | ], 1033 | "source": [ 1034 | "np.random.seed(seed)\n", 1035 | "model_d2v_12 = Sequential()\n", 1036 | "model_d2v_12.add(Dense(512, activation='relu', input_dim=200))\n", 1037 | "model_d2v_12.add(Dense(512, activation='relu'))\n", 1038 | "model_d2v_12.add(Dense(512, activation='relu'))\n", 1039 | "model_d2v_12.add(Dense(1, activation='sigmoid'))\n", 1040 | "model_d2v_12.compile(optimizer='adam',\n", 1041 | " loss='binary_crossentropy',\n", 1042 | " metrics=['accuracy'])\n", 1043 | "\n", 1044 | "model_d2v_12.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "markdown", 1049 | "metadata": {}, 1050 | "source": [ 1051 | "After trying 12 different models with a range of hidden layers (from 1 to 3) and a range of hidden nodes for each hidden layer (64, 128, 256, 512), below is the result I got. Best validation accuracy (79.93%) is from \"model_d2v_09\" at epoch 7, which has 3 hidden layers of 256 hidden nodes for each hidden layer." 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "markdown", 1056 | "metadata": {}, 1057 | "source": [ 1058 | "| model | input layer (nodes) | hidden layer (nodes) | output layer (nodes) | best validation accuracy | number of epochs for best validation accuracy |\n", 1059 | "|-------|--------------|--------------|------------------|--------|--------|\n", 1060 | "| model_d2v_01 | 1 (200) | 1 (64) relu | 1 (1) sigmoid | 78.77% | epoch 9 |\n", 1061 | "| model_d2v_02 | 1 (200) | 2 (64) relu | 1 (1) sigmoid | 79.55% | epoch 8 |\n", 1062 | "| model_d2v_03 | 1 (200) | 3 (64) relu | 1 (1) sigmoid | 79.37% | epoch 7 |\n", 1063 | "| model_d2v_04 | 1 (200) | 1 (128) relu | 1 (1) sigmoid | 79.14% | epoch 10 |\n", 1064 | "| model_d2v_05 | 1 (200) | 2 (128) relu | 1 (1) sigmoid | 79.60% | epoch 4 |\n", 1065 | "| model_d2v_06 | 1 (200) | 3 (128) relu | 1 (1) sigmoid | 79.75% | epoch 6 |\n", 1066 | "| model_d2v_07 | 1 (200) | 1 (256) relu | 1 (1) sigmoid | 79.12% | epoch 7 |\n", 1067 | "| model_d2v_08 | 1 (200) | 2 (256) relu | 1 (1) sigmoid | 79.51% | epoch 10 |\n", 1068 | "| model_d2v_09 | 1 (200) | 3 (256) relu | 1 (1) sigmoid | 79.93% | epoch 7 |\n", 1069 | "| model_d2v_10 | 1 (200) | 1 (512) relu | 1 (1) sigmoid | 79.04% | epoch 7 |\n", 1070 | "| model_d2v_11 | 1 (200) | 2 (512) relu | 1 (1) sigmoid | 79.78% | epoch 3 |\n", 1071 | "| model_d2v_12 | 1 (200) | 3 (512) relu | 1 (1) sigmoid | 79.83% | epoch 7 |" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "markdown", 1076 | "metadata": {}, 1077 | "source": [ 1078 | "Now I know which model gives me the best result, I will run the final model of \"model_d2v_09\", but this time with callback functions in Keras. I was not quite familiar with callback functions in Keras before I received a comment in my previous post. After I got the comment, I did some digging and found all the useful functions in Keras callbacks. Thanks to @rcshubha for the comment. With my final model of Doc2Vec below, I used \"checkpoint\" and \"earlystop\". You can set the \"checkpoint\" function with options, and with the below parameter setting, \"checkpoint\" will save the best performing model up until the point of running, and only if a new epoch outperforms the saved model it will save it as a new model. And \"early_stop\" I defined it as to monitor validation accuracy, and if it doesn't outperform the best validation accuracy so far for 5 epochs, it will stop." 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "execution_count": 36, 1084 | "metadata": {}, 1085 | "outputs": [ 1086 | { 1087 | "name": "stdout", 1088 | "output_type": "stream", 1089 | "text": [ 1090 | "Train on 1564098 samples, validate on 15960 samples\n", 1091 | "Epoch 1/100\n", 1092 | "Epoch 00001: val_acc improved from -inf to 0.79041, saving model to d2v_09_best_weights.01-0.7904.hdf5\n", 1093 | " - 354s - loss: 0.4579 - acc: 0.7827 - val_loss: 0.4448 - val_acc: 0.7904\n", 1094 | "Epoch 2/100\n", 1095 | "Epoch 00002: val_acc improved from 0.79041 to 0.79674, saving model to d2v_09_best_weights.02-0.7967.hdf5\n", 1096 | " - 494s - loss: 0.4371 - acc: 0.7944 - val_loss: 0.4401 - val_acc: 0.7967\n", 1097 | "Epoch 3/100\n", 1098 | "Epoch 00003: val_acc did not improve\n", 1099 | " - 635s - loss: 0.4287 - acc: 0.7987 - val_loss: 0.4396 - val_acc: 0.7948\n", 1100 | "Epoch 4/100\n", 1101 | "Epoch 00004: val_acc did not improve\n", 1102 | " - 656s - loss: 0.4229 - acc: 0.8019 - val_loss: 0.4369 - val_acc: 0.7957\n", 1103 | "Epoch 5/100\n", 1104 | "Epoch 00005: val_acc did not improve\n", 1105 | " - 665s - loss: 0.4182 - acc: 0.8046 - val_loss: 0.4353 - val_acc: 0.7953\n", 1106 | "Epoch 6/100\n", 1107 | "Epoch 00006: val_acc improved from 0.79674 to 0.79743, saving model to d2v_09_best_weights.06-0.7974.hdf5\n", 1108 | " - 670s - loss: 0.4146 - acc: 0.8063 - val_loss: 0.4363 - val_acc: 0.7974\n", 1109 | "Epoch 7/100\n", 1110 | "Epoch 00007: val_acc improved from 0.79743 to 0.79931, saving model to d2v_09_best_weights.07-0.7993.hdf5\n", 1111 | " - 678s - loss: 0.4115 - acc: 0.8079 - val_loss: 0.4403 - val_acc: 0.7993\n", 1112 | "Epoch 8/100\n", 1113 | "Epoch 00008: val_acc did not improve\n", 1114 | " - 678s - loss: 0.4087 - acc: 0.8094 - val_loss: 0.4437 - val_acc: 0.7964\n", 1115 | "Epoch 9/100\n", 1116 | "Epoch 00009: val_acc did not improve\n", 1117 | " - 681s - loss: 0.4061 - acc: 0.8107 - val_loss: 0.4435 - val_acc: 0.7926\n", 1118 | "Epoch 10/100\n", 1119 | "Epoch 00010: val_acc did not improve\n", 1120 | " - 681s - loss: 0.4037 - acc: 0.8118 - val_loss: 0.4411 - val_acc: 0.7952\n", 1121 | "Epoch 11/100\n", 1122 | "Epoch 00011: val_acc did not improve\n", 1123 | " - 681s - loss: 0.4019 - acc: 0.8128 - val_loss: 0.4459 - val_acc: 0.7898\n", 1124 | "Epoch 12/100\n", 1125 | "Epoch 00012: val_acc did not improve\n", 1126 | " - 680s - loss: 0.4001 - acc: 0.8136 - val_loss: 0.4493 - val_acc: 0.7877\n" 1127 | ] 1128 | }, 1129 | { 1130 | "data": { 1131 | "text/plain": [ 1132 | "" 1133 | ] 1134 | }, 1135 | "execution_count": 36, 1136 | "metadata": {}, 1137 | "output_type": "execute_result" 1138 | } 1139 | ], 1140 | "source": [ 1141 | "from keras.callbacks import ModelCheckpoint, EarlyStopping\n", 1142 | "\n", 1143 | "filepath=\"d2v_09_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5\"\n", 1144 | "checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')\n", 1145 | "early_stop = EarlyStopping(monitor='val_acc', patience=5, mode='max') \n", 1146 | "callbacks_list = [checkpoint, early_stop]\n", 1147 | "np.random.seed(seed)\n", 1148 | "model_d2v_09_es = Sequential()\n", 1149 | "model_d2v_09_es.add(Dense(256, activation='relu', input_dim=200))\n", 1150 | "model_d2v_09_es.add(Dense(256, activation='relu'))\n", 1151 | "model_d2v_09_es.add(Dense(256, activation='relu'))\n", 1152 | "model_d2v_09_es.add(Dense(1, activation='sigmoid'))\n", 1153 | "model_d2v_09_es.compile(optimizer='adam',\n", 1154 | " loss='binary_crossentropy',\n", 1155 | " metrics=['accuracy'])\n", 1156 | "\n", 1157 | "model_d2v_09_es.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), \n", 1158 | " epochs=100, batch_size=32, verbose=2, callbacks=callbacks_list)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "markdown", 1163 | "metadata": {}, 1164 | "source": [ 1165 | "If I evaluate the model I just run, it will give me the result as same as I got from the last epoch." 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "code", 1170 | "execution_count": 37, 1171 | "metadata": {}, 1172 | "outputs": [ 1173 | { 1174 | "name": "stdout", 1175 | "output_type": "stream", 1176 | "text": [ 1177 | "15960/15960 [==============================] - 1s 33us/step \n" 1178 | ] 1179 | }, 1180 | { 1181 | "data": { 1182 | "text/plain": [ 1183 | "[0.4493457753556713, 0.787719298275491]" 1184 | ] 1185 | }, 1186 | "execution_count": 37, 1187 | "metadata": {}, 1188 | "output_type": "execute_result" 1189 | } 1190 | ], 1191 | "source": [ 1192 | "model_d2v_09_es.evaluate(x=validation_vecs_ugdbow_tgdmm, y=y_validation)" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "markdown", 1197 | "metadata": {}, 1198 | "source": [ 1199 | "But if I load the saved model at the best epoch, then this model will give me the result at that epoch." 1200 | ] 1201 | }, 1202 | { 1203 | "cell_type": "code", 1204 | "execution_count": 38, 1205 | "metadata": { 1206 | "collapsed": true 1207 | }, 1208 | "outputs": [], 1209 | "source": [ 1210 | "from keras.models import load_model\n", 1211 | "loaded_model = load_model('d2v_09_best_weights.07-0.7993.hdf5')" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "execution_count": 39, 1217 | "metadata": {}, 1218 | "outputs": [ 1219 | { 1220 | "name": "stdout", 1221 | "output_type": "stream", 1222 | "text": [ 1223 | "15960/15960 [==============================] - 0s 17us/step\n" 1224 | ] 1225 | }, 1226 | { 1227 | "data": { 1228 | "text/plain": [ 1229 | "[0.4402723977739052, 0.7993107769722329]" 1230 | ] 1231 | }, 1232 | "execution_count": 39, 1233 | "metadata": {}, 1234 | "output_type": "execute_result" 1235 | } 1236 | ], 1237 | "source": [ 1238 | "loaded_model.evaluate(x=validation_vecs_ugdbow_tgdmm, y=y_validation)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "markdown", 1243 | "metadata": {}, 1244 | "source": [ 1245 | "If you remember the validation accuracy with the same vector representation of the tweets with a logistic regression model (75.76%), you can see that feeding the same information to neural networks yields a significantly better result. It's amazing to see how neural network can boost the performance of dense vectors, but the best validation accuracy is still lower than the Tfidf vectors + logistic regression model, which gave me 82.92% validation accuracy. " 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "markdown", 1250 | "metadata": {}, 1251 | "source": [ 1252 | "If you have read my posts on Doc2Vec, or familiar with Doc2Vec, you might know that you can also extract word vectors for each word from the trained Doc2Vec model. I will move on to Word2Vec, and try different methods to see if any of those can outperform the Doc2Vec result (79.93%), ultimately outperform the Tfidf + logistic regression model (82.92%)." 1253 | ] 1254 | }, 1255 | { 1256 | "cell_type": "markdown", 1257 | "metadata": {}, 1258 | "source": [ 1259 | "# Word2Vec" 1260 | ] 1261 | }, 1262 | { 1263 | "cell_type": "markdown", 1264 | "metadata": {}, 1265 | "source": [ 1266 | "To make use of word vectors extracted from Doc2Vec model, I can no longer use the concatenated vectors of different n-grams, since they will not consist of the same vocabularies. Thus below, I load the model for unigram DMM and create concatenated vectors with unigram DBOW of 200 dimensions for each word in the vocabularies." 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "markdown", 1271 | "metadata": {}, 1272 | "source": [ 1273 | "What I will do first before I try neural networks with document representations computed from word vectors is that I will fit a logistic regression with various methods of document representation and with the one that gives me the best validation accuracy, I will finally define neural network models." 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": {}, 1279 | "source": [ 1280 | "I will also give you the summary of result from all the different word vectors fit with logistic regression as a table." 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "## Word vectors extracted from Doc2Vec models (Average/Sum)" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "markdown", 1292 | "metadata": {}, 1293 | "source": [ 1294 | "There could be a number of different ways to come up with document representational vectors with individual word vectors. One obvious choice is to average them. For every word in a tweet, see if trained Doc2Vec has word vector representation of the word, if so, sum them up throughout the document while counting how many words were detected as having word vectors, and finally by dividing the summed vector by the count you get the averaged word vector for the whole document which will have the same dimension (200 in this case) as the individual word vectors." 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "markdown", 1299 | "metadata": {}, 1300 | "source": [ 1301 | "Another method is just the sum of the word vectors without averaging them. This might distort the vector representation of the document if some tweets only have a few words in the Doc2Vec vocabulary and some tweets have most of the words in the Doc2Vec vocabulary. But I will try both summing and averaging and compare the results." 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "code", 1306 | "execution_count": 13, 1307 | "metadata": { 1308 | "collapsed": true 1309 | }, 1310 | "outputs": [], 1311 | "source": [ 1312 | "from sklearn.linear_model import LogisticRegression\n", 1313 | "from sklearn.preprocessing import scale" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": 14, 1319 | "metadata": { 1320 | "collapsed": true 1321 | }, 1322 | "outputs": [], 1323 | "source": [ 1324 | "model_ug_dmm = Doc2Vec.load('d2v_model_ug_dmm.doc2vec')\n", 1325 | "model_ug_dmm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)" 1326 | ] 1327 | }, 1328 | { 1329 | "cell_type": "code", 1330 | "execution_count": 14, 1331 | "metadata": { 1332 | "collapsed": true 1333 | }, 1334 | "outputs": [], 1335 | "source": [ 1336 | "def get_w2v_ugdbowdmm(tweet, size):\n", 1337 | " vec = np.zeros(size).reshape((1, size))\n", 1338 | " count = 0.\n", 1339 | " for word in tweet.split():\n", 1340 | " try:\n", 1341 | " vec += np.append(model_ug_dbow[word],model_ug_dmm[word]).reshape((1, size))\n", 1342 | " count += 1.\n", 1343 | " except KeyError:\n", 1344 | " continue\n", 1345 | " if count != 0:\n", 1346 | " vec /= count\n", 1347 | " return vec" 1348 | ] 1349 | }, 1350 | { 1351 | "cell_type": "code", 1352 | "execution_count": 15, 1353 | "metadata": { 1354 | "collapsed": true 1355 | }, 1356 | "outputs": [], 1357 | "source": [ 1358 | "def get_w2v_ugdbowdmm_sum(tweet, size):\n", 1359 | " vec = np.zeros(size).reshape((1, size))\n", 1360 | " for word in tweet.split():\n", 1361 | " try:\n", 1362 | " vec += np.append(model_ug_dbow[word],model_ug_dmm[word]).reshape((1, size))\n", 1363 | " except KeyError:\n", 1364 | " continue\n", 1365 | " return vec" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "code", 1370 | "execution_count": 24, 1371 | "metadata": { 1372 | "collapsed": true 1373 | }, 1374 | "outputs": [], 1375 | "source": [ 1376 | "train_vecs_w2v_dbowdmm = np.concatenate([get_w2v_ugdbowdmm(z, 200) for z in x_train])\n", 1377 | "validation_vecs_w2v_dbowdmm = np.concatenate([get_w2v_ugdbowdmm(z, 200) for z in x_validation])" 1378 | ] 1379 | }, 1380 | { 1381 | "cell_type": "code", 1382 | "execution_count": 17, 1383 | "metadata": {}, 1384 | "outputs": [ 1385 | { 1386 | "name": "stdout", 1387 | "output_type": "stream", 1388 | "text": [ 1389 | "CPU times: user 6min 13s, sys: 2min 34s, total: 8min 48s\n", 1390 | "Wall time: 10min 58s\n" 1391 | ] 1392 | } 1393 | ], 1394 | "source": [ 1395 | "%%time\n", 1396 | "clf = LogisticRegression()\n", 1397 | "clf.fit(train_vecs_w2v_dbowdmm, y_train)" 1398 | ] 1399 | }, 1400 | { 1401 | "cell_type": "code", 1402 | "execution_count": 18, 1403 | "metadata": {}, 1404 | "outputs": [ 1405 | { 1406 | "data": { 1407 | "text/plain": [ 1408 | "0.7173558897243107" 1409 | ] 1410 | }, 1411 | "execution_count": 18, 1412 | "metadata": {}, 1413 | "output_type": "execute_result" 1414 | } 1415 | ], 1416 | "source": [ 1417 | "clf.score(validation_vecs_w2v_dbowdmm, y_validation)" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "markdown", 1422 | "metadata": {}, 1423 | "source": [ 1424 | "The validation accuracy with averaged word vectors of unigram DBOW + unigram DMM is 71.74%, which is significantly lower than document vectors extracted from unigram DBOW + trigram DMM (75.76%), and also from the results I got from the 6th part of this series, I know that document vectors extracted from unigram DBOW + unigram DMM will give me 75.51% validation accuracy." 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "markdown", 1429 | "metadata": {}, 1430 | "source": [ 1431 | "I also tried scaling the vectors using ScikitLearn's scale function, and saw significant improvement in computation time and a slight improvement of the accuracy." 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "code", 1436 | "execution_count": 25, 1437 | "metadata": { 1438 | "collapsed": true 1439 | }, 1440 | "outputs": [], 1441 | "source": [ 1442 | "train_vecs_w2v_dbowdmm_s = scale(train_vecs_w2v_dbowdmm)\n", 1443 | "validation_vecs_w2v_dbowdmm_s = scale(validation_vecs_w2v_dbowdmm)" 1444 | ] 1445 | }, 1446 | { 1447 | "cell_type": "code", 1448 | "execution_count": 26, 1449 | "metadata": {}, 1450 | "outputs": [ 1451 | { 1452 | "name": "stdout", 1453 | "output_type": "stream", 1454 | "text": [ 1455 | "CPU times: user 1min 11s, sys: 34.6 s, total: 1min 46s\n", 1456 | "Wall time: 2min 29s\n" 1457 | ] 1458 | } 1459 | ], 1460 | "source": [ 1461 | "%%time\n", 1462 | "clf = LogisticRegression()\n", 1463 | "clf.fit(train_vecs_w2v_dbowdmm_s, y_train)" 1464 | ] 1465 | }, 1466 | { 1467 | "cell_type": "code", 1468 | "execution_count": 27, 1469 | "metadata": {}, 1470 | "outputs": [ 1471 | { 1472 | "data": { 1473 | "text/plain": [ 1474 | "0.7241854636591478" 1475 | ] 1476 | }, 1477 | "execution_count": 27, 1478 | "metadata": {}, 1479 | "output_type": "execute_result" 1480 | } 1481 | ], 1482 | "source": [ 1483 | "clf.score(validation_vecs_w2v_dbowdmm_s, y_validation)" 1484 | ] 1485 | }, 1486 | { 1487 | "cell_type": "markdown", 1488 | "metadata": {}, 1489 | "source": [ 1490 | "Let's see how summed word vectors perform compared to the averaged counter part." 1491 | ] 1492 | }, 1493 | { 1494 | "cell_type": "code", 1495 | "execution_count": 16, 1496 | "metadata": { 1497 | "collapsed": true 1498 | }, 1499 | "outputs": [], 1500 | "source": [ 1501 | "train_vecs_w2v_dbowdmm_sum = np.concatenate([get_w2v_ugdbowdmm_sum(z, 200) for z in x_train])\n", 1502 | "validation_vecs_w2v_dbowdmm_sum = np.concatenate([get_w2v_ugdbowdmm_sum(z, 200) for z in x_validation])" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "code", 1507 | "execution_count": 17, 1508 | "metadata": {}, 1509 | "outputs": [ 1510 | { 1511 | "name": "stdout", 1512 | "output_type": "stream", 1513 | "text": [ 1514 | "CPU times: user 22min 13s, sys: 1h 29min 43s, total: 1h 51min 57s\n", 1515 | "Wall time: 3h 28min 17s\n" 1516 | ] 1517 | } 1518 | ], 1519 | "source": [ 1520 | "%%time\n", 1521 | "clf = LogisticRegression()\n", 1522 | "clf.fit(train_vecs_w2v_dbowdmm_sum, y_train)" 1523 | ] 1524 | }, 1525 | { 1526 | "cell_type": "code", 1527 | "execution_count": 18, 1528 | "metadata": {}, 1529 | "outputs": [ 1530 | { 1531 | "data": { 1532 | "text/plain": [ 1533 | "0.7251253132832081" 1534 | ] 1535 | }, 1536 | "execution_count": 18, 1537 | "metadata": {}, 1538 | "output_type": "execute_result" 1539 | } 1540 | ], 1541 | "source": [ 1542 | "clf.score(validation_vecs_w2v_dbowdmm_sum, y_validation)" 1543 | ] 1544 | }, 1545 | { 1546 | "cell_type": "markdown", 1547 | "metadata": {}, 1548 | "source": [ 1549 | "The summation method gave me higher accuracy without scaling compared to the average method. But the simple logistic regression with the summed vectors took more than 3 hours to run. So again I tried scaling these vectors." 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "code", 1554 | "execution_count": 19, 1555 | "metadata": { 1556 | "collapsed": true 1557 | }, 1558 | "outputs": [], 1559 | "source": [ 1560 | "train_vecs_w2v_dbowdmm_sum_s = scale(train_vecs_w2v_dbowdmm_sum)\n", 1561 | "validation_vecs_w2v_dbowdmm_sum_s = scale(validation_vecs_w2v_dbowdmm_sum)" 1562 | ] 1563 | }, 1564 | { 1565 | "cell_type": "code", 1566 | "execution_count": 22, 1567 | "metadata": {}, 1568 | "outputs": [ 1569 | { 1570 | "name": "stdout", 1571 | "output_type": "stream", 1572 | "text": [ 1573 | "CPU times: user 2min 2s, sys: 48.4 s, total: 2min 51s\n", 1574 | "Wall time: 3min 41s\n" 1575 | ] 1576 | } 1577 | ], 1578 | "source": [ 1579 | "%%time\n", 1580 | "clf = LogisticRegression()\n", 1581 | "clf.fit(train_vecs_w2v_dbowdmm_sum_s, y_train)" 1582 | ] 1583 | }, 1584 | { 1585 | "cell_type": "code", 1586 | "execution_count": 23, 1587 | "metadata": {}, 1588 | "outputs": [ 1589 | { 1590 | "data": { 1591 | "text/plain": [ 1592 | "0.725250626566416" 1593 | ] 1594 | }, 1595 | "execution_count": 23, 1596 | "metadata": {}, 1597 | "output_type": "execute_result" 1598 | } 1599 | ], 1600 | "source": [ 1601 | "clf.score(validation_vecs_w2v_dbowdmm_sum_s, y_validation)" 1602 | ] 1603 | }, 1604 | { 1605 | "cell_type": "markdown", 1606 | "metadata": {}, 1607 | "source": [ 1608 | "Surprising! With scaling, logistic regression fitting only took 3 minutes! That's quite a difference." 1609 | ] 1610 | }, 1611 | { 1612 | "cell_type": "markdown", 1613 | "metadata": {}, 1614 | "source": [ 1615 | "## Word vectors extracted from Doc2Vec models with TFIDF weighting (Average/Sum)" 1616 | ] 1617 | }, 1618 | { 1619 | "cell_type": "markdown", 1620 | "metadata": {}, 1621 | "source": [ 1622 | "In the 5th part of this series, I have already explained what TF-IDF is. TF-IDF is a way of weighting each word by calculating the product of relative term frequency and inverse document frequency. Since it gives one scalar value for each word in the vocabulary, this can also be used as a weighting factor of each word vectors. Correa Jr. et al (2017) has implemented this Tf-idf weighting in their paper \"NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis\" http://www.aclweb.org/anthology/S17-2100" 1623 | ] 1624 | }, 1625 | { 1626 | "cell_type": "markdown", 1627 | "metadata": {}, 1628 | "source": [ 1629 | "In order to get the Tfidf value for each word, I first fit and transform the training set with TfidfVectorizer and create a dictionary containing \"word\", \"tfidf value\" pairs." 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "code", 1634 | "execution_count": 21, 1635 | "metadata": {}, 1636 | "outputs": [ 1637 | { 1638 | "name": "stderr", 1639 | "output_type": "stream", 1640 | "text": [ 1641 | "/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:1089: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", 1642 | " if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):\n" 1643 | ] 1644 | }, 1645 | { 1646 | "name": "stdout", 1647 | "output_type": "stream", 1648 | "text": [ 1649 | "vocab size : 103691\n" 1650 | ] 1651 | } 1652 | ], 1653 | "source": [ 1654 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1655 | "tvec = TfidfVectorizer(min_df=2)\n", 1656 | "tvec.fit_transform(x_train)\n", 1657 | "tfidf = dict(zip(tvec.get_feature_names(), tvec.idf_))\n", 1658 | "print 'vocab size :', len(tfidf)" 1659 | ] 1660 | }, 1661 | { 1662 | "cell_type": "code", 1663 | "execution_count": 33, 1664 | "metadata": {}, 1665 | "outputs": [ 1666 | { 1667 | "data": { 1668 | "text/plain": [ 1669 | "103691" 1670 | ] 1671 | }, 1672 | "execution_count": 33, 1673 | "metadata": {}, 1674 | "output_type": "execute_result" 1675 | } 1676 | ], 1677 | "source": [ 1678 | "len(set(model_ug_dbow.wv.vocab.keys()) & set(tvec.get_feature_names()))" 1679 | ] 1680 | }, 1681 | { 1682 | "cell_type": "code", 1683 | "execution_count": 1, 1684 | "metadata": { 1685 | "collapsed": true 1686 | }, 1687 | "outputs": [], 1688 | "source": [ 1689 | "def get_w2v_general(tweet, size, vectors, aggregation='mean'):\n", 1690 | " vec = np.zeros(size).reshape((1, size))\n", 1691 | " count = 0.\n", 1692 | " for word in tweet.split():\n", 1693 | " try:\n", 1694 | " vec += vectors[word].reshape((1, size))\n", 1695 | " count += 1.\n", 1696 | " except KeyError:\n", 1697 | " continue\n", 1698 | " if aggregation == 'mean':\n", 1699 | " if count != 0:\n", 1700 | " vec /= count\n", 1701 | " return vec\n", 1702 | " elif aggregation == 'sum':\n", 1703 | " return vec" 1704 | ] 1705 | }, 1706 | { 1707 | "cell_type": "markdown", 1708 | "metadata": {}, 1709 | "source": [ 1710 | "The below code can also be implemented within the word vector averaging or summing function, but it seems like it's taking quite a long time, so I separated this and tried to make a dictionary of word vectors weighted by Tfidf values. To be honest, I am still not sure why it took so long to compute the Tfidf weighting of the word vectors, but after 5 hours it finally finished computing. You can also see later that I tried another method of weighting but that took less than 10 seconds. If you have an answer to this, any insight would be appreciated." 1711 | ] 1712 | }, 1713 | { 1714 | "cell_type": "code", 1715 | "execution_count": 22, 1716 | "metadata": {}, 1717 | "outputs": [ 1718 | { 1719 | "name": "stdout", 1720 | "output_type": "stream", 1721 | "text": [ 1722 | "CPU times: user 4h 53min 1s, sys: 6min 1s, total: 4h 59min 2s\n", 1723 | "Wall time: 4h 58min 17s\n" 1724 | ] 1725 | } 1726 | ], 1727 | "source": [ 1728 | "%%time\n", 1729 | "w2v_tfidf = {}\n", 1730 | "for w in model_ug_dbow.wv.vocab.keys():\n", 1731 | " if w in tvec.get_feature_names():\n", 1732 | " w2v_tfidf[w] = np.append(model_ug_dbow[w],model_ug_dmm[w]) * tfidf[w]" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "code", 1737 | "execution_count": 25, 1738 | "metadata": { 1739 | "collapsed": true 1740 | }, 1741 | "outputs": [], 1742 | "source": [ 1743 | "import cPickle as pickle\n", 1744 | "with open('w2v_tfidf.p', 'wb') as fp:\n", 1745 | " pickle.dump(w2v_tfidf, fp, protocol=pickle.HIGHEST_PROTOCOL)" 1746 | ] 1747 | }, 1748 | { 1749 | "cell_type": "code", 1750 | "execution_count": 33, 1751 | "metadata": { 1752 | "collapsed": true 1753 | }, 1754 | "outputs": [], 1755 | "source": [ 1756 | "import cPickle as pickle\n", 1757 | "with open('w2v_tfidf.p', 'rb') as fp:\n", 1758 | " w2v_tfidf = pickle.load(fp)" 1759 | ] 1760 | }, 1761 | { 1762 | "cell_type": "code", 1763 | "execution_count": 37, 1764 | "metadata": {}, 1765 | "outputs": [ 1766 | { 1767 | "name": "stdout", 1768 | "output_type": "stream", 1769 | "text": [ 1770 | "CPU times: user 1min 18s, sys: 22.4 s, total: 1min 40s\n", 1771 | "Wall time: 1min 58s\n" 1772 | ] 1773 | } 1774 | ], 1775 | "source": [ 1776 | "%%time\n", 1777 | "train_vecs_w2v_tfidf_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'mean') for z in x_train]))\n", 1778 | "validation_vecs_w2v_tfidf_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'mean') for z in x_validation]))" 1779 | ] 1780 | }, 1781 | { 1782 | "cell_type": "code", 1783 | "execution_count": 38, 1784 | "metadata": {}, 1785 | "outputs": [ 1786 | { 1787 | "name": "stdout", 1788 | "output_type": "stream", 1789 | "text": [ 1790 | "CPU times: user 52.4 s, sys: 28.7 s, total: 1min 21s\n", 1791 | "Wall time: 1min 52s\n" 1792 | ] 1793 | } 1794 | ], 1795 | "source": [ 1796 | "%%time\n", 1797 | "clf = LogisticRegression()\n", 1798 | "clf.fit(train_vecs_w2v_tfidf_mean, y_train)" 1799 | ] 1800 | }, 1801 | { 1802 | "cell_type": "code", 1803 | "execution_count": 39, 1804 | "metadata": {}, 1805 | "outputs": [ 1806 | { 1807 | "data": { 1808 | "text/plain": [ 1809 | "0.7057017543859649" 1810 | ] 1811 | }, 1812 | "execution_count": 39, 1813 | "metadata": {}, 1814 | "output_type": "execute_result" 1815 | } 1816 | ], 1817 | "source": [ 1818 | "clf.score(validation_vecs_w2v_tfidf_mean, y_validation)" 1819 | ] 1820 | }, 1821 | { 1822 | "cell_type": "code", 1823 | "execution_count": 40, 1824 | "metadata": {}, 1825 | "outputs": [ 1826 | { 1827 | "name": "stdout", 1828 | "output_type": "stream", 1829 | "text": [ 1830 | "CPU times: user 1min 13s, sys: 20.8 s, total: 1min 34s\n", 1831 | "Wall time: 1min 52s\n" 1832 | ] 1833 | } 1834 | ], 1835 | "source": [ 1836 | "%%time\n", 1837 | "train_vecs_w2v_tfidf_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'sum') for z in x_train]))\n", 1838 | "validation_vecs_w2v_tfidf_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'sum') for z in x_validation]))" 1839 | ] 1840 | }, 1841 | { 1842 | "cell_type": "code", 1843 | "execution_count": 41, 1844 | "metadata": {}, 1845 | "outputs": [ 1846 | { 1847 | "name": "stdout", 1848 | "output_type": "stream", 1849 | "text": [ 1850 | "CPU times: user 1min 16s, sys: 29.7 s, total: 1min 46s\n", 1851 | "Wall time: 2min 18s\n" 1852 | ] 1853 | } 1854 | ], 1855 | "source": [ 1856 | "%%time\n", 1857 | "clf = LogisticRegression()\n", 1858 | "clf.fit(train_vecs_w2v_tfidf_sum, y_train)" 1859 | ] 1860 | }, 1861 | { 1862 | "cell_type": "code", 1863 | "execution_count": 42, 1864 | "metadata": {}, 1865 | "outputs": [ 1866 | { 1867 | "data": { 1868 | "text/plain": [ 1869 | "0.7031954887218045" 1870 | ] 1871 | }, 1872 | "execution_count": 42, 1873 | "metadata": {}, 1874 | "output_type": "execute_result" 1875 | } 1876 | ], 1877 | "source": [ 1878 | "clf.score(validation_vecs_w2v_tfidf_sum, y_validation)" 1879 | ] 1880 | }, 1881 | { 1882 | "cell_type": "markdown", 1883 | "metadata": {}, 1884 | "source": [ 1885 | "The result is not what I expected, especially after 5 hours of waiting. By weighting word vectors with Tfidf values, the validation accuracy dropped around 2% both for averaging and summing." 1886 | ] 1887 | }, 1888 | { 1889 | "cell_type": "markdown", 1890 | "metadata": {}, 1891 | "source": [ 1892 | "## Word vectors extracted from Doc2Vec models with custom weighting (Average/Sum)" 1893 | ] 1894 | }, 1895 | { 1896 | "cell_type": "markdown", 1897 | "metadata": {}, 1898 | "source": [ 1899 | "In the 3rd part of this series, I have defined a custom metric called \"pos_normcdf_hmean\", which is a metric borrowed from the presentation by Jason Kessler in PyData 2017 Seattle. If you want to know more in detail about the calculation, you can either check my previous post or you can also watch Jason Kessler's presentation. To give you a high-level intuition, by calculating harmonic mean of CDF(Cumulative Distribution Function) transformed values of term frequency rate within the whole document and the term frequency within a class, you can get a meaningful metric which shows how each word is related to a certain class." 1900 | ] 1901 | }, 1902 | { 1903 | "cell_type": "markdown", 1904 | "metadata": {}, 1905 | "source": [ 1906 | "I have used this metric to visualise tokens in the 3rd part of the series, and also used this again to create custom lexicon to be used for classification purpose in the 5th part. I will use this again as a weighting factor for the word vectors, and see how it affects the performance." 1907 | ] 1908 | }, 1909 | { 1910 | "cell_type": "code", 1911 | "execution_count": 53, 1912 | "metadata": {}, 1913 | "outputs": [ 1914 | { 1915 | "data": { 1916 | "text/plain": [ 1917 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 1918 | " dtype=, encoding=u'utf-8', input=u'content',\n", 1919 | " lowercase=True, max_df=1.0, max_features=100000, min_df=1,\n", 1920 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 1921 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 1922 | " tokenizer=None, vocabulary=None)" 1923 | ] 1924 | }, 1925 | "execution_count": 53, 1926 | "metadata": {}, 1927 | "output_type": "execute_result" 1928 | } 1929 | ], 1930 | "source": [ 1931 | "from sklearn.feature_extraction.text import CountVectorizer\n", 1932 | "cvec = CountVectorizer(max_features=100000)\n", 1933 | "cvec.fit(x_train)" 1934 | ] 1935 | }, 1936 | { 1937 | "cell_type": "code", 1938 | "execution_count": 54, 1939 | "metadata": { 1940 | "collapsed": true 1941 | }, 1942 | "outputs": [], 1943 | "source": [ 1944 | "neg_train = x_train[y_train == 0]\n", 1945 | "pos_train = x_train[y_train == 1]\n", 1946 | "neg_doc_matrix = cvec.transform(neg_train)\n", 1947 | "pos_doc_matrix = cvec.transform(pos_train)\n", 1948 | "neg_tf = np.sum(neg_doc_matrix,axis=0)\n", 1949 | "pos_tf = np.sum(pos_doc_matrix,axis=0)" 1950 | ] 1951 | }, 1952 | { 1953 | "cell_type": "code", 1954 | "execution_count": 55, 1955 | "metadata": {}, 1956 | "outputs": [ 1957 | { 1958 | "data": { 1959 | "text/html": [ 1960 | "

\n", 1961 | "\n", 1974 | "\n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | "

	negative	positive	total	pos_rate	pos_freq_pct	pos_rate_normcdf	pos_freq_pct_normcdf	pos_normcdf_hmean
welcome	610	6565	7175	0.914983	0.000752	0.912972	0.999474	0.954267
thank	2234	15428	17662	0.873514	0.001768	0.888181	1.000000	0.940779
thanks	5646	33697	39343	0.856493	0.003862	0.876664	1.000000	0.934279
congrats	451	3254	3705	0.878273	0.000373	0.891258	0.945374	0.917519
followfriday	167	2665	2832	0.941031	0.000305	0.926292	0.903828	0.914922
awesome	3735	14189	17924	0.791620	0.001626	0.825297	1.000000	0.904288
hello	1104	4425	5529	0.800326	0.000507	0.832885	0.985875	0.902945
hehe	960	3966	4926	0.805116	0.000454	0.836969	0.975098	0.900769
glad	2225	8086	10311	0.784211	0.000927	0.818668	0.999974	0.900284
follow	2498	8977	11475	0.782309	0.001029	0.816942	0.999997	0.899248

\n", 2101 | "

" 2102 | ], 2103 | "text/plain": [ 2104 | " negative positive total pos_rate pos_freq_pct \\\n", 2105 | "welcome 610 6565 7175 0.914983 0.000752 \n", 2106 | "thank 2234 15428 17662 0.873514 0.001768 \n", 2107 | "thanks 5646 33697 39343 0.856493 0.003862 \n", 2108 | "congrats 451 3254 3705 0.878273 0.000373 \n", 2109 | "followfriday 167 2665 2832 0.941031 0.000305 \n", 2110 | "awesome 3735 14189 17924 0.791620 0.001626 \n", 2111 | "hello 1104 4425 5529 0.800326 0.000507 \n", 2112 | "hehe 960 3966 4926 0.805116 0.000454 \n", 2113 | "glad 2225 8086 10311 0.784211 0.000927 \n", 2114 | "follow 2498 8977 11475 0.782309 0.001029 \n", 2115 | "\n", 2116 | " pos_rate_normcdf pos_freq_pct_normcdf pos_normcdf_hmean \n", 2117 | "welcome 0.912972 0.999474 0.954267 \n", 2118 | "thank 0.888181 1.000000 0.940779 \n", 2119 | "thanks 0.876664 1.000000 0.934279 \n", 2120 | "congrats 0.891258 0.945374 0.917519 \n", 2121 | "followfriday 0.926292 0.903828 0.914922 \n", 2122 | "awesome 0.825297 1.000000 0.904288 \n", 2123 | "hello 0.832885 0.985875 0.902945 \n", 2124 | "hehe 0.836969 0.975098 0.900769 \n", 2125 | "glad 0.818668 0.999974 0.900284 \n", 2126 | "follow 0.816942 0.999997 0.899248 " 2127 | ] 2128 | }, 2129 | "execution_count": 55, 2130 | "metadata": {}, 2131 | "output_type": "execute_result" 2132 | } 2133 | ], 2134 | "source": [ 2135 | "from scipy.stats import hmean\n", 2136 | "from scipy.stats import norm\n", 2137 | "def normcdf(x):\n", 2138 | " return norm.cdf(x, x.mean(), x.std())\n", 2139 | "\n", 2140 | "neg = np.squeeze(np.asarray(neg_tf))\n", 2141 | "pos = np.squeeze(np.asarray(pos_tf))\n", 2142 | "term_freq_df2 = pd.DataFrame([neg,pos],columns=cvec.get_feature_names()).transpose()\n", 2143 | "term_freq_df2.columns = ['negative', 'positive']\n", 2144 | "term_freq_df2['total'] = term_freq_df2['negative'] + term_freq_df2['positive']\n", 2145 | "term_freq_df2['pos_rate'] = term_freq_df2['positive'] * 1./term_freq_df2['total']\n", 2146 | "term_freq_df2['pos_freq_pct'] = term_freq_df2['positive'] * 1./term_freq_df2['positive'].sum()\n", 2147 | "term_freq_df2['pos_rate_normcdf'] = normcdf(term_freq_df2['pos_rate'])\n", 2148 | "term_freq_df2['pos_freq_pct_normcdf'] = normcdf(term_freq_df2['pos_freq_pct'])\n", 2149 | "term_freq_df2['pos_normcdf_hmean'] = hmean([term_freq_df2['pos_rate_normcdf'], term_freq_df2['pos_freq_pct_normcdf']])\n", 2150 | "term_freq_df2.sort_values(by='pos_normcdf_hmean', ascending=False).iloc[:10]" 2151 | ] 2152 | }, 2153 | { 2154 | "cell_type": "code", 2155 | "execution_count": 56, 2156 | "metadata": { 2157 | "collapsed": true 2158 | }, 2159 | "outputs": [], 2160 | "source": [ 2161 | "pos_hmean = term_freq_df2.pos_normcdf_hmean" 2162 | ] 2163 | }, 2164 | { 2165 | "cell_type": "code", 2166 | "execution_count": 53, 2167 | "metadata": {}, 2168 | "outputs": [ 2169 | { 2170 | "name": "stdout", 2171 | "output_type": "stream", 2172 | "text": [ 2173 | "CPU times: user 4.81 s, sys: 1.93 s, total: 6.75 s\n", 2174 | "Wall time: 9.51 s\n" 2175 | ] 2176 | } 2177 | ], 2178 | "source": [ 2179 | "%%time\n", 2180 | "w2v_pos_hmean = {}\n", 2181 | "for w in model_ug_dbow.wv.vocab.keys():\n", 2182 | " if w in pos_hmean.keys():\n", 2183 | " w2v_pos_hmean[w] = np.append(model_ug_dbow[w],model_ug_dmm[w]) * pos_hmean[w]" 2184 | ] 2185 | }, 2186 | { 2187 | "cell_type": "code", 2188 | "execution_count": 58, 2189 | "metadata": { 2190 | "collapsed": true 2191 | }, 2192 | "outputs": [], 2193 | "source": [ 2194 | "with open('w2v_hmean.p', 'wb') as fp:\n", 2195 | " pickle.dump(w2v_pos_hmean, fp, protocol=pickle.HIGHEST_PROTOCOL)" 2196 | ] 2197 | }, 2198 | { 2199 | "cell_type": "code", 2200 | "execution_count": 43, 2201 | "metadata": { 2202 | "collapsed": true 2203 | }, 2204 | "outputs": [], 2205 | "source": [ 2206 | "import cPickle as pickle\n", 2207 | "with open('w2v_hmean.p', 'rb') as fp:\n", 2208 | " w2v_pos_hmean = pickle.load(fp)" 2209 | ] 2210 | }, 2211 | { 2212 | "cell_type": "code", 2213 | "execution_count": 44, 2214 | "metadata": { 2215 | "collapsed": true 2216 | }, 2217 | "outputs": [], 2218 | "source": [ 2219 | "train_vecs_w2v_poshmean_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'mean') for z in x_train]))\n", 2220 | "validation_vecs_w2v_poshmean_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'mean') for z in x_validation]))" 2221 | ] 2222 | }, 2223 | { 2224 | "cell_type": "code", 2225 | "execution_count": 45, 2226 | "metadata": {}, 2227 | "outputs": [ 2228 | { 2229 | "name": "stdout", 2230 | "output_type": "stream", 2231 | "text": [ 2232 | "CPU times: user 1min 40s, sys: 1min 15s, total: 2min 55s\n", 2233 | "Wall time: 4min 20s\n" 2234 | ] 2235 | } 2236 | ], 2237 | "source": [ 2238 | "%%time\n", 2239 | "clf = LogisticRegression()\n", 2240 | "clf.fit(train_vecs_w2v_poshmean_mean, y_train)" 2241 | ] 2242 | }, 2243 | { 2244 | "cell_type": "code", 2245 | "execution_count": 46, 2246 | "metadata": {}, 2247 | "outputs": [ 2248 | { 2249 | "data": { 2250 | "text/plain": [ 2251 | "0.7327067669172932" 2252 | ] 2253 | }, 2254 | "execution_count": 46, 2255 | "metadata": {}, 2256 | "output_type": "execute_result" 2257 | } 2258 | ], 2259 | "source": [ 2260 | "clf.score(validation_vecs_w2v_poshmean_mean, y_validation)" 2261 | ] 2262 | }, 2263 | { 2264 | "cell_type": "code", 2265 | "execution_count": 47, 2266 | "metadata": { 2267 | "collapsed": true 2268 | }, 2269 | "outputs": [], 2270 | "source": [ 2271 | "train_vecs_w2v_poshmean_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'sum') for z in x_train]))\n", 2272 | "validation_vecs_w2v_poshmean_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'sum') for z in x_validation]))" 2273 | ] 2274 | }, 2275 | { 2276 | "cell_type": "code", 2277 | "execution_count": 48, 2278 | "metadata": {}, 2279 | "outputs": [ 2280 | { 2281 | "name": "stdout", 2282 | "output_type": "stream", 2283 | "text": [ 2284 | "CPU times: user 3min 23s, sys: 3min 51s, total: 7min 14s\n", 2285 | "Wall time: 11min 46s\n" 2286 | ] 2287 | } 2288 | ], 2289 | "source": [ 2290 | "%%time\n", 2291 | "clf = LogisticRegression()\n", 2292 | "clf.fit(train_vecs_w2v_poshmean_sum, y_train)" 2293 | ] 2294 | }, 2295 | { 2296 | "cell_type": "code", 2297 | "execution_count": 49, 2298 | "metadata": {}, 2299 | "outputs": [ 2300 | { 2301 | "data": { 2302 | "text/plain": [ 2303 | "0.7093984962406015" 2304 | ] 2305 | }, 2306 | "execution_count": 49, 2307 | "metadata": {}, 2308 | "output_type": "execute_result" 2309 | } 2310 | ], 2311 | "source": [ 2312 | "clf.score(validation_vecs_w2v_poshmean_sum, y_validation)" 2313 | ] 2314 | }, 2315 | { 2316 | "cell_type": "markdown", 2317 | "metadata": {}, 2318 | "source": [ 2319 | "Unlike Tfidf weighting, this time with custom weighting it actually gave me some performance boost when used with averaging method. But with summing, this weighting has performed no better than the word vectors without weighting." 2320 | ] 2321 | }, 2322 | { 2323 | "cell_type": "markdown", 2324 | "metadata": {}, 2325 | "source": [ 2326 | "## Word vectors extracted from pre-trained GloVe (Average/Sum)" 2327 | ] 2328 | }, 2329 | { 2330 | "cell_type": "markdown", 2331 | "metadata": {}, 2332 | "source": [ 2333 | "GloVe is another kind of word representaiton in vectors proposed by Pennington et al. (2014) from the Stanford NLP Group. https://nlp.stanford.edu/pubs/glove.pdf" 2334 | ] 2335 | }, 2336 | { 2337 | "cell_type": "markdown", 2338 | "metadata": {}, 2339 | "source": [ 2340 | "The difference between Word2Vec and Glove is how the two models compute the word vectors. In Word2Vec, the word vectors you are getting is a kind of a by-product of a shallow neural network, when it tries to predict either centre word given surrounding words or vice versa. But with GloVe, the word vectors you are getting is the object matrix of GloVe model, and it calculates this using term co-occurrence matrix and dimensionality reduction." 2341 | ] 2342 | }, 2343 | { 2344 | "cell_type": "markdown", 2345 | "metadata": {}, 2346 | "source": [ 2347 | "The good news is you can now easily load and use the pre-trained GloVe vectors from Gensim thanks to its latest update (Gensim 3.2.0). In addition to some pre-trained word vectors, new datasets are also added and this also can be easily downloaded using their downloader API. If you want to know more about this, please check this blog post by RaRe Technologies. https://rare-technologies.com/new-download-api-for-pretrained-nlp-models-and-datasets-in-gensim/" 2348 | ] 2349 | }, 2350 | { 2351 | "cell_type": "markdown", 2352 | "metadata": {}, 2353 | "source": [ 2354 | "The Stanford NLP Group has made their pre-trained GloVe vectors publicly available, and among them there are GloVe vectors trained specifically with Tweets. This sounds like something definitely worth trying. They have four different versions of Tweet vectors each with different dimensions (25, 50, 100, 200) trained on 2 billion Tweets. You can find more detail in their website. https://nlp.stanford.edu/projects/glove/" 2355 | ] 2356 | }, 2357 | { 2358 | "cell_type": "markdown", 2359 | "metadata": {}, 2360 | "source": [ 2361 | "For this post, I will use 200 dimesion pre-trrained GloVe vectors." 2362 | ] 2363 | }, 2364 | { 2365 | "cell_type": "code", 2366 | "execution_count": 16, 2367 | "metadata": { 2368 | "collapsed": true 2369 | }, 2370 | "outputs": [], 2371 | "source": [ 2372 | "import gensim.downloader as api\n", 2373 | "glove_twitter = api.load(\"glove-twitter-200\")" 2374 | ] 2375 | }, 2376 | { 2377 | "cell_type": "code", 2378 | "execution_count": 17, 2379 | "metadata": { 2380 | "collapsed": true 2381 | }, 2382 | "outputs": [], 2383 | "source": [ 2384 | "train_vecs_glove_mean = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'mean') for z in x_train]))\n", 2385 | "validation_vecs_glove_mean = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'mean') for z in x_validation]))" 2386 | ] 2387 | }, 2388 | { 2389 | "cell_type": "code", 2390 | "execution_count": 18, 2391 | "metadata": {}, 2392 | "outputs": [ 2393 | { 2394 | "name": "stdout", 2395 | "output_type": "stream", 2396 | "text": [ 2397 | "CPU times: user 1min 39s, sys: 34 s, total: 2min 13s\n", 2398 | "Wall time: 2min 45s\n" 2399 | ] 2400 | } 2401 | ], 2402 | "source": [ 2403 | "%%time\n", 2404 | "clf = LogisticRegression()\n", 2405 | "clf.fit(train_vecs_glove_mean, y_train)" 2406 | ] 2407 | }, 2408 | { 2409 | "cell_type": "code", 2410 | "execution_count": 19, 2411 | "metadata": {}, 2412 | "outputs": [ 2413 | { 2414 | "data": { 2415 | "text/plain": [ 2416 | "0.76265664160401" 2417 | ] 2418 | }, 2419 | "execution_count": 19, 2420 | "metadata": {}, 2421 | "output_type": "execute_result" 2422 | } 2423 | ], 2424 | "source": [ 2425 | "clf.score(validation_vecs_glove_mean, y_validation)" 2426 | ] 2427 | }, 2428 | { 2429 | "cell_type": "code", 2430 | "execution_count": 20, 2431 | "metadata": { 2432 | "collapsed": true 2433 | }, 2434 | "outputs": [], 2435 | "source": [ 2436 | "train_vecs_glove_sum = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'sum') for z in x_train]))\n", 2437 | "validation_vecs_glove_sum = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'sum') for z in x_validation]))" 2438 | ] 2439 | }, 2440 | { 2441 | "cell_type": "code", 2442 | "execution_count": 21, 2443 | "metadata": {}, 2444 | "outputs": [ 2445 | { 2446 | "name": "stdout", 2447 | "output_type": "stream", 2448 | "text": [ 2449 | "CPU times: user 2min 49s, sys: 43.3 s, total: 3min 32s\n", 2450 | "Wall time: 4min 14s\n" 2451 | ] 2452 | } 2453 | ], 2454 | "source": [ 2455 | "%%time\n", 2456 | "clf = LogisticRegression()\n", 2457 | "clf.fit(train_vecs_glove_sum, y_train)" 2458 | ] 2459 | }, 2460 | { 2461 | "cell_type": "code", 2462 | "execution_count": 22, 2463 | "metadata": {}, 2464 | "outputs": [ 2465 | { 2466 | "data": { 2467 | "text/plain": [ 2468 | "0.7659774436090225" 2469 | ] 2470 | }, 2471 | "execution_count": 22, 2472 | "metadata": {}, 2473 | "output_type": "execute_result" 2474 | } 2475 | ], 2476 | "source": [ 2477 | "clf.score(validation_vecs_glove_sum, y_validation)" 2478 | ] 2479 | }, 2480 | { 2481 | "cell_type": "markdown", 2482 | "metadata": {}, 2483 | "source": [ 2484 | "By using pre-trained GloVe vectors, I can see that the validation accuracy significantly improved. So far the best validation accuracy was from the averaged word vectors with custom weighting, which gave me 73.27% accuracy, and compared to this, GloVe vectors yields 76.27%, 76.60% for average and sum respectively." 2485 | ] 2486 | }, 2487 | { 2488 | "cell_type": "markdown", 2489 | "metadata": {}, 2490 | "source": [ 2491 | "## Word vectors extracted from pre-trained Google News Word2Vec (Average/Sum)" 2492 | ] 2493 | }, 2494 | { 2495 | "cell_type": "markdown", 2496 | "metadata": {}, 2497 | "source": [ 2498 | "With new updated Gensim, I can also load the famous pre-trained Google News word vectors. These word vectors are trained using Word2Vec model on Google News dataset (about 100 billion words) and published by Google. The model contains 300-dimensional vectors for 3 million words and phrases. You can find more detail in the Google project archive. https://code.google.com/archive/p/word2vec/" 2499 | ] 2500 | }, 2501 | { 2502 | "cell_type": "code", 2503 | "execution_count": 16, 2504 | "metadata": { 2505 | "collapsed": true 2506 | }, 2507 | "outputs": [], 2508 | "source": [ 2509 | "import gensim.downloader as api\n", 2510 | "googlenews = api.load(\"word2vec-google-news-300\")" 2511 | ] 2512 | }, 2513 | { 2514 | "cell_type": "code", 2515 | "execution_count": 17, 2516 | "metadata": { 2517 | "collapsed": true 2518 | }, 2519 | "outputs": [], 2520 | "source": [ 2521 | "train_vecs_googlenews_mean = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'mean') for z in x_train]))\n", 2522 | "validation_vecs_googlenews_mean = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'mean') for z in x_validation]))" 2523 | ] 2524 | }, 2525 | { 2526 | "cell_type": "code", 2527 | "execution_count": 18, 2528 | "metadata": {}, 2529 | "outputs": [ 2530 | { 2531 | "name": "stdout", 2532 | "output_type": "stream", 2533 | "text": [ 2534 | "CPU times: user 5min 55s, sys: 41min 5s, total: 47min\n", 2535 | "Wall time: 1h 23min 5s\n" 2536 | ] 2537 | } 2538 | ], 2539 | "source": [ 2540 | "%%time\n", 2541 | "clf = LogisticRegression()\n", 2542 | "clf.fit(train_vecs_googlenews_mean, y_train)" 2543 | ] 2544 | }, 2545 | { 2546 | "cell_type": "code", 2547 | "execution_count": 19, 2548 | "metadata": {}, 2549 | "outputs": [ 2550 | { 2551 | "data": { 2552 | "text/plain": [ 2553 | "0.749561403508772" 2554 | ] 2555 | }, 2556 | "execution_count": 19, 2557 | "metadata": {}, 2558 | "output_type": "execute_result" 2559 | } 2560 | ], 2561 | "source": [ 2562 | "clf.score(validation_vecs_googlenews_mean, y_validation)" 2563 | ] 2564 | }, 2565 | { 2566 | "cell_type": "code", 2567 | "execution_count": 20, 2568 | "metadata": { 2569 | "collapsed": true 2570 | }, 2571 | "outputs": [], 2572 | "source": [ 2573 | "train_vecs_googlenews_sum = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'sum') for z in x_train]))\n", 2574 | "validation_vecs_googlenews_sum = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'sum') for z in x_validation]))" 2575 | ] 2576 | }, 2577 | { 2578 | "cell_type": "code", 2579 | "execution_count": 21, 2580 | "metadata": {}, 2581 | "outputs": [ 2582 | { 2583 | "name": "stdout", 2584 | "output_type": "stream", 2585 | "text": [ 2586 | "CPU times: user 5min 45s, sys: 39min 17s, total: 45min 2s\n", 2587 | "Wall time: 1h 19min 51s\n" 2588 | ] 2589 | } 2590 | ], 2591 | "source": [ 2592 | "%%time\n", 2593 | "clf = LogisticRegression()\n", 2594 | "clf.fit(train_vecs_googlenews_sum, y_train)" 2595 | ] 2596 | }, 2597 | { 2598 | "cell_type": "code", 2599 | "execution_count": 22, 2600 | "metadata": {}, 2601 | "outputs": [ 2602 | { 2603 | "data": { 2604 | "text/plain": [ 2605 | "0.7491854636591478" 2606 | ] 2607 | }, 2608 | "execution_count": 22, 2609 | "metadata": {}, 2610 | "output_type": "execute_result" 2611 | } 2612 | ], 2613 | "source": [ 2614 | "clf.score(validation_vecs_googlenews_sum, y_validation)" 2615 | ] 2616 | }, 2617 | { 2618 | "cell_type": "markdown", 2619 | "metadata": {}, 2620 | "source": [ 2621 | "Even though it gives me a better result than the word vectors extracted from custom trained Doc2Vec models, but it fails to outperform GloVe vectors. And the vector dimension is even larger in Google News word vectors." 2622 | ] 2623 | }, 2624 | { 2625 | "cell_type": "markdown", 2626 | "metadata": {}, 2627 | "source": [ 2628 | "But, this is trained with Google News, and GloVe vector I used was trained specifically with Tweets, thus it is hard to comapre each other directly. What if Word2Vec is specifically trained with Tweets?" 2629 | ] 2630 | }, 2631 | { 2632 | "cell_type": "markdown", 2633 | "metadata": {}, 2634 | "source": [ 2635 | "## Separately trained Word2Vec (Average/Sum)" 2636 | ] 2637 | }, 2638 | { 2639 | "cell_type": "markdown", 2640 | "metadata": {}, 2641 | "source": [ 2642 | "I know I have already tried word vectors I extracted from Doc2Vec models, but what if I train separate Word2Vec models? Even though Doc2Vec models gave good representational vectors of document level, would it be more efficently learning word vectors if I train pure Word2Vec?" 2643 | ] 2644 | }, 2645 | { 2646 | "cell_type": "markdown", 2647 | "metadata": {}, 2648 | "source": [ 2649 | "In order to answer my own questions, I trained two Word2Vec models using CBOW (Continuous Bag Of Words) and Skip Gram models. In terms of parameter setting, I set the same parameters I used for Doc2Vec.\n", 2650 | "\n", 2651 | "- size of vectors: 100 dimensions\n", 2652 | "- negative sampling: 5\n", 2653 | "- window: 2\n", 2654 | "- minimum word count: 2\n", 2655 | "- alpha: 0.065 (decrease alpha by 0.002 per epoch)\n", 2656 | "- number of epochs: 30\n", 2657 | "\n", 2658 | "With above settings, I defined CBOW model by passing \"sg=0\", and Skip Gram model by passing \"sg=1\"." 2659 | ] 2660 | }, 2661 | { 2662 | "cell_type": "markdown", 2663 | "metadata": {}, 2664 | "source": [ 2665 | "And once I get the results from two models, I concatenate vectors of two models for each word so that the concatenated vectors will have 200 dimensional representation of each word." 2666 | ] 2667 | }, 2668 | { 2669 | "cell_type": "markdown", 2670 | "metadata": {}, 2671 | "source": [ 2672 | "Please note that in the 6th part, where I trained Doc2Vec, I used \"LabeledSentence\" function imported from Gensim. This has now been deprecated, thus for this post I used \"TaggedDocument\" function instead. The usage is the same." 2673 | ] 2674 | }, 2675 | { 2676 | "cell_type": "code", 2677 | "execution_count": 26, 2678 | "metadata": { 2679 | "collapsed": true 2680 | }, 2681 | "outputs": [], 2682 | "source": [ 2683 | "from tqdm import tqdm\n", 2684 | "tqdm.pandas(desc=\"progress-bar\")\n", 2685 | "import gensim\n", 2686 | "from gensim.models.word2vec import Word2Vec\n", 2687 | "from gensim.models.doc2vec import TaggedDocument\n", 2688 | "import multiprocessing\n", 2689 | "from sklearn import utils" 2690 | ] 2691 | }, 2692 | { 2693 | "cell_type": "code", 2694 | "execution_count": 27, 2695 | "metadata": { 2696 | "collapsed": true 2697 | }, 2698 | "outputs": [], 2699 | "source": [ 2700 | "def labelize_tweets_ug(tweets,label):\n", 2701 | " result = []\n", 2702 | " prefix = label\n", 2703 | " for i, t in zip(tweets.index, tweets):\n", 2704 | " result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))\n", 2705 | " return result" 2706 | ] 2707 | }, 2708 | { 2709 | "cell_type": "code", 2710 | "execution_count": 28, 2711 | "metadata": { 2712 | "collapsed": true 2713 | }, 2714 | "outputs": [], 2715 | "source": [ 2716 | "all_x = pd.concat([x_train,x_validation,x_test])\n", 2717 | "all_x_w2v = labelize_tweets_ug(all_x, 'all')" 2718 | ] 2719 | }, 2720 | { 2721 | "cell_type": "code", 2722 | "execution_count": 32, 2723 | "metadata": {}, 2724 | "outputs": [ 2725 | { 2726 | "name": "stderr", 2727 | "output_type": "stream", 2728 | "text": [ 2729 | "100%|██████████| 1596019/1596019 [00:01<00:00, 974931.80it/s]\n" 2730 | ] 2731 | } 2732 | ], 2733 | "source": [ 2734 | "cores = multiprocessing.cpu_count()\n", 2735 | "model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)\n", 2736 | "model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])" 2737 | ] 2738 | }, 2739 | { 2740 | "cell_type": "code", 2741 | "execution_count": 33, 2742 | "metadata": {}, 2743 | "outputs": [ 2744 | { 2745 | "name": "stderr", 2746 | "output_type": "stream", 2747 | "text": [ 2748 | "100%|██████████| 1596019/1596019 [00:01<00:00, 896751.73it/s]\n", 2749 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1132476.16it/s]\n", 2750 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1098657.06it/s]\n", 2751 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1087776.38it/s]\n", 2752 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1093001.25it/s]\n", 2753 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1157588.98it/s]\n", 2754 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1100648.79it/s]\n", 2755 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1119397.57it/s]\n", 2756 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1039300.87it/s]\n", 2757 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1139894.69it/s]\n", 2758 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1115533.23it/s]\n", 2759 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1056060.47it/s]\n", 2760 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1110581.51it/s]\n", 2761 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1120059.66it/s]\n", 2762 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1124520.95it/s]\n", 2763 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1012078.21it/s]\n", 2764 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1065250.54it/s]\n", 2765 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1105911.38it/s]\n", 2766 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1152587.48it/s]\n", 2767 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1151142.59it/s]\n", 2768 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1082757.94it/s]\n", 2769 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1087360.45it/s]\n", 2770 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1099016.36it/s]\n", 2771 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1102341.08it/s]\n", 2772 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1005251.65it/s]\n", 2773 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1016836.86it/s]\n", 2774 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1129312.57it/s]\n", 2775 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1080032.42it/s]\n", 2776 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1091104.83it/s]\n", 2777 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1118170.79it/s]\n" 2778 | ] 2779 | }, 2780 | { 2781 | "name": "stdout", 2782 | "output_type": "stream", 2783 | "text": [ 2784 | "CPU times: user 22min 1s, sys: 1min 3s, total: 23min 5s\n", 2785 | "Wall time: 9min 37s\n" 2786 | ] 2787 | } 2788 | ], 2789 | "source": [ 2790 | "%%time\n", 2791 | "for epoch in range(30):\n", 2792 | " model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)\n", 2793 | " model_ug_cbow.alpha -= 0.002\n", 2794 | " model_ug_cbow.min_alpha = model_ug_cbow.alpha" 2795 | ] 2796 | }, 2797 | { 2798 | "cell_type": "code", 2799 | "execution_count": 35, 2800 | "metadata": {}, 2801 | "outputs": [ 2802 | { 2803 | "name": "stderr", 2804 | "output_type": "stream", 2805 | "text": [ 2806 | "/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", 2807 | " \n" 2808 | ] 2809 | } 2810 | ], 2811 | "source": [ 2812 | "train_vecs_cbow_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_cbow,'mean') for z in x_train]))\n", 2813 | "validation_vecs_cbow_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_cbow,'mean') for z in x_validation]))" 2814 | ] 2815 | }, 2816 | { 2817 | "cell_type": "code", 2818 | "execution_count": 36, 2819 | "metadata": {}, 2820 | "outputs": [ 2821 | { 2822 | "name": "stdout", 2823 | "output_type": "stream", 2824 | "text": [ 2825 | "CPU times: user 40.8 s, sys: 5.19 s, total: 46 s\n", 2826 | "Wall time: 48.7 s\n" 2827 | ] 2828 | } 2829 | ], 2830 | "source": [ 2831 | "%%time\n", 2832 | "clf = LogisticRegression()\n", 2833 | "clf.fit(train_vecs_cbow_mean, y_train)" 2834 | ] 2835 | }, 2836 | { 2837 | "cell_type": "code", 2838 | "execution_count": 37, 2839 | "metadata": {}, 2840 | "outputs": [ 2841 | { 2842 | "data": { 2843 | "text/plain": [ 2844 | "0.7600250626566416" 2845 | ] 2846 | }, 2847 | "execution_count": 37, 2848 | "metadata": {}, 2849 | "output_type": "execute_result" 2850 | } 2851 | ], 2852 | "source": [ 2853 | "clf.score(validation_vecs_cbow_mean, y_validation)" 2854 | ] 2855 | }, 2856 | { 2857 | "cell_type": "code", 2858 | "execution_count": 38, 2859 | "metadata": {}, 2860 | "outputs": [ 2861 | { 2862 | "name": "stderr", 2863 | "output_type": "stream", 2864 | "text": [ 2865 | "100%|██████████| 1596019/1596019 [00:02<00:00, 533098.47it/s]\n" 2866 | ] 2867 | } 2868 | ], 2869 | "source": [ 2870 | "model_ug_sg = Word2Vec(sg=1, size=100, negative=5, window=2, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)\n", 2871 | "model_ug_sg.build_vocab([x.words for x in tqdm(all_x_w2v)])" 2872 | ] 2873 | }, 2874 | { 2875 | "cell_type": "code", 2876 | "execution_count": 39, 2877 | "metadata": {}, 2878 | "outputs": [ 2879 | { 2880 | "name": "stderr", 2881 | "output_type": "stream", 2882 | "text": [ 2883 | "100%|██████████| 1596019/1596019 [00:01<00:00, 923343.66it/s]\n", 2884 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1071407.58it/s]\n", 2885 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1084559.36it/s]\n", 2886 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1085515.92it/s]\n", 2887 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1098921.10it/s]\n", 2888 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1084263.01it/s]\n", 2889 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1137634.20it/s]\n", 2890 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1063158.62it/s]\n", 2891 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1095510.09it/s]\n", 2892 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1124627.88it/s]\n", 2893 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1097201.52it/s]\n", 2894 | "100%|██████████| 1596019/1596019 [00:01<00:00, 994637.94it/s]\n", 2895 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1164331.50it/s]\n", 2896 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1117291.03it/s]\n", 2897 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1121341.67it/s]\n", 2898 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1069449.79it/s]\n", 2899 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1084845.33it/s]\n", 2900 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1013267.45it/s]\n", 2901 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1127429.99it/s]\n", 2902 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1169868.72it/s]\n", 2903 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1052711.44it/s]\n", 2904 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1098613.06it/s]\n", 2905 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1176477.52it/s]\n", 2906 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1078878.89it/s]\n", 2907 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1164789.36it/s]\n", 2908 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1132839.33it/s]\n", 2909 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1103489.69it/s]\n", 2910 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1115084.29it/s]\n", 2911 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1165798.53it/s]\n", 2912 | "100%|██████████| 1596019/1596019 [00:01<00:00, 1137897.00it/s]\n" 2913 | ] 2914 | }, 2915 | { 2916 | "name": "stdout", 2917 | "output_type": "stream", 2918 | "text": [ 2919 | "CPU times: user 41min 35s, sys: 31.7 s, total: 42min 7s\n", 2920 | "Wall time: 12min 35s\n" 2921 | ] 2922 | } 2923 | ], 2924 | "source": [ 2925 | "%%time\n", 2926 | "for epoch in range(30):\n", 2927 | " model_ug_sg.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)\n", 2928 | " model_ug_sg.alpha -= 0.002\n", 2929 | " model_ug_sg.min_alpha = model_ug_sg.alpha" 2930 | ] 2931 | }, 2932 | { 2933 | "cell_type": "code", 2934 | "execution_count": 40, 2935 | "metadata": {}, 2936 | "outputs": [ 2937 | { 2938 | "name": "stderr", 2939 | "output_type": "stream", 2940 | "text": [ 2941 | "/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", 2942 | " \n" 2943 | ] 2944 | } 2945 | ], 2946 | "source": [ 2947 | "train_vecs_sg_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_sg,'mean') for z in x_train]))\n", 2948 | "validation_vecs_sg_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_sg,'mean') for z in x_validation]))" 2949 | ] 2950 | }, 2951 | { 2952 | "cell_type": "code", 2953 | "execution_count": 41, 2954 | "metadata": {}, 2955 | "outputs": [ 2956 | { 2957 | "name": "stdout", 2958 | "output_type": "stream", 2959 | "text": [ 2960 | "CPU times: user 23.3 s, sys: 4.34 s, total: 27.7 s\n", 2961 | "Wall time: 29.6 s\n" 2962 | ] 2963 | } 2964 | ], 2965 | "source": [ 2966 | "%%time\n", 2967 | "clf = LogisticRegression()\n", 2968 | "clf.fit(train_vecs_sg_mean, y_train)" 2969 | ] 2970 | }, 2971 | { 2972 | "cell_type": "code", 2973 | "execution_count": 42, 2974 | "metadata": {}, 2975 | "outputs": [ 2976 | { 2977 | "data": { 2978 | "text/plain": [ 2979 | "0.7604010025062656" 2980 | ] 2981 | }, 2982 | "execution_count": 42, 2983 | "metadata": {}, 2984 | "output_type": "execute_result" 2985 | } 2986 | ], 2987 | "source": [ 2988 | "clf.score(validation_vecs_sg_mean, y_validation)" 2989 | ] 2990 | }, 2991 | { 2992 | "cell_type": "code", 2993 | "execution_count": 43, 2994 | "metadata": { 2995 | "collapsed": true 2996 | }, 2997 | "outputs": [], 2998 | "source": [ 2999 | "def get_w2v_mean(tweet, size):\n", 3000 | " vec = np.zeros(size).reshape((1, size))\n", 3001 | " count = 0.\n", 3002 | " for word in tweet.split():\n", 3003 | " try:\n", 3004 | " vec += np.append(model_ug_cbow[word],model_ug_sg[word]).reshape((1, size))\n", 3005 | " count += 1.\n", 3006 | " except KeyError:\n", 3007 | " continue\n", 3008 | " if count != 0:\n", 3009 | " vec /= count\n", 3010 | " return vec" 3011 | ] 3012 | }, 3013 | { 3014 | "cell_type": "code", 3015 | "execution_count": 44, 3016 | "metadata": {}, 3017 | "outputs": [ 3018 | { 3019 | "name": "stderr", 3020 | "output_type": "stream", 3021 | "text": [ 3022 | "/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", 3023 | " \n" 3024 | ] 3025 | } 3026 | ], 3027 | "source": [ 3028 | "train_vecs_cbowsg_mean = scale(np.concatenate([get_w2v_mean(z, 200) for z in x_train]))\n", 3029 | "validation_vecs_cbowsg_mean = scale(np.concatenate([get_w2v_mean(z, 200) for z in x_validation]))" 3030 | ] 3031 | }, 3032 | { 3033 | "cell_type": "code", 3034 | "execution_count": 45, 3035 | "metadata": {}, 3036 | "outputs": [ 3037 | { 3038 | "name": "stdout", 3039 | "output_type": "stream", 3040 | "text": [ 3041 | "CPU times: user 6min 17s, sys: 27min 45s, total: 34min 2s\n", 3042 | "Wall time: 1h 13min 2s\n" 3043 | ] 3044 | } 3045 | ], 3046 | "source": [ 3047 | "%%time\n", 3048 | "clf = LogisticRegression()\n", 3049 | "clf.fit(train_vecs_cbowsg_mean, y_train)" 3050 | ] 3051 | }, 3052 | { 3053 | "cell_type": "code", 3054 | "execution_count": 46, 3055 | "metadata": {}, 3056 | "outputs": [ 3057 | { 3058 | "data": { 3059 | "text/plain": [ 3060 | "0.7650375939849624" 3061 | ] 3062 | }, 3063 | "execution_count": 46, 3064 | "metadata": {}, 3065 | "output_type": "execute_result" 3066 | } 3067 | ], 3068 | "source": [ 3069 | "clf.score(validation_vecs_cbowsg_mean, y_validation)" 3070 | ] 3071 | }, 3072 | { 3073 | "cell_type": "code", 3074 | "execution_count": 47, 3075 | "metadata": { 3076 | "collapsed": true 3077 | }, 3078 | "outputs": [], 3079 | "source": [ 3080 | "def get_w2v_sum(tweet, size):\n", 3081 | " vec = np.zeros(size).reshape((1, size))\n", 3082 | " for word in tweet.split():\n", 3083 | " try:\n", 3084 | " vec += np.append(model_ug_cbow[word],model_ug_sg[word]).reshape((1, size))\n", 3085 | " except KeyError:\n", 3086 | " continue\n", 3087 | " return vec" 3088 | ] 3089 | }, 3090 | { 3091 | "cell_type": "code", 3092 | "execution_count": 48, 3093 | "metadata": {}, 3094 | "outputs": [ 3095 | { 3096 | "name": "stderr", 3097 | "output_type": "stream", 3098 | "text": [ 3099 | "/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", 3100 | " \"\"\"\n" 3101 | ] 3102 | } 3103 | ], 3104 | "source": [ 3105 | "train_vecs_cbowsg_sum = scale(np.concatenate([get_w2v_sum(z, 200) for z in x_train]))\n", 3106 | "validation_vecs_cbowsg_sum = scale(np.concatenate([get_w2v_sum(z, 200) for z in x_validation]))" 3107 | ] 3108 | }, 3109 | { 3110 | "cell_type": "code", 3111 | "execution_count": 49, 3112 | "metadata": {}, 3113 | "outputs": [ 3114 | { 3115 | "name": "stdout", 3116 | "output_type": "stream", 3117 | "text": [ 3118 | "CPU times: user 7min 7s, sys: 28min 32s, total: 35min 40s\n", 3119 | "Wall time: 1h 16min 21s\n" 3120 | ] 3121 | } 3122 | ], 3123 | "source": [ 3124 | "%%time\n", 3125 | "clf = LogisticRegression()\n", 3126 | "clf.fit(train_vecs_cbowsg_sum, y_train)" 3127 | ] 3128 | }, 3129 | { 3130 | "cell_type": "code", 3131 | "execution_count": 50, 3132 | "metadata": {}, 3133 | "outputs": [ 3134 | { 3135 | "data": { 3136 | "text/plain": [ 3137 | "0.7675438596491229" 3138 | ] 3139 | }, 3140 | "execution_count": 50, 3141 | "metadata": {}, 3142 | "output_type": "execute_result" 3143 | } 3144 | ], 3145 | "source": [ 3146 | "clf.score(validation_vecs_cbowsg_sum, y_validation)" 3147 | ] 3148 | }, 3149 | { 3150 | "cell_type": "markdown", 3151 | "metadata": {}, 3152 | "source": [ 3153 | "The concatenated vectors of unigram CBOW and unigram Skip Gram models has yielded 76.50%, 76.75% validation accuracy respectively with mean and sum method. These results are even higher than the results I got from GloVe vectors. " 3154 | ] 3155 | }, 3156 | { 3157 | "cell_type": "markdown", 3158 | "metadata": {}, 3159 | "source": [ 3160 | "But please do not confuse this as a general statement. This is an empirical finding in this particualr setting." 3161 | ] 3162 | }, 3163 | { 3164 | "cell_type": "markdown", 3165 | "metadata": {}, 3166 | "source": [ 3167 | "## Separately trained Word2Vec with custom weighting (Average/Sum)" 3168 | ] 3169 | }, 3170 | { 3171 | "cell_type": "markdown", 3172 | "metadata": {}, 3173 | "source": [ 3174 | "As a final step, I will apply the custom weighting I have implemented above and see if this affects the performance." 3175 | ] 3176 | }, 3177 | { 3178 | "cell_type": "code", 3179 | "execution_count": 58, 3180 | "metadata": {}, 3181 | "outputs": [ 3182 | { 3183 | "name": "stderr", 3184 | "output_type": "stream", 3185 | "text": [ 3186 | "/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:4: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", 3187 | " after removing the cwd from sys.path.\n" 3188 | ] 3189 | }, 3190 | { 3191 | "name": "stdout", 3192 | "output_type": "stream", 3193 | "text": [ 3194 | "CPU times: user 4.92 s, sys: 631 ms, total: 5.55 s\n", 3195 | "Wall time: 6.19 s\n" 3196 | ] 3197 | } 3198 | ], 3199 | "source": [ 3200 | "%%time\n", 3201 | "w2v_pos_hmean_01 = {}\n", 3202 | "for w in model_ug_cbow.wv.vocab.keys():\n", 3203 | " if w in pos_hmean.keys():\n", 3204 | " w2v_pos_hmean_01[w] = np.append(model_ug_cbow[w],model_ug_sg[w]) * pos_hmean[w]" 3205 | ] 3206 | }, 3207 | { 3208 | "cell_type": "code", 3209 | "execution_count": 59, 3210 | "metadata": { 3211 | "collapsed": true 3212 | }, 3213 | "outputs": [], 3214 | "source": [ 3215 | "train_vecs_w2v_poshmean_mean_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'mean') for z in x_train]))\n", 3216 | "validation_vecs_w2v_poshmean_mean_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'mean') for z in x_validation]))" 3217 | ] 3218 | }, 3219 | { 3220 | "cell_type": "code", 3221 | "execution_count": 60, 3222 | "metadata": {}, 3223 | "outputs": [ 3224 | { 3225 | "name": "stdout", 3226 | "output_type": "stream", 3227 | "text": [ 3228 | "CPU times: user 8min 3s, sys: 32min 13s, total: 40min 17s\n", 3229 | "Wall time: 1h 23min 58s\n" 3230 | ] 3231 | } 3232 | ], 3233 | "source": [ 3234 | "%%time\n", 3235 | "clf = LogisticRegression()\n", 3236 | "clf.fit(train_vecs_w2v_poshmean_mean_01, y_train)" 3237 | ] 3238 | }, 3239 | { 3240 | "cell_type": "code", 3241 | "execution_count": 61, 3242 | "metadata": {}, 3243 | "outputs": [ 3244 | { 3245 | "data": { 3246 | "text/plain": [ 3247 | "0.7797619047619048" 3248 | ] 3249 | }, 3250 | "execution_count": 61, 3251 | "metadata": {}, 3252 | "output_type": "execute_result" 3253 | } 3254 | ], 3255 | "source": [ 3256 | "clf.score(validation_vecs_w2v_poshmean_mean_01, y_validation)" 3257 | ] 3258 | }, 3259 | { 3260 | "cell_type": "code", 3261 | "execution_count": 62, 3262 | "metadata": { 3263 | "collapsed": true 3264 | }, 3265 | "outputs": [], 3266 | "source": [ 3267 | "train_vecs_w2v_poshmean_sum_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'sum') for z in x_train]))\n", 3268 | "validation_vecs_w2v_poshmean_sum_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'sum') for z in x_validation]))" 3269 | ] 3270 | }, 3271 | { 3272 | "cell_type": "code", 3273 | "execution_count": 63, 3274 | "metadata": {}, 3275 | "outputs": [ 3276 | { 3277 | "name": "stdout", 3278 | "output_type": "stream", 3279 | "text": [ 3280 | "CPU times: user 7min 48s, sys: 28min 59s, total: 36min 48s\n", 3281 | "Wall time: 1h 11min 4s\n" 3282 | ] 3283 | } 3284 | ], 3285 | "source": [ 3286 | "%%time\n", 3287 | "clf = LogisticRegression()\n", 3288 | "clf.fit(train_vecs_w2v_poshmean_sum_01, y_train)" 3289 | ] 3290 | }, 3291 | { 3292 | "cell_type": "code", 3293 | "execution_count": 64, 3294 | "metadata": {}, 3295 | "outputs": [ 3296 | { 3297 | "data": { 3298 | "text/plain": [ 3299 | "0.7451754385964913" 3300 | ] 3301 | }, 3302 | "execution_count": 64, 3303 | "metadata": {}, 3304 | "output_type": "execute_result" 3305 | } 3306 | ], 3307 | "source": [ 3308 | "clf.score(validation_vecs_w2v_poshmean_sum_01, y_validation)" 3309 | ] 3310 | }, 3311 | { 3312 | "cell_type": "markdown", 3313 | "metadata": {}, 3314 | "source": [ 3315 | "Finally I get the best performing word vectors. Averaged word vectors (separately trained Word2Vec models) weighted with custom metric has yielded the best validation accuray of 77.97%! Below is the table of all the results I tried above." 3316 | ] 3317 | }, 3318 | { 3319 | "cell_type": "markdown", 3320 | "metadata": {}, 3321 | "source": [ 3322 | "| Word vectors extracted from | Vector dimensions | Weightings | Validation Accuracy with mean | Validation accuracy with sum |\n", 3323 | "|---|---|---|---|\n", 3324 | "| Doc2Vec (unigram DBOW + unigram DMM) | 200 | N/A | 72.42% | 72.51% |\n", 3325 | "| Doc2Vec (unigram DBOW + unigram DMM) | 200 | TF-IDF | 70.57% | 70.32% |\n", 3326 | "| Doc2Vec (unigram DBOW + unigram DMM) | 200 | custom | 73.27% | 70.94% |\n", 3327 | "| pre-trained GloVe (Tweets) | 200 | N/A | 76.27% | 76.60% |\n", 3328 | "| pre-trained Word2Vec (Google News) | 300 | N/A | 74.96% | 74.92% |\n", 3329 | "| Word2Vec (unigram CBOW + unigram SG) | 200 | N/A | 76.50% | 76.75% |\n", 3330 | "| Word2Vec (unigram CBOW + unigram SG) | 200 | custom | 77.98% | 74.52% |" 3331 | ] 3332 | }, 3333 | { 3334 | "cell_type": "markdown", 3335 | "metadata": {}, 3336 | "source": [ 3337 | "# Neural Network with Word2Vec" 3338 | ] 3339 | }, 3340 | { 3341 | "cell_type": "markdown", 3342 | "metadata": {}, 3343 | "source": [ 3344 | "The best performing word vectors with logistic regression was chosen to feed to a neural network model. This time I did not try various different architecture. Based on what I have observed during trials of different artchitectures with Doc2Vec document vectors, the best performing architecture was one with 3 hiddel layers with 256 hidden nodes at each hidden layer." 3345 | ] 3346 | }, 3347 | { 3348 | "cell_type": "markdown", 3349 | "metadata": {}, 3350 | "source": [ 3351 | "I will finally fit a neural network with early stopping and checkpoint so that I can save the best performing weights on validation accuracy." 3352 | ] 3353 | }, 3354 | { 3355 | "cell_type": "code", 3356 | "execution_count": 65, 3357 | "metadata": { 3358 | "collapsed": true 3359 | }, 3360 | "outputs": [], 3361 | "source": [ 3362 | "train_w2v_final = train_vecs_w2v_poshmean_mean_01\n", 3363 | "validation_w2v_final = validation_vecs_w2v_poshmean_mean_01" 3364 | ] 3365 | }, 3366 | { 3367 | "cell_type": "code", 3368 | "execution_count": 66, 3369 | "metadata": {}, 3370 | "outputs": [ 3371 | { 3372 | "name": "stdout", 3373 | "output_type": "stream", 3374 | "text": [ 3375 | "Train on 1564098 samples, validate on 15960 samples\n", 3376 | "Epoch 1/100\n", 3377 | "Epoch 00001: val_acc improved from -inf to 0.79530, saving model to w2v_01_best_weights.01-0.7953.hdf5\n", 3378 | " - 500s - loss: 0.4294 - acc: 0.8017 - val_loss: 0.4392 - val_acc: 0.7953\n", 3379 | "Epoch 2/100\n", 3380 | "Epoch 00002: val_acc improved from 0.79530 to 0.79762, saving model to w2v_01_best_weights.02-0.7976.hdf5\n", 3381 | " - 575s - loss: 0.4144 - acc: 0.8090 - val_loss: 0.4353 - val_acc: 0.7976\n", 3382 | "Epoch 3/100\n", 3383 | "Epoch 00003: val_acc improved from 0.79762 to 0.80056, saving model to w2v_01_best_weights.03-0.8006.hdf5\n", 3384 | " - 753s - loss: 0.4086 - acc: 0.8118 - val_loss: 0.4319 - val_acc: 0.8006\n", 3385 | "Epoch 4/100\n", 3386 | "Epoch 00004: val_acc improved from 0.80056 to 0.80182, saving model to w2v_01_best_weights.04-0.8018.hdf5\n", 3387 | " - 768s - loss: 0.4046 - acc: 0.8138 - val_loss: 0.4331 - val_acc: 0.8018\n", 3388 | "Epoch 5/100\n", 3389 | "Epoch 00005: val_acc did not improve\n", 3390 | " - 787s - loss: 0.4016 - acc: 0.8155 - val_loss: 0.4300 - val_acc: 0.8010\n", 3391 | "Epoch 6/100\n", 3392 | "Epoch 00006: val_acc improved from 0.80182 to 0.80301, saving model to w2v_01_best_weights.06-0.8030.hdf5\n", 3393 | " - 788s - loss: 0.3993 - acc: 0.8168 - val_loss: 0.4270 - val_acc: 0.8030\n", 3394 | "Epoch 7/100\n", 3395 | "Epoch 00007: val_acc did not improve\n", 3396 | " - 787s - loss: 0.3974 - acc: 0.8179 - val_loss: 0.4339 - val_acc: 0.8018\n", 3397 | "Epoch 8/100\n", 3398 | "Epoch 00008: val_acc improved from 0.80301 to 0.80351, saving model to w2v_01_best_weights.08-0.8035.hdf5\n", 3399 | " - 787s - loss: 0.3954 - acc: 0.8186 - val_loss: 0.4340 - val_acc: 0.8035\n", 3400 | "Epoch 9/100\n", 3401 | "Epoch 00009: val_acc did not improve\n", 3402 | " - 774s - loss: 0.3940 - acc: 0.8196 - val_loss: 0.4316 - val_acc: 0.8001\n", 3403 | "Epoch 10/100\n", 3404 | "Epoch 00010: val_acc improved from 0.80351 to 0.80476, saving model to w2v_01_best_weights.10-0.8048.hdf5\n", 3405 | " - 772s - loss: 0.3927 - acc: 0.8202 - val_loss: 0.4245 - val_acc: 0.8048\n", 3406 | "Epoch 11/100\n", 3407 | "Epoch 00011: val_acc did not improve\n", 3408 | " - 770s - loss: 0.3913 - acc: 0.8209 - val_loss: 0.4294 - val_acc: 0.8004\n", 3409 | "Epoch 12/100\n", 3410 | "Epoch 00012: val_acc did not improve\n", 3411 | " - 779s - loss: 0.3901 - acc: 0.8212 - val_loss: 0.4354 - val_acc: 0.7996\n", 3412 | "Epoch 13/100\n", 3413 | "Epoch 00013: val_acc did not improve\n", 3414 | " - 784s - loss: 0.3894 - acc: 0.8219 - val_loss: 0.4263 - val_acc: 0.7999\n", 3415 | "Epoch 14/100\n", 3416 | "Epoch 00014: val_acc did not improve\n", 3417 | " - 807s - loss: 0.3883 - acc: 0.8225 - val_loss: 0.4287 - val_acc: 0.7994\n", 3418 | "Epoch 15/100\n", 3419 | "Epoch 00015: val_acc did not improve\n", 3420 | " - 790s - loss: 0.3876 - acc: 0.8228 - val_loss: 0.4269 - val_acc: 0.8035\n" 3421 | ] 3422 | }, 3423 | { 3424 | "data": { 3425 | "text/plain": [ 3426 | "" 3427 | ] 3428 | }, 3429 | "execution_count": 66, 3430 | "metadata": {}, 3431 | "output_type": "execute_result" 3432 | } 3433 | ], 3434 | "source": [ 3435 | "from keras.callbacks import ModelCheckpoint, EarlyStopping\n", 3436 | "\n", 3437 | "filepath=\"w2v_01_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5\"\n", 3438 | "checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')\n", 3439 | "early_stop = EarlyStopping(monitor='val_acc', patience=5, mode='max') \n", 3440 | "callbacks_list = [checkpoint, early_stop]\n", 3441 | "np.random.seed(seed)\n", 3442 | "model_w2v_01 = Sequential()\n", 3443 | "model_w2v_01.add(Dense(256, activation='relu', input_dim=200))\n", 3444 | "model_w2v_01.add(Dense(256, activation='relu'))\n", 3445 | "model_w2v_01.add(Dense(256, activation='relu'))\n", 3446 | "model_w2v_01.add(Dense(1, activation='sigmoid'))\n", 3447 | "model_w2v_01.compile(optimizer='adam',\n", 3448 | " loss='binary_crossentropy',\n", 3449 | " metrics=['accuracy'])\n", 3450 | "\n", 3451 | "model_w2v_01.fit(train_w2v_final, y_train, validation_data=(validation_w2v_final, y_validation), \n", 3452 | " epochs=100, batch_size=32, verbose=2, callbacks=callbacks_list)" 3453 | ] 3454 | }, 3455 | { 3456 | "cell_type": "code", 3457 | "execution_count": 67, 3458 | "metadata": { 3459 | "collapsed": true 3460 | }, 3461 | "outputs": [], 3462 | "source": [ 3463 | "from keras.models import load_model\n", 3464 | "loaded_w2v_model = load_model('w2v_01_best_weights.10-0.8048.hdf5')" 3465 | ] 3466 | }, 3467 | { 3468 | "cell_type": "code", 3469 | "execution_count": 68, 3470 | "metadata": {}, 3471 | "outputs": [ 3472 | { 3473 | "name": "stdout", 3474 | "output_type": "stream", 3475 | "text": [ 3476 | "15960/15960 [==============================] - 0s 20us/step\n" 3477 | ] 3478 | }, 3479 | { 3480 | "data": { 3481 | "text/plain": [ 3482 | "[0.4244666022615026, 0.8047619047619048]" 3483 | ] 3484 | }, 3485 | "execution_count": 68, 3486 | "metadata": {}, 3487 | "output_type": "execute_result" 3488 | } 3489 | ], 3490 | "source": [ 3491 | "loaded_w2v_model.evaluate(x=validation_w2v_final, y=y_validation)" 3492 | ] 3493 | }, 3494 | { 3495 | "cell_type": "markdown", 3496 | "metadata": {}, 3497 | "source": [ 3498 | "The best validation accuracy is 80.48%. Surprisingly this is even hihger than the best accuracy I got by feeding document vectors to neurla network models in the above." 3499 | ] 3500 | }, 3501 | { 3502 | "cell_type": "markdown", 3503 | "metadata": {}, 3504 | "source": [ 3505 | "It took quite some time for me to try different settings, different calculations, but I learned some valuable lessons through all the trial and errors. Specifically trained Word2Vec with carefully engineered weighting can even outperform Doc2Vec in classification task." 3506 | ] 3507 | }, 3508 | { 3509 | "cell_type": "markdown", 3510 | "metadata": {}, 3511 | "source": [ 3512 | "In the next post, I will try more sophisticated neural network model, Convolutional Neural Network. Again I hope this will give me some boost of the performance." 3513 | ] 3514 | } 3515 | ], 3516 | "metadata": { 3517 | "kernelspec": { 3518 | "display_name": "Python 2", 3519 | "language": "python", 3520 | "name": "python2" 3521 | }, 3522 | "language_info": { 3523 | "codemirror_mode": { 3524 | "name": "ipython", 3525 | "version": 2 3526 | }, 3527 | "file_extension": ".py", 3528 | "mimetype": "text/x-python", 3529 | "name": "python", 3530 | "nbconvert_exporter": "python", 3531 | "pygments_lexer": "ipython2", 3532 | "version": "2.7.13" 3533 | } 3534 | }, 3535 | "nbformat": 4, 3536 | "nbformat_minor": 2 3537 | } 3538 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Another Twitter Sentiment Analysis with Python - Part 10 2 | 3 | Attached Jupyter Notebook is the part 10 of the Twitter Sentiment Analysis project I implemented as a capstone project for General Assembly's Data Science Immersive course. 4 | 5 | Accompanying blog posts can be found from my Medium account: 6 | https://medium.com/@rickykim78 7 | 8 | Below implementations can be found in the attached notebook. 9 | 10 | ## Neural Network with Doc2Vec/Word2Vec/GloVe
11 | prerequisite: [Theano](https://github.com/Theano/Theano), [Keras](https://github.com/keras-team/keras), [Gensim 3.2.0](https://github.com/RaRe-Technologies/gensim) 12 | ``` 13 | pip install tensorflow 14 | pip install keras 15 | pip install gensim 16 | ``` 17 | --------------------------------------------------------------------------------