├── models └── .keep ├── images ├── cnn.png └── sentiment.jpg ├── floyd.yml ├── README.md ├── support.py └── sentiment-movie-review.ipynb /models/.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /images/cnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/floydhub/sentiment-analysis-template/master/images/cnn.png -------------------------------------------------------------------------------- /images/sentiment.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/floydhub/sentiment-analysis-template/master/images/sentiment.jpg -------------------------------------------------------------------------------- /floyd.yml: -------------------------------------------------------------------------------- 1 | env: tensorflow-1.7 2 | machine: cpu 3 | data: 4 | - source: floydhub/datasets/imdb-preprocessed/1 5 | destination: imdb 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis 2 | 3 | [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is one of the most common [NLP](https://en.wikipedia.org/wiki/Natural-language_processing) problems. The goal is to analyze a text and predict whether the underlying sentiment is positive, negative or neutral. 4 | *What can you use it for?* Here are a few ideas - measure sentiment of customer support tickets, survey responses, social media, and movie reviews! 5 | 6 | ### Try it now 7 | 8 | [![Run on FloydHub](https://static.floydhub.com/button/button.svg)](https://floydhub.com/run?template=https://github.com/floydhub/sentiment-analysis-template) 9 | 10 | Click this button to open a Workspace on FloydHub that will train this model. 11 | 12 | ### Predicting sentiment of movie reviews 13 | 14 | In this notebook we will build a [Convolutional Neural Network](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) (CNN) classifier to predict the sentiment (positive or negative) of movie reviews. 15 | 16 | ![sentiment](images/sentiment.jpg) 17 | 18 | We will use the [Stanford Large Movie Reviews](http://ai.stanford.edu/~amaas/data/sentiment/) dataset for training our model. The dataset is compiled from a collection of 50,000 reviews from IMDB. It contains an equal number of positive and negative reviews. The authors considered only highly polarized reviews. Negative reviews have scores ≤ 4 (out of 10), while positive reviews have score ≥ 7. Neutral reviews are not included. The dataset is divided evenly into training and test sets. 19 | 20 | We will: 21 | - Preprocess text data for NLP 22 | - Build and train a 1-D CNN using Keras and Tensorflow 23 | - Evaluate our model on the test set 24 | - Run the model on your own movie reviews! 25 | -------------------------------------------------------------------------------- /support.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | import seaborn as sns 4 | 5 | def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14): 6 | """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap. 7 | 8 | Arguments 9 | --------- 10 | confusion_matrix: numpy.ndarray 11 | The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix. 12 | Similarly constructed ndarrays can also be used. 13 | class_names: list 14 | An ordered list of class names, in the order they index the given confusion matrix. 15 | figsize: tuple 16 | A 2-long tuple, the first value determining the horizontal size of the ouputted figure, 17 | the second determining the vertical size. Defaults to (10,7). 18 | fontsize: int 19 | Font size for axes labels. Defaults to 14. 20 | 21 | Returns 22 | ------- 23 | matplotlib.figure.Figure 24 | The resulting confusion matrix figure 25 | 26 | FROM: https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823 27 | """ 28 | df_cm = pd.DataFrame( 29 | confusion_matrix, index=class_names, columns=class_names, 30 | ) 31 | fig = plt.figure(figsize=figsize) 32 | try: 33 | heatmap = sns.heatmap(df_cm, annot=True, fmt="d") 34 | except ValueError: 35 | raise ValueError("Confusion matrix values must be integers.") 36 | heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize) 37 | heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize) 38 | plt.title('Confusion Matrix') 39 | plt.ylabel('True label') 40 | plt.xlabel('Predicted label') 41 | return fig -------------------------------------------------------------------------------- /sentiment-movie-review.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Sentiment Analysis\n", 8 | "\n", 9 | "Hi 🙂, if you are seeing this notebook, you've succesfully started your first project on FloydHub, hooray! 🚀\n", 10 | "\n", 11 | "[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is one of the most common [NLP](https://en.wikipedia.org/wiki/Natural-language_processing) problems. The goal is to analyze a text and predict whether the underlying sentiment is positive, negative or neutral. \n", 12 | "*What can you use it for?* Here are a few ideas - measure sentiment of customer support tickets, survey responses, social media, and movie reviews! \n", 13 | "\n", 14 | "### Predicting sentiment of movie reviews\n", 15 | "\n", 16 | "In this notebook we will build a [Convolutional Neural Network](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) (CNN) classifier to predict the sentiment (positive or negative) of movie reviews. \n", 17 | "\n", 18 | "\n", 19 | "\n", 20 | "We will use the [Stanford Large Movie Reviews](http://ai.stanford.edu/~amaas/data/sentiment/) dataset for training our model. The dataset is compiled from a collection of 50,000 reviews from IMDB. It contains an equal number of positive and negative reviews. The authors considered only highly polarized reviews. Negative reviews have scores ≤ 4 (out of 10), while positive reviews have score ≥ 7. Neutral reviews are not included. The dataset is divided evenly into training and test sets.\n", 21 | "\n", 22 | "We will:\n", 23 | "- Preprocess text data for NLP\n", 24 | "- Build and train a 1-D CNN using Keras and Tensorflow\n", 25 | "- Evaluate our model on the test set\n", 26 | "- Run the model on your own movie reviews!\n", 27 | "\n", 28 | "### Instructions\n", 29 | "\n", 30 | "- To execute a code cell, click on the cell and press `Shift + Enter` (shortcut for Run).\n", 31 | "- To learn more about Workspaces, check out the [Getting Started Notebook](get_started_workspace.ipynb).\n", 32 | "- **Tip**: *Feel free to try this Notebook with your own data and on your own super awesome sentiment classification task.*\n", 33 | "\n", 34 | "Now, let's get started! 🚀" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## Initial Setup\n", 42 | "\n", 43 | "Let's start by importing some packages" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stderr", 53 | "output_type": "stream", 54 | "text": [ 55 | "Using TensorFlow backend.\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "from keras.preprocessing import sequence\n", 61 | "from keras.models import Sequential\n", 62 | "from keras.layers import Dense, Embedding, GlobalMaxPooling1D, Flatten, Conv1D, Dropout, Activation\n", 63 | "from keras.preprocessing.text import Tokenizer\n", 64 | "\n", 65 | "import tensorflow as tf\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "import numpy as np\n", 68 | "import pandas as pd\n", 69 | "\n", 70 | "import os\n", 71 | "import re\n", 72 | "import string\n", 73 | "\n", 74 | "# For reproducibility\n", 75 | "from tensorflow import set_random_seed\n", 76 | "from numpy.random import seed\n", 77 | "seed(1)\n", 78 | "set_random_seed(2)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Training Parameters\n", 86 | "\n", 87 | "We'll set the hyperparameters for training our model. If you understand what they mean, feel free to play around - otherwise, we recommend keeping the defaults for your first run 🙂" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 2, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Hyperparams if GPU is available\n", 97 | "if tf.test.is_gpu_available():\n", 98 | " # GPU\n", 99 | " BATCH_SIZE = 128 # Number of examples used in each iteration\n", 100 | " EPOCHS = 2 # Number of passes through entire dataset\n", 101 | " VOCAB_SIZE = 30000 # Size of vocabulary dictionary\n", 102 | " MAX_LEN = 500 # Max length of review (in words)\n", 103 | " EMBEDDING_DIM = 40 # Dimension of word embedding vector\n", 104 | "\n", 105 | "# Hyperparams for CPU training\n", 106 | "else:\n", 107 | " # CPU\n", 108 | " BATCH_SIZE = 32\n", 109 | " EPOCHS = 2\n", 110 | " VOCAB_SIZE = 20000\n", 111 | " MAX_LEN = 90\n", 112 | " EMBEDDING_DIM = 40" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## Data\n", 120 | "\n", 121 | "The movie reviews dataset is already attached to your workspace (if you want to attach your own data, [check out our docs](https://docs.floydhub.com/guides/workspace/#attaching-floydhub-datasets)).\n", 122 | "\n", 123 | "Let's take a look at data. The labels are encoded in the dataset: **0** is for *negative* and **1** for a *positive* review." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 3, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "Train shape (rows, columns): (25000, 2) , Validation shape (rows, columns): (25000, 2)\n", 136 | "\n", 137 | "--- First Sample ---\n", 138 | "Label: 0\n", 139 | "Text: Watch the Original with the same title from 1944! This made for TV movie, is just god-awful! Although it does use (as far as I can tell) almost the same dialog, it just doesn't work! Is it the acting, the poor directing? OK so it's made for TV, but why watch a bad copy, when you can get your hands on the superb original? Especially as you'll be spoiled to the plot and won't enjoy the original as much, as if you've watched it first!

There are a few things that are different from the original (it's shorter for once), but all are for the worse! The actors playing the parts here, just don't fit the bill! You just don't believe them and who could top Edward G. Robinsons performance from the original? If you want, only watch it after you've seen the original and even then you'll be very brave, if you watch it through! It's almost sacrilege!\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "DS_PATH = '/floyd/input/imdb/' # ADD path/to/dataset\n", 145 | "LABELS = ['negative', 'positive']\n", 146 | "\n", 147 | "# Load data\n", 148 | "train = pd.read_csv(os.path.join(DS_PATH, \"train.tsv\"), sep='\\t') # EDIT WITH YOUR TRAIN FILE NAME\n", 149 | "val = pd.read_csv(os.path.join(DS_PATH, \"val.tsv\"), sep='\\t') # EDIT WITH YOUR VALIDATION FILE NAME\n", 150 | "\n", 151 | "print(\"Train shape (rows, columns): \", train.shape, \", Validation shape (rows, columns): \", val.shape)\n", 152 | "\n", 153 | "# How a row/sample looks like\n", 154 | "print(\"\\n--- First Sample ---\")\n", 155 | "print('Label:', train['label'][0])\n", 156 | "print('Text:', train['text'][0])" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 4, 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "image/png": "\n", 167 | "text/plain": [ 168 | "
" 169 | ] 170 | }, 171 | "metadata": {}, 172 | "output_type": "display_data" 173 | } 174 | ], 175 | "source": [ 176 | "# Custom Tokenizer\n", 177 | "re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')\n", 178 | "def tokenize(s): return re_tok.sub(r' \\1 ', s).split()\n", 179 | "\n", 180 | "# Plot sentence by lenght\n", 181 | "plt.hist([len(tokenize(s)) for s in train['text'].values], bins=50)\n", 182 | "plt.title('Tokens per sentence')\n", 183 | "plt.xlabel('Len (number of token)')\n", 184 | "plt.ylabel('# samples')\n", 185 | "plt.show()" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "## Data Preprocessing\n", 193 | "\n", 194 | "Before feeding the data into the model, we have to preprocess the text. \n", 195 | "\n", 196 | "- We will use the Keras `Tokenizer` to convert each word to a corresponding integer ID. Representing words as integers saves a lot of memory!\n", 197 | "- In order to feed the text into our CNN, all texts should be the same length. We ensure this using the `sequence.pad_sequences()` method and `MAX_LEN` variable. All texts longer than `MAX_LEN` are truncated and shorter texts are padded to get them to the same length.\n", 198 | "\n", 199 | "The *Tokens per sentence* plot (see above) is useful for setting the `MAX_LEN` training hyperparameter. " 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 5, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "First sample before preprocessing: \n", 212 | " Watch the Original with the same title from 1944! This made for TV movie, is just god-awful! Although it does use (as far as I can tell) almost the same dialog, it just doesn't work! Is it the acting, the poor directing? OK so it's made for TV, but why watch a bad copy, when you can get your hands on the superb original? Especially as you'll be spoiled to the plot and won't enjoy the original as much, as if you've watched it first!

There are a few things that are different from the original (it's shorter for once), but all are for the worse! The actors playing the parts here, just don't fit the bill! You just don't believe them and who could top Edward G. Robinsons performance from the original? If you want, only watch it after you've seen the original and even then you'll be very brave, if you watch it through! It's almost sacrilege! \n", 213 | "\n", 214 | "First sample after preprocessing: \n", 215 | " [ 5 1 111 2 525 354 1 201 14 73 14 44 871 293\n", 216 | " 9 83 7 7 47 23 3 168 180 12 23 272 36 1\n", 217 | " 201 42 5844 15 277 18 29 23 15 1 430 1 153 392\n", 218 | " 1 528 130 40 89 1180 1 985 22 40 89 261 95 2\n", 219 | " 34 97 347 2485 1328 236 36 1 201 44 22 178 61 103\n", 220 | " 9 100 871 107 1 201 2 57 92 487 27 52 2502 44\n", 221 | " 22 103 9 140 42 217]\n" 222 | ] 223 | } 224 | ], 225 | "source": [ 226 | "imdb_tokenizer = Tokenizer(num_words=VOCAB_SIZE)\n", 227 | "imdb_tokenizer.fit_on_texts(train['text'].values)\n", 228 | "\n", 229 | "x_train_seq = imdb_tokenizer.texts_to_sequences(train['text'].values)\n", 230 | "x_val_seq = imdb_tokenizer.texts_to_sequences(val['text'].values)\n", 231 | "\n", 232 | "x_train = sequence.pad_sequences(x_train_seq, maxlen=MAX_LEN, padding=\"post\", value=0)\n", 233 | "x_val = sequence.pad_sequences(x_val_seq, maxlen=MAX_LEN, padding=\"post\", value=0)\n", 234 | "\n", 235 | "y_train, y_val = train['label'].values, val['label'].values\n", 236 | "\n", 237 | "print('First sample before preprocessing: \\n', train['text'].values[0], '\\n')\n", 238 | "print('First sample after preprocessing: \\n', x_train[0])" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "## Model\n", 246 | "\n", 247 | "We will implement a model similar to Kim Yoon’s [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882).\n", 248 | "\n", 249 | "![cnn for text](https://github.com/floydhub/sentiment-analysis-template/raw/master/images/cnn.png)\n", 250 | "*Image from [the paper](https://arxiv.org/abs/1408.5882)*" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 6, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "# Model Parameters - You can play with these\n", 260 | "\n", 261 | "NUM_FILTERS = 250\n", 262 | "KERNEL_SIZE = 3\n", 263 | "HIDDEN_DIMS = 250" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 7, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "Build model...\n", 276 | "WARNING:tensorflow:From /usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.\n", 277 | "Instructions for updating:\n", 278 | "`NHWC` for data_format is deprecated, use `NWC` instead\n", 279 | "_________________________________________________________________\n", 280 | "Layer (type) Output Shape Param # \n", 281 | "=================================================================\n", 282 | "embedding_1 (Embedding) (None, 90, 40) 800000 \n", 283 | "_________________________________________________________________\n", 284 | "dropout_1 (Dropout) (None, 90, 40) 0 \n", 285 | "_________________________________________________________________\n", 286 | "conv1d_1 (Conv1D) (None, 88, 250) 30250 \n", 287 | "_________________________________________________________________\n", 288 | "global_max_pooling1d_1 (Glob (None, 250) 0 \n", 289 | "_________________________________________________________________\n", 290 | "dense_1 (Dense) (None, 250) 62750 \n", 291 | "_________________________________________________________________\n", 292 | "dropout_2 (Dropout) (None, 250) 0 \n", 293 | "_________________________________________________________________\n", 294 | "activation_1 (Activation) (None, 250) 0 \n", 295 | "_________________________________________________________________\n", 296 | "dense_2 (Dense) (None, 1) 251 \n", 297 | "_________________________________________________________________\n", 298 | "activation_2 (Activation) (None, 1) 0 \n", 299 | "=================================================================\n", 300 | "Total params: 893,251\n", 301 | "Trainable params: 893,251\n", 302 | "Non-trainable params: 0\n", 303 | "_________________________________________________________________\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "# CNN Model\n", 309 | "print('Build model...')\n", 310 | "model = Sequential()\n", 311 | "\n", 312 | "# we start off with an efficient embedding layer which maps\n", 313 | "# our vocab indices into EMBEDDING_DIM dimensions\n", 314 | "model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LEN))\n", 315 | "model.add(Dropout(0.2))\n", 316 | "\n", 317 | "# we add a Convolution1D, which will learn NUM_FILTERS filters\n", 318 | "model.add(Conv1D(NUM_FILTERS,\n", 319 | " KERNEL_SIZE,\n", 320 | " padding='valid',\n", 321 | " activation='relu',\n", 322 | " strides=1))\n", 323 | "\n", 324 | "# we use max pooling:\n", 325 | "model.add(GlobalMaxPooling1D())\n", 326 | "\n", 327 | "# We add a vanilla hidden layer:\n", 328 | "model.add(Dense(HIDDEN_DIMS))\n", 329 | "model.add(Dropout(0.2))\n", 330 | "model.add(Activation('relu'))\n", 331 | "\n", 332 | "# We project onto a single unit output layer, and squash it with a sigmoid:\n", 333 | "model.add(Dense(1))\n", 334 | "model.add(Activation('sigmoid'))\n", 335 | "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", 336 | "model.summary()" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "## Train & Evaluate\n", 344 | "\n", 345 | "If you left the default hyperpameters in the Notebook untouched, your training should take approximately: \n", 346 | "\n", 347 | "- On CPU machine: 2 minutes for 2 epochs.\n", 348 | "- On GPU machine: 1 minute for 2 epochs.\n", 349 | "\n", 350 | "You should get an accuracy of > 84%. *Note*: The model will start overfitting after 2 to 3 epochs. " 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 8, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "name": "stdout", 360 | "output_type": "stream", 361 | "text": [ 362 | "Train on 22500 samples, validate on 2500 samples\n", 363 | "Epoch 1/2\n", 364 | "22500/22500 [==============================] - 56s 2ms/step - loss: 0.4780 - acc: 0.7473 - val_loss: 0.3608 - val_acc: 0.8412\n", 365 | "Epoch 2/2\n", 366 | "22500/22500 [==============================] - 53s 2ms/step - loss: 0.2796 - acc: 0.8832 - val_loss: 0.3510 - val_acc: 0.8484\n", 367 | "25000/25000 [==============================] - 5s 201us/step\n", 368 | "\n", 369 | "Accuracy: 85.052\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "# fit a model\n", 375 | "model.fit(x_train, y_train,\n", 376 | " batch_size=BATCH_SIZE,\n", 377 | " epochs=EPOCHS,\n", 378 | " validation_split=0.1,\n", 379 | " verbose=2)\n", 380 | "\n", 381 | "# Evaluate the model\n", 382 | "score, acc = model.evaluate(x_val, y_val, batch_size=BATCH_SIZE)\n", 383 | "print('\\nAccuracy: ', acc*100)\n", 384 | "\n", 385 | "pred = model.predict_classes(x_val)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 9, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "name": "stdout", 395 | "output_type": "stream", 396 | "text": [ 397 | " precision recall f1-score support\n", 398 | "\n", 399 | " negative 0.81 0.88 0.84 11573\n", 400 | " positive 0.89 0.83 0.86 13427\n", 401 | "\n", 402 | "avg / total 0.85 0.85 0.85 25000\n", 403 | "\n" 404 | ] 405 | }, 406 | { 407 | "data": { 408 | "image/png": "\n", 409 | "text/plain": [ 410 | "
" 411 | ] 412 | }, 413 | "metadata": {}, 414 | "output_type": "display_data" 415 | } 416 | ], 417 | "source": [ 418 | "# Plot confusion matrix\n", 419 | "from sklearn.metrics import confusion_matrix\n", 420 | "from support import print_confusion_matrix\n", 421 | "cnf_matrix = confusion_matrix(pred, y_val)\n", 422 | "_ = print_confusion_matrix(cnf_matrix, LABELS)\n", 423 | "\n", 424 | "# Print Precision Recall F1-Score Report\n", 425 | "from sklearn.metrics import classification_report\n", 426 | "\n", 427 | "report = classification_report(pred, y_val, target_names=LABELS)\n", 428 | "print(report)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "## It's your turn" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "Test out the model you just trained. Edit the `my_review` variable and Run the Code cell below. Have fun!🎉\n", 443 | "\n", 444 | "Here are some inspirations:\n", 445 | "- Rian Johnson\\'s Star Wars: The Last Jedi is a satisfying, at times transporting entertainment with visual wit and a distinctly human touch. \n", 446 | "- All evidence points to this animated film being contrived as a money-making scheme. The result is worse than crass, it\\'s abominably bad.\n", 447 | "- It was inevitable that there would be the odd turkey in there. What I didn\\'t realise however, was that there could be one THIS bad.\n", 448 | "\n", 449 | "And some wrong predictions:\n", 450 | "- Pulp Fiction: Quentin Tarantino proves that he is the master of witty dialogue and a fast plot that doesn\\'t allow the viewer a moment of boredom or rest.\n", 451 | "\n", 452 | "Can you do better? Play around with the model hyperparameters!" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 11, 458 | "metadata": {}, 459 | "outputs": [ 460 | { 461 | "data": { 462 | "application/vnd.jupyter.widget-view+json": { 463 | "model_id": "1488dc6f57fa43e2a60c177a65a5e5de", 464 | "version_major": 2, 465 | "version_minor": 0 466 | }, 467 | "text/plain": [ 468 | "interactive(children=(Textarea(value='', description='review', placeholder='Type your Review here'), Button(de…" 469 | ] 470 | }, 471 | "metadata": {}, 472 | "output_type": "display_data" 473 | } 474 | ], 475 | "source": [ 476 | "from ipywidgets import interact_manual\n", 477 | "from ipywidgets import widgets\n", 478 | "\n", 479 | "def get_prediction(review):\n", 480 | " # Preprocessing\n", 481 | " review_np_array = imdb_tokenizer.texts_to_sequences([review])\n", 482 | " review_np_array = sequence.pad_sequences(review_np_array, maxlen=MAX_LEN, padding=\"post\", value=0)\n", 483 | " # Prediction\n", 484 | " score = model.predict(review_np_array)[0][0]\n", 485 | " prediction = LABELS[model.predict_classes(review_np_array)[0][0]]\n", 486 | " print('REVIEW:', review, '\\nPREDICTION:', prediction, '\\nSCORE: ', score)\n", 487 | "\n", 488 | "interact_manual(get_prediction, review=widgets.Textarea(placeholder='Type your Review here'));" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "## Save the result" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 12, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "import pickle\n", 505 | "\n", 506 | "# Saving Tokenizer\n", 507 | "with open('models/tokenizer.pickle', 'wb') as handle:\n", 508 | " pickle.dump(imdb_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 509 | " \n", 510 | "# Saving Model Weight\n", 511 | "model.save_weights('models/cnn_sentiment_weights.h5')" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "##### That's all folks - don't forget to shutdown your workspace once you're done 🙂" 519 | ] 520 | } 521 | ], 522 | "metadata": { 523 | "kernelspec": { 524 | "display_name": "Python 2", 525 | "language": "python", 526 | "name": "python2" 527 | }, 528 | "language_info": { 529 | "codemirror_mode": { 530 | "name": "ipython", 531 | "version": 2 532 | }, 533 | "file_extension": ".py", 534 | "mimetype": "text/x-python", 535 | "name": "python", 536 | "nbconvert_exporter": "python", 537 | "pygments_lexer": "ipython2", 538 | "version": "2.7.10" 539 | } 540 | }, 541 | "nbformat": 4, 542 | "nbformat_minor": 2 543 | } 544 | --------------------------------------------------------------------------------