├── Hierarchical Attentional Hybrid Neural Networks.pdf ├── README.md ├── hahnn-for-document-classification.ipynb ├── make_predictions.ipynb └── track_colab.PNG /Hierarchical Attentional Hybrid Neural Networks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luisfredgs/cnn-hierarchical-network-for-document-classification/449c8df5d1daa22bec580c470c9f6e7265842220/Hierarchical Attentional Hybrid Neural Networks.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hierarchical Attentional Hybrid Neural Networks for Document Classification 2 | 3 | This paper was accepted in **ICANN 2019** 4 | 5 | J. Abreu , L. Fred, D. Macêdo, C. Zanchettin, "[**Hierarchical Attentional Hybrid Neural Networks for Document Classification**](https://arxiv.org/abs/1901.06610)". 6 | 7 | ## Performance on Yelp Dataset multi-class 8 | 9 |  10 | 11 | ## Datasets: 12 | | Dataset | Classes | Documents | download | 13 | |------------------------|:---------:|:-------:|:--------:| 14 | | Yelp Reviews 2018 | 5 | 1569264 |[link](https://www.kaggle.com/luisfredgs/hahnn-for-document-classification)| 15 | | IMDb Movie Review | 2 | 50000 | [link](https://www.kaggle.com/luisfredgs/hahnn-for-document-classification)| 16 | 17 | Do you want use Pre-trained FastText word embeddings? Downloaded in [https://www.kaggle.com/luisfredgs/wiki-news-300d-1m-subword](https://www.kaggle.com/luisfredgs/wiki-news-300d-1m-subword). Check the source code for more details. Pay attention to Colab limits of RAM and GPU. 18 | 19 | ## Requirements 20 | 21 | * Python 3 22 | * tensorflow 1.10 23 | * Keras 2.x 24 | * spacy 2.0 25 | * gensim 26 | * tqdm 27 | * matplotlib 28 | 29 | A GPU with CUDA support is required to run this code. 30 | 31 | ## Run this code on Google Colab with Free GPU 32 | 33 | On Google Colab, Select "**Runtime**," "**Change runtime type**" to Python 3. Ensure "**Hardware accelerator**" is set to GPU (the default is CPU). 34 | 35 | [](https://colab.research.google.com/drive/1LH7xLroO6QWO9dC6Hipn7xHYxVchJiUt) 36 | 37 | To run this notebook on Google Colab you don't need download dataset files. Type your kaggle username and API key during cell execution and wait. Will done. If do you want to make predictions on new text data using a trained model, check [**make_predictions.ipynb**](https://github.com/luisfredgs/cnn-hierarchical-network-for-document-classification/blob/master/make_predictions.ipynb) for more details. 38 | 39 | 40 | ## Please cite 41 | ``` 42 | @article{abreu2019hierarchical, 43 | title={Hierarchical Attentional Hybrid Neural Networks for Document Classification}, 44 | author={Abreu, Jader and Fred, Luis and Mac{\^e}do, David and Zanchettin, Cleber}, 45 | journal={arXiv preprint arXiv:1901.06610}, 46 | year={2019} 47 | } 48 | ``` 49 | -------------------------------------------------------------------------------- /make_predictions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Welcome To Colaboratory", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [], 10 | "toc_visible": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "accelerator": "GPU" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "code", 21 | "metadata": { 22 | "id": "AoekF5mpfCyE", 23 | "colab_type": "code", 24 | "colab": { 25 | "base_uri": "https://localhost:8080/", 26 | "height": 102 27 | }, 28 | "outputId": "9b6ea3de-a59a-4272-d305-3461810548a3" 29 | }, 30 | "source": [ 31 | "import datetime, pickle, os, codecs, re, string\n", 32 | "import json\n", 33 | "import random\n", 34 | "import numpy as np\n", 35 | "import keras\n", 36 | "from keras.models import *\n", 37 | "from keras.layers import *\n", 38 | "from keras.optimizers import *\n", 39 | "from keras.callbacks import *\n", 40 | "from keras import regularizers\n", 41 | "from keras.preprocessing.text import Tokenizer\n", 42 | "from keras.preprocessing.sequence import pad_sequences\n", 43 | "from keras import backend as K\n", 44 | "from keras.utils import CustomObjectScope\n", 45 | "from keras.engine.topology import Layer\n", 46 | "\n", 47 | "#\n", 48 | "from keras.engine import InputSpec\n", 49 | "\n", 50 | "from keras import initializers\n", 51 | "\n", 52 | "import pandas as pd\n", 53 | "from tqdm import tqdm\n", 54 | "\n", 55 | "import string\n", 56 | "from spacy.lang.en import English\n", 57 | "import gensim, nltk, logging\n", 58 | "\n", 59 | "from nltk.corpus import stopwords\n", 60 | "from nltk import tokenize\n", 61 | "from nltk.stem import SnowballStemmer\n", 62 | "\n", 63 | "nltk.download('punkt')\n", 64 | "nltk.download('stopwords')\n", 65 | "\n", 66 | "from sklearn.manifold import TSNE\n", 67 | "import matplotlib.pyplot as plt\n", 68 | "import en_core_web_sm\n", 69 | "\n", 70 | "from IPython.display import HTML, display\n", 71 | "\n", 72 | "import tensorflow as tf\n", 73 | "\n", 74 | "from numpy.random import seed\n", 75 | "from tensorflow import set_random_seed\n", 76 | "os.environ['PYTHONHASHSEED'] = str(1024)\n", 77 | "set_random_seed(1024)\n", 78 | "seed(1024)\n", 79 | "np.random.seed(1024)\n", 80 | "random.seed(1024)" 81 | ], 82 | "execution_count": 1, 83 | "outputs": [ 84 | { 85 | "output_type": "stream", 86 | "text": [ 87 | "Using TensorFlow backend.\n" 88 | ], 89 | "name": "stderr" 90 | }, 91 | { 92 | "output_type": "stream", 93 | "text": [ 94 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 95 | "[nltk_data] Package punkt is already up-to-date!\n", 96 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 97 | "[nltk_data] Package stopwords is already up-to-date!\n" 98 | ], 99 | "name": "stdout" 100 | } 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": { 106 | "id": "AOC9MzbzfJ0J", 107 | "colab_type": "text" 108 | }, 109 | "source": [ 110 | "# Attention layer" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "metadata": { 116 | "id": "JWZVSCpQ8q4R", 117 | "colab_type": "code", 118 | "colab": {} 119 | }, 120 | "source": [ 121 | "class Attention(Layer):\n", 122 | " def __init__(self, **kwargs):\n", 123 | " self.init = initializers.get('normal')\n", 124 | " self.supports_masking = True\n", 125 | " self.attention_dim = 50\n", 126 | " super(Attention, self).__init__(**kwargs)\n", 127 | "\n", 128 | " def build(self, input_shape):\n", 129 | " assert len(input_shape) == 3\n", 130 | " self.W = K.variable(self.init((input_shape[-1], 1)))\n", 131 | " self.b = K.variable(self.init((self.attention_dim, )))\n", 132 | " self.u = K.variable(self.init((self.attention_dim, 1)))\n", 133 | " self.trainable_weights = [self.W, self.b, self.u]\n", 134 | " super(Attention, self).build(input_shape)\n", 135 | "\n", 136 | " def compute_mask(self, inputs, mask=None):\n", 137 | " return mask\n", 138 | "\n", 139 | " def call(self, x, mask=None):\n", 140 | " uit = K.tanh(K.bias_add(K.dot(x, self.W), self.b))\n", 141 | " ait = K.dot(uit, self.u)\n", 142 | " ait = K.squeeze(ait, -1)\n", 143 | " ait = K.exp(ait)\n", 144 | "\n", 145 | " if mask is not None:\n", 146 | " ait *= K.cast(mask, K.floatx())\n", 147 | " \n", 148 | " ait /= K.cast(K.sum(ait, axis=1, keepdims=True) + K.epsilon(), K.floatx())\n", 149 | " ait = K.expand_dims(ait)\n", 150 | " weighted_input = x * ait\n", 151 | " output = K.sum(weighted_input, axis=1)\n", 152 | " return output\n", 153 | "\n", 154 | " def compute_output_shape(self, input_shape):\n", 155 | " return (input_shape[0], input_shape[-1])" 156 | ], 157 | "execution_count": 0, 158 | "outputs": [] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": { 163 | "id": "s_0JG25KfRqq", 164 | "colab_type": "text" 165 | }, 166 | "source": [ 167 | "# Model architecture" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "metadata": { 173 | "id": "lPmVsbH8fQwp", 174 | "colab_type": "code", 175 | "colab": {} 176 | }, 177 | "source": [ 178 | "class HAHNetwork():\n", 179 | " def __init__(self):\n", 180 | " self.model = None\n", 181 | " self.MAX_SENTENCE_LENGTH = 0\n", 182 | " self.MAX_SENTENCE_COUNT = 0\n", 183 | " self.VOCABULARY_SIZE = 0\n", 184 | " self.word_embedding = None\n", 185 | " self.model = None\n", 186 | " self.word_attention_model = None\n", 187 | " self.tokenizer = None\n", 188 | " self.class_count = 2\n", 189 | "\n", 190 | " def build_model(self, n_classes=2, embedding_dim=200, embeddings_path=False):\n", 191 | " \n", 192 | " l2_reg = regularizers.l2(0.001)\n", 193 | " \n", 194 | " embedding_weights = np.random.normal(0, 1, (len(self.tokenizer.word_index) + 1, embedding_dim))\n", 195 | " \n", 196 | " if embeddings_path is not None:\n", 197 | "\n", 198 | " if word_embedding_type is 'from_scratch':\n", 199 | " # FastText\n", 200 | " filename = './fasttext_model.txt' \n", 201 | " model = gensim.models.FastText.load(filename)\n", 202 | "\n", 203 | " embeddings_index = model.wv \n", 204 | " embedding_matrix = np.zeros( ( len(self.tokenizer.word_index) + 1, embedding_dim) )\n", 205 | " for word, i in self.tokenizer.word_index.items():\n", 206 | " try:\n", 207 | " embedding_vector = embeddings_index[word]\n", 208 | " if embedding_vector is not None:\n", 209 | " embedding_matrix[i] = embedding_vector\n", 210 | " except Exception as e:\n", 211 | " #print(str(e))\n", 212 | " continue\n", 213 | "\n", 214 | "\n", 215 | " else: \n", 216 | " embedding_dim = 300\n", 217 | " embedding_matrix = load_subword_embedding_300d(self.tokenizer.word_index)\n", 218 | "\n", 219 | " embedding_weights = embedding_matrix\n", 220 | "\n", 221 | " sentence_in = Input(shape=(self.MAX_SENTENCE_LENGTH,), dtype='int32', name=\"input_1\")\n", 222 | " \n", 223 | " embedding_trainable = True\n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " if word_embedding_type is 'pre_trained':\n", 228 | " embedding_trainable = False\n", 229 | " \n", 230 | " embedded_word_seq = Embedding(\n", 231 | " self.VOCABULARY_SIZE,\n", 232 | " embedding_dim,\n", 233 | " weights=[embedding_weights],\n", 234 | " input_length=self.MAX_SENTENCE_LENGTH,\n", 235 | " trainable=embedding_trainable,\n", 236 | " #mask_zero=True,\n", 237 | " mask_zero=False,\n", 238 | " name='word_embeddings',)(sentence_in) \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " dropout = Dropout(0.2)(embedded_word_seq)\n", 243 | " filter_sizes = [3,4,5]\n", 244 | " convs = []\n", 245 | " for filter_size in filter_sizes:\n", 246 | " conv = Conv1D(filters=64, kernel_size=filter_size, padding='same', activation='relu')(dropout)\n", 247 | " pool = MaxPool1D(filter_size)(conv)\n", 248 | " convs.append(pool)\n", 249 | " \n", 250 | " concatenate = Concatenate(axis=1)(convs)\n", 251 | " \n", 252 | " if rnn_type is 'GRU':\n", 253 | " #word_encoder = Bidirectional(CuDNNGRU(50, return_sequences=True, dropout=0.2))(concatenate) \n", 254 | " dropout = Dropout(0.1)(concatenate)\n", 255 | " word_encoder = Bidirectional(CuDNNGRU(50, return_sequences=True))(dropout) \n", 256 | " else:\n", 257 | " word_encoder = Bidirectional(\n", 258 | " LSTM(50, return_sequences=True, dropout=0.2))(embedded_word_seq)\n", 259 | " \n", 260 | " \n", 261 | " dense_transform_word = Dense(\n", 262 | " 100, \n", 263 | " activation='relu', \n", 264 | " name='dense_transform_word', \n", 265 | " kernel_regularizer=l2_reg)(word_encoder)\n", 266 | " \n", 267 | " # word attention\n", 268 | " attention_weighted_sentence = Model(\n", 269 | " sentence_in, Attention(name=\"word_attention\")(dense_transform_word))\n", 270 | " \n", 271 | " self.word_attention_model = attention_weighted_sentence\n", 272 | " \n", 273 | " attention_weighted_sentence.summary()\n", 274 | "\n", 275 | " # sentence-attention-weighted document scores\n", 276 | " \n", 277 | " texts_in = Input(shape=(self.MAX_SENTENCE_COUNT, self.MAX_SENTENCE_LENGTH), dtype='int32', name=\"input_2\")\n", 278 | " \n", 279 | " attention_weighted_sentences = TimeDistributed(attention_weighted_sentence)(texts_in)\n", 280 | " \n", 281 | " \n", 282 | " if rnn_type is 'GRU':\n", 283 | " #sentence_encoder = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.2))(attention_weighted_sentences)\n", 284 | " dropout = Dropout(0.1)(attention_weighted_sentences)\n", 285 | " sentence_encoder = Bidirectional(CuDNNGRU(50, return_sequences=True))(dropout)\n", 286 | " else:\n", 287 | " sentence_encoder = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.2))(attention_weighted_sentences)\n", 288 | " \n", 289 | " \n", 290 | " dense_transform_sentence = Dense(\n", 291 | " 100, \n", 292 | " activation='relu', \n", 293 | " name='dense_transform_sentence',\n", 294 | " kernel_regularizer=l2_reg)(sentence_encoder)\n", 295 | " \n", 296 | " # sentence attention\n", 297 | " attention_weighted_text = Attention(name=\"sentence_attention\")(dense_transform_sentence)\n", 298 | " \n", 299 | " \n", 300 | " prediction = Dense(n_classes, activation='softmax')(attention_weighted_text)\n", 301 | " \n", 302 | " model = Model(texts_in, prediction)\n", 303 | " model.summary()\n", 304 | " \n", 305 | " \n", 306 | " optimizer=Adam(lr=learning_rate, decay=0.0001)\n", 307 | "\n", 308 | " model.compile(\n", 309 | " optimizer=optimizer,\n", 310 | " loss='categorical_crossentropy',\n", 311 | " metrics=['accuracy'])\n", 312 | "\n", 313 | " return model\n", 314 | "\n", 315 | "\n", 316 | " def get_tokenizer_filename(self, saved_model_filename):\n", 317 | " return saved_model_filename + '.tokenizer'\n", 318 | "\n", 319 | " def create_reverse_word_index(self):\n", 320 | " self.reverse_word_index = {value:key for key,value in self.tokenizer.word_index.items()}\n", 321 | "\n", 322 | " def encode_texts(self, texts):\n", 323 | " encoded_texts = np.zeros((len(texts), self.MAX_SENTENCE_COUNT, self.MAX_SENTENCE_LENGTH))\n", 324 | " for i, text in enumerate(texts):\n", 325 | " encoded_text = np.array(pad_sequences(\n", 326 | " self.tokenizer.texts_to_sequences(text), \n", 327 | " maxlen=self.MAX_SENTENCE_LENGTH))[:self.MAX_SENTENCE_COUNT]\n", 328 | " encoded_texts[i][-len(encoded_text):] = encoded_text\n", 329 | " return encoded_texts\n", 330 | "\n", 331 | "\n", 332 | " def encode_input(self, x, log=False):\n", 333 | " x = np.array(x)\n", 334 | " if not x.shape:\n", 335 | " x = np.expand_dims(x, 0)\n", 336 | " texts = np.array([normalize(text) for text in x])\n", 337 | " return self.encode_texts(texts)\n", 338 | "\n", 339 | "\n", 340 | " def predict(self, x):\n", 341 | " encoded_x = self.encode_texts(x)\n", 342 | " return self.model.predict(encoded_x)\n", 343 | "\n", 344 | " \n", 345 | " def activation_maps(self, text, websafe=False):\n", 346 | " normalized_text = normalize(text)\n", 347 | " \n", 348 | " encoded_text = self.encode_input(text)[0]\n", 349 | "\n", 350 | " # get word activations\n", 351 | " \n", 352 | " hidden_word_encoding_out = Model(\n", 353 | " inputs=self.word_attention_model.input, \n", 354 | " outputs=self.word_attention_model.get_layer('dense_transform_word').output)\n", 355 | " \n", 356 | " \n", 357 | " hidden_word_encodings = hidden_word_encoding_out.predict(encoded_text)\n", 358 | " \n", 359 | " word_context = self.word_attention_model.get_layer('word_attention').get_weights()[0]\n", 360 | "\n", 361 | " \n", 362 | " dot = np.dot(hidden_word_encodings, word_context)\n", 363 | " \n", 364 | " #u_wattention = encoded_text*np.exp(np.squeeze(dot))\n", 365 | " u_wattention = encoded_text\n", 366 | " \n", 367 | " if websafe:\n", 368 | " u_wattention = u_wattention.astype(float)\n", 369 | "\n", 370 | " nopad_encoded_text = encoded_text[-len(normalized_text):]\n", 371 | " nopad_encoded_text = [list(filter(lambda x: x > 0, sentence)) for sentence in nopad_encoded_text]\n", 372 | " reconstructed_texts = [[self.reverse_word_index[int(i)] \n", 373 | " for i in sentence] for sentence in nopad_encoded_text]\n", 374 | " nopad_wattention = u_wattention[-len(normalized_text):]\n", 375 | " nopad_wattention = nopad_wattention/np.expand_dims(np.sum(nopad_wattention, -1), -1)\n", 376 | " nopad_wattention = np.array([attention_seq[-len(sentence):] \n", 377 | " for attention_seq, sentence in zip(nopad_wattention, nopad_encoded_text)])\n", 378 | " word_activation_maps = []\n", 379 | " for i, text in enumerate(reconstructed_texts):\n", 380 | " word_activation_maps.append(list(zip(text, nopad_wattention[i])))\n", 381 | " \n", 382 | " hidden_sentence_encoding_out = Model(inputs=self.model.input,\n", 383 | " outputs=self.model.get_layer('dense_transform_sentence').output)\n", 384 | " hidden_sentence_encodings = np.squeeze(\n", 385 | " hidden_sentence_encoding_out.predict(np.expand_dims(encoded_text, 0)), 0)\n", 386 | " sentence_context = self.model.get_layer('sentence_attention').get_weights()[0]\n", 387 | " u_sattention = np.exp(np.squeeze(np.dot(hidden_sentence_encodings, sentence_context), -1))\n", 388 | " if websafe:\n", 389 | " u_sattention = u_sattention.astype(float)\n", 390 | " nopad_sattention = u_sattention[-len(normalized_text):]\n", 391 | "\n", 392 | " nopad_sattention = nopad_sattention/np.expand_dims(np.sum(nopad_sattention, -1), -1)\n", 393 | "\n", 394 | " activation_map = list(zip(word_activation_maps, nopad_sattention)) \n", 395 | "\n", 396 | " return activation_map\n", 397 | " \n", 398 | " \n", 399 | " def load_weights(self, saved_model_dir, saved_model_filename):\n", 400 | " with CustomObjectScope({'Attention': Attention}):\n", 401 | " print(os.path.join(saved_model_dir, saved_model_filename))\n", 402 | " self.model = load_model(os.path.join(saved_model_dir, saved_model_filename)) \n", 403 | " self.word_attention_model = self.model.get_layer('time_distributed_1').layer\n", 404 | " tokenizer_path = os.path.join(\n", 405 | " saved_model_dir, self.get_tokenizer_filename(saved_model_filename))\n", 406 | " tokenizer_state = pickle.load(open(tokenizer_path, \"rb\" ))\n", 407 | " self.tokenizer = tokenizer_state['tokenizer']\n", 408 | " self.MAX_SENTENCE_COUNT = tokenizer_state['maxSentenceCount']\n", 409 | " self.MAX_SENTENCE_LENGTH = tokenizer_state['maxSentenceLength']\n", 410 | " self.VOCABULARY_SIZE = tokenizer_state['vocabularySize']\n", 411 | " self.create_reverse_word_index()" 412 | ], 413 | "execution_count": 0, 414 | "outputs": [] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": { 419 | "id": "Lkkjbzxsatxp", 420 | "colab_type": "text" 421 | }, 422 | "source": [ 423 | "# Normalize texts" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "metadata": { 429 | "id": "ox0OkPFvasId", 430 | "colab_type": "code", 431 | "colab": {} 432 | }, 433 | "source": [ 434 | "nlp = en_core_web_sm.load()\n", 435 | "\n", 436 | "\n", 437 | "puncts = [',', '.', '\"', ':', ')', '(', '-', '!', '?', '|', ';', \"'\", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\\\', '•', '~', '@', '£', \n", 438 | " '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', \n", 439 | " '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', \n", 440 | " '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', \n", 441 | " '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', '#', '—–']\n", 442 | "\n", 443 | "\n", 444 | "def clean_str(string):\n", 445 | " string = re.sub(r\"\\'s\", \" \\'s\", string)\n", 446 | " string = re.sub(r\"\\'ve\", \" \\'ve\", string)\n", 447 | " string = re.sub(r\"n\\'t\", \" n\\'t\", string)\n", 448 | " string = re.sub(r\"\\'re\", \" \\'re\", string)\n", 449 | " string = re.sub(r\"\\'d\", \" \\'d\", string)\n", 450 | " string = re.sub(r\"\\'ll\", \" \\'ll\", string)\n", 451 | " string = re.sub(r\",\", \" , \", string)\n", 452 | " string = re.sub(r\"!\", \" ! \", string)\n", 453 | " string = re.sub(r\"\\(\", \" \\( \", string)\n", 454 | " string = re.sub(r\"\\)\", \" \\) \", string)\n", 455 | " string = re.sub(r\"\\?\", \" \\? \", string)\n", 456 | " string = re.sub(r\"\\s{2,}\", \" \", string)\n", 457 | "\n", 458 | " cleanr = re.compile('<.*?>')\n", 459 | "\n", 460 | " # string = re.sub(r'\\d+', '', string)\n", 461 | " string = re.sub(cleanr, '', string)\n", 462 | " # string = re.sub(\"'\", '', string)\n", 463 | " # string = re.sub(r'\\W+', ' ', string)\n", 464 | " string = string.replace('_', '')\n", 465 | "\n", 466 | "\n", 467 | " return string.strip().lower()\n", 468 | "\n", 469 | "\n", 470 | "\n", 471 | "def clean_puncts(x):\n", 472 | " x = str(x)\n", 473 | " for punct in puncts:\n", 474 | " x = x.replace(punct, f' {punct} ')\n", 475 | " return x\n", 476 | "\n", 477 | "def remove_stopwords(text):\n", 478 | " text = str(text) \n", 479 | " ## Convert words to lower case and split them\n", 480 | " text = text.lower().split()\n", 481 | " \n", 482 | " ## Remove stop words\n", 483 | " stops = set(stopwords.words(\"english\"))\n", 484 | " text = [w for w in text if not w in stops and len(w) >= 3]\n", 485 | " text = \" \".join(text)\n", 486 | " \n", 487 | " return text\n", 488 | "\n", 489 | "\n", 490 | "def normalize(text):\n", 491 | " text = text.lower().strip()\n", 492 | " doc = nlp(text)\n", 493 | " filtered_sentences = []\n", 494 | " for sentence in doc.sents: \n", 495 | " sentence = clean_puncts(sentence)\n", 496 | " sentence = clean_str(sentence) \n", 497 | " #sentence = remove_stopwords(sentence) \n", 498 | " filtered_sentences.append(sentence)\n", 499 | " return filtered_sentences" 500 | ], 501 | "execution_count": 0, 502 | "outputs": [] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": { 507 | "id": "yevgKDr7eOIl", 508 | "colab_type": "text" 509 | }, 510 | "source": [ 511 | "# Prediction" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "metadata": { 517 | "id": "ns44_Hvh8wh3", 518 | "colab_type": "code", 519 | "colab": { 520 | "base_uri": "https://localhost:8080/", 521 | "height": 88 522 | }, 523 | "outputId": "e0e6e823-a0d2-4c2d-cba0-44cdbc18df3c" 524 | }, 525 | "source": [ 526 | "model = HAHNetwork()\n", 527 | "\n", 528 | "model.load_weights('./saved_models', './model.h5')\n", 529 | "\n", 530 | "\n", 531 | "import tensorflow as tf\n", 532 | "graph = tf.get_default_graph()\n", 533 | "\n", 534 | "text = \"I absolutely love Daughters of the Night Sky. This is a book that popped up as a free or low-cost Amazon special deal. I don't often succumb to these offers, but the brief description and, to be honest, the cover art intrigued me. I am so glad I did. I would give this book six stars if I could.\"\n", 535 | "ntext = normalize(text)\n", 536 | "\n", 537 | "\n", 538 | "global graph\n", 539 | "with graph.as_default():\n", 540 | " activation_maps = model.activation_maps(text, websafe=True)\n", 541 | " preds = model.predict([ntext])[0]\n", 542 | " prediction = np.argmax(preds).astype(float)\n", 543 | " data = {'activations': activation_maps, 'normalizedText': ntext, 'prediction': prediction}\n", 544 | " print(\"Activations map:\")\n", 545 | " print(json.dumps(data))" 546 | ], 547 | "execution_count": 7, 548 | "outputs": [ 549 | { 550 | "output_type": "stream", 551 | "text": [ 552 | "./saved_models/./model.h5\n", 553 | "Activations map:\n", 554 | "{\"activations\": [[[[\"i\", 0.0006597176408497163], [\"absolutely\", 0.04908299247921889], [\"love\", 0.014117957514183928], [\"daughters\", 0.41944847605224966], [\"of\", 0.001451378809869376], [\"the\", 0.0003958305845098298], [\"night\", 0.022298456260720412], [\"sky\", 0.4922813036020583], [\".\", 0.0002638870563398865]], 0.21974912946052083], [[[\"this\", 0.0017789072426937739], [\"is\", 0.0011859381617958492], [\"a\", 0.0005929690808979246], [\"book\", 0.08115205421431597], [\"that\", 0.0014400677678949598], [\"popped\", 0.3030072003388395], [\"up\", 0.005167301990681914], [\"as\", 0.0034731046166878443], [\"a\", 0.0005929690808979246], [\"free\", 0.023972892842016095], [\"or\", 0.005336721728081322], [\"low\", 0.07005506141465481], [\"cost\", 0.04862346463362982], [\"amazon\", 0.38881829733163914], [\"special\", 0.0265141889030072], [\"deal\", 0.03811944091486658], [\".\", 0.00016941973739940702]], 0.2073157501646788], [[[\"i\", 9.862321985088169e-05], [\"don\", 0.0017949426012860469], [\"'\", 0.0001577971517614107], [\"t\", 0.0004931160992544084], [\"often\", 0.011716438518284744], [\"succumb\", 0.5820545189159335], [\"to\", 0.00011834786382105803], [\"these\", 0.005049508856365142], [\"offers\", 0.027180559390902994], [\"but\", 0.0004536668113140558], [\"the\", 5.9173931910529015e-05], [\"brief\", 0.09984614777703263], [\"description\", 0.060850526647994], [\"and\", 7.889857588070535e-05], [\"to\", 0.00011834786382105803], [\"be\", 0.0006903625389561718], [\"honest\", 0.015227425144976133], [\"the\", 5.9173931910529015e-05], [\"cover\", 0.0272594579667837], [\"art\", 0.02660854471576788], [\"intrigued\", 0.13933488500532565], [\"me\", 0.0007100871829263482], [\".\", 3.9449287940352674e-05]], 0.19126346748854894], [[[\"i\", 0.005330490405117271], [\"am\", 0.15671641791044777], [\"so\", 0.03411513859275053], [\"glad\", 0.6961620469083155], [\"i\", 0.005330490405117271], [\"did\", 0.10021321961620469], [\".\", 0.0021321961620469083]], 0.1976753082412455], [[[\"i\", 0.0015234613040828763], [\"would\", 0.015234613040828763], [\"give\", 0.05088360755636807], [\"this\", 0.006398537477148081], [\"book\", 0.2918951858622791], [\"six\", 0.5027422303473492], [\"stars\", 0.08226691042047532], [\"if\", 0.013711151736745886], [\"i\", 0.0015234613040828763], [\"could\", 0.033211456429006705], [\".\", 0.0006093845216331506]], 0.18399634464500592]], \"normalizedText\": [\"i absolutely love daughters of the night sky .\", \"this is a book that popped up as a free or low - cost amazon special deal .\", \"i don ' t often succumb to these offers , but the brief description and , to be honest , the cover art intrigued me .\", \"i am so glad i did .\", \"i would give this book six stars if i could .\"], \"prediction\": 4.0}\n" 555 | ], 556 | "name": "stdout" 557 | } 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "metadata": { 563 | "id": "cuiViYnWc63B", 564 | "colab_type": "code", 565 | "colab": { 566 | "base_uri": "https://localhost:8080/", 567 | "height": 700 568 | }, 569 | "outputId": "e3128a74-5615-449f-9309-76c8bfc1058d" 570 | }, 571 | "source": [ 572 | "display(HTML(\"\"\"
\"\"\"), display_id=True)" 573 | ], 574 | "execution_count": 6, 575 | "outputs": [ 576 | { 577 | "output_type": "display_data", 578 | "data": { 579 | "text/html": [ 580 | "" 581 | ], 582 | "text/plain": [ 583 | "