├── Dockerfile ├── LICENSE ├── README.md ├── lib └── imgs │ ├── HierarchicalAttentionNetworksDiagram.png │ ├── graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png │ ├── training_accuracy.png │ └── training_loss.png └── src ├── create_csv.py ├── dataProcessing.py ├── download.py ├── han.py ├── han_tester.py ├── han_trainer.py ├── requirements.txt ├── run_all.py ├── serialize_data.py └── utils.py /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04 2 | COPY . /home/ 3 | WORKDIR /home/src/ 4 | RUN apt-get update && apt-get install -y \ 5 | vim \ 6 | git-core \ 7 | wget \ 8 | python3 \ 9 | python3-pip \ 10 | && pip3 install -r requirements.txt 11 | 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Michael 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Document Classification Comparisons featuring Hierarchical Attention Network 2 | 3 | The [Hierarchical Attention Network](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) is a novel deep learning architecture that takes advantage of the hierarchical structure of documents to construct a detailed representation of the document. As words form sentences and sentences form the document, the Hierarchical Attention Networks representation of the document uses this hierarchy in order to determine what sentences and what words in those sentences are most important in the classification of the document as a whole. 4 | 5 |

Figure 1: Hierarchical Attention Network Architecture Zichao (1)

9 | 10 | 11 | This model uses two levels of LSTM encoders at the word and sentences level in order to build the word and sentence level representations of the document. The attention mechanism is used to attribute importance at the word and sentence level. 12 | 13 | There are two applications of the attention mechanism that attend over of the word level encoder and the sentence level encoder. These allow the model to construct a representation of the document that attribute greater levels of importance to key sentences and words throughout the document. 14 | 15 | 16 | ## IMDB Dataset 17 | All experiments were performed on the Stanford IMDB dataset which is a natural language dataset where movie reviews have labels that describe the sentiment of the movie review. This is one of the many datasets used in the original paper [Hierarchical Attention Network](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). There are 8 different classes that describe the sentiment from 0-3 for negative sentiment to 6-10 for positive sentiment, which are mapped down to negative sentiment 0 and positive sentiment 1. 18 | 19 | ## Files in this repo 20 | * IMDB download script: [download.py](src/download.py) 21 | * first step of data preprocessing and create a csv: [create_csv.py](src/create_csv.py) 22 | * second step of data preprocessing and create serialized dataset as binary files: [serialize_data.py](src/serialize_data.py) 23 | * IMDB data preprocessing: [dataProcessing.py](src/dataProcessing.py) 24 | * Paths shared throughout files: [utils.py](src/utils.py) 25 | * Hierarchical Attention Networks: [han.py](src/han.py) 26 | * Train the Hierarchical Attention Networks: [han_trainer.py](src/han_trainer.py) 27 | * Test the Hierarchical Attention Networks: [han_tester.py](src/han_tester.py) 28 | 29 | ## What you need to run the code in this repo 30 | * [Docker](https://www.docker.com/) 31 | * Nvidia GPU with the CUDA driver installed 32 | 33 | ## To run the experiments contained in this repo 34 | 35 | **To run the model** 36 | * build the container image from the docker file `docker build -t han:1.0 .` 37 | * start container `nvidia-docker run -p 6006:6006 -p 8888:8889 -it "IMAGE_ID" bash` 38 | * to download and process all data run `python3 run_all.py imdb True` or run the below three commands 39 | * download the imdb dataset `python3 download.py imdb` 40 | * create csv file `python3 create_csv.py imdb True` 41 | * create serialized dataset as binary files `python3 serialize_data.py imdb` 42 | * start training the han model with `nohup python3 han_trainer.py --run_type "train" >> train.out &` 43 | * start validation the han model with `nohup python3 han_tester.py --run_type "val" >> val.out &` 44 | * start testing the han model with `nohup python3 han_tester.py --run_type "test" >> test.out &` 45 | 46 | Note the attention weights consume lots of vram memory on the gpu and running validation while model is training causes a out of memory exception 47 | 48 | **Set up Tensorboard and Jupyter Notebook** 49 | * create another session in the same container `nvidia-docker exec -it "CONTAINER_ID" bash` 50 | * start jupyter notebook in the container `jupyter notebook --no-browser --port=8889 --ip=0.0.0.0 --allow-root` grab the authenication token 51 | 52 | * create another session in the same container `nvidia-docker exec -it "CONTAINER_ID" bash` 53 | * then run `tensorboard --logdir ../lib/summaries/train/` start tensorboard in the container 54 | 55 | * go to your browser on local machine `localhost:6001` for tensorboard 56 | * go to your browser on local machine `localhost:8890` for tensorboard 57 | 58 | if you are working on a remote machine you must set up a tunnel for tensorboard and jupyter tools 59 | * on host machine `ssh -N -L localhost:6001:localhost:6006 username@ipaddress` set up tunnel for tensorboard 60 | * on host machine `ssh -N -L localhost:8890:localhost:8888 username@ipaddress` set up tunnel for jupyter notebook 61 | 62 | 63 | ## Graph of operations for this model 64 |

Figure 2: Hierarchical Attention Network model graph operations

68 | 69 | ## Results 70 |

71 | Shown above is the training accuracy achieved during training of the HAN model after 120 thousand training steps on the IMDB dataset where the labels are converted to binary classes. As seen the maximum training accuracy achieved is approximately 64% accuracy, which is significantly less than that reported by the original paper. 72 | 73 |

74 | Shown above is the training loss achieved during training of the HAN model after 120 thousand training steps on the IMDB dataset where the labels are converted to binary classes. The training loss seems to be steadily decreasing. 75 | 76 | ## References 77 | Zichao, Yang. [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) 78 | 79 | ## TODOs 80 | * publish trained model files 81 | * find a way to validate model during model training without causing OOM either by pausing training and validate then return to training 82 | * visualize trained model weights in jupyter notebook over input text document 83 | -------------------------------------------------------------------------------- /lib/imgs/HierarchicalAttentionNetworksDiagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/HierarchicalAttentionNetworksDiagram.png -------------------------------------------------------------------------------- /lib/imgs/graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png -------------------------------------------------------------------------------- /lib/imgs/training_accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/training_accuracy.png -------------------------------------------------------------------------------- /lib/imgs/training_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/training_loss.png -------------------------------------------------------------------------------- /src/create_csv.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import os 6 | import argparse 7 | from dataProcessing import IMDB 8 | from utils import prjPaths 9 | 10 | def get_args(): 11 | """ 12 | desc: get cli arguments 13 | returns: 14 | args: dictionary of cli arguments 15 | """ 16 | 17 | parser = argparse.ArgumentParser(description="this script is used for creating csv datasets for training this implementation of the Hierarchical Attention Networks") 18 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str) 19 | parser.add_argument("binary", default=True, help="coerce to binary classification", type=bool) 20 | args = parser.parse_args() 21 | return args 22 | # end 23 | 24 | def create_csv(paths, args): 25 | """ 26 | desc: This function creates a csv file from a downloaded dataset. 27 | Currently this process works on the imdb dataset but other datasets 28 | can be easily added. 29 | args: 30 | args: dictionary of cli arguments 31 | paths: project paths 32 | """ 33 | 34 | if args.dataset == "imdb": 35 | print("creating {} csv".format(args.dataset)) 36 | imdb = IMDB(action="create") 37 | imdb.createManager(args.binary) 38 | print("{} csv created".format(args.dataset)) 39 | # end 40 | 41 | if __name__ == "__main__": 42 | paths = prjPaths() 43 | args = get_args() 44 | create_csv(paths=paths, args=args) 45 | -------------------------------------------------------------------------------- /src/dataProcessing.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import tensorflow as tf 6 | import os 7 | import csv 8 | import re 9 | import itertools 10 | import more_itertools 11 | import pickle 12 | import pandas as pd 13 | import numpy as np 14 | from tqdm import tqdm 15 | from bs4 import BeautifulSoup 16 | from utils import prjPaths 17 | 18 | class IMDB: 19 | 20 | def __init__(self, action): 21 | """ 22 | desc: this class is used to process the imdb dataset 23 | args: 24 | action: specify whether to create or fetch the data using the IMDB class 25 | """ 26 | self.paths = prjPaths() 27 | self.ROOT_DATA_DIR = self.paths.ROOT_DATA_DIR 28 | self.DATASET = "imdb" 29 | 30 | self.CSVFILENAME = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "{}.csv".format(self.DATASET)) 31 | assert(action in ["create", "fetch"]), "invalid action" 32 | 33 | if action == "create": 34 | 35 | # if creating new csv remove old if one exists 36 | if os.path.exists(self.CSVFILENAME): 37 | print("removing existing csv file from {}".format(self.CSVFILENAME)) 38 | os.remove(self.CSVFILENAME) 39 | 40 | # directory structure 41 | train_dir = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "aclImdb", "train") 42 | test_dir = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "aclImdb", "test") 43 | 44 | trainPos_dir = os.path.join(train_dir, "pos") 45 | trainNeg_dir = os.path.join(train_dir, "neg") 46 | 47 | testPos_dir = os.path.join(test_dir, "pos") 48 | testNeg_dir = os.path.join(test_dir, "neg") 49 | 50 | self.data = {"trainPos": self._getDirContents(trainPos_dir), 51 | "trainNeg": self._getDirContents(trainNeg_dir), 52 | "testPos": self._getDirContents(testPos_dir), 53 | "testNeg": self._getDirContents(testNeg_dir)} 54 | # end 55 | 56 | def _getDirContents(self, path): 57 | """ 58 | desc: get all filenames in a specified directory 59 | args: 60 | path: path of directory to get contents of 61 | returns: 62 | dirFiles: list of filenames in a directory 63 | """ 64 | dirFiles = os.listdir(path) 65 | dirFiles = [os.path.join(path, file) for file in dirFiles] 66 | return dirFiles 67 | # end 68 | 69 | def _getID_label(self, file, binary): 70 | """ 71 | desc: get label for a specific filename 72 | args: 73 | file: current file being operated on 74 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset 75 | returns: 76 | list of unique identifier of file, label, and if it is test or training data 77 | """ 78 | splitFile = file.split("/") 79 | testOtrain = splitFile[-3] 80 | filename = os.path.splitext(splitFile[-1])[0] 81 | id, label = filename.split("_") 82 | if binary: 83 | if int(label) < 5: 84 | label = 0 85 | else: 86 | label = 1 87 | 88 | return [id, label, testOtrain] 89 | # end 90 | 91 | def _loadTxtFiles(self, dirFiles, binary): 92 | """ 93 | desc: load and format all imdb dataset 94 | args: 95 | dirFiles: current file being operated on 96 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset 97 | returns: 98 | list of dictionaries containing all information about imdb dataset 99 | """ 100 | TxtContents = list() 101 | for file in tqdm(dirFiles, desc="process all files in a directory"): 102 | try: 103 | with open(file, encoding="utf8") as txtFile: 104 | content = txtFile.read() 105 | id, label, testOtrain = self._getID_label(file, binary=binary) 106 | TxtContents.append({"id": id, 107 | "content": content, 108 | "label": label, 109 | "testOtrain": testOtrain}) 110 | except: 111 | print("this file threw and error and is being omited: {}".format(file)) 112 | continue 113 | return TxtContents 114 | # end 115 | 116 | def _writeTxtFiles(self, TxtContents): 117 | """ 118 | desc: write imdb content and meta data to csv 119 | args: 120 | TxtContents: list of dictionaries containing all information about imdb dataset 121 | """ 122 | 123 | with open(self.CSVFILENAME, "a") as csvFile: 124 | fieldNames = ["id", "content", "label", "testOtrain"] 125 | writer = csv.DictWriter(csvFile, fieldnames=fieldNames) 126 | writer.writeheader() 127 | 128 | for seq in TxtContents: 129 | try: 130 | writer.writerow({"id": seq["id"], 131 | "content": seq["content"].encode("ascii", "ignore").decode("ascii"), 132 | "label": seq["label"], 133 | "testOtrain": seq["testOtrain"]}) 134 | except: 135 | print("this sequence threw an exception: {}".format(seq["id"])) 136 | continue 137 | # end 138 | 139 | def createManager(self, binary): 140 | """ 141 | desc: This function is called by create_csv.py script. 142 | It manages the loading, formatting, and creation of a csv from the imdb directory structure. 143 | args: 144 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset 145 | """ 146 | 147 | for key in self.data.keys(): 148 | self.data[key] = self._loadTxtFiles(self.data[key], binary) 149 | self._writeTxtFiles(self.data[key]) 150 | # end 151 | 152 | def _clean_str(self, string): 153 | """ 154 | desc: This function cleans a string 155 | adapted from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py 156 | args: 157 | string: the string to be cleaned 158 | returns: 159 | a cleaned string 160 | """ 161 | 162 | string = BeautifulSoup(string, "lxml").text 163 | string = re.sub(r"[^A-Za-z0-9(),!?\"\`]", " ", string) 164 | string = re.sub(r"\"s", " \"s", string) 165 | string = re.sub(r"\"ve", " \"ve", string) 166 | string = re.sub(r"n\"t", " n\"t", string) 167 | string = re.sub(r"\"re", " \"re", string) 168 | string = re.sub(r"\"d", " \"d", string) 169 | string = re.sub(r"\"ll", " \"ll", string) 170 | string = re.sub(r",", " , ", string) 171 | string = re.sub(r"!", " ! ", string) 172 | string = re.sub(r"\(", " \( ", string) 173 | string = re.sub(r"\)", " \) ", string) 174 | string = re.sub(r"\?", " \? ", string) 175 | string = re.sub(r"\s{2,}", " ", string) 176 | return string.strip().lower().split(" ") 177 | # end 178 | 179 | def _oneHot(self, ys): 180 | """ 181 | desc: one hot encodes labels in dataset 182 | args: 183 | ys: dataset labels 184 | returns: 185 | list of one hot encoded training, testing, and lookup labels 186 | """ 187 | 188 | y_train, y_test = ys 189 | y_train = list(map(int, y_train)) # confirm all type int 190 | y_test = list(map(int, y_test)) # confirm all type int 191 | lookuplabels = {v: k for k, v in enumerate(sorted(list(set(y_train + y_test))))} 192 | recoded_y_train = [lookuplabels[i] for i in y_train] 193 | recoded_y_test = [lookuplabels[i] for i in y_test] 194 | labels_y_train = tf.constant(recoded_y_train) 195 | labels_y_test = tf.constant(recoded_y_test) 196 | max_label = tf.reduce_max(labels_y_train + labels_y_test) 197 | labels_y_train_OHE = tf.one_hot(labels_y_train, max_label+1) 198 | labels_y_test_OHE = tf.one_hot(labels_y_test, max_label+1) 199 | 200 | with tf.Session() as sess: 201 | # Initialize all variables 202 | sess.run(tf.global_variables_initializer()) 203 | #l = sess.run(labels) 204 | y_train_ohe = sess.run(labels_y_train_OHE) 205 | y_test_ohe = sess.run(labels_y_test_OHE) 206 | sess.close() 207 | return [y_train_ohe, y_test_ohe, lookuplabels] 208 | # end 209 | 210 | def _index(self, xs): 211 | """ 212 | desc: apply index to text data and persist unique vocabulary in dataset to pickle file 213 | args: 214 | xs: text data 215 | returns: 216 | list of test, train data after it was indexed, the lookup table for the vocabulary, 217 | and any persisted variables that may be needed 218 | """ 219 | def _apply_index(txt_data): 220 | indexed = [[[unqVoc_LookUp[char] for char in seq] for seq in doc] for doc in txt_data] 221 | return indexed 222 | # end 223 | 224 | x_train, x_test = xs 225 | 226 | # create look up table for all unique vocab in test and train datasets 227 | unqVoc = set(list(more_itertools.collapse(x_train[:] + x_test[:]))) 228 | unqVoc_LookUp = {k: v+1 for v, k in enumerate(unqVoc)} 229 | vocab_size = len(list(unqVoc_LookUp)) 230 | 231 | x_train = _apply_index(txt_data=x_train) 232 | x_test = _apply_index(txt_data=x_test) 233 | 234 | # determine max sequence lengths 235 | max_seq_len = max([len(seq) for seq in itertools.chain.from_iterable(x_train + x_test)]) # max length of sequence across all documents 236 | max_sent_len = max([len(sent) for sent in (x_train + x_test)]) # max length of sentence across all documents 237 | 238 | persisted_vars = {"max_seq_len":max_seq_len, 239 | "max_sent_len":max_sent_len, 240 | "vocab_size":vocab_size} 241 | 242 | return [x_train, x_test, unqVoc_LookUp, persisted_vars] 243 | # end 244 | 245 | def partitionManager(self, dataset): 246 | """ 247 | desc: apply index to text data, one hot encode labels, and persist unique vocabulary in dataset to pickle file 248 | args: 249 | dataset: dataset to be processed 250 | returns: 251 | return list of indexed training, training data along with one hot encoded labels 252 | """ 253 | assert(self.DATASET==dataset), "this function works on {} and is not meant to process {} dataset".format(self.DATASET, dataset) 254 | 255 | # load csv file 256 | df = pd.read_csv(self.CSVFILENAME) 257 | 258 | # partition data 259 | train = df.loc[df["testOtrain"] == "train"] 260 | test = df.loc[df["testOtrain"] == "test"] 261 | 262 | # create 3D list for han model and clean strings 263 | create3DList = lambda df: [[self._clean_str(seq) for seq in "|||".join(re.split("[.?!]", docs)).split("|||")] 264 | for docs in df["content"].values] 265 | x_train = create3DList(df=train) 266 | x_test = create3DList(df=test) 267 | 268 | # index and persist unq vocab in pickle file 269 | x_train, x_test, unqVoc_LookUp, persisted_vars = self._index(xs=[x_train[:], x_test[:]]) 270 | 271 | y_train = train["label"].tolist() 272 | y_test = test["label"].tolist() 273 | 274 | #OHE classes 275 | y_train_ohe, y_test_ohe, lookuplabels = self._oneHot(ys=[y_train, y_test]) 276 | 277 | # update persisted vars 278 | persisted_vars["lookuplabels"] = lookuplabels 279 | persisted_vars["num_classes"] = len(lookuplabels.keys()) 280 | 281 | # save lookup table and variables that need to be persisted 282 | if not os.path.exists(os.path.join(self.paths.LIB_DIR, self.DATASET)): 283 | os.mkdir(os.path.join(self.paths.LIB_DIR, self.DATASET)) 284 | pickle._dump(unqVoc_LookUp, open(os.path.join(self.paths.LIB_DIR, self.DATASET, "unqVoc_Lookup.p"), "wb")) 285 | pickle._dump(persisted_vars, open(os.path.join(self.paths.LIB_DIR, self.DATASET, "persisted_vars.p"), "wb")) 286 | 287 | return[x_train, y_train_ohe, x_test, y_test_ohe] 288 | # end 289 | 290 | def get_data(self, type_): 291 | """ 292 | desc: load and return dataset from binary files 293 | args: 294 | type_: type of dataset (train, val, test) 295 | returns: 296 | loaded dataset 297 | """ 298 | 299 | assert(type_ in ["train", "val", "test"]) 300 | 301 | print("loading {} dataset...".format(type_)) 302 | 303 | x = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_x.npy".format(type_))) 304 | y = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_y.npy".format(type_))) 305 | docsize = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_docsize.npy".format(type_))) 306 | sent_size = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_sent_size.npy".format(type_))) 307 | return [x, y, docsize, sent_size] 308 | # end 309 | 310 | def get_batch_iter(self, data, batch_size, num_epochs, shuffle=True): 311 | """ 312 | desc: batch dataset generator 313 | args: 314 | data: dataset to batch as list 315 | batch_size: the batch size used 316 | num_epochs: number of training epochs 317 | shuffle: shuffle dataset 318 | returns: 319 | adapted from Denny Britz https://github.com/dennybritz/cnn-text-classification-tf.git 320 | """ 321 | 322 | data = np.array(data) 323 | data_size = len(data) 324 | num_batches_per_epoch = int((len(data) - 1) / batch_size) + 1 325 | for epoch in range(num_epochs): 326 | # Shuffle the data at each epoch 327 | if shuffle: 328 | shuffle_indices = np.random.permutation(np.arange(data_size)) 329 | next_batch = data[shuffle_indices] 330 | else: 331 | next_batch = data 332 | for batch_num in range(num_batches_per_epoch): 333 | start_index = batch_num * batch_size 334 | end_index = min((batch_num + 1) * batch_size, data_size) 335 | #yield next_batch[start_index:end_index] 336 | yield epoch, next_batch[start_index:end_index] 337 | # end 338 | 339 | def hanformater(self, inputs): 340 | """ 341 | desc: format data specific for hierarchical attention networks 342 | args: 343 | inputs: data 344 | returns: 345 | dataset with corresponding dimensions for document and sentence level 346 | """ 347 | 348 | batch_size = len(inputs) 349 | 350 | document_sizes = np.array([len(doc) for doc in inputs], dtype=np.int32) 351 | document_size = document_sizes.max() 352 | 353 | sentence_sizes_ = [[len(sent) for sent in doc] for doc in inputs] 354 | sentence_size = max(map(max, sentence_sizes_)) 355 | 356 | b = np.zeros(shape=[batch_size, document_size, sentence_size], dtype=np.int32) # == PAD 357 | 358 | sentence_sizes = np.zeros(shape=[batch_size, document_size], dtype=np.int32) 359 | for i, document in enumerate(tqdm(inputs, desc="formating data for hierarchical attention networks")): 360 | for j, sentence in enumerate(document): 361 | sentence_sizes[i, j] = sentence_sizes_[i][j] 362 | for k, word in enumerate(sentence): 363 | b[i, j, k] = word 364 | return b, document_sizes, sentence_sizes 365 | # end 366 | # end 367 | -------------------------------------------------------------------------------- /src/download.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import os 6 | import shutil 7 | import platform 8 | import urllib.request 9 | import tarfile 10 | import traceback 11 | import argparse 12 | 13 | from utils import prjPaths 14 | 15 | def get_args(): 16 | """ 17 | desc: get cli arguments 18 | returns: 19 | args: dictionary of cli arguments 20 | """ 21 | 22 | parser = argparse.ArgumentParser(description="this script is used for downloading datasets for training this implementation of the Hierarchical Attention Networks") 23 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str) 24 | args = parser.parse_args() 25 | return args 26 | # end 27 | 28 | def download(paths, args): 29 | """ 30 | desc: download a dataset from url 31 | args: 32 | args: dictionary of cli arguments 33 | paths: project paths 34 | """ 35 | 36 | if args.dataset == "imdb": 37 | resource_loc = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" 38 | osType = platform.system() 39 | if osType == "Windows": 40 | print("manually download data set from {}"\ 41 | " and set getDataset=False when prjPaths is called in *_master.py script".format(resource_loc)) 42 | exit(0) 43 | elif osType is not "Linux": 44 | osType = "OSX" 45 | 46 | filename=os.path.join(paths.ROOT_DATA_DIR, args.dataset, "aclImdb_v1.tar.gz") 47 | ACLIMDB_DIR = os.path.join(paths.ROOT_DATA_DIR, args.dataset) 48 | 49 | # if tar file already exists remove it 50 | if os.path.exists(filename): 51 | os.remove(filename) 52 | # if fclImdb dir already exists remove it 53 | if os.path.exists(os.path.join(ACLIMDB_DIR, "aclImdb")): 54 | shutil.rmtree(os.path.join(ACLIMDB_DIR, "aclImdb")) 55 | else: 56 | os.mkdir(ACLIMDB_DIR) 57 | 58 | print("downloading: {}".format(args.dataset)) 59 | try: 60 | urllib.request.urlretrieve(resource_loc, filename) 61 | except Exception as e: 62 | print("something went wrong downloading: {} at {}".format(args.dataset, resource_loc)) 63 | traceback.print_exc() 64 | 65 | print("unpacking: {}".format(args.dataset)) 66 | if (filename.endswith("tar.gz")): 67 | tar = tarfile.open(filename, "r:gz") 68 | tar.extractall(ACLIMDB_DIR) 69 | tar.close() 70 | elif (filename.endswith("tar")): 71 | tar = tarfile.open(filename, "r:") 72 | tar.extractall(ACLIMDB_DIR) 73 | tar.close() 74 | # end 75 | 76 | if __name__ == "__main__": 77 | paths = prjPaths() 78 | args = get_args() 79 | download(paths=paths, args=args) 80 | print("download complete!") 81 | -------------------------------------------------------------------------------- /src/han.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import numpy as np 6 | 7 | np.set_printoptions(threshold=np.nan) 8 | import tensorflow as tf 9 | from tensorflow.contrib import rnn 10 | import tensorflow.contrib.layers as layers 11 | 12 | class HAN: 13 | def __init__(self, max_seq_len, max_sent_len, num_classes, 14 | vocab_size, embedding_size, max_grad_norm, dropout_keep_proba, 15 | learning_rate): 16 | ## Parameters 17 | self.learning_rate = learning_rate 18 | self.vocab_size = vocab_size 19 | self.num_classes = num_classes 20 | self.max_seq_len = max_seq_len 21 | self.embedding_size = embedding_size 22 | self.word_encoder_num_hidden = max_seq_len 23 | self.word_output_size = max_seq_len 24 | self.sentence_encoder_num_hidden = max_sent_len 25 | self.sentence_output_size = max_sent_len 26 | self.max_grad_norm = max_grad_norm 27 | self.dropout_keep_proba = dropout_keep_proba 28 | 29 | # tf graph input 30 | self.input_x = tf.placeholder(shape=[None, None, None], 31 | dtype=tf.int32, 32 | name="input_x") 33 | self.input_y = tf.placeholder(shape=[None, self.num_classes], 34 | dtype=tf.int32, 35 | name="input_y") 36 | self.word_lengths = tf.placeholder(shape=[None, None], 37 | dtype=tf.int32, 38 | name="word_lengths") 39 | self.sentence_lengths = tf.placeholder(shape=[None,], 40 | dtype=tf.int32, 41 | name="sentence_lengths") 42 | self.is_training = tf.placeholder(dtype=tf.bool, 43 | name="is_training") 44 | 45 | # input_x dims 46 | (self.document_size, self.sentence_size, self.word_size) = tf.unstack(tf.shape(self.input_x)) 47 | 48 | with tf.device("/gpu:0"), tf.name_scope("embedding_layer"): 49 | w = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0), 50 | dtype=tf.float32, 51 | name="w") # TODO check if this needs to be marked as untrainable 52 | self.input_x_embedded = tf.nn.embedding_lookup(w, self.input_x) 53 | 54 | # reshape input_x after embedding 55 | self.input_x_embedded = tf.reshape(self.input_x_embedded, 56 | [self.document_size * self.sentence_size, self.word_size, self.embedding_size]) 57 | self.input_x_embedded_lengths = tf.reshape(self.word_lengths, [self.document_size * self.sentence_size]) 58 | 59 | with tf.variable_scope("word_level"): 60 | self.word_encoder_outputs = self.bidirectional_RNN(num_hidden=self.word_encoder_num_hidden, 61 | inputs=self.input_x_embedded) 62 | word_level_output = self.attention(inputs=self.word_encoder_outputs, 63 | output_size=self.word_output_size) 64 | 65 | with tf.variable_scope("dropout"): 66 | print('self.is_training: {}'.format(self.is_training)) 67 | word_level_output = layers.dropout(word_level_output, 68 | keep_prob=self.dropout_keep_proba, 69 | is_training=self.is_training) 70 | 71 | # reshape word_level output 72 | self.sentence_encoder_inputs = tf.reshape(word_level_output, 73 | [self.document_size, self.sentence_size, self.word_output_size]) 74 | 75 | with tf.variable_scope("sentence_level"): 76 | self.sentence_encoder_outputs = self.bidirectional_RNN(num_hidden=self.sentence_encoder_num_hidden, 77 | inputs=self.sentence_encoder_inputs) 78 | sentence_level_output = self.attention(inputs=self.sentence_encoder_outputs, 79 | output_size=self.sentence_output_size) 80 | with tf.variable_scope("dropout"): 81 | sentence_level_output = layers.dropout(sentence_level_output, 82 | keep_prob=self.dropout_keep_proba, 83 | is_training=self.is_training) 84 | 85 | # Final model prediction 86 | with tf.variable_scope("classifier_output"): 87 | self.logits = layers.fully_connected(sentence_level_output, 88 | self.num_classes, 89 | activation_fn=None) 90 | #trainable=self.is_training) 91 | self.predictions = tf.argmax(self.logits, axis=1, name="predictions") 92 | 93 | # Calculate mean cross-entropy loss 94 | with tf.variable_scope("loss"): 95 | losses = tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits) 96 | self.loss = tf.reduce_mean(losses) 97 | tf.summary.scalar("Loss", self.loss) 98 | 99 | # Accuracy 100 | with tf.variable_scope("accuracy"): 101 | correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, axis=1)) 102 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy") 103 | tf.summary.scalar("Accuracy", self.accuracy) 104 | 105 | def bidirectional_RNN(self, num_hidden, inputs): 106 | """ 107 | desc: create bidirectional rnn layer 108 | args: 109 | num_hidden: number of hidden units 110 | inputs: input word or sentence 111 | returns: 112 | concatenated encoder and decoder outputs 113 | """ 114 | 115 | with tf.name_scope("bidirectional_RNN"): 116 | encoder_fw_cell = rnn.GRUCell(num_hidden) 117 | encoder_bw_cell = rnn.GRUCell(num_hidden) 118 | ((encoder_fw_outputs, encoder_bw_outputs), (_, _)) = tf.nn.bidirectional_dynamic_rnn(cell_fw=encoder_fw_cell, 119 | cell_bw=encoder_bw_cell, 120 | inputs=inputs, 121 | dtype=tf.float32, 122 | time_major=True) 123 | encoder_outputs = tf.concat((encoder_fw_outputs, encoder_bw_outputs), 2) 124 | return encoder_outputs 125 | # end 126 | 127 | def attention(self, inputs, output_size): 128 | """ 129 | desc: create attention mechanism 130 | args: 131 | inputs: input which is sentence or document level output from bidirectional rnn layer 132 | output_size: specify the dimensions of the output 133 | returns: 134 | output from attention distribution 135 | """ 136 | 137 | with tf.variable_scope("attention"): 138 | attention_context_vector_uw = tf.get_variable(name="attention_context_vector", 139 | shape=[output_size], 140 | #trainable=self.is_training, 141 | initializer=layers.xavier_initializer(), 142 | dtype=tf.float32) 143 | input_projection_u = layers.fully_connected(inputs, 144 | output_size, 145 | #trainable=self.is_training, 146 | activation_fn=tf.tanh) 147 | vector_attn = tf.reduce_sum(tf.multiply(input_projection_u, attention_context_vector_uw), axis=2, keep_dims=True) 148 | attention_weights = tf.nn.softmax(vector_attn, dim=1) 149 | weighted_projection = tf.multiply(input_projection_u, attention_weights) 150 | outputs = tf.reduce_sum(weighted_projection, axis=1) 151 | return outputs 152 | # end 153 | # end 154 | -------------------------------------------------------------------------------- /src/han_tester.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import tensorflow as tf 6 | import numpy as np 7 | from tqdm import tqdm 8 | import time 9 | import pickle 10 | from scipy import stats 11 | from collections import Counter 12 | import os 13 | from han import HAN 14 | from utils import prjPaths, get_logger 15 | from dataProcessing import IMDB 16 | 17 | def get_flags(): 18 | """ 19 | desc: get cli arguments 20 | returns: 21 | args: dictionary of cli arguments 22 | """ 23 | 24 | tf.flags.DEFINE_string("dataset", "imdb", 25 | "enter the type of training dataset") 26 | tf.flags.DEFINE_string("run_type", "val", 27 | "enter val or test to specify run_type (default: val)") 28 | tf.flags.DEFINE_integer("log_summaries_every", 30, 29 | "Save model summaries after this many steps (default: 30)") 30 | tf.flags.DEFINE_float("per_process_gpu_memory_fraction", 0.90, 31 | "gpu memory to be used (default: 0.90)") 32 | tf.flags.DEFINE_boolean("wait_for_checkpoint_files", False, 33 | "wait for model checkpoint file to be created") 34 | 35 | FLAGS = tf.flags.FLAGS 36 | FLAGS._parse_flags() 37 | 38 | return FLAGS 39 | # end 40 | 41 | def get_most_recently_created_file(files): 42 | return max(files, key=os.path.getctime) # most recently created file in list of files 43 | # end 44 | 45 | if __name__ == '__main__': 46 | 47 | MINUTE = 60 48 | paths = prjPaths() 49 | FLAGS = get_flags() 50 | 51 | print("current version of tf:{}".format(tf.__version__)) 52 | 53 | assert(FLAGS.run_type == "val" or FLAGS.run_type == "test") 54 | 55 | print("loading persisted variables...") 56 | with open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "rb") as handle: 57 | persisted_vars = pickle.load(handle) 58 | 59 | # create new graph set as default 60 | with tf.Graph().as_default(): 61 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.per_process_gpu_memory_fraction) 62 | session_conf = tf.ConfigProto(allow_soft_placement=True, 63 | log_device_placement=False, 64 | gpu_options=gpu_options) 65 | session_conf.gpu_options.allocator_type = "BFC" 66 | 67 | # create new session set it as default 68 | with tf.Session(config=session_conf) as sess: 69 | 70 | # create han model instance 71 | han = HAN(max_seq_len=persisted_vars["max_seq_len"], 72 | max_sent_len=persisted_vars["max_sent_len"], 73 | num_classes=persisted_vars["num_classes"], 74 | vocab_size=persisted_vars["vocab_size"], 75 | embedding_size=persisted_vars["embedding_dim"], 76 | max_grad_norm=persisted_vars["max_grad_norm"], 77 | dropout_keep_proba=persisted_vars["dropout_keep_proba"], 78 | learning_rate=persisted_vars["learning_rate"]) 79 | 80 | global_step = tf.Variable(0, name="global_step", trainable=False) 81 | tvars = tf.trainable_variables() 82 | grads, global_norm = tf.clip_by_global_norm(tf.gradients(han.loss, tvars), 83 | han.max_grad_norm) 84 | optimizer = tf.train.AdamOptimizer(han.learning_rate) 85 | test_op = optimizer.apply_gradients(zip(grads, tvars), 86 | name="{}_op".format(FLAGS.run_type), 87 | global_step=global_step) 88 | 89 | # write summaries 90 | merge_summary_op = tf.summary.merge_all() 91 | test_summary_writer = tf.summary.FileWriter(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type), sess.graph) 92 | 93 | # give check for checkpoint files directory if none then sleep until a checkpoint is created 94 | #if os.listdir(paths.CHECKPOINT_DIR) == []: 95 | #time.sleep(2*MINUTE) 96 | 97 | meta_file = get_most_recently_created_file([os.path.join(paths.CHECKPOINT_DIR, file) for file in os.listdir(paths.CHECKPOINT_DIR) if file.endswith('.meta')]) 98 | saver = tf.train.import_meta_graph(meta_file) 99 | 100 | # Initialize all variables 101 | sess.run(tf.global_variables_initializer()) 102 | 103 | def test_step(sample_num, x_batch, y_batch, docsize, sent_size, is_training): 104 | 105 | feed_dict = {han.input_x: x_batch, 106 | han.input_y: y_batch, 107 | han.sentence_lengths: docsize, 108 | han.word_lengths: sent_size, 109 | han.is_training: is_training} 110 | 111 | loss, accuracy = sess.run([han.loss, han.accuracy], feed_dict=feed_dict) 112 | return loss, accuracy 113 | # end 114 | 115 | # generate batches on imdb dataset else quit 116 | if FLAGS.dataset == "imdb": 117 | dataset_controller = IMDB(action="fetch") 118 | else: 119 | exit("set dataset flag to appropiate dataset") 120 | 121 | x, y, docsize, sent_size = dataset_controller.get_data(type_=FLAGS.run_type) # fetch dataset 122 | all_evaluated_chkpts = [] # list of all checkpoint files previously evaluated 123 | 124 | # testing loop 125 | while True: 126 | 127 | if FLAGS.wait_for_checkpoint_files: 128 | time.sleep(2*MINUTE) # wait to allow for creation of new checkpoint file 129 | else: 130 | time.sleep(0*MINUTE) # don't wait for model checkpoint files 131 | 132 | # if checkpoint file already evaluated then continue and wait for a new checkpoint file 133 | if (tf.train.latest_checkpoint(paths.CHECKPOINT_DIR) in all_evaluated_chkpts): 134 | continue 135 | 136 | # restore most recent checkpoint 137 | saver.restore(sess, tf.train.latest_checkpoint(paths.CHECKPOINT_DIR)) # restore most recent checkpoint 138 | all_evaluated_chkpts.append(tf.train.latest_checkpoint(paths.CHECKPOINT_DIR)) # add current checkpoint to list of evaluated checkpoints 139 | 140 | losses = [] # aggregate testing losses on a given checkpoint 141 | accuracies = [] # aggregate testing accuracies on a given checkpoint 142 | 143 | tic = time.time() # start time for step 144 | 145 | # loop to test every sample on a given checkpoint 146 | for i, batch in enumerate(tqdm(list(zip(x, y, docsize, sent_size)))): 147 | 148 | x_batch, y_batch, docsize_batch, sent_size_batch = batch 149 | x_batch = np.expand_dims(x_batch, axis=0) 150 | y_batch = np.expand_dims(y_batch, axis=0) 151 | sent_size_batch = np.expand_dims(sent_size_batch, axis=0) 152 | 153 | # run step 154 | loss, accuracy = test_step(sample_num=i, 155 | x_batch=x_batch, 156 | y_batch=y_batch, 157 | docsize=docsize, 158 | sent_size=sent_size, 159 | is_training=False) 160 | losses.append(loss) 161 | accuracies.append(accuracy) 162 | 163 | time_elapsed = time.time() - tic # end time for step 164 | 165 | losses_accuracies_vars = {"losses": losses, "accuracies": accuracies} 166 | 167 | print("Time taken to complete {} evaluation of {} checkpoint: {}".format(FLAGS.run_type, all_evaluated_chkpts[-1], time_elapsed)) 168 | for k in losses_accuracies_vars.keys(): 169 | print("stats for {}: {}".format(k, stats.describe(losses_accuracies_vars[k]))) 170 | print(Counter(losses_accuracies_vars[k])) 171 | 172 | filename, ext = os.path.splitext(all_evaluated_chkpts[-1]) 173 | pickle._dump(losses_accuracies_vars, open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "losses_accuracies_vars_{}.p".format(filename.split("/")[-1])), "wb")) 174 | 175 | sess.close() 176 | -------------------------------------------------------------------------------- /src/han_trainer.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import tensorflow as tf 6 | import numpy as np 7 | import time 8 | import pickle 9 | import os 10 | from han import HAN 11 | from utils import prjPaths, get_logger 12 | from dataProcessing import IMDB 13 | 14 | def get_flags(): 15 | """ 16 | desc: get cli arguments 17 | returns: 18 | args: dictionary of cli arguments 19 | """ 20 | 21 | tf.flags.DEFINE_string("dataset", "imdb", 22 | "enter the type of training dataset") 23 | tf.flags.DEFINE_string("run_type", "train", 24 | "enter train or test to specify run_type (default: train)") 25 | tf.flags.DEFINE_integer("embedding_dim", 100, 26 | "Dimensionality of character embedding (default: 100)") 27 | tf.flags.DEFINE_integer("batch_size", 2, 28 | "Batch Size (default: 2)") 29 | tf.flags.DEFINE_integer("num_epochs", 25, 30 | "Number of training epochs (default: 25)") 31 | tf.flags.DEFINE_integer("evaluate_every", 100, 32 | "Evaluate model on dev set after this many steps") 33 | tf.flags.DEFINE_integer("log_summaries_every", 30, 34 | "Save model summaries after this many steps (default: 30)") 35 | tf.flags.DEFINE_integer("checkpoint_every", 100, 36 | "Save model after this many steps (default: 100)") 37 | tf.flags.DEFINE_integer("num_checkpoints", 5, 38 | "Number of checkpoints to store (default: 5)") 39 | tf.flags.DEFINE_float("max_grad_norm", 5.0, 40 | "maximum permissible norm of the gradient (default: 5.0)") 41 | tf.flags.DEFINE_float("dropout_keep_proba", 0.5, 42 | "probability of neurons turned off (default: 0.5)") 43 | tf.flags.DEFINE_float("learning_rate", 0.001, 44 | "model learning rate (default: 0.001)") 45 | tf.flags.DEFINE_float("per_process_gpu_memory_fraction", 0.90, 46 | "gpu memory to be used (default: 0.90)") 47 | 48 | FLAGS = tf.flags.FLAGS 49 | FLAGS._parse_flags() 50 | 51 | return FLAGS 52 | # end 53 | 54 | if __name__ == '__main__': 55 | 56 | paths = prjPaths() 57 | FLAGS = get_flags() 58 | 59 | print("current version of tf:{}".format(tf.__version__)) 60 | 61 | assert(FLAGS.run_type == "train") 62 | 63 | print("loading persisted variables...") 64 | 65 | with open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "rb") as handle: 66 | persisted_vars = pickle.load(handle) 67 | 68 | persisted_vars["embedding_dim"] = FLAGS.embedding_dim 69 | persisted_vars["max_grad_norm"] = FLAGS.max_grad_norm 70 | persisted_vars["dropout_keep_proba"] = FLAGS.dropout_keep_proba 71 | persisted_vars["learning_rate"] = FLAGS.learning_rate 72 | pickle._dump(persisted_vars, open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "wb")) 73 | 74 | # create new graph set as default 75 | with tf.Graph().as_default(): 76 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.per_process_gpu_memory_fraction) 77 | session_conf = tf.ConfigProto(allow_soft_placement=True, 78 | log_device_placement=False, 79 | gpu_options=gpu_options) 80 | session_conf.gpu_options.allocator_type = "BFC" 81 | 82 | # create new session set it as default 83 | with tf.Session(config=session_conf) as sess: 84 | 85 | # create han model instance 86 | han = HAN(max_seq_len=persisted_vars["max_seq_len"], 87 | max_sent_len=persisted_vars["max_sent_len"], 88 | num_classes=persisted_vars["num_classes"], 89 | vocab_size=persisted_vars["vocab_size"], 90 | embedding_size=persisted_vars["embedding_dim"], 91 | max_grad_norm=persisted_vars["max_grad_norm"], 92 | dropout_keep_proba=persisted_vars["dropout_keep_proba"], 93 | learning_rate=persisted_vars["learning_rate"]) 94 | 95 | global_step = tf.Variable(0, name="global_step", trainable=False) 96 | tvars = tf.trainable_variables() 97 | grads, global_norm = tf.clip_by_global_norm(tf.gradients(han.loss, tvars), 98 | han.max_grad_norm) 99 | optimizer = tf.train.AdamOptimizer(han.learning_rate) 100 | train_op = optimizer.apply_gradients(zip(grads, tvars), 101 | name="train_op", 102 | global_step=global_step) 103 | 104 | # write summaries 105 | merge_summary_op = tf.summary.merge_all() 106 | train_summary_writer = tf.summary.FileWriter(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type), sess.graph) 107 | 108 | # checkpoint model 109 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints) 110 | 111 | # Initialize all variables 112 | sess.run(tf.global_variables_initializer()) 113 | 114 | def train_step(epoch, x_batch, y_batch, docsize, sent_size, is_training): 115 | tic = time.time() # start time for step 116 | 117 | feed_dict = {han.input_x: x_batch, 118 | han.input_y: y_batch, 119 | han.sentence_lengths: docsize, 120 | han.word_lengths: sent_size, 121 | han.is_training: is_training} 122 | 123 | _, step, loss, accuracy, summaries = sess.run([train_op, global_step, han.loss, han.accuracy, merge_summary_op], feed_dict=feed_dict) 124 | 125 | time_elapsed = time.time() - tic # end time for step 126 | 127 | if is_training: 128 | print("Training || CurrentEpoch: {} || GlobalStep: {} || ({} sec/step) || Loss {:g} || Accuracy {:g}".format(epoch+1, step, time_elapsed, loss, accuracy)) 129 | 130 | if step % FLAGS.log_summaries_every == 0: 131 | train_summary_writer.add_summary(summaries, step) 132 | print("Saved model summaries to {}\n".format(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type))) 133 | 134 | if step % FLAGS.checkpoint_every == 0: 135 | chkpt_path = saver.save(sess, 136 | os.path.join(paths.CHECKPOINT_DIR, "han"), 137 | global_step=step) 138 | print("Saved model checkpoint to {}\n".format(chkpt_path)) 139 | # end 140 | 141 | # Generate batches 142 | imdb = IMDB(action="fetch") 143 | x_train, y_train, docsize_train, sent_size_train = imdb.get_data(type_=FLAGS.run_type) 144 | 145 | # Training loop. For each batch... 146 | for epoch, batch in imdb.get_batch_iter(data=list(zip(x_train, y_train, docsize_train, sent_size_train)), 147 | batch_size=FLAGS.batch_size, 148 | num_epochs=FLAGS.num_epochs): 149 | 150 | x_batch, y_batch, docsize, sent_size = zip(*batch) 151 | 152 | train_step(epoch=epoch, 153 | x_batch=x_batch, 154 | y_batch=y_batch, 155 | docsize=docsize, 156 | sent_size=sent_size, 157 | is_training=True) 158 | 159 | sess.close() -------------------------------------------------------------------------------- /src/requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow-gpu==1.3 2 | keras==2.2.0 3 | pandas==0.23.3 4 | psutil==5.4.6 5 | tqdm==4.23.4 6 | more_itertools==4.2.0 7 | bs4==0.0.1 8 | lxml==4.2.3 9 | jupyter==1.0.0 -------------------------------------------------------------------------------- /src/run_all.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import os 6 | import argparse 7 | 8 | def get_args(): 9 | """ 10 | desc: get cli arguments 11 | returns: 12 | args: dictionary of cli arguments 13 | """ 14 | 15 | parser = argparse.ArgumentParser(description="this script is used to download and process all data") 16 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str) 17 | parser.add_argument("binary", default=True, help="coerce to binary classification", type=bool) 18 | args = parser.parse_args() 19 | return args 20 | # end 21 | 22 | if __name__ == "__main__": 23 | 24 | args = get_args() 25 | os.system("python3 download.py {}".format(args.dataset)) 26 | os.system("python3 create_csv.py {} {}".format(args.dataset, args.binary)) 27 | os.system("python3 serialize_data.py {}".format(args.dataset)) 28 | -------------------------------------------------------------------------------- /src/serialize_data.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import tensorflow as tf 6 | import numpy as np 7 | import argparse 8 | import time 9 | import os 10 | import sys 11 | import pickle 12 | from tqdm import tqdm 13 | from dataProcessing import IMDB 14 | from utils import prjPaths 15 | 16 | def get_args(): 17 | """ 18 | desc: get cli arguments 19 | returns: 20 | args: dictionary of cli arguments 21 | """ 22 | 23 | parser = argparse.ArgumentParser(description="this script creates tf record files", 24 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 25 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str) 26 | parser.add_argument("--train_data_percentage", default=0.70, help="percent of dataset to use for training", type=float) 27 | parser.add_argument("--validation_data_percentage", default=0.20, help="percent of dataset to use for validation", type=float) 28 | parser.add_argument("--test_data_percentage", default=0.10, help="percent of dataset to use for testing", type=float) 29 | args = parser.parse_args() 30 | return args 31 | # end 32 | 33 | def _write_binaryfile(nparray, filename): 34 | """ 35 | desc: write dataset partition to binary file 36 | args: 37 | nparray: dataset partition as numpy array to write to binary file 38 | filename: name of file to write dataset partition to 39 | """ 40 | 41 | np.save(filename, nparray) 42 | # end 43 | 44 | def serialize_data(paths, args): 45 | """ 46 | desc: write dataset partition to binary file 47 | args: 48 | nparray: dataset partition as numpy array to write to binary file 49 | filename: name of file to write dataset partition to 50 | """ 51 | 52 | if args.dataset == "imdb": 53 | 54 | # fetch imdb dataset 55 | imdb = IMDB(action="fetch") 56 | tic = time.time() # start time of data fetch 57 | x_train, y_train, x_test, y_test = imdb.partitionManager(args.dataset) 58 | 59 | toc = time.time() # end time of data fetch 60 | print("time taken to fetch {} dataset: {}(sec)".format(args.dataset, toc - tic)) 61 | 62 | # kill if shapes don't make sense 63 | assert(len(x_train) == len(y_train)), "x_train length does not match y_train length" 64 | assert(len(x_test) == len(y_test)), "x_test length does not match y_test length" 65 | 66 | # combine datasets 67 | x_all = x_train + x_test 68 | y_all = np.concatenate((y_train, y_test), axis=0) 69 | 70 | # create slices 71 | train_slice_lim = int(round(len(x_all)*args.train_data_percentage)) 72 | validation_slice_lim = int(round((train_slice_lim) + len(x_all)*args.validation_data_percentage)) 73 | 74 | # partition dataset into train, validation, and test sets 75 | x_all, docsize, sent_size = imdb.hanformater(inputs=x_all) 76 | 77 | x_train = x_all[:train_slice_lim] 78 | y_train = y_all[:train_slice_lim] 79 | docsize_train = docsize[:train_slice_lim] 80 | sent_size_train = sent_size[:train_slice_lim] 81 | 82 | x_val = x_all[train_slice_lim+1:validation_slice_lim] 83 | y_val = y_all[train_slice_lim+1:validation_slice_lim] 84 | docsize_val = docsize[train_slice_lim+1:validation_slice_lim] 85 | sent_size_val = sent_size[train_slice_lim+1:validation_slice_lim] 86 | 87 | 88 | x_test = x_all[validation_slice_lim+1:] 89 | y_test = y_all[validation_slice_lim+1:] 90 | docsize_test = docsize[validation_slice_lim+1:] 91 | sent_size_test = sent_size[validation_slice_lim+1:] 92 | 93 | train_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_x.npy") 94 | train_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_y.npy") 95 | train_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_docsize.npy") 96 | train_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_sent_size.npy") 97 | 98 | val_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_x.npy") 99 | val_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_y.npy") 100 | val_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_docsize.npy") 101 | val_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_sent_size.npy") 102 | 103 | test_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_x.npy") 104 | test_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_y.npy") 105 | test_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_docsize.npy") 106 | test_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_sent_size.npy") 107 | 108 | _write_binaryfile(nparray=x_train, filename=train_bin_filename_x) 109 | _write_binaryfile(nparray=y_train, filename=train_bin_filename_y) 110 | _write_binaryfile(nparray=docsize_train, filename=train_bin_filename_docsize) 111 | _write_binaryfile(nparray=sent_size_train, filename=train_bin_filename_sent_size) 112 | 113 | _write_binaryfile(nparray=x_val, filename=val_bin_filename_x) 114 | _write_binaryfile(nparray=y_val, filename=val_bin_filename_y) 115 | _write_binaryfile(nparray=docsize_val, filename=val_bin_filename_docsize) 116 | _write_binaryfile(nparray=sent_size_val, filename=val_bin_filename_sent_size) 117 | 118 | _write_binaryfile(nparray=x_test, filename=test_bin_filename_x) 119 | _write_binaryfile(nparray=y_test, filename=test_bin_filename_y) 120 | _write_binaryfile(nparray=docsize_test, filename=test_bin_filename_docsize) 121 | _write_binaryfile(nparray=sent_size_test, filename=test_bin_filename_sent_size) 122 | # end 123 | 124 | if __name__ == "__main__": 125 | paths = prjPaths() 126 | args = get_args() 127 | serialize_data(paths, args=args) 128 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: Michael Guarino 3 | """ 4 | 5 | import os 6 | import datetime 7 | import logging 8 | 9 | class prjPaths: 10 | def __init__(self): 11 | """ 12 | desc: create object containing project paths 13 | """ 14 | 15 | self.SRC_DIR = os.path.abspath(os.path.curdir) 16 | self.ROOT_MOD_DIR = "/".join(self.SRC_DIR.split("/")[:-1]) 17 | self.ROOT_DATA_DIR = os.path.join(self.ROOT_MOD_DIR, "data") 18 | self.LIB_DIR = os.path.join(self.ROOT_MOD_DIR, "lib") 19 | self.CHECKPOINT_DIR = os.path.join(self.LIB_DIR, "chkpts") 20 | self.SUMMARY_DIR = os.path.join(self.LIB_DIR, "summaries") 21 | self.LOGS_DIR = os.path.join(self.LIB_DIR, "logs") 22 | 23 | pth_exists_else_mk = lambda path: os.mkdir(path) if not os.path.exists(path) else None 24 | 25 | pth_exists_else_mk(self.ROOT_DATA_DIR) 26 | pth_exists_else_mk(self.LIB_DIR) 27 | pth_exists_else_mk(self.CHECKPOINT_DIR) 28 | pth_exists_else_mk(self.SUMMARY_DIR) 29 | pth_exists_else_mk(self.LOGS_DIR) 30 | # end 31 | # end 32 | 33 | def get_logger(paths): 34 | # TODO logger not logging to file 35 | currentTime = str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")) 36 | logFileName = os.path.join(paths.LOGS_DIR, "HAN_TxtClassification_{}.log".format(currentTime)) 37 | 38 | logger = logging.getLogger(__name__) 39 | formatter = logging.Formatter("%(asctime)s:%(name)s:%(message)s") 40 | 41 | fileHandler = logging.FileHandler(logFileName) 42 | fileHandler.setLevel(logging.INFO) 43 | fileHandler.setFormatter(formatter) 44 | 45 | logger.addHandler(fileHandler) 46 | 47 | return logger 48 | # end 49 | 50 | --------------------------------------------------------------------------------