├── Dockerfile
├── LICENSE
├── README.md
├── lib
└── imgs
│ ├── HierarchicalAttentionNetworksDiagram.png
│ ├── graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png
│ ├── training_accuracy.png
│ └── training_loss.png
└── src
├── create_csv.py
├── dataProcessing.py
├── download.py
├── han.py
├── han_tester.py
├── han_trainer.py
├── requirements.txt
├── run_all.py
├── serialize_data.py
└── utils.py
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04
2 | COPY . /home/
3 | WORKDIR /home/src/
4 | RUN apt-get update && apt-get install -y \
5 | vim \
6 | git-core \
7 | wget \
8 | python3 \
9 | python3-pip \
10 | && pip3 install -r requirements.txt
11 |
12 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Michael
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Document Classification Comparisons featuring Hierarchical Attention Network
2 |
3 | The [Hierarchical Attention Network](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) is a novel deep learning architecture that takes advantage of the hierarchical structure of documents to construct a detailed representation of the document. As words form sentences and sentences form the document, the Hierarchical Attention Networks representation of the document uses this hierarchy in order to determine what sentences and what words in those sentences are most important in the classification of the document as a whole.
4 |
5 |
6 |
7 | Figure 1: Hierarchical Attention Network Architecture Zichao (1)
8 |
9 |
10 |
11 | This model uses two levels of LSTM encoders at the word and sentences level in order to build the word and sentence level representations of the document. The attention mechanism is used to attribute importance at the word and sentence level.
12 |
13 | There are two applications of the attention mechanism that attend over of the word level encoder and the sentence level encoder. These allow the model to construct a representation of the document that attribute greater levels of importance to key sentences and words throughout the document.
14 |
15 |
16 | ## IMDB Dataset
17 | All experiments were performed on the Stanford IMDB dataset which is a natural language dataset where movie reviews have labels that describe the sentiment of the movie review. This is one of the many datasets used in the original paper [Hierarchical Attention Network](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). There are 8 different classes that describe the sentiment from 0-3 for negative sentiment to 6-10 for positive sentiment, which are mapped down to negative sentiment 0 and positive sentiment 1.
18 |
19 | ## Files in this repo
20 | * IMDB download script: [download.py](src/download.py)
21 | * first step of data preprocessing and create a csv: [create_csv.py](src/create_csv.py)
22 | * second step of data preprocessing and create serialized dataset as binary files: [serialize_data.py](src/serialize_data.py)
23 | * IMDB data preprocessing: [dataProcessing.py](src/dataProcessing.py)
24 | * Paths shared throughout files: [utils.py](src/utils.py)
25 | * Hierarchical Attention Networks: [han.py](src/han.py)
26 | * Train the Hierarchical Attention Networks: [han_trainer.py](src/han_trainer.py)
27 | * Test the Hierarchical Attention Networks: [han_tester.py](src/han_tester.py)
28 |
29 | ## What you need to run the code in this repo
30 | * [Docker](https://www.docker.com/)
31 | * Nvidia GPU with the CUDA driver installed
32 |
33 | ## To run the experiments contained in this repo
34 |
35 | **To run the model**
36 | * build the container image from the docker file `docker build -t han:1.0 .`
37 | * start container `nvidia-docker run -p 6006:6006 -p 8888:8889 -it "IMAGE_ID" bash`
38 | * to download and process all data run `python3 run_all.py imdb True` or run the below three commands
39 | * download the imdb dataset `python3 download.py imdb`
40 | * create csv file `python3 create_csv.py imdb True`
41 | * create serialized dataset as binary files `python3 serialize_data.py imdb`
42 | * start training the han model with `nohup python3 han_trainer.py --run_type "train" >> train.out &`
43 | * start validation the han model with `nohup python3 han_tester.py --run_type "val" >> val.out &`
44 | * start testing the han model with `nohup python3 han_tester.py --run_type "test" >> test.out &`
45 |
46 | Note the attention weights consume lots of vram memory on the gpu and running validation while model is training causes a out of memory exception
47 |
48 | **Set up Tensorboard and Jupyter Notebook**
49 | * create another session in the same container `nvidia-docker exec -it "CONTAINER_ID" bash`
50 | * start jupyter notebook in the container `jupyter notebook --no-browser --port=8889 --ip=0.0.0.0 --allow-root` grab the authenication token
51 |
52 | * create another session in the same container `nvidia-docker exec -it "CONTAINER_ID" bash`
53 | * then run `tensorboard --logdir ../lib/summaries/train/` start tensorboard in the container
54 |
55 | * go to your browser on local machine `localhost:6001` for tensorboard
56 | * go to your browser on local machine `localhost:8890` for tensorboard
57 |
58 | if you are working on a remote machine you must set up a tunnel for tensorboard and jupyter tools
59 | * on host machine `ssh -N -L localhost:6001:localhost:6006 username@ipaddress` set up tunnel for tensorboard
60 | * on host machine `ssh -N -L localhost:8890:localhost:8888 username@ipaddress` set up tunnel for jupyter notebook
61 |
62 |
63 | ## Graph of operations for this model
64 |
65 |
66 | Figure 2: Hierarchical Attention Network model graph operations
67 |
68 |
69 | ## Results
70 |
71 | Shown above is the training accuracy achieved during training of the HAN model after 120 thousand training steps on the IMDB dataset where the labels are converted to binary classes. As seen the maximum training accuracy achieved is approximately 64% accuracy, which is significantly less than that reported by the original paper.
72 |
73 |
74 | Shown above is the training loss achieved during training of the HAN model after 120 thousand training steps on the IMDB dataset where the labels are converted to binary classes. The training loss seems to be steadily decreasing.
75 |
76 | ## References
77 | Zichao, Yang. [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
78 |
79 | ## TODOs
80 | * publish trained model files
81 | * find a way to validate model during model training without causing OOM either by pausing training and validate then return to training
82 | * visualize trained model weights in jupyter notebook over input text document
83 |
--------------------------------------------------------------------------------
/lib/imgs/HierarchicalAttentionNetworksDiagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/HierarchicalAttentionNetworksDiagram.png
--------------------------------------------------------------------------------
/lib/imgs/graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/graph_large_attrs_key=_too_large_attrs&limit_attr_size=1024&run=.png
--------------------------------------------------------------------------------
/lib/imgs/training_accuracy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/training_accuracy.png
--------------------------------------------------------------------------------
/lib/imgs/training_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mguarin0/HierarchicalAttentionNetworksForDocumentClassification/04c382b52488fc60a8fd4a15f7023efff180cc23/lib/imgs/training_loss.png
--------------------------------------------------------------------------------
/src/create_csv.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import os
6 | import argparse
7 | from dataProcessing import IMDB
8 | from utils import prjPaths
9 |
10 | def get_args():
11 | """
12 | desc: get cli arguments
13 | returns:
14 | args: dictionary of cli arguments
15 | """
16 |
17 | parser = argparse.ArgumentParser(description="this script is used for creating csv datasets for training this implementation of the Hierarchical Attention Networks")
18 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str)
19 | parser.add_argument("binary", default=True, help="coerce to binary classification", type=bool)
20 | args = parser.parse_args()
21 | return args
22 | # end
23 |
24 | def create_csv(paths, args):
25 | """
26 | desc: This function creates a csv file from a downloaded dataset.
27 | Currently this process works on the imdb dataset but other datasets
28 | can be easily added.
29 | args:
30 | args: dictionary of cli arguments
31 | paths: project paths
32 | """
33 |
34 | if args.dataset == "imdb":
35 | print("creating {} csv".format(args.dataset))
36 | imdb = IMDB(action="create")
37 | imdb.createManager(args.binary)
38 | print("{} csv created".format(args.dataset))
39 | # end
40 |
41 | if __name__ == "__main__":
42 | paths = prjPaths()
43 | args = get_args()
44 | create_csv(paths=paths, args=args)
45 |
--------------------------------------------------------------------------------
/src/dataProcessing.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import tensorflow as tf
6 | import os
7 | import csv
8 | import re
9 | import itertools
10 | import more_itertools
11 | import pickle
12 | import pandas as pd
13 | import numpy as np
14 | from tqdm import tqdm
15 | from bs4 import BeautifulSoup
16 | from utils import prjPaths
17 |
18 | class IMDB:
19 |
20 | def __init__(self, action):
21 | """
22 | desc: this class is used to process the imdb dataset
23 | args:
24 | action: specify whether to create or fetch the data using the IMDB class
25 | """
26 | self.paths = prjPaths()
27 | self.ROOT_DATA_DIR = self.paths.ROOT_DATA_DIR
28 | self.DATASET = "imdb"
29 |
30 | self.CSVFILENAME = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "{}.csv".format(self.DATASET))
31 | assert(action in ["create", "fetch"]), "invalid action"
32 |
33 | if action == "create":
34 |
35 | # if creating new csv remove old if one exists
36 | if os.path.exists(self.CSVFILENAME):
37 | print("removing existing csv file from {}".format(self.CSVFILENAME))
38 | os.remove(self.CSVFILENAME)
39 |
40 | # directory structure
41 | train_dir = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "aclImdb", "train")
42 | test_dir = os.path.join(self.ROOT_DATA_DIR, self.DATASET, "aclImdb", "test")
43 |
44 | trainPos_dir = os.path.join(train_dir, "pos")
45 | trainNeg_dir = os.path.join(train_dir, "neg")
46 |
47 | testPos_dir = os.path.join(test_dir, "pos")
48 | testNeg_dir = os.path.join(test_dir, "neg")
49 |
50 | self.data = {"trainPos": self._getDirContents(trainPos_dir),
51 | "trainNeg": self._getDirContents(trainNeg_dir),
52 | "testPos": self._getDirContents(testPos_dir),
53 | "testNeg": self._getDirContents(testNeg_dir)}
54 | # end
55 |
56 | def _getDirContents(self, path):
57 | """
58 | desc: get all filenames in a specified directory
59 | args:
60 | path: path of directory to get contents of
61 | returns:
62 | dirFiles: list of filenames in a directory
63 | """
64 | dirFiles = os.listdir(path)
65 | dirFiles = [os.path.join(path, file) for file in dirFiles]
66 | return dirFiles
67 | # end
68 |
69 | def _getID_label(self, file, binary):
70 | """
71 | desc: get label for a specific filename
72 | args:
73 | file: current file being operated on
74 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset
75 | returns:
76 | list of unique identifier of file, label, and if it is test or training data
77 | """
78 | splitFile = file.split("/")
79 | testOtrain = splitFile[-3]
80 | filename = os.path.splitext(splitFile[-1])[0]
81 | id, label = filename.split("_")
82 | if binary:
83 | if int(label) < 5:
84 | label = 0
85 | else:
86 | label = 1
87 |
88 | return [id, label, testOtrain]
89 | # end
90 |
91 | def _loadTxtFiles(self, dirFiles, binary):
92 | """
93 | desc: load and format all imdb dataset
94 | args:
95 | dirFiles: current file being operated on
96 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset
97 | returns:
98 | list of dictionaries containing all information about imdb dataset
99 | """
100 | TxtContents = list()
101 | for file in tqdm(dirFiles, desc="process all files in a directory"):
102 | try:
103 | with open(file, encoding="utf8") as txtFile:
104 | content = txtFile.read()
105 | id, label, testOtrain = self._getID_label(file, binary=binary)
106 | TxtContents.append({"id": id,
107 | "content": content,
108 | "label": label,
109 | "testOtrain": testOtrain})
110 | except:
111 | print("this file threw and error and is being omited: {}".format(file))
112 | continue
113 | return TxtContents
114 | # end
115 |
116 | def _writeTxtFiles(self, TxtContents):
117 | """
118 | desc: write imdb content and meta data to csv
119 | args:
120 | TxtContents: list of dictionaries containing all information about imdb dataset
121 | """
122 |
123 | with open(self.CSVFILENAME, "a") as csvFile:
124 | fieldNames = ["id", "content", "label", "testOtrain"]
125 | writer = csv.DictWriter(csvFile, fieldnames=fieldNames)
126 | writer.writeheader()
127 |
128 | for seq in TxtContents:
129 | try:
130 | writer.writerow({"id": seq["id"],
131 | "content": seq["content"].encode("ascii", "ignore").decode("ascii"),
132 | "label": seq["label"],
133 | "testOtrain": seq["testOtrain"]})
134 | except:
135 | print("this sequence threw an exception: {}".format(seq["id"]))
136 | continue
137 | # end
138 |
139 | def createManager(self, binary):
140 | """
141 | desc: This function is called by create_csv.py script.
142 | It manages the loading, formatting, and creation of a csv from the imdb directory structure.
143 | args:
144 | binary: specify if data should be recoded as binary or kept in original form for imdb dataset
145 | """
146 |
147 | for key in self.data.keys():
148 | self.data[key] = self._loadTxtFiles(self.data[key], binary)
149 | self._writeTxtFiles(self.data[key])
150 | # end
151 |
152 | def _clean_str(self, string):
153 | """
154 | desc: This function cleans a string
155 | adapted from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
156 | args:
157 | string: the string to be cleaned
158 | returns:
159 | a cleaned string
160 | """
161 |
162 | string = BeautifulSoup(string, "lxml").text
163 | string = re.sub(r"[^A-Za-z0-9(),!?\"\`]", " ", string)
164 | string = re.sub(r"\"s", " \"s", string)
165 | string = re.sub(r"\"ve", " \"ve", string)
166 | string = re.sub(r"n\"t", " n\"t", string)
167 | string = re.sub(r"\"re", " \"re", string)
168 | string = re.sub(r"\"d", " \"d", string)
169 | string = re.sub(r"\"ll", " \"ll", string)
170 | string = re.sub(r",", " , ", string)
171 | string = re.sub(r"!", " ! ", string)
172 | string = re.sub(r"\(", " \( ", string)
173 | string = re.sub(r"\)", " \) ", string)
174 | string = re.sub(r"\?", " \? ", string)
175 | string = re.sub(r"\s{2,}", " ", string)
176 | return string.strip().lower().split(" ")
177 | # end
178 |
179 | def _oneHot(self, ys):
180 | """
181 | desc: one hot encodes labels in dataset
182 | args:
183 | ys: dataset labels
184 | returns:
185 | list of one hot encoded training, testing, and lookup labels
186 | """
187 |
188 | y_train, y_test = ys
189 | y_train = list(map(int, y_train)) # confirm all type int
190 | y_test = list(map(int, y_test)) # confirm all type int
191 | lookuplabels = {v: k for k, v in enumerate(sorted(list(set(y_train + y_test))))}
192 | recoded_y_train = [lookuplabels[i] for i in y_train]
193 | recoded_y_test = [lookuplabels[i] for i in y_test]
194 | labels_y_train = tf.constant(recoded_y_train)
195 | labels_y_test = tf.constant(recoded_y_test)
196 | max_label = tf.reduce_max(labels_y_train + labels_y_test)
197 | labels_y_train_OHE = tf.one_hot(labels_y_train, max_label+1)
198 | labels_y_test_OHE = tf.one_hot(labels_y_test, max_label+1)
199 |
200 | with tf.Session() as sess:
201 | # Initialize all variables
202 | sess.run(tf.global_variables_initializer())
203 | #l = sess.run(labels)
204 | y_train_ohe = sess.run(labels_y_train_OHE)
205 | y_test_ohe = sess.run(labels_y_test_OHE)
206 | sess.close()
207 | return [y_train_ohe, y_test_ohe, lookuplabels]
208 | # end
209 |
210 | def _index(self, xs):
211 | """
212 | desc: apply index to text data and persist unique vocabulary in dataset to pickle file
213 | args:
214 | xs: text data
215 | returns:
216 | list of test, train data after it was indexed, the lookup table for the vocabulary,
217 | and any persisted variables that may be needed
218 | """
219 | def _apply_index(txt_data):
220 | indexed = [[[unqVoc_LookUp[char] for char in seq] for seq in doc] for doc in txt_data]
221 | return indexed
222 | # end
223 |
224 | x_train, x_test = xs
225 |
226 | # create look up table for all unique vocab in test and train datasets
227 | unqVoc = set(list(more_itertools.collapse(x_train[:] + x_test[:])))
228 | unqVoc_LookUp = {k: v+1 for v, k in enumerate(unqVoc)}
229 | vocab_size = len(list(unqVoc_LookUp))
230 |
231 | x_train = _apply_index(txt_data=x_train)
232 | x_test = _apply_index(txt_data=x_test)
233 |
234 | # determine max sequence lengths
235 | max_seq_len = max([len(seq) for seq in itertools.chain.from_iterable(x_train + x_test)]) # max length of sequence across all documents
236 | max_sent_len = max([len(sent) for sent in (x_train + x_test)]) # max length of sentence across all documents
237 |
238 | persisted_vars = {"max_seq_len":max_seq_len,
239 | "max_sent_len":max_sent_len,
240 | "vocab_size":vocab_size}
241 |
242 | return [x_train, x_test, unqVoc_LookUp, persisted_vars]
243 | # end
244 |
245 | def partitionManager(self, dataset):
246 | """
247 | desc: apply index to text data, one hot encode labels, and persist unique vocabulary in dataset to pickle file
248 | args:
249 | dataset: dataset to be processed
250 | returns:
251 | return list of indexed training, training data along with one hot encoded labels
252 | """
253 | assert(self.DATASET==dataset), "this function works on {} and is not meant to process {} dataset".format(self.DATASET, dataset)
254 |
255 | # load csv file
256 | df = pd.read_csv(self.CSVFILENAME)
257 |
258 | # partition data
259 | train = df.loc[df["testOtrain"] == "train"]
260 | test = df.loc[df["testOtrain"] == "test"]
261 |
262 | # create 3D list for han model and clean strings
263 | create3DList = lambda df: [[self._clean_str(seq) for seq in "|||".join(re.split("[.?!]", docs)).split("|||")]
264 | for docs in df["content"].values]
265 | x_train = create3DList(df=train)
266 | x_test = create3DList(df=test)
267 |
268 | # index and persist unq vocab in pickle file
269 | x_train, x_test, unqVoc_LookUp, persisted_vars = self._index(xs=[x_train[:], x_test[:]])
270 |
271 | y_train = train["label"].tolist()
272 | y_test = test["label"].tolist()
273 |
274 | #OHE classes
275 | y_train_ohe, y_test_ohe, lookuplabels = self._oneHot(ys=[y_train, y_test])
276 |
277 | # update persisted vars
278 | persisted_vars["lookuplabels"] = lookuplabels
279 | persisted_vars["num_classes"] = len(lookuplabels.keys())
280 |
281 | # save lookup table and variables that need to be persisted
282 | if not os.path.exists(os.path.join(self.paths.LIB_DIR, self.DATASET)):
283 | os.mkdir(os.path.join(self.paths.LIB_DIR, self.DATASET))
284 | pickle._dump(unqVoc_LookUp, open(os.path.join(self.paths.LIB_DIR, self.DATASET, "unqVoc_Lookup.p"), "wb"))
285 | pickle._dump(persisted_vars, open(os.path.join(self.paths.LIB_DIR, self.DATASET, "persisted_vars.p"), "wb"))
286 |
287 | return[x_train, y_train_ohe, x_test, y_test_ohe]
288 | # end
289 |
290 | def get_data(self, type_):
291 | """
292 | desc: load and return dataset from binary files
293 | args:
294 | type_: type of dataset (train, val, test)
295 | returns:
296 | loaded dataset
297 | """
298 |
299 | assert(type_ in ["train", "val", "test"])
300 |
301 | print("loading {} dataset...".format(type_))
302 |
303 | x = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_x.npy".format(type_)))
304 | y = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_y.npy".format(type_)))
305 | docsize = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_docsize.npy".format(type_)))
306 | sent_size = np.load(os.path.join(self.paths.ROOT_DATA_DIR, self.DATASET, "{}_sent_size.npy".format(type_)))
307 | return [x, y, docsize, sent_size]
308 | # end
309 |
310 | def get_batch_iter(self, data, batch_size, num_epochs, shuffle=True):
311 | """
312 | desc: batch dataset generator
313 | args:
314 | data: dataset to batch as list
315 | batch_size: the batch size used
316 | num_epochs: number of training epochs
317 | shuffle: shuffle dataset
318 | returns:
319 | adapted from Denny Britz https://github.com/dennybritz/cnn-text-classification-tf.git
320 | """
321 |
322 | data = np.array(data)
323 | data_size = len(data)
324 | num_batches_per_epoch = int((len(data) - 1) / batch_size) + 1
325 | for epoch in range(num_epochs):
326 | # Shuffle the data at each epoch
327 | if shuffle:
328 | shuffle_indices = np.random.permutation(np.arange(data_size))
329 | next_batch = data[shuffle_indices]
330 | else:
331 | next_batch = data
332 | for batch_num in range(num_batches_per_epoch):
333 | start_index = batch_num * batch_size
334 | end_index = min((batch_num + 1) * batch_size, data_size)
335 | #yield next_batch[start_index:end_index]
336 | yield epoch, next_batch[start_index:end_index]
337 | # end
338 |
339 | def hanformater(self, inputs):
340 | """
341 | desc: format data specific for hierarchical attention networks
342 | args:
343 | inputs: data
344 | returns:
345 | dataset with corresponding dimensions for document and sentence level
346 | """
347 |
348 | batch_size = len(inputs)
349 |
350 | document_sizes = np.array([len(doc) for doc in inputs], dtype=np.int32)
351 | document_size = document_sizes.max()
352 |
353 | sentence_sizes_ = [[len(sent) for sent in doc] for doc in inputs]
354 | sentence_size = max(map(max, sentence_sizes_))
355 |
356 | b = np.zeros(shape=[batch_size, document_size, sentence_size], dtype=np.int32) # == PAD
357 |
358 | sentence_sizes = np.zeros(shape=[batch_size, document_size], dtype=np.int32)
359 | for i, document in enumerate(tqdm(inputs, desc="formating data for hierarchical attention networks")):
360 | for j, sentence in enumerate(document):
361 | sentence_sizes[i, j] = sentence_sizes_[i][j]
362 | for k, word in enumerate(sentence):
363 | b[i, j, k] = word
364 | return b, document_sizes, sentence_sizes
365 | # end
366 | # end
367 |
--------------------------------------------------------------------------------
/src/download.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import os
6 | import shutil
7 | import platform
8 | import urllib.request
9 | import tarfile
10 | import traceback
11 | import argparse
12 |
13 | from utils import prjPaths
14 |
15 | def get_args():
16 | """
17 | desc: get cli arguments
18 | returns:
19 | args: dictionary of cli arguments
20 | """
21 |
22 | parser = argparse.ArgumentParser(description="this script is used for downloading datasets for training this implementation of the Hierarchical Attention Networks")
23 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str)
24 | args = parser.parse_args()
25 | return args
26 | # end
27 |
28 | def download(paths, args):
29 | """
30 | desc: download a dataset from url
31 | args:
32 | args: dictionary of cli arguments
33 | paths: project paths
34 | """
35 |
36 | if args.dataset == "imdb":
37 | resource_loc = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
38 | osType = platform.system()
39 | if osType == "Windows":
40 | print("manually download data set from {}"\
41 | " and set getDataset=False when prjPaths is called in *_master.py script".format(resource_loc))
42 | exit(0)
43 | elif osType is not "Linux":
44 | osType = "OSX"
45 |
46 | filename=os.path.join(paths.ROOT_DATA_DIR, args.dataset, "aclImdb_v1.tar.gz")
47 | ACLIMDB_DIR = os.path.join(paths.ROOT_DATA_DIR, args.dataset)
48 |
49 | # if tar file already exists remove it
50 | if os.path.exists(filename):
51 | os.remove(filename)
52 | # if fclImdb dir already exists remove it
53 | if os.path.exists(os.path.join(ACLIMDB_DIR, "aclImdb")):
54 | shutil.rmtree(os.path.join(ACLIMDB_DIR, "aclImdb"))
55 | else:
56 | os.mkdir(ACLIMDB_DIR)
57 |
58 | print("downloading: {}".format(args.dataset))
59 | try:
60 | urllib.request.urlretrieve(resource_loc, filename)
61 | except Exception as e:
62 | print("something went wrong downloading: {} at {}".format(args.dataset, resource_loc))
63 | traceback.print_exc()
64 |
65 | print("unpacking: {}".format(args.dataset))
66 | if (filename.endswith("tar.gz")):
67 | tar = tarfile.open(filename, "r:gz")
68 | tar.extractall(ACLIMDB_DIR)
69 | tar.close()
70 | elif (filename.endswith("tar")):
71 | tar = tarfile.open(filename, "r:")
72 | tar.extractall(ACLIMDB_DIR)
73 | tar.close()
74 | # end
75 |
76 | if __name__ == "__main__":
77 | paths = prjPaths()
78 | args = get_args()
79 | download(paths=paths, args=args)
80 | print("download complete!")
81 |
--------------------------------------------------------------------------------
/src/han.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import numpy as np
6 |
7 | np.set_printoptions(threshold=np.nan)
8 | import tensorflow as tf
9 | from tensorflow.contrib import rnn
10 | import tensorflow.contrib.layers as layers
11 |
12 | class HAN:
13 | def __init__(self, max_seq_len, max_sent_len, num_classes,
14 | vocab_size, embedding_size, max_grad_norm, dropout_keep_proba,
15 | learning_rate):
16 | ## Parameters
17 | self.learning_rate = learning_rate
18 | self.vocab_size = vocab_size
19 | self.num_classes = num_classes
20 | self.max_seq_len = max_seq_len
21 | self.embedding_size = embedding_size
22 | self.word_encoder_num_hidden = max_seq_len
23 | self.word_output_size = max_seq_len
24 | self.sentence_encoder_num_hidden = max_sent_len
25 | self.sentence_output_size = max_sent_len
26 | self.max_grad_norm = max_grad_norm
27 | self.dropout_keep_proba = dropout_keep_proba
28 |
29 | # tf graph input
30 | self.input_x = tf.placeholder(shape=[None, None, None],
31 | dtype=tf.int32,
32 | name="input_x")
33 | self.input_y = tf.placeholder(shape=[None, self.num_classes],
34 | dtype=tf.int32,
35 | name="input_y")
36 | self.word_lengths = tf.placeholder(shape=[None, None],
37 | dtype=tf.int32,
38 | name="word_lengths")
39 | self.sentence_lengths = tf.placeholder(shape=[None,],
40 | dtype=tf.int32,
41 | name="sentence_lengths")
42 | self.is_training = tf.placeholder(dtype=tf.bool,
43 | name="is_training")
44 |
45 | # input_x dims
46 | (self.document_size, self.sentence_size, self.word_size) = tf.unstack(tf.shape(self.input_x))
47 |
48 | with tf.device("/gpu:0"), tf.name_scope("embedding_layer"):
49 | w = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0),
50 | dtype=tf.float32,
51 | name="w") # TODO check if this needs to be marked as untrainable
52 | self.input_x_embedded = tf.nn.embedding_lookup(w, self.input_x)
53 |
54 | # reshape input_x after embedding
55 | self.input_x_embedded = tf.reshape(self.input_x_embedded,
56 | [self.document_size * self.sentence_size, self.word_size, self.embedding_size])
57 | self.input_x_embedded_lengths = tf.reshape(self.word_lengths, [self.document_size * self.sentence_size])
58 |
59 | with tf.variable_scope("word_level"):
60 | self.word_encoder_outputs = self.bidirectional_RNN(num_hidden=self.word_encoder_num_hidden,
61 | inputs=self.input_x_embedded)
62 | word_level_output = self.attention(inputs=self.word_encoder_outputs,
63 | output_size=self.word_output_size)
64 |
65 | with tf.variable_scope("dropout"):
66 | print('self.is_training: {}'.format(self.is_training))
67 | word_level_output = layers.dropout(word_level_output,
68 | keep_prob=self.dropout_keep_proba,
69 | is_training=self.is_training)
70 |
71 | # reshape word_level output
72 | self.sentence_encoder_inputs = tf.reshape(word_level_output,
73 | [self.document_size, self.sentence_size, self.word_output_size])
74 |
75 | with tf.variable_scope("sentence_level"):
76 | self.sentence_encoder_outputs = self.bidirectional_RNN(num_hidden=self.sentence_encoder_num_hidden,
77 | inputs=self.sentence_encoder_inputs)
78 | sentence_level_output = self.attention(inputs=self.sentence_encoder_outputs,
79 | output_size=self.sentence_output_size)
80 | with tf.variable_scope("dropout"):
81 | sentence_level_output = layers.dropout(sentence_level_output,
82 | keep_prob=self.dropout_keep_proba,
83 | is_training=self.is_training)
84 |
85 | # Final model prediction
86 | with tf.variable_scope("classifier_output"):
87 | self.logits = layers.fully_connected(sentence_level_output,
88 | self.num_classes,
89 | activation_fn=None)
90 | #trainable=self.is_training)
91 | self.predictions = tf.argmax(self.logits, axis=1, name="predictions")
92 |
93 | # Calculate mean cross-entropy loss
94 | with tf.variable_scope("loss"):
95 | losses = tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits)
96 | self.loss = tf.reduce_mean(losses)
97 | tf.summary.scalar("Loss", self.loss)
98 |
99 | # Accuracy
100 | with tf.variable_scope("accuracy"):
101 | correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, axis=1))
102 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
103 | tf.summary.scalar("Accuracy", self.accuracy)
104 |
105 | def bidirectional_RNN(self, num_hidden, inputs):
106 | """
107 | desc: create bidirectional rnn layer
108 | args:
109 | num_hidden: number of hidden units
110 | inputs: input word or sentence
111 | returns:
112 | concatenated encoder and decoder outputs
113 | """
114 |
115 | with tf.name_scope("bidirectional_RNN"):
116 | encoder_fw_cell = rnn.GRUCell(num_hidden)
117 | encoder_bw_cell = rnn.GRUCell(num_hidden)
118 | ((encoder_fw_outputs, encoder_bw_outputs), (_, _)) = tf.nn.bidirectional_dynamic_rnn(cell_fw=encoder_fw_cell,
119 | cell_bw=encoder_bw_cell,
120 | inputs=inputs,
121 | dtype=tf.float32,
122 | time_major=True)
123 | encoder_outputs = tf.concat((encoder_fw_outputs, encoder_bw_outputs), 2)
124 | return encoder_outputs
125 | # end
126 |
127 | def attention(self, inputs, output_size):
128 | """
129 | desc: create attention mechanism
130 | args:
131 | inputs: input which is sentence or document level output from bidirectional rnn layer
132 | output_size: specify the dimensions of the output
133 | returns:
134 | output from attention distribution
135 | """
136 |
137 | with tf.variable_scope("attention"):
138 | attention_context_vector_uw = tf.get_variable(name="attention_context_vector",
139 | shape=[output_size],
140 | #trainable=self.is_training,
141 | initializer=layers.xavier_initializer(),
142 | dtype=tf.float32)
143 | input_projection_u = layers.fully_connected(inputs,
144 | output_size,
145 | #trainable=self.is_training,
146 | activation_fn=tf.tanh)
147 | vector_attn = tf.reduce_sum(tf.multiply(input_projection_u, attention_context_vector_uw), axis=2, keep_dims=True)
148 | attention_weights = tf.nn.softmax(vector_attn, dim=1)
149 | weighted_projection = tf.multiply(input_projection_u, attention_weights)
150 | outputs = tf.reduce_sum(weighted_projection, axis=1)
151 | return outputs
152 | # end
153 | # end
154 |
--------------------------------------------------------------------------------
/src/han_tester.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import tensorflow as tf
6 | import numpy as np
7 | from tqdm import tqdm
8 | import time
9 | import pickle
10 | from scipy import stats
11 | from collections import Counter
12 | import os
13 | from han import HAN
14 | from utils import prjPaths, get_logger
15 | from dataProcessing import IMDB
16 |
17 | def get_flags():
18 | """
19 | desc: get cli arguments
20 | returns:
21 | args: dictionary of cli arguments
22 | """
23 |
24 | tf.flags.DEFINE_string("dataset", "imdb",
25 | "enter the type of training dataset")
26 | tf.flags.DEFINE_string("run_type", "val",
27 | "enter val or test to specify run_type (default: val)")
28 | tf.flags.DEFINE_integer("log_summaries_every", 30,
29 | "Save model summaries after this many steps (default: 30)")
30 | tf.flags.DEFINE_float("per_process_gpu_memory_fraction", 0.90,
31 | "gpu memory to be used (default: 0.90)")
32 | tf.flags.DEFINE_boolean("wait_for_checkpoint_files", False,
33 | "wait for model checkpoint file to be created")
34 |
35 | FLAGS = tf.flags.FLAGS
36 | FLAGS._parse_flags()
37 |
38 | return FLAGS
39 | # end
40 |
41 | def get_most_recently_created_file(files):
42 | return max(files, key=os.path.getctime) # most recently created file in list of files
43 | # end
44 |
45 | if __name__ == '__main__':
46 |
47 | MINUTE = 60
48 | paths = prjPaths()
49 | FLAGS = get_flags()
50 |
51 | print("current version of tf:{}".format(tf.__version__))
52 |
53 | assert(FLAGS.run_type == "val" or FLAGS.run_type == "test")
54 |
55 | print("loading persisted variables...")
56 | with open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "rb") as handle:
57 | persisted_vars = pickle.load(handle)
58 |
59 | # create new graph set as default
60 | with tf.Graph().as_default():
61 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.per_process_gpu_memory_fraction)
62 | session_conf = tf.ConfigProto(allow_soft_placement=True,
63 | log_device_placement=False,
64 | gpu_options=gpu_options)
65 | session_conf.gpu_options.allocator_type = "BFC"
66 |
67 | # create new session set it as default
68 | with tf.Session(config=session_conf) as sess:
69 |
70 | # create han model instance
71 | han = HAN(max_seq_len=persisted_vars["max_seq_len"],
72 | max_sent_len=persisted_vars["max_sent_len"],
73 | num_classes=persisted_vars["num_classes"],
74 | vocab_size=persisted_vars["vocab_size"],
75 | embedding_size=persisted_vars["embedding_dim"],
76 | max_grad_norm=persisted_vars["max_grad_norm"],
77 | dropout_keep_proba=persisted_vars["dropout_keep_proba"],
78 | learning_rate=persisted_vars["learning_rate"])
79 |
80 | global_step = tf.Variable(0, name="global_step", trainable=False)
81 | tvars = tf.trainable_variables()
82 | grads, global_norm = tf.clip_by_global_norm(tf.gradients(han.loss, tvars),
83 | han.max_grad_norm)
84 | optimizer = tf.train.AdamOptimizer(han.learning_rate)
85 | test_op = optimizer.apply_gradients(zip(grads, tvars),
86 | name="{}_op".format(FLAGS.run_type),
87 | global_step=global_step)
88 |
89 | # write summaries
90 | merge_summary_op = tf.summary.merge_all()
91 | test_summary_writer = tf.summary.FileWriter(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type), sess.graph)
92 |
93 | # give check for checkpoint files directory if none then sleep until a checkpoint is created
94 | #if os.listdir(paths.CHECKPOINT_DIR) == []:
95 | #time.sleep(2*MINUTE)
96 |
97 | meta_file = get_most_recently_created_file([os.path.join(paths.CHECKPOINT_DIR, file) for file in os.listdir(paths.CHECKPOINT_DIR) if file.endswith('.meta')])
98 | saver = tf.train.import_meta_graph(meta_file)
99 |
100 | # Initialize all variables
101 | sess.run(tf.global_variables_initializer())
102 |
103 | def test_step(sample_num, x_batch, y_batch, docsize, sent_size, is_training):
104 |
105 | feed_dict = {han.input_x: x_batch,
106 | han.input_y: y_batch,
107 | han.sentence_lengths: docsize,
108 | han.word_lengths: sent_size,
109 | han.is_training: is_training}
110 |
111 | loss, accuracy = sess.run([han.loss, han.accuracy], feed_dict=feed_dict)
112 | return loss, accuracy
113 | # end
114 |
115 | # generate batches on imdb dataset else quit
116 | if FLAGS.dataset == "imdb":
117 | dataset_controller = IMDB(action="fetch")
118 | else:
119 | exit("set dataset flag to appropiate dataset")
120 |
121 | x, y, docsize, sent_size = dataset_controller.get_data(type_=FLAGS.run_type) # fetch dataset
122 | all_evaluated_chkpts = [] # list of all checkpoint files previously evaluated
123 |
124 | # testing loop
125 | while True:
126 |
127 | if FLAGS.wait_for_checkpoint_files:
128 | time.sleep(2*MINUTE) # wait to allow for creation of new checkpoint file
129 | else:
130 | time.sleep(0*MINUTE) # don't wait for model checkpoint files
131 |
132 | # if checkpoint file already evaluated then continue and wait for a new checkpoint file
133 | if (tf.train.latest_checkpoint(paths.CHECKPOINT_DIR) in all_evaluated_chkpts):
134 | continue
135 |
136 | # restore most recent checkpoint
137 | saver.restore(sess, tf.train.latest_checkpoint(paths.CHECKPOINT_DIR)) # restore most recent checkpoint
138 | all_evaluated_chkpts.append(tf.train.latest_checkpoint(paths.CHECKPOINT_DIR)) # add current checkpoint to list of evaluated checkpoints
139 |
140 | losses = [] # aggregate testing losses on a given checkpoint
141 | accuracies = [] # aggregate testing accuracies on a given checkpoint
142 |
143 | tic = time.time() # start time for step
144 |
145 | # loop to test every sample on a given checkpoint
146 | for i, batch in enumerate(tqdm(list(zip(x, y, docsize, sent_size)))):
147 |
148 | x_batch, y_batch, docsize_batch, sent_size_batch = batch
149 | x_batch = np.expand_dims(x_batch, axis=0)
150 | y_batch = np.expand_dims(y_batch, axis=0)
151 | sent_size_batch = np.expand_dims(sent_size_batch, axis=0)
152 |
153 | # run step
154 | loss, accuracy = test_step(sample_num=i,
155 | x_batch=x_batch,
156 | y_batch=y_batch,
157 | docsize=docsize,
158 | sent_size=sent_size,
159 | is_training=False)
160 | losses.append(loss)
161 | accuracies.append(accuracy)
162 |
163 | time_elapsed = time.time() - tic # end time for step
164 |
165 | losses_accuracies_vars = {"losses": losses, "accuracies": accuracies}
166 |
167 | print("Time taken to complete {} evaluation of {} checkpoint: {}".format(FLAGS.run_type, all_evaluated_chkpts[-1], time_elapsed))
168 | for k in losses_accuracies_vars.keys():
169 | print("stats for {}: {}".format(k, stats.describe(losses_accuracies_vars[k])))
170 | print(Counter(losses_accuracies_vars[k]))
171 |
172 | filename, ext = os.path.splitext(all_evaluated_chkpts[-1])
173 | pickle._dump(losses_accuracies_vars, open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "losses_accuracies_vars_{}.p".format(filename.split("/")[-1])), "wb"))
174 |
175 | sess.close()
176 |
--------------------------------------------------------------------------------
/src/han_trainer.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import tensorflow as tf
6 | import numpy as np
7 | import time
8 | import pickle
9 | import os
10 | from han import HAN
11 | from utils import prjPaths, get_logger
12 | from dataProcessing import IMDB
13 |
14 | def get_flags():
15 | """
16 | desc: get cli arguments
17 | returns:
18 | args: dictionary of cli arguments
19 | """
20 |
21 | tf.flags.DEFINE_string("dataset", "imdb",
22 | "enter the type of training dataset")
23 | tf.flags.DEFINE_string("run_type", "train",
24 | "enter train or test to specify run_type (default: train)")
25 | tf.flags.DEFINE_integer("embedding_dim", 100,
26 | "Dimensionality of character embedding (default: 100)")
27 | tf.flags.DEFINE_integer("batch_size", 2,
28 | "Batch Size (default: 2)")
29 | tf.flags.DEFINE_integer("num_epochs", 25,
30 | "Number of training epochs (default: 25)")
31 | tf.flags.DEFINE_integer("evaluate_every", 100,
32 | "Evaluate model on dev set after this many steps")
33 | tf.flags.DEFINE_integer("log_summaries_every", 30,
34 | "Save model summaries after this many steps (default: 30)")
35 | tf.flags.DEFINE_integer("checkpoint_every", 100,
36 | "Save model after this many steps (default: 100)")
37 | tf.flags.DEFINE_integer("num_checkpoints", 5,
38 | "Number of checkpoints to store (default: 5)")
39 | tf.flags.DEFINE_float("max_grad_norm", 5.0,
40 | "maximum permissible norm of the gradient (default: 5.0)")
41 | tf.flags.DEFINE_float("dropout_keep_proba", 0.5,
42 | "probability of neurons turned off (default: 0.5)")
43 | tf.flags.DEFINE_float("learning_rate", 0.001,
44 | "model learning rate (default: 0.001)")
45 | tf.flags.DEFINE_float("per_process_gpu_memory_fraction", 0.90,
46 | "gpu memory to be used (default: 0.90)")
47 |
48 | FLAGS = tf.flags.FLAGS
49 | FLAGS._parse_flags()
50 |
51 | return FLAGS
52 | # end
53 |
54 | if __name__ == '__main__':
55 |
56 | paths = prjPaths()
57 | FLAGS = get_flags()
58 |
59 | print("current version of tf:{}".format(tf.__version__))
60 |
61 | assert(FLAGS.run_type == "train")
62 |
63 | print("loading persisted variables...")
64 |
65 | with open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "rb") as handle:
66 | persisted_vars = pickle.load(handle)
67 |
68 | persisted_vars["embedding_dim"] = FLAGS.embedding_dim
69 | persisted_vars["max_grad_norm"] = FLAGS.max_grad_norm
70 | persisted_vars["dropout_keep_proba"] = FLAGS.dropout_keep_proba
71 | persisted_vars["learning_rate"] = FLAGS.learning_rate
72 | pickle._dump(persisted_vars, open(os.path.join(paths.LIB_DIR, FLAGS.dataset, "persisted_vars.p"), "wb"))
73 |
74 | # create new graph set as default
75 | with tf.Graph().as_default():
76 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.per_process_gpu_memory_fraction)
77 | session_conf = tf.ConfigProto(allow_soft_placement=True,
78 | log_device_placement=False,
79 | gpu_options=gpu_options)
80 | session_conf.gpu_options.allocator_type = "BFC"
81 |
82 | # create new session set it as default
83 | with tf.Session(config=session_conf) as sess:
84 |
85 | # create han model instance
86 | han = HAN(max_seq_len=persisted_vars["max_seq_len"],
87 | max_sent_len=persisted_vars["max_sent_len"],
88 | num_classes=persisted_vars["num_classes"],
89 | vocab_size=persisted_vars["vocab_size"],
90 | embedding_size=persisted_vars["embedding_dim"],
91 | max_grad_norm=persisted_vars["max_grad_norm"],
92 | dropout_keep_proba=persisted_vars["dropout_keep_proba"],
93 | learning_rate=persisted_vars["learning_rate"])
94 |
95 | global_step = tf.Variable(0, name="global_step", trainable=False)
96 | tvars = tf.trainable_variables()
97 | grads, global_norm = tf.clip_by_global_norm(tf.gradients(han.loss, tvars),
98 | han.max_grad_norm)
99 | optimizer = tf.train.AdamOptimizer(han.learning_rate)
100 | train_op = optimizer.apply_gradients(zip(grads, tvars),
101 | name="train_op",
102 | global_step=global_step)
103 |
104 | # write summaries
105 | merge_summary_op = tf.summary.merge_all()
106 | train_summary_writer = tf.summary.FileWriter(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type), sess.graph)
107 |
108 | # checkpoint model
109 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
110 |
111 | # Initialize all variables
112 | sess.run(tf.global_variables_initializer())
113 |
114 | def train_step(epoch, x_batch, y_batch, docsize, sent_size, is_training):
115 | tic = time.time() # start time for step
116 |
117 | feed_dict = {han.input_x: x_batch,
118 | han.input_y: y_batch,
119 | han.sentence_lengths: docsize,
120 | han.word_lengths: sent_size,
121 | han.is_training: is_training}
122 |
123 | _, step, loss, accuracy, summaries = sess.run([train_op, global_step, han.loss, han.accuracy, merge_summary_op], feed_dict=feed_dict)
124 |
125 | time_elapsed = time.time() - tic # end time for step
126 |
127 | if is_training:
128 | print("Training || CurrentEpoch: {} || GlobalStep: {} || ({} sec/step) || Loss {:g} || Accuracy {:g}".format(epoch+1, step, time_elapsed, loss, accuracy))
129 |
130 | if step % FLAGS.log_summaries_every == 0:
131 | train_summary_writer.add_summary(summaries, step)
132 | print("Saved model summaries to {}\n".format(os.path.join(paths.SUMMARY_DIR, FLAGS.run_type)))
133 |
134 | if step % FLAGS.checkpoint_every == 0:
135 | chkpt_path = saver.save(sess,
136 | os.path.join(paths.CHECKPOINT_DIR, "han"),
137 | global_step=step)
138 | print("Saved model checkpoint to {}\n".format(chkpt_path))
139 | # end
140 |
141 | # Generate batches
142 | imdb = IMDB(action="fetch")
143 | x_train, y_train, docsize_train, sent_size_train = imdb.get_data(type_=FLAGS.run_type)
144 |
145 | # Training loop. For each batch...
146 | for epoch, batch in imdb.get_batch_iter(data=list(zip(x_train, y_train, docsize_train, sent_size_train)),
147 | batch_size=FLAGS.batch_size,
148 | num_epochs=FLAGS.num_epochs):
149 |
150 | x_batch, y_batch, docsize, sent_size = zip(*batch)
151 |
152 | train_step(epoch=epoch,
153 | x_batch=x_batch,
154 | y_batch=y_batch,
155 | docsize=docsize,
156 | sent_size=sent_size,
157 | is_training=True)
158 |
159 | sess.close()
--------------------------------------------------------------------------------
/src/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow-gpu==1.3
2 | keras==2.2.0
3 | pandas==0.23.3
4 | psutil==5.4.6
5 | tqdm==4.23.4
6 | more_itertools==4.2.0
7 | bs4==0.0.1
8 | lxml==4.2.3
9 | jupyter==1.0.0
--------------------------------------------------------------------------------
/src/run_all.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import os
6 | import argparse
7 |
8 | def get_args():
9 | """
10 | desc: get cli arguments
11 | returns:
12 | args: dictionary of cli arguments
13 | """
14 |
15 | parser = argparse.ArgumentParser(description="this script is used to download and process all data")
16 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str)
17 | parser.add_argument("binary", default=True, help="coerce to binary classification", type=bool)
18 | args = parser.parse_args()
19 | return args
20 | # end
21 |
22 | if __name__ == "__main__":
23 |
24 | args = get_args()
25 | os.system("python3 download.py {}".format(args.dataset))
26 | os.system("python3 create_csv.py {} {}".format(args.dataset, args.binary))
27 | os.system("python3 serialize_data.py {}".format(args.dataset))
28 |
--------------------------------------------------------------------------------
/src/serialize_data.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import tensorflow as tf
6 | import numpy as np
7 | import argparse
8 | import time
9 | import os
10 | import sys
11 | import pickle
12 | from tqdm import tqdm
13 | from dataProcessing import IMDB
14 | from utils import prjPaths
15 |
16 | def get_args():
17 | """
18 | desc: get cli arguments
19 | returns:
20 | args: dictionary of cli arguments
21 | """
22 |
23 | parser = argparse.ArgumentParser(description="this script creates tf record files",
24 | formatter_class=argparse.ArgumentDefaultsHelpFormatter)
25 | parser.add_argument("dataset", choices=["imdb"], default="imdb", help="dataset to use", type=str)
26 | parser.add_argument("--train_data_percentage", default=0.70, help="percent of dataset to use for training", type=float)
27 | parser.add_argument("--validation_data_percentage", default=0.20, help="percent of dataset to use for validation", type=float)
28 | parser.add_argument("--test_data_percentage", default=0.10, help="percent of dataset to use for testing", type=float)
29 | args = parser.parse_args()
30 | return args
31 | # end
32 |
33 | def _write_binaryfile(nparray, filename):
34 | """
35 | desc: write dataset partition to binary file
36 | args:
37 | nparray: dataset partition as numpy array to write to binary file
38 | filename: name of file to write dataset partition to
39 | """
40 |
41 | np.save(filename, nparray)
42 | # end
43 |
44 | def serialize_data(paths, args):
45 | """
46 | desc: write dataset partition to binary file
47 | args:
48 | nparray: dataset partition as numpy array to write to binary file
49 | filename: name of file to write dataset partition to
50 | """
51 |
52 | if args.dataset == "imdb":
53 |
54 | # fetch imdb dataset
55 | imdb = IMDB(action="fetch")
56 | tic = time.time() # start time of data fetch
57 | x_train, y_train, x_test, y_test = imdb.partitionManager(args.dataset)
58 |
59 | toc = time.time() # end time of data fetch
60 | print("time taken to fetch {} dataset: {}(sec)".format(args.dataset, toc - tic))
61 |
62 | # kill if shapes don't make sense
63 | assert(len(x_train) == len(y_train)), "x_train length does not match y_train length"
64 | assert(len(x_test) == len(y_test)), "x_test length does not match y_test length"
65 |
66 | # combine datasets
67 | x_all = x_train + x_test
68 | y_all = np.concatenate((y_train, y_test), axis=0)
69 |
70 | # create slices
71 | train_slice_lim = int(round(len(x_all)*args.train_data_percentage))
72 | validation_slice_lim = int(round((train_slice_lim) + len(x_all)*args.validation_data_percentage))
73 |
74 | # partition dataset into train, validation, and test sets
75 | x_all, docsize, sent_size = imdb.hanformater(inputs=x_all)
76 |
77 | x_train = x_all[:train_slice_lim]
78 | y_train = y_all[:train_slice_lim]
79 | docsize_train = docsize[:train_slice_lim]
80 | sent_size_train = sent_size[:train_slice_lim]
81 |
82 | x_val = x_all[train_slice_lim+1:validation_slice_lim]
83 | y_val = y_all[train_slice_lim+1:validation_slice_lim]
84 | docsize_val = docsize[train_slice_lim+1:validation_slice_lim]
85 | sent_size_val = sent_size[train_slice_lim+1:validation_slice_lim]
86 |
87 |
88 | x_test = x_all[validation_slice_lim+1:]
89 | y_test = y_all[validation_slice_lim+1:]
90 | docsize_test = docsize[validation_slice_lim+1:]
91 | sent_size_test = sent_size[validation_slice_lim+1:]
92 |
93 | train_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_x.npy")
94 | train_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_y.npy")
95 | train_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_docsize.npy")
96 | train_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "train_sent_size.npy")
97 |
98 | val_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_x.npy")
99 | val_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_y.npy")
100 | val_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_docsize.npy")
101 | val_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "val_sent_size.npy")
102 |
103 | test_bin_filename_x = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_x.npy")
104 | test_bin_filename_y = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_y.npy")
105 | test_bin_filename_docsize = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_docsize.npy")
106 | test_bin_filename_sent_size = os.path.join(paths.ROOT_DATA_DIR, args.dataset, "test_sent_size.npy")
107 |
108 | _write_binaryfile(nparray=x_train, filename=train_bin_filename_x)
109 | _write_binaryfile(nparray=y_train, filename=train_bin_filename_y)
110 | _write_binaryfile(nparray=docsize_train, filename=train_bin_filename_docsize)
111 | _write_binaryfile(nparray=sent_size_train, filename=train_bin_filename_sent_size)
112 |
113 | _write_binaryfile(nparray=x_val, filename=val_bin_filename_x)
114 | _write_binaryfile(nparray=y_val, filename=val_bin_filename_y)
115 | _write_binaryfile(nparray=docsize_val, filename=val_bin_filename_docsize)
116 | _write_binaryfile(nparray=sent_size_val, filename=val_bin_filename_sent_size)
117 |
118 | _write_binaryfile(nparray=x_test, filename=test_bin_filename_x)
119 | _write_binaryfile(nparray=y_test, filename=test_bin_filename_y)
120 | _write_binaryfile(nparray=docsize_test, filename=test_bin_filename_docsize)
121 | _write_binaryfile(nparray=sent_size_test, filename=test_bin_filename_sent_size)
122 | # end
123 |
124 | if __name__ == "__main__":
125 | paths = prjPaths()
126 | args = get_args()
127 | serialize_data(paths, args=args)
128 |
--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
1 | """
2 | @author: Michael Guarino
3 | """
4 |
5 | import os
6 | import datetime
7 | import logging
8 |
9 | class prjPaths:
10 | def __init__(self):
11 | """
12 | desc: create object containing project paths
13 | """
14 |
15 | self.SRC_DIR = os.path.abspath(os.path.curdir)
16 | self.ROOT_MOD_DIR = "/".join(self.SRC_DIR.split("/")[:-1])
17 | self.ROOT_DATA_DIR = os.path.join(self.ROOT_MOD_DIR, "data")
18 | self.LIB_DIR = os.path.join(self.ROOT_MOD_DIR, "lib")
19 | self.CHECKPOINT_DIR = os.path.join(self.LIB_DIR, "chkpts")
20 | self.SUMMARY_DIR = os.path.join(self.LIB_DIR, "summaries")
21 | self.LOGS_DIR = os.path.join(self.LIB_DIR, "logs")
22 |
23 | pth_exists_else_mk = lambda path: os.mkdir(path) if not os.path.exists(path) else None
24 |
25 | pth_exists_else_mk(self.ROOT_DATA_DIR)
26 | pth_exists_else_mk(self.LIB_DIR)
27 | pth_exists_else_mk(self.CHECKPOINT_DIR)
28 | pth_exists_else_mk(self.SUMMARY_DIR)
29 | pth_exists_else_mk(self.LOGS_DIR)
30 | # end
31 | # end
32 |
33 | def get_logger(paths):
34 | # TODO logger not logging to file
35 | currentTime = str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))
36 | logFileName = os.path.join(paths.LOGS_DIR, "HAN_TxtClassification_{}.log".format(currentTime))
37 |
38 | logger = logging.getLogger(__name__)
39 | formatter = logging.Formatter("%(asctime)s:%(name)s:%(message)s")
40 |
41 | fileHandler = logging.FileHandler(logFileName)
42 | fileHandler.setLevel(logging.INFO)
43 | fileHandler.setFormatter(formatter)
44 |
45 | logger.addHandler(fileHandler)
46 |
47 | return logger
48 | # end
49 |
50 |
--------------------------------------------------------------------------------