├── README.md ├── code ├── components.py ├── gen_word_embeddings.py ├── models.py └── preprocess_reviews.py ├── data.sh ├── images └── hier-atten-net.png └── visualization ├── sents_in_review_visualization_0.html ├── sents_in_review_visualization_100.html ├── sents_in_review_visualization_1000.html ├── sents_in_review_visualization_20000.html ├── sents_in_review_visualization_20010.html ├── sents_in_review_visualization_21000.html └── sents_in_review_visualization_23000.html /README.md: -------------------------------------------------------------------------------- 1 | # hierarchical-attention-model 2 | hierarchical attention model 3 | 4 | This repo implemented "Hierarchical Attention Networks for Document Classification" by Zichao Yang et.al. 5 | 6 | It benefited greatly from two resources: the foremost one is Ilya Ivanov's repo on hierarchical attention model: https://github.com/ilivans/tf-rnn-attention . I followed the way Ilya did in implementing attention and visualization. The difference is that in this implementation it also has sentence-level attention. The other one is r2rt's code in generating batch samples for dynamic rnns: https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html 7 | 8 | The code was experimented on imdb data (with only positive and negative labels) 9 | 10 | To prepare the data: 11 | 12 | 1. bash data.sh 13 | 14 | It will download the raw imdb data and uncompress it to ./data/aclImdb folder with positive samples under 'pos' and negative ones under 'neg' subdirectories. 15 | 16 | 2. pretrain word embeddings 17 | 18 | I've tried both training word embeddings in a supervised fashion and in an unsupervised(pretaining) fashion. The former took more computational resources and also prone to overfitting. 19 | 20 | cd ./code 21 | python gen_word_embeddings.py 22 | (By default, the embedding size is 50.) 23 | 24 | 3. preprocess reviews 25 | 26 | Preprocess reviews: each review will be composed of max_rev_len sentences. If the original review is longer than that, we truncate it, and if shorter than that, we append empty sentences to it. And each sentence will be composed of sent_length words. If the original sentence is longer than that, we truncate it, and if shorter, we append the word of 'STOP' to it. Also, we keep track of the actual number of sentences each review contains. 27 | We directly read in pre-trained embeddings. Here we take the default dictionary size to be 10000. The words are indexed from 1 to 10000. 28 | Any words that are not included in the dictionary are makred as "UNK", and the index for "UNK" is 0. The index for "STOP" is 10001. 29 | 30 | python preprocess_reviews.py --sent_length 70 --max_rev_length 15 31 | 32 | 4. run the model 33 | 34 | Train the model and evaluate it on the test set. 35 | 36 | python models.py 37 | 38 | --batch_size batch size (default 512) 39 | 40 | --resume pick up latest checkpoint and resume running 41 | 42 | --epoches epoches (default 10) 43 | 44 | Note: 45 | if you just want to build the model and evaluate it, you can just run it in the default mode: 46 | 47 | python models.py 48 | 49 | if you want to pick up latetest check point and resume the computation: 50 | 51 | python models.py -r True -e 5 52 | (-e 5 means another 5 more epochs after the check point) 53 | 54 | if you only want to use the latest check point to do the evaluation: 55 | 56 | python models.py -r True -e 0 57 | 58 | 5. visualization 59 | 60 | The visualization module is embeded in models.py. A few examples are contained in the visualization folder. Use any html reader to display the results. 61 | -------------------------------------------------------------------------------- /code/components.py: -------------------------------------------------------------------------------- 1 | import os, re 2 | import tensorflow as tf 3 | import numpy as np 4 | 5 | class BucketedDataIterator(): 6 | ## bucketed data iterator uses R2RT's implementation(https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html) 7 | def __init__(self, df, num_buckets = 3): 8 | df = df.sort_values('length').reset_index(drop=True) 9 | self.size = int(len(df) / num_buckets) 10 | self.dfs = [] 11 | for bucket in range(num_buckets): 12 | self.dfs.append(df.iloc[bucket*self.size: (bucket+1)*self.size]) 13 | self.num_buckets = num_buckets 14 | 15 | # cursor[i] will be the cursor for the ith bucket 16 | self.cursor = np.array([0] * num_buckets) 17 | self.shuffle() 18 | 19 | self.epochs = 0 20 | 21 | def shuffle(self): 22 | #sorts dataframe by sequence length, but keeps it random within the same length 23 | for i in range(self.num_buckets): 24 | self.dfs[i] = self.dfs[i].sample(frac=1).reset_index(drop=True) 25 | self.cursor[i] = 0 26 | 27 | def next_batch(self, n): 28 | if np.any(self.cursor+n > self.size): 29 | self.epochs += 1 30 | self.shuffle() 31 | 32 | i = np.random.randint(0, self.num_buckets) 33 | 34 | res = self.dfs[i].iloc[self.cursor[i]:self.cursor[i]+n] 35 | self.cursor[i] += n 36 | return np.asarray(res['review'].tolist()), res['label'].tolist(), res['length'].tolist() 37 | 38 | def get_sentence(vocabulary_inv, sen_index): 39 | return ' '.join([vocabulary_inv[index] for index in sen_index]) 40 | 41 | def sequence(rnn_inputs, hidden_size, seq_lens): 42 | cell_fw = tf.nn.rnn_cell.GRUCell(hidden_size) 43 | print('build fw cell: '+str(cell_fw)) 44 | cell_bw = tf.nn.rnn_cell.GRUCell(hidden_size) 45 | print('build bw cell: '+str(cell_bw)) 46 | rnn_outputs, final_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 47 | cell_bw, 48 | inputs=rnn_inputs, 49 | sequence_length=seq_lens, 50 | dtype=tf.float32 51 | ) 52 | print('rnn outputs: '+str(rnn_outputs)) 53 | print('final state: '+str(final_state)) 54 | 55 | return rnn_outputs 56 | 57 | def attention(atten_inputs, atten_size): 58 | ## attention mechanism uses Ilya Ivanov's implementation(https://github.com/ilivans/tf-rnn-attention) 59 | print('attention inputs: '+str(atten_inputs)) 60 | max_time = int(atten_inputs.shape[1]) 61 | print("max time length: "+str(max_time)) 62 | combined_hidden_size = int(atten_inputs.shape[2]) 63 | print("combined hidden size: "+str(combined_hidden_size)) 64 | W_omega = tf.Variable(tf.random_normal([combined_hidden_size, atten_size], stddev=0.1, dtype=tf.float32)) 65 | b_omega = tf.Variable(tf.random_normal([atten_size], stddev=0.1, dtype=tf.float32)) 66 | u_omega = tf.Variable(tf.random_normal([atten_size], stddev=0.1, dtype=tf.float32)) 67 | 68 | v = tf.tanh(tf.matmul(tf.reshape(atten_inputs, [-1, combined_hidden_size]), W_omega) + tf.reshape(b_omega, [1, -1])) 69 | print("v: "+str(v)) 70 | # u_omega is the summarizing question vector 71 | vu = tf.matmul(v, tf.reshape(u_omega, [-1, 1])) 72 | print("vu: "+str(vu)) 73 | exps = tf.reshape(tf.exp(vu), [-1, max_time]) 74 | print("exps: "+str(exps)) 75 | alphas = exps / tf.reshape(tf.reduce_sum(exps, 1), [-1, 1]) 76 | print("alphas: "+str(alphas)) 77 | atten_outs = tf.reduce_sum(atten_inputs * tf.reshape(alphas, [-1, max_time, 1]), 1) 78 | print("atten outs: "+str(atten_outs)) 79 | return atten_outs, alphas 80 | 81 | def visualize_sentence_format(sent): 82 | ## remove the trailing 'STOP' symbols from sent 83 | visual_sent = ' '.join(re.sub('STOP', '', sent).split()) 84 | return visual_sent 85 | 86 | def visualize(sess, inputs, revlens, max_rev_length, keep_probs, index2word, alphas_words, alphas_sents, x_test, y_test, y_predict, visual_sample_index): 87 | visual_dir = "../visualization" 88 | # visualization 89 | sents_visual_file = os.path.join(visual_dir, "sents_in_review_visualization_{}.html".format(visual_sample_index)) 90 | x_test_sample = x_test[visual_sample_index:visual_sample_index+1] 91 | y_test_sample = y_test[visual_sample_index:visual_sample_index+1] 92 | test_dict = {inputs:x_test_sample, revlens: [max_rev_length], keep_probs: [1.0, 1.0]} 93 | alphas_words_test, alphas_sents_test = sess.run([alphas_words, alphas_sents], feed_dict=test_dict) 94 | y_test_predict = sess.run(y_predict, feed_dict=test_dict) 95 | print("test sample is {}".format(y_test_sample[0])) 96 | print("test sample is predicted as {}".format(y_test_predict[0])) 97 | print(alphas_words_test.shape) 98 | 99 | # visualize a review 100 | sents = [get_sentence(index2word, x_test_sample[0][i]) for i in range(max_rev_length)] 101 | index_sent = 0 102 | print("sents size is {}".format(len(sents))) 103 | with open(sents_visual_file, "w") as html_file: 104 | html_file.write('actual label: %f, predicted label: %f
' % (y_test_sample[0], y_test_predict[0])) 105 | for sent, alpha in zip(sents, alphas_sents_test[0] / alphas_sents_test[0].max()): 106 | if len(set(sent.split(' '))) == 1: 107 | index_sent += 1 108 | continue 109 | visual_sent = visualize_sentence_format(sent) 110 | # display each sent's importance by color 111 | html_file.write('     ' % (alpha)) 112 | visual_words = visual_sent.split() 113 | visual_words_alphas = alphas_words_test[index_sent][:len(visual_words)] 114 | # for each sent, display its word importance by color 115 | for word, alpha_w in zip(visual_words, visual_words_alphas / visual_words_alphas.max()): 116 | html_file.write('%s ' % (alpha_w, word)) 117 | html_file.write('
') 118 | index_sent += 1 119 | 120 | if __name__ == '__main__': 121 | pass 122 | -------------------------------------------------------------------------------- /code/gen_word_embeddings.py: -------------------------------------------------------------------------------- 1 | import os 2 | from os import walk 3 | from nltk.tokenize import RegexpTokenizer 4 | from gensim.models import Word2Vec 5 | 6 | def gen_formatted_review(data_dir, tokenizer = RegexpTokenizer(r'\w+') ): 7 | data = [] 8 | for filename in os.listdir(data_dir): 9 | file = os.path.join(data_dir, filename) 10 | with open(file) as f: 11 | content = f.readline().lower() 12 | content_formatted = tokenizer.tokenize(content) 13 | data.append(content_formatted) 14 | return data 15 | 16 | 17 | if __name__ == "__main__": 18 | working_dir = "../data/aclImdb" 19 | train_dir = os.path.join(working_dir, "train") 20 | train_pos_dir = os.path.join(train_dir, "pos") 21 | train_neg_dir = os.path.join(train_dir, "neg") 22 | test_dir = os.path.join(working_dir, "test") 23 | test_pos_dir = os.path.join(test_dir, "pos") 24 | test_neg_dir = os.path.join(test_dir, "neg") 25 | train = gen_formatted_review(train_pos_dir) 26 | train2 = gen_formatted_review(train_neg_dir) 27 | train.extend(train2) 28 | test = gen_formatted_review(test_pos_dir) 29 | test2 = gen_formatted_review(test_neg_dir) 30 | test.extend(test2) 31 | train.extend(test) 32 | embedding_size = 50 33 | fname = os.path.join(working_dir, "imdb_embedding") 34 | if os.path.isfile(fname): 35 | embedding_model = Word2Vec.load(fname) 36 | else: 37 | embedding_model = Word2Vec(train, size=embedding_size, window=5, min_count=5) 38 | embedding_model.save(fname) 39 | word1 = "great" 40 | word2 = "horrible" 41 | print("similar words of {}:".format(word1)) 42 | print(embedding_model.most_similar('great')) 43 | print("similar words of {}:".format(word2)) 44 | print(embedding_model.most_similar('horrible')) 45 | pass 46 | -------------------------------------------------------------------------------- /code/models.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import argparse 4 | import numpy as np 5 | import pickle as pl 6 | import pandas as pd 7 | import tensorflow as tf 8 | from components import sequence, attention, BucketedDataIterator, get_sentence, visualize_sentence_format, visualize 9 | 10 | def build_graph( 11 | inputs, 12 | revlens, 13 | keep_probs, 14 | hidden_size = 50, 15 | atten_size = 50, 16 | nclasses = 2, 17 | embeddings = None 18 | ): 19 | 20 | # Placeholders 21 | print(inputs.shape) 22 | print(revlens.shape) 23 | 24 | max_rev_length = int(inputs.shape[1]) 25 | sent_length = int(inputs.shape[2]) 26 | print(max_rev_length, sent_length) 27 | 28 | _, embedding_size = embeddings.shape 29 | word_rnn_inputs = tf.nn.embedding_lookup( tf.convert_to_tensor(embeddings, np.float32), inputs) 30 | print("word rnn inputs: "+str(word_rnn_inputs)) 31 | word_rnn_inputs_formatted = tf.reshape(word_rnn_inputs, [-1, sent_length, embedding_size]) 32 | print('word rnn inputs formatted: '+str(word_rnn_inputs_formatted)) 33 | 34 | reuse_value = None 35 | 36 | with tf.variable_scope("word_rnn"): 37 | word_rnn_outputs = sequence(word_rnn_inputs_formatted, hidden_size, None) 38 | 39 | # now add the attention mech on words: 40 | # Attention mechanism at word level 41 | 42 | atten_inputs = tf.concat(word_rnn_outputs, 2) 43 | combined_hidden_size = int(atten_inputs.shape[2]) 44 | 45 | atten_inputs = tf.nn.dropout(atten_inputs, keep_probs[0]) 46 | with tf.variable_scope("word_atten"): 47 | sent_outs, alphas_words = attention(atten_inputs, atten_size) 48 | 49 | sent_outs_formatted = tf.reshape(sent_outs, [-1, max_rev_length, combined_hidden_size]) 50 | print("sent outs formatted: "+str(sent_outs_formatted)) 51 | sent_rnn_inputs_formatted = sent_outs_formatted 52 | print('sent rnn inputs formatted: '+str(sent_rnn_inputs_formatted)) 53 | 54 | with tf.variable_scope("sent_rnn"): 55 | sent_rnn_outputs = sequence(sent_rnn_inputs_formatted, hidden_size, revlens) 56 | 57 | # attention at sentence level: 58 | sent_atten_inputs = tf.concat(sent_rnn_outputs, 2) 59 | sent_atten_inputs = tf.nn.dropout(sent_atten_inputs, keep_probs[1]) 60 | 61 | with tf.variable_scope("sent_atten"): 62 | rev_outs, alphas_sents = attention(sent_atten_inputs, atten_size) 63 | 64 | with tf.variable_scope("out_weights1", reuse=reuse_value) as out: 65 | weights_out = tf.get_variable(name="output_w", dtype=tf.float32, shape=[hidden_size*2, nclasses]) 66 | biases_out = tf.get_variable(name="output_bias", dtype=tf.float32, shape=[nclasses]) 67 | dense = tf.matmul(rev_outs, weights_out) + biases_out 68 | print(dense) 69 | 70 | return dense, alphas_words, alphas_sents 71 | 72 | 73 | if __name__=="__main__": 74 | parser = argparse.ArgumentParser(description='Parameters for building the model.') 75 | parser.add_argument('-b', '--batch_size', type=int, default=512, 76 | help='training batch size') 77 | parser.add_argument('-r', '--resume', type=bool, default=False, 78 | help='pick up the latest check point and resume') 79 | parser.add_argument('-e', '--epochs', type=int, default=10, 80 | help='epochs for training') 81 | 82 | args = parser.parse_args() 83 | train_batch_size = args.batch_size 84 | resume = args.resume 85 | epochs = args.epochs 86 | 87 | working_dir = "../data/aclImdb" 88 | log_dir = "../logs" 89 | train_filename = os.path.join(working_dir, "train_df_file") 90 | test_filename = os.path.join(working_dir, "test_df_file") 91 | emb_filename = os.path.join(working_dir, "emb_matrix") 92 | print("load dataframe for training...") 93 | df_train = pd.read_pickle(train_filename) 94 | max_rev_length, sent_length = df_train['review'][0].shape 95 | print("load dataframe for testing...") 96 | df_test = pd.read_pickle(test_filename) 97 | print(df_test.shape) 98 | print("load embedding matrix...") 99 | (emb_matrix, word2index, index2word) = pl.load(open(emb_filename, "rb")) 100 | 101 | nclasses = 2 102 | y_ = tf.placeholder(tf.int32, shape=[None, nclasses]) 103 | inputs = tf.placeholder(tf.int32, [None, max_rev_length, sent_length]) 104 | revlens = tf.placeholder(tf.int32, [None]) 105 | keep_probs = tf.placeholder(tf.float32, [2]) 106 | 107 | dense, alphas_words, alphas_sents = build_graph(inputs, revlens, keep_probs, embeddings=emb_matrix, nclasses=nclasses) 108 | cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=dense)) 109 | with tf.variable_scope('optimizers', reuse=None): 110 | optimizer = tf.train.AdamOptimizer(0.01).minimize(cross_entropy) 111 | y_predict = tf.argmax(dense, 1) 112 | correct_prediction = tf.equal(y_predict, tf.argmax(y_, 1)) 113 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 114 | saver = tf.train.Saver() 115 | tf.summary.scalar("cost", cross_entropy) 116 | tf.summary.scalar("accuracy", accuracy) 117 | summary_op = tf.summary.merge_all() 118 | 119 | total_batch = int(len(df_train)/(train_batch_size)) 120 | 121 | num_buckets = 3 122 | data = BucketedDataIterator(df_train, num_buckets) 123 | 124 | depth = nclasses 125 | on_value = 1 126 | off_value = 0 127 | 128 | with tf.Session() as sess: 129 | train_writer = tf.summary.FileWriter(log_dir, sess.graph) 130 | sess.run(tf.global_variables_initializer()) 131 | sess.run(tf.local_variables_initializer()) 132 | # insert this snippet to restore a model: 133 | resume_from_epoch = -1 134 | if resume: 135 | latest_cpt_file = tf.train.latest_checkpoint('../logs') 136 | print("the code pick up from lateset checkpoint file: {}".format(latest_cpt_file)) 137 | resume_from_epoch = int(str(latest_cpt_file).split('-')[1]) 138 | print("it resumes from previous epoch of {}".format(resume_from_epoch)) 139 | saver.restore(sess, latest_cpt_file) 140 | for epoch in range(resume_from_epoch+1, resume_from_epoch+epochs+1): 141 | avg_cost = 0.0 142 | print("epoch {}".format(epoch)) 143 | for i in range(total_batch): 144 | batch_data, batch_label, seqlens = data.next_batch(train_batch_size) 145 | batch_label_formatted = tf.one_hot(indices=batch_label, depth=depth, on_value=on_value, off_value=off_value, axis=-1) 146 | 147 | batch_labels = sess.run(batch_label_formatted) 148 | feed = {inputs: batch_data, revlens: seqlens, y_: batch_labels, keep_probs: [0.9, 0.9]} 149 | _, c, summary_in_batch_train = sess.run([optimizer, cross_entropy, summary_op], feed_dict=feed) 150 | avg_cost += c/total_batch 151 | train_writer.add_summary(summary_in_batch_train, epoch*total_batch + i) 152 | saver.save(sess, os.path.join(log_dir, "model.ckpt"), epoch, write_meta_graph=False) 153 | print("avg cost in the training phase epoch {}: {}".format(epoch, avg_cost)) 154 | 155 | print("evaluating...") 156 | 157 | x_test = np.asarray(df_test['review'].tolist()) 158 | y_test = df_test['label'].values.tolist() 159 | test_review_lens = df_test['length'].tolist() 160 | test_batch_size = 1000 161 | total_batch2 = int(len(x_test)/(test_batch_size)) 162 | avg_accu = 0.0 163 | 164 | for i in range(total_batch2): 165 | #for i in range(0): 166 | batch_x = x_test[i*test_batch_size:(i+1)*test_batch_size] 167 | batch_y = y_test[i*test_batch_size:(i+1)*test_batch_size] 168 | batch_seqlen = test_review_lens[i*test_batch_size:(i+1)*test_batch_size] 169 | 170 | batch_label_formatted2 =tf.one_hot(indices=batch_y, depth=depth, on_value=on_value, off_value=off_value, axis=-1) 171 | 172 | batch_labels2 = sess.run(batch_label_formatted2) 173 | feed = {inputs: batch_x, revlens: batch_seqlen, y_: batch_labels2, keep_probs: [1.0, 1.0]} 174 | accu = sess.run(accuracy, feed_dict=feed) 175 | avg_accu += 1.0*accu/total_batch2 176 | 177 | print("prediction accuracy on test set is {}".format(avg_accu)) 178 | visual_sample_index = 99 179 | visualize(sess, inputs, revlens, max_rev_length, keep_probs, index2word, alphas_words, alphas_sents, x_test, y_test, y_predict, visual_sample_index) 180 | -------------------------------------------------------------------------------- /code/preprocess_reviews.py: -------------------------------------------------------------------------------- 1 | import os, re 2 | import argparse 3 | import numpy as np 4 | import pickle as pl 5 | from os import walk 6 | from gensim.models import Word2Vec 7 | from nltk.tokenize import RegexpTokenizer 8 | import pandas as pd 9 | from tensorflow.contrib.keras import preprocessing 10 | 11 | def build_emb_matrix_and_vocab(embedding_model, keep_in_dict=10000, embedding_size=50): 12 | # 0 th element is the default vector for unknowns. 13 | emb_matrix = np.zeros((keep_in_dict+2, embedding_size)) 14 | word2index = {} 15 | index2word = {} 16 | for k in range(1, keep_in_dict+1): 17 | word = embedding_model.wv.index2word[k-1] 18 | emb_matrix[k] = embedding_model[word] 19 | word2index[word] = k 20 | index2word[k] = word 21 | word2index['UNK'] = 0 22 | index2word[0] = 'UNK' 23 | word2index['STOP'] = keep_in_dict+1 24 | index2word[keep_in_dict+1] = 'STOP' 25 | return emb_matrix, word2index, index2word 26 | 27 | def sent2index(sent, word2index): 28 | words = sent.strip().split(' ') 29 | sent_index = [word2index[word] if word in word2index else 0 for word in words] 30 | return sent_index 31 | 32 | def get_sentence(index2word, sen_index): 33 | return ' '.join([index2word[index] for index in sen_index]) 34 | 35 | def gen_data(data_dir, word2index): 36 | data = [] 37 | tokenizer = RegexpTokenizer(r'\w+|[?!]|\.{1,}') 38 | for filename in os.listdir(data_dir): 39 | file = os.path.join(data_dir, filename) 40 | with open(file) as f: 41 | content = f.readline().lower() 42 | content_formatted = ' '.join(tokenizer.tokenize(content))[:-1] 43 | sents = re.compile("[?!]|\.{1,}").split(content_formatted) 44 | sents_index = [sent2index(sent, word2index) for sent in sents] 45 | data.append(sents_index) 46 | return data 47 | 48 | def preprocess_review(data, sent_length, max_rev_len, keep_in_dict=10000): 49 | ## As the result, each review will be composed of max_rev_len sentences. If the original review is longer than that, we truncate it, and if shorter than that, we append empty sentences to it. And each sentence will be composed of sent_length words. If the original sentence is longer than that, we truncate it, and if shorter, we append the word of 'UNK' to it. Also, we keep track of the actual number of sentences each review contains. 50 | data_formatted = [] 51 | review_lens = [] 52 | for review in data: 53 | review_formatted = preprocessing.sequence.pad_sequences(review, maxlen=sent_length, padding="post", truncating="post", value=keep_in_dict+1) 54 | review_len = review_formatted.shape[0] 55 | review_lens.append(review_len if review_len<=max_rev_len else max_rev_len) 56 | lack_len = max_rev_length - review_len 57 | review_formatted_right_len = review_formatted 58 | if lack_len > 0: 59 | #extra_rows = np.zeros([lack_len, sent_length], dtype=np.int32) 60 | extra_rows = np.full((lack_len, sent_length), keep_in_dict+1) 61 | review_formatted_right_len = np.append(review_formatted, extra_rows, axis=0) 62 | elif lack_len < 0: 63 | row_index = [max_rev_length+i for i in list(range(0, -lack_len))] 64 | review_formatted_right_len = np.delete(review_formatted, row_index, axis=0) 65 | data_formatted.append(review_formatted_right_len) 66 | return data_formatted, review_lens 67 | 68 | if __name__ == "__main__": 69 | parser = argparse.ArgumentParser(description='Process some important parameters.') 70 | parser.add_argument('-s', '--sent_length', type=int, default=70, 71 | help='fix the sentence length in all reviews') 72 | parser.add_argument('-r', '--max_rev_length', type=int, default=15, 73 | help='fix the maximum review length') 74 | 75 | args = parser.parse_args() 76 | sent_length = args.sent_length 77 | max_rev_length = args.max_rev_length 78 | 79 | print('sent length is set as {}'.format(sent_length)) 80 | print('rev length is set as {}'.format(max_rev_length)) 81 | working_dir = "../data/aclImdb" 82 | fname = os.path.join(working_dir, "imdb_embedding") 83 | train_dir = os.path.join(working_dir, "train") 84 | train_pos_dir = os.path.join(train_dir, "pos") 85 | train_neg_dir = os.path.join(train_dir, "neg") 86 | test_dir = os.path.join(working_dir, "test") 87 | test_pos_dir = os.path.join(test_dir, "pos") 88 | test_neg_dir = os.path.join(test_dir, "neg") 89 | 90 | if os.path.isfile(fname): 91 | embedding_model = Word2Vec.load(fname) 92 | else: 93 | print("please run gen_word_embeddings.py first to generate embeddings!") 94 | exit(1) 95 | print("generate word to index dictionary and inverse dictionary...") 96 | emb_matrix, word2index, index2word = build_emb_matrix_and_vocab(embedding_model) 97 | print("format each review into sentences, and also represent each word by index...") 98 | train_pos_data = gen_data(train_pos_dir, word2index) 99 | train_neg_data = gen_data(train_neg_dir, word2index) 100 | train_data = train_neg_data + train_pos_data 101 | test_pos_data = gen_data(test_pos_dir, word2index) 102 | test_neg_data = gen_data(test_neg_dir, word2index) 103 | test_data = test_neg_data + test_pos_data 104 | 105 | print("preprocess each review...") 106 | x_train, train_review_lens = preprocess_review(train_data, sent_length, max_rev_length) 107 | x_test, test_review_lens = preprocess_review(test_data, sent_length, max_rev_length) 108 | y_train = [0]*len(train_neg_data)+[1]*len(train_pos_data) 109 | y_test = [0]*len(test_neg_data)+[1]*len(test_pos_data) 110 | 111 | print("save word embedding matrix ...") 112 | emb_filename = os.path.join(working_dir, "emb_matrix") 113 | #emb_matrix.dump(emb_filename) 114 | pl.dump([emb_matrix, word2index, index2word], open(emb_filename, "wb")) 115 | 116 | print("save review data for training...") 117 | df_train = pd.DataFrame({'review':x_train, 'label':y_train, 'length':train_review_lens}) 118 | train_filename = os.path.join(working_dir, "train_df_file") 119 | df_train.to_pickle(train_filename) 120 | 121 | print("save review data for testing...") 122 | df_test = pd.DataFrame({'review':x_test, 'label':y_test, 'length':test_review_lens}) 123 | test_filename = os.path.join(working_dir, "test_df_file") 124 | df_test.to_pickle(test_filename) 125 | -------------------------------------------------------------------------------- /data.sh: -------------------------------------------------------------------------------- 1 | mkdir ./data 2 | cd ./data 3 | wget http://ai.stanford.edu/\~amaas/data/sentiment/aclImdb_v1.tar.gz 4 | tar -xzvf aclImdb_v1.tar.gz 5 | 6 | -------------------------------------------------------------------------------- /images/hier-atten-net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/triplemeng/hierarchical-attention-model/d665a33e83b2e92754bf9ccd2ba0db1c67ebddc1/images/hier-atten-net.png -------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_0.html: -------------------------------------------------------------------------------- 1 | actual label: 0.000000, predicted label: 0.000000
     once again mr
     costner has dragged out a movie for far longer than necessary
     aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters
     most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care
     the character we should really care about is a very cocky UNK ashton kutcher
     the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a UNK closet
     his only UNK appears to be winning over costner
     finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts
     we are told why kutcher is driven to be the best with no prior UNK or UNK
     no magic here it was all i could do to keep from turning it off an hour in
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_100.html: -------------------------------------------------------------------------------- 1 | actual label: 0.000000, predicted label: 0.000000
     imagine every stereotypical UNK cliche from every movie and tv show set on the streets of brooklyn between 1930 and 1980
     UNK it with a cast of UNK caricatures instead of actual characters
     throw in a mix of period music and UNK electric UNK during the rumble scenes
     then pass the time trying to figure out or care which of the UNK is going to be killed in the anti climactic final rumble
     br br i ll give this movie points for not being just another romantic comedy teen slasher explosive action movie teen sex comedy kiddie musical or oscar nomination vehicle
     but bringing something new or interesting to the street gang tragedy genre might ve been nice
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_1000.html: -------------------------------------------------------------------------------- 1 | actual label: 0.000000, predicted label: 0.000000
     but it s not
     the plot isn t all that bad the actors aren t all terrible so it should be decent
     instead though despite a good starting point the plot just drags on and suffers from a lot of those i can t believe he she is so dumb moments so often used in horror movies to keep things going
     it frustrated me at times watching some of the decision made by the lead character
     also it took way too long to get to the good part of the movie
     anticipation is great but you can t spend over half the movie building it up
     a shame too since it got decent exposure upon release and hit right before the big halloween season
     even so i have a feeling this is going to get at least one sequel if not more so maybe they ll be able to build on the strong general plot to eventually release something decent
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_20000.html: -------------------------------------------------------------------------------- 1 | actual label: 1.000000, predicted label: 1.000000
     in the line of fire is one of those hollywood films that shows up on tv quite a bit but although i ve seen it a few times i usually end up sitting through the whole thing again
     why
     it s good
     clint eastwood is great as usual and the character he plays is interesting and more fleshed out than usual
     the character secret service agent frank UNK is haunted by the fact that he was on the detail that failed to protect president kennedy in dallas and now he s forced to match wits with a professional assassin that is openly UNK that he will kill the president
     however the film doesn t make him a depressed brooding and obsessed character
     he s charming and UNK and is realistic as a guy that has experienced a lot in life and is comfortable in his own skin
     he s even quite convincing when he UNK with the pretty younger agent played by rene russo
     the killer played by john malkovich at his best is cerebral deliberate and enjoys playing high stakes games of life and death
     he even goes by the name of another presidential assassin john booth
     br br the film is consistently enjoyable and it delivers all the goods suspense action romance and drama all in their proper amounts
     it s a fun film that is really helped by the great actors in it
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_20010.html: -------------------------------------------------------------------------------- 1 | actual label: 1.000000, predicted label: 1.000000
     my one line summary hints that this is not a good film but this is not true
     i did enjoy the movie but was probably expecting too much
     br br adele who is UNK portrayed by susan sarandon did not come off as a very likable character
     she is UNK and irresponsible to what would be an unforgivable degree were it not for the tremendous love she has for her daughter
     this is the one thing she knows how to do without fail
     adele s daughter anna is a sad girl who is so busy making up for her mother s shortcomings that she does not seem to be only 14 17 years old
     this of course makes natalie portman the perfect choice to play anna since she never seems to be 14 17 years old either
     portman pulls this role off with such ease that you almost forget that she has not been making movies for 20 years
     yet even with the two solid leads wayne wang never seems to quite draw the audience in as he did with the joy luck luck and even more so with smoke
     though i have not read the book the film feels as if it has made necessary changes to the story to bring it to the big screen changes which may drain the emotional UNK of the story
     i enjoyed the film for the fun of watching two wonderful actresses do their work but i never got lost in the experience and i never related to their plight
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_21000.html: -------------------------------------------------------------------------------- 1 | actual label: 1.000000, predicted label: 1.000000
     i think that key west might do well as a dvd
     there probably are a lot of failed star UNK that just never had a chance to succeed
     we will never know if this could have been a great series
     i would love to know if there is a way to see older shows like this or are they just another hollywood UNK
     is it possible to find copies of these shows so that we loyal UNK can enjoy them again
     the show had a great writing talent and some if not all of the episodes left you with a feel for the characters that is often missing in todays hit shows
     i often came away with a sense of learning something from the story lines and greatly entertained by the very unique characters
     thank you
-------------------------------------------------------------------------------- /visualization/sents_in_review_visualization_23000.html: -------------------------------------------------------------------------------- 1 | actual label: 1.000000, predicted label: 1.000000
     i ve read some UNK about the court scenes
     these people UNK their ignorance
     this production went to simply amazing lengths to recreate all aspects of the period in which the story occurred
     UNK manners are something few people outside the court ever see
     while the acting may appear highly stylized it is in fact as close a UNK as possible of the behavior of individuals in their particular stations as the director could create
     the actor s facial expressions are a marvel particularly the UNK UNK UNK and the king s mother
     br br there are of course UNK of both greek and UNK tragedy in the relationship between the king his parents and his love
     the UNK of the king UNK from good to bad and the assassin from bad to good provides much food for thought on the evolution of an individual s nature
     this movie would provide much to ponder in a college course on the UNK
     br br at the same time it almost rushes along even in the UNK scenes heading towards an UNK denouement
     one suspects the involvement of large portions of the UNK movements which were quite awesome
     it makes the lord of the rings battle scenes pale by comparison
     few directors have the ability to literally field thousands of humans on the field of battle just for art s sake
     i recall one scene in which at least 30 000 troops can be seen moving across a huge plain
     the UNK for such a shot would have been UNK
--------------------------------------------------------------------------------