├── .gitignore ├── LICENSE ├── README.md ├── alt-version ├── data_helpers.py ├── eval.py ├── model.py └── train.py ├── cnn-model ├── cnn_model.py ├── data_helpers.py ├── eval.py └── train.py ├── data └── rt-polaritydata │ ├── rt-polarity.neg │ └── rt-polarity.pos ├── data_helpers.py ├── eval.py ├── model.py ├── res ├── acc-val.png ├── acc.png ├── bidirectional-rnn.png ├── cnn-128.png ├── loss-val.png ├── loss.png ├── lstm+cnn-128.png └── lstm+cnn-300.png ├── runs ├── cnn-128 │ └── events.out.tfevents.1483714098.FYP6 ├── lstm+cnn-128 │ └── events.out.tfevents.1483625861.FYP6 └── lstm+cnn-300 │ └── events.out.tfevents.1483786544.FYP6 ├── tflearn ├── cnn.py └── model.py └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | runs/1484149652 2 | runs/1484150035 3 | runs/1484150236 4 | runs/1484151924 5 | GoogleNews-vectors-negative300.bin 6 | imdb.pkl 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | env/ 19 | build/ 20 | develop-eggs/ 21 | dist/ 22 | downloads/ 23 | eggs/ 24 | .eggs/ 25 | lib/ 26 | lib64/ 27 | parts/ 28 | sdist/ 29 | var/ 30 | wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .coverage 49 | .coverage.* 50 | .cache 51 | nosetests.xml 52 | coverage.xml 53 | *,cover 54 | .hypothesis/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # pyenv 81 | .python-version 82 | 83 | # celery beat schedule file 84 | celerybeat-schedule 85 | 86 | # dotenv 87 | .env 88 | 89 | # virtualenv 90 | .venv/ 91 | venv/ 92 | ENV/ 93 | 94 | # Spyder project settings 95 | .spyderproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Chaitanya Joshi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | Presented here is a method to modify the word embeddings of a word in a sentence with its surrounding context using a bidirectional Recurrent Neural Network (RNN). The hypothesis is that these modified embeddings are a better input for performing text classification tasks like sentiment analysis or polarity detection. 3 | 4 | **Read the full blog post here: [chaitjo.github.io/context-embeddings](https://chaitjo.github.io/context-embeddings/)** 5 | 6 | --- 7 | 8 | ![Bidirectional RNN layer](res/bidirectional-rnn.png) 9 | 10 | # Implementation 11 | The code implements the proposed model as a pre-processing layer before feeding it into a [Convolutional Neural Network for Sentence Classification](https://arxiv.org/pdf/1408.5882v2.pdf) (Kim, 2014). Two implementations are provided to run experiments: one with [tensorflow](https://www.tensorflow.org/) and one with [tflearn](http://tflearn.org/) (A high-level API for tensorflow). Training happens end-to-end in a supervised manner: the RNN layer is simply inserted as part of the existing model's architecture for text classification. 12 | 13 | The tensorflow version is built on top of [Denny Britz's implementation of Kim's CNN](https://github.com/dennybritz/cnn-text-classification-tf), and also allows loading pre-trained word2vec embeddings. 14 | 15 | Although both versions work exactly as intended, results in the blog post are from experiments with the tflearn version only. 16 | 17 | # Usage 18 | I used Python 3.6 and Tensorflow 0.12.1 for my experiments. 19 | Tensorflow code is divided into `model.py` which abstracts the model as a class, and `train.py` which is used to train the model. It can be executed by running the `train.py` script (with optional flags to set hyperparameters)- 20 | ``` 21 | $ python train.py [--flag=1] 22 | ``` 23 | (Tensorflow code for Kim's baseline CNN can be found in `/cnn-model`.) 24 | 25 | Tflearn code can be found in the `/tflearn` folder and can be run directly to start training (with optional flags to set hyperparameters)- 26 | ``` 27 | $ python tflearn/model.py [--flag=1] 28 | ``` 29 | 30 | The summaries generated during training (saved in `/runs` by default) can be used to visualize results using tensorboard with the following command- 31 | ``` 32 | $ tensorboard --logdir= 33 | ``` 34 | -------------------------------------------------------------------------------- /alt-version/data_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import itertools 4 | from collections import Counter 5 | 6 | 7 | def clean_str(string): 8 | """ 9 | Tokenization/string cleaning 10 | """ 11 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 12 | string = re.sub(r"\'s", " \'s", string) 13 | string = re.sub(r"\'ve", " \'ve", string) 14 | string = re.sub(r"n\'t", " n\'t", string) 15 | string = re.sub(r"\'re", " \'re", string) 16 | string = re.sub(r"\'d", " \'d", string) 17 | string = re.sub(r"\'ll", " \'ll", string) 18 | string = re.sub(r",", " , ", string) 19 | string = re.sub(r"!", " ! ", string) 20 | string = re.sub(r"\(", " \( ", string) 21 | string = re.sub(r"\)", " \) ", string) 22 | string = re.sub(r"\?", " \? ", string) 23 | string = re.sub(r"\s{2,}", " ", string) 24 | 25 | return string.strip().lower() 26 | 27 | 28 | def load_data_and_labels(): 29 | """ 30 | Loads polarity data from files, splits the data into words and generates labels. 31 | Returns split sentences and labels. 32 | """ 33 | 34 | # Load data from files 35 | positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines()) 36 | positive_examples = [s.strip() for s in positive_examples] 37 | negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines()) 38 | negative_examples = [s.strip() for s in negative_examples] 39 | 40 | # Split by words 41 | x_text = positive_examples + negative_examples 42 | x_text = [clean_str(sent) for sent in x_text] 43 | 44 | # Generate labels 45 | positive_labels = [[0, 1] for _ in positive_examples] 46 | negative_labels = [[1, 0] for _ in negative_examples] 47 | y = np.concatenate([positive_labels, negative_labels], 0) 48 | 49 | return [x_text, y] 50 | 51 | 52 | def batch_iter(data, batch_size, num_epochs, shuffle=True): 53 | """ 54 | Generates a batch iterator for a dataset. 55 | """ 56 | data = np.array(data) 57 | data_size = len(data) 58 | num_batches_per_epoch = int(len(data)/batch_size) + 1 59 | 60 | for epoch in range(num_epochs): 61 | # Shuffle the data at each epoch 62 | if shuffle: 63 | shuffle_indices = np.random.permutation(np.arange(data_size)) 64 | shuffled_data = data[shuffle_indices] 65 | else: 66 | shuffled_data = data 67 | 68 | for batch_num in range(num_batches_per_epoch): 69 | start_index = batch_num * batch_size 70 | end_index = min((batch_num + 1) * batch_size, data_size) 71 | yield shuffled_data[start_index:end_index] 72 | 73 | 74 | def pad_sentences(sentences, padding_word="", max_filter=5): 75 | """ 76 | Pads all sentences to the same length. The length is defined by the longest sentence. 77 | Returns padded sentences. 78 | """ 79 | 80 | # Using this might improve accuracy... 81 | 82 | pad_filter = max_filter -1 83 | sequence_length = max(len(x) for x in sentences) + 2*pad_filter 84 | 85 | padded_sentences = [] 86 | for i in range(len(sentences)): 87 | sentence = sentences[i] 88 | num_padding = sequence_length - len(sentence) - pad_filter 89 | new_sentence = [padding_word]*max_filter + sentence + [padding_word] * num_padding 90 | padded_sentences.append(new_sentence) 91 | 92 | return padded_sentences 93 | 94 | -------------------------------------------------------------------------------- /alt-version/eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from text_lstm import TextLSTM 10 | from tensorflow.contrib import learn 11 | 12 | 13 | # Parameters 14 | # ================================================== 15 | 16 | # Eval Parameters 17 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 18 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run") 19 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data") 20 | 21 | # Misc Parameters 22 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 23 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 24 | 25 | 26 | FLAGS = tf.flags.FLAGS 27 | FLAGS._parse_flags() 28 | print("\nParameters:") 29 | for attr, value in sorted(FLAGS.__flags.items()): 30 | print("{}={}".format(attr.upper(), value)) 31 | print("") 32 | 33 | # Load new datasets here 34 | if FLAGS.eval_train: 35 | x_raw, y_test = data_helpers.load_data_and_labels() 36 | y_test = np.argmax(y_test, axis=1) 37 | else: 38 | x_raw = ["a masterpiece four years in the making", "everything is off."] 39 | y_test = [1, 0] 40 | 41 | # Map data into vocabulary 42 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab") 43 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path) 44 | x_test = np.array(list(vocab_processor.transform(x_raw))) 45 | 46 | print("\nEvaluating...\n") 47 | 48 | # Evaluation 49 | # ================================================== 50 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir) 51 | graph = tf.Graph() 52 | with graph.as_default(): 53 | session_conf = tf.ConfigProto( 54 | allow_soft_placement=FLAGS.allow_soft_placement, 55 | log_device_placement=FLAGS.log_device_placement) 56 | sess = tf.Session(config=session_conf) 57 | 58 | with sess.as_default(): 59 | # Load the saved meta graph and restore variables 60 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 61 | saver.restore(sess, checkpoint_file) 62 | 63 | # Get the placeholders from the graph by name 64 | input_x = graph.get_operation_by_name("input_x").outputs[0] 65 | # input_y = graph.get_operation_by_name("input_y").outputs[0] 66 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 67 | 68 | # Tensors we want to evaluate 69 | predictions = graph.get_operation_by_name("output/predictions").outputs[0] 70 | 71 | # Generate batches for one epoch 72 | batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False) 73 | 74 | # Collect the predictions here 75 | all_predictions = [] 76 | 77 | for x_test_batch in batches: 78 | batch_predictions = sess.run( 79 | predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0}) 80 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 81 | 82 | # Print accuracy if y_test is defined 83 | if y_test is not None: 84 | correct_predictions = float(sum(all_predictions == y_test)) 85 | print("Total number of test examples: {}".format(len(y_test))) 86 | print("Accuracy: {:g}".format(correct_predictions/float(len(y_test)))) 87 | -------------------------------------------------------------------------------- /alt-version/model.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | class Model(object): 5 | def __init__( 6 | self, sequence_length, num_classes, vocab_size, 7 | embedding_size, hidden_size, 8 | filter_sizes, num_filters, l2_reg_lambda=0.0): 9 | 10 | # Placeholders for input, output and dropout 11 | self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") 12 | self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y") 13 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 14 | 15 | # Keeping track of l2 regularization loss (optional) 16 | l2_loss = tf.constant(0.0) 17 | 18 | # Embedding layer 19 | with tf.device('/cpu:0'), tf.name_scope("embedding"): 20 | self.W = tf.Variable( 21 | tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), 22 | trainable=True, 23 | name="W") 24 | self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) 25 | 26 | with tf.name_scope("bidirectional-lstm"): 27 | b = tf.Variable(tf.constant(0.1, shape=[hidden_size]), name="b") 28 | lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0) 29 | lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0) 30 | 31 | self.lstm_outputs, _, _ = tf.nn.bidirectional_dynammic_rnn(lstm_fw_cell, lstm_bw_cell, self.embedded_chars, dtype=tf.float32) 32 | self.lstm_outputs = tf.nn.bias_add(self.lstm_outputs, b) 33 | lstm_outputs_fw, lstm_outputs_bw = tf.split(value=self.lstm_outputs, split_dim=2, num_split=2) 34 | self.lstm_outputs = tf.add(lstm_outputs_fw, lstm_outputs_bw, name="lstm_outputs") 35 | 36 | self.lstm_outputs_expanded = tf.expand_dims(self.lstm_outputs, -1) 37 | 38 | # Create a convolution + maxpool layer for each filter size 39 | pooled_outputs = [] 40 | for i, filter_size in enumerate(filter_sizes): 41 | with tf.name_scope("conv-maxpool-%s" % filter_size): 42 | # Convolution Layer 43 | filter_shape = [filter_size, hidden_size, 1, num_filters] 44 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") 45 | b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") 46 | 47 | conv = tf.nn.conv2d( 48 | self.lstm_outputs_expanded, 49 | W, 50 | strides=[1, 1, 1, 1], 51 | padding="VALID", 52 | name="conv") 53 | 54 | # Apply nonlinearity 55 | h = tf.nn.relu(nn.bias_add(conv, b), name="relu") 56 | 57 | # Maxpooling over the outputs 58 | pooled = tf.nn.max_pool( 59 | h, 60 | ksize=[1, sequence_length - filter_size + 1, 1, 1], 61 | strides=[1, 1, 1, 1], 62 | padding='VALID', 63 | name="pool") 64 | pooled_outputs.append(pooled) 65 | 66 | # Combine all the pooled features 67 | num_filters_total = num_filters * len(filter_sizes) 68 | self.h_pool = tf.concat(3, pooled_outputs) 69 | self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) 70 | 71 | # Add dropout 72 | with tf.name_scope("dropout"): 73 | self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) 74 | 75 | # Final (unnormalized) scores and predictions 76 | with tf.name_scope("output"): 77 | # Standard output weights initialization 78 | W = tf.get_variable( 79 | "W", 80 | shape=[num_filters_total, num_classes], 81 | initializer=tf.contrib.layers.xavier_initializer()) 82 | b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") 83 | 84 | # Initialized output weights to 0.0, might improve accuracy 85 | # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W") 86 | # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b") 87 | 88 | l2_loss += tf.nn.l2_loss(W) 89 | l2_loss += tf.nn.l2_loss(b) 90 | 91 | self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") 92 | self.predictions = tf.argmax(self.scores, 1, name="predictions") 93 | 94 | # Calculate mean cross-entropy loss 95 | with tf.name_scope("loss"): 96 | losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y) 97 | self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss 98 | 99 | # Accuracy 100 | with tf.name_scope("accuracy"): 101 | correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) 102 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy") 103 | -------------------------------------------------------------------------------- /alt-version/train.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from model import Model 10 | from tensorflow.contrib import learn 11 | 12 | # Parameters 13 | # ================================================== 14 | 15 | # Model Hyperparameters 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)") 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)") 18 | tf.flags.DEFINE_integer("hidden_dim", 300, "Dimensionality of hidden layer in LSTM (default: 300") 19 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')") 20 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)") 21 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)") 22 | tf.flags.DEFINE_float("l2_reg_lambda", 0, "L2 regularizaion lambda (default: 0.15)") 23 | 24 | # Training parameters 25 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)") 26 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)") 27 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)") 28 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 29 | 30 | # Misc Parameters 31 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 32 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 33 | 34 | FLAGS = tf.flags.FLAGS 35 | FLAGS._parse_flags() 36 | print("\nParameters:") 37 | for attr, value in sorted(FLAGS.__flags.items()): 38 | print("{}={}".format(attr.upper(), value)) 39 | print("") 40 | 41 | # Data Preparatopn 42 | # ================================================== 43 | 44 | # Load data 45 | print("Loading data...") 46 | x_text, y = data_helpers.load_data_and_labels() 47 | 48 | # Build vocabulary 49 | max_document_length = max([len(x.split(" ")) for x in x_text]) 50 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 51 | x = np.array(list(vocab_processor.fit_transform(x_text))) 52 | 53 | # Randomly shuffle data 54 | np.random.seed(10) 55 | shuffle_indices = np.random.permutation(np.arange(len(y))) 56 | x_shuffled = x[shuffle_indices] 57 | y_shuffled = y[shuffle_indices] 58 | 59 | # Split train/test set 60 | # TODO: This is very crude, should use cross-validation 61 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:] 62 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:] 63 | 64 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_))) 65 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev))) 66 | 67 | # Training 68 | # ================================================== 69 | 70 | with tf.Graph().as_default(): 71 | session_conf = tf.ConfigProto( 72 | allow_soft_placement=FLAGS.allow_soft_placement, 73 | log_device_placement=FLAGS.log_device_placement) 74 | sess = tf.Session(config=session_conf) 75 | 76 | with sess.as_default(): 77 | model = Model( 78 | sequence_length=x_train.shape[1], 79 | num_classes=2, 80 | vocab_size=len(vocab_processor.vocabulary_), 81 | embedding_size=FLAGS.embedding_dim, 82 | hidden_size=FLAGS.hidden_dim, 83 | filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))), 84 | num_filters=FLAGS.num_filters, 85 | l2_reg_lambda=FLAGS.l2_reg_lambda) 86 | 87 | # Define Training procedure 88 | global_step = tf.Variable(0, name="global_step", trainable=False) 89 | optimizer = tf.train.AdamOptimizer(0.001) 90 | grads_and_vars = optimizer.compute_gradients(model.loss) 91 | train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 92 | 93 | # Keep track of gradient values and sparsity (optional) 94 | grad_summaries = [] 95 | for g, v in grads_and_vars: 96 | if g is not None: 97 | grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g) 98 | sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 99 | grad_summaries.append(grad_hist_summary) 100 | grad_summaries.append(sparsity_summary) 101 | grad_summaries_merged = tf.merge_summary(grad_summaries) 102 | 103 | # Output directory for models and summaries 104 | timestamp = str(int(time.time())) 105 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 106 | print("Writing to {}\n".format(out_dir)) 107 | 108 | # Summaries for loss and accuracy 109 | loss_summary = tf.scalar_summary("loss", model.loss) 110 | acc_summary = tf.scalar_summary("accuracy", model.accuracy) 111 | 112 | # Train Summaries 113 | train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged]) 114 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 115 | train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph) 116 | 117 | # Dev summaries 118 | dev_summary_op = tf.merge_summary([loss_summary, acc_summary]) 119 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev") 120 | dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph) 121 | 122 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 123 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 124 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 125 | if not os.path.exists(checkpoint_dir): 126 | os.makedirs(checkpoint_dir) 127 | saver = tf.train.Saver(tf.all_variables()) 128 | 129 | # Write vocabulary 130 | vocab_processor.save(os.path.join(out_dir, "vocab")) 131 | 132 | # Initialize all variables 133 | sess.run(tf.initialize_all_variables()) 134 | 135 | if FLAGS.word2vec: 136 | # Initialize matrix with random uniform distribution 137 | initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim)) 138 | # Load any vectors from word2vec 139 | print("Load word2vec file {}\n".format(FLAGS.word2vec)) 140 | with open(FLAGS.word2vec, "rb") as f: 141 | header = f.readline() 142 | vocab_size, layer1_size = map(int, header.split()) 143 | binary_len = np.dtype('float32').itemsize * layer1_size 144 | 145 | for line in xrange(vocab_size): 146 | word = [] 147 | while True: 148 | ch = f.read(1) 149 | if ch == ' ': 150 | word = ''.join(word) 151 | break 152 | if ch != '\n': 153 | word.append(ch) 154 | 155 | idx = vocab_processor.vocabulary_.get(word) 156 | if idx != 0: 157 | initW[idx] = np.fromstring(f.read(binary_len), dtype='float32') 158 | else: 159 | f.read(binary_len) 160 | 161 | sess.run(model.W.assign(initW)) 162 | 163 | def train_step(x_batch, y_batch): 164 | """ 165 | A single training step 166 | """ 167 | feed_dict = { 168 | model.input_x: x_batch, 169 | model.input_y: y_batch, 170 | model.dropout_keep_prob: FLAGS.dropout_keep_prob 171 | } 172 | _, step, summaries, loss, accuracy = sess.run( 173 | [train_op, global_step, train_summary_op, model.loss, model.accuracy], 174 | feed_dict) 175 | 176 | time_str = datetime.datetime.now().isoformat() 177 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 178 | train_summary_writer.add_summary(summaries, step) 179 | 180 | def dev_step(x_batch, y_batch, writer=None): 181 | """ 182 | Evaluates model on a dev set 183 | """ 184 | feed_dict = { 185 | model.input_x: x_batch, 186 | model.input_y: y_batch, 187 | model.dropout_keep_prob: 1.0 188 | } 189 | step, summaries, loss, accuracy = sess.run( 190 | [global_step, dev_summary_op, model.loss, model.accuracy], 191 | feed_dict) 192 | 193 | time_str = datetime.datetime.now().isoformat() 194 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 195 | if writer: 196 | writer.add_summary(summaries, step) 197 | 198 | # Generate batches 199 | batches = data_helpers.batch_iter( 200 | list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs) 201 | 202 | # Training loop 203 | for batch in batches: 204 | x_batch, y_batch = zip(*batch) 205 | train_step(x_batch, y_batch) 206 | current_step = tf.train.global_step(sess, global_step) 207 | 208 | if current_step % FLAGS.evaluate_every == 0: 209 | print("\nEvaluation:") 210 | dev_step(x_dev, y_dev, writer=dev_summary_writer) 211 | print("") 212 | 213 | if current_step % FLAGS.checkpoint_every == 0: 214 | path = saver.save(sess, checkpoint_prefix, global_step=current_step) 215 | print("Saved model checkpoint to {}\n".format(path)) 216 | -------------------------------------------------------------------------------- /cnn-model/cnn_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | 5 | class TextCNN(object): 6 | """ 7 | A CNN for text classification. 8 | Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer. 9 | """ 10 | def __init__( 11 | self, sequence_length, num_classes, vocab_size, 12 | embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): 13 | 14 | # Placeholders for input, output and dropout 15 | self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") 16 | self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y") 17 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 18 | 19 | # Keeping track of l2 regularization loss (optional) 20 | l2_loss = tf.constant(0.0) 21 | 22 | # Embedding layer 23 | with tf.device('/cpu:0'), tf.name_scope("embedding"): 24 | self.W = tf.Variable( 25 | tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), 26 | trainable=True, 27 | name="W") 28 | self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) 29 | self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1) 30 | 31 | # Create a convolution + maxpool layer for each filter size 32 | pooled_outputs = [] 33 | for i, filter_size in enumerate(filter_sizes): 34 | with tf.name_scope("conv-maxpool-%s" % filter_size): 35 | # Convolution Layer 36 | filter_shape = [filter_size, embedding_size, 1, num_filters] 37 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") 38 | b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") 39 | conv = tf.nn.conv2d( 40 | self.embedded_chars_expanded, 41 | W, 42 | strides=[1, 1, 1, 1], 43 | padding="VALID", 44 | name="conv") 45 | # Apply nonlinearity 46 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 47 | # Maxpooling over the outputs 48 | pooled = tf.nn.max_pool( 49 | h, 50 | ksize=[1, sequence_length - filter_size + 1, 1, 1], 51 | strides=[1, 1, 1, 1], 52 | padding='VALID', 53 | name="pool") 54 | pooled_outputs.append(pooled) 55 | 56 | # Combine all the pooled features 57 | num_filters_total = num_filters * len(filter_sizes) 58 | self.h_pool = tf.concat(3, pooled_outputs) 59 | self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) 60 | 61 | # Add dropout 62 | with tf.name_scope("dropout"): 63 | self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) 64 | 65 | # Final (unnormalized) scores and predictions 66 | with tf.name_scope("output"): 67 | # Standard output weights initialization 68 | W = tf.get_variable( 69 | "W", 70 | shape=[num_filters_total, num_classes], 71 | initializer=tf.contrib.layers.xavier_initializer()) 72 | b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") 73 | 74 | # # Initialized output weights to 0.0, might improve accuracy 75 | # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W") 76 | # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b") 77 | 78 | l2_loss += tf.nn.l2_loss(W) 79 | l2_loss += tf.nn.l2_loss(b) 80 | self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") 81 | self.predictions = tf.argmax(self.scores, 1, name="predictions") 82 | 83 | # CalculateMean cross-entropy loss 84 | with tf.name_scope("loss"): 85 | losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y) 86 | self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss 87 | 88 | # Accuracy 89 | with tf.name_scope("accuracy"): 90 | correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) 91 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy") 92 | -------------------------------------------------------------------------------- /cnn-model/data_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import itertools 4 | from collections import Counter 5 | 6 | 7 | def clean_str(string): 8 | """ 9 | Tokenization/string cleaning for all datasets except for SST. 10 | Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py 11 | """ 12 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 13 | string = re.sub(r"\'s", " \'s", string) 14 | string = re.sub(r"\'ve", " \'ve", string) 15 | string = re.sub(r"n\'t", " n\'t", string) 16 | string = re.sub(r"\'re", " \'re", string) 17 | string = re.sub(r"\'d", " \'d", string) 18 | string = re.sub(r"\'ll", " \'ll", string) 19 | string = re.sub(r",", " , ", string) 20 | string = re.sub(r"!", " ! ", string) 21 | string = re.sub(r"\(", " \( ", string) 22 | string = re.sub(r"\)", " \) ", string) 23 | string = re.sub(r"\?", " \? ", string) 24 | string = re.sub(r"\s{2,}", " ", string) 25 | return string.strip().lower() 26 | 27 | 28 | def load_data_and_labels(): 29 | """ 30 | Loads MR polarity data from files, splits the data into words and generates labels. 31 | Returns split sentences and labels. 32 | """ 33 | # Load data from files 34 | positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines()) 35 | positive_examples = [s.strip() for s in positive_examples] 36 | negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines()) 37 | negative_examples = [s.strip() for s in negative_examples] 38 | # Split by words 39 | x_text = positive_examples + negative_examples 40 | x_text = [clean_str(sent) for sent in x_text] 41 | # Generate labels 42 | positive_labels = [[0, 1] for _ in positive_examples] 43 | negative_labels = [[1, 0] for _ in negative_examples] 44 | y = np.concatenate([positive_labels, negative_labels], 0) 45 | return [x_text, y] 46 | 47 | 48 | def batch_iter(data, batch_size, num_epochs, shuffle=True): 49 | """ 50 | Generates a batch iterator for a dataset. 51 | """ 52 | data = np.array(data) 53 | data_size = len(data) 54 | num_batches_per_epoch = int(len(data)/batch_size) + 1 55 | for epoch in range(num_epochs): 56 | # Shuffle the data at each epoch 57 | if shuffle: 58 | shuffle_indices = np.random.permutation(np.arange(data_size)) 59 | shuffled_data = data[shuffle_indices] 60 | else: 61 | shuffled_data = data 62 | for batch_num in range(num_batches_per_epoch): 63 | start_index = batch_num * batch_size 64 | end_index = min((batch_num + 1) * batch_size, data_size) 65 | yield shuffled_data[start_index:end_index] 66 | -------------------------------------------------------------------------------- /cnn-model/eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from cnn_model import TextCNN 10 | from tensorflow.contrib import learn 11 | 12 | # Parameters 13 | # ================================================== 14 | 15 | # Eval Parameters 16 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 17 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run") 18 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data") 19 | 20 | # Misc Parameters 21 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 22 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 23 | 24 | 25 | FLAGS = tf.flags.FLAGS 26 | FLAGS._parse_flags() 27 | print("\nParameters:") 28 | for attr, value in sorted(FLAGS.__flags.items()): 29 | print("{}={}".format(attr.upper(), value)) 30 | print("") 31 | 32 | # CHANGE THIS: Load data. Load your own data here 33 | if FLAGS.eval_train: 34 | x_raw, y_test = data_helpers.load_data_and_labels() 35 | y_test = np.argmax(y_test, axis=1) 36 | else: 37 | x_raw = ["a masterpiece four years in the making", "everything is off."] 38 | y_test = [1, 0] 39 | 40 | # Map data into vocabulary 41 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab") 42 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path) 43 | x_test = np.array(list(vocab_processor.transform(x_raw))) 44 | 45 | print("\nEvaluating...\n") 46 | 47 | # Evaluation 48 | # ================================================== 49 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir) 50 | graph = tf.Graph() 51 | with graph.as_default(): 52 | session_conf = tf.ConfigProto( 53 | allow_soft_placement=FLAGS.allow_soft_placement, 54 | log_device_placement=FLAGS.log_device_placement) 55 | sess = tf.Session(config=session_conf) 56 | with sess.as_default(): 57 | # Load the saved meta graph and restore variables 58 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 59 | saver.restore(sess, checkpoint_file) 60 | 61 | # Get the placeholders from the graph by name 62 | input_x = graph.get_operation_by_name("input_x").outputs[0] 63 | # input_y = graph.get_operation_by_name("input_y").outputs[0] 64 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 65 | 66 | # Tensors we want to evaluate 67 | predictions = graph.get_operation_by_name("output/predictions").outputs[0] 68 | 69 | # Generate batches for one epoch 70 | batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False) 71 | 72 | # Collect the predictions here 73 | all_predictions = [] 74 | 75 | for x_test_batch in batches: 76 | batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0}) 77 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 78 | 79 | # Print accuracy if y_test is defined 80 | if y_test is not None: 81 | correct_predictions = float(sum(all_predictions == y_test)) 82 | print("Total number of test examples: {}".format(len(y_test))) 83 | print("Accuracy: {:g}".format(correct_predictions/float(len(y_test)))) 84 | -------------------------------------------------------------------------------- /cnn-model/train.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from cnn_model import TextCNN 10 | from tensorflow.contrib import learn 11 | 12 | # Parameters 13 | # ================================================== 14 | 15 | # Model Hyperparameters 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)") 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)") 18 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')") 19 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)") 20 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)") 21 | tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularizaion lambda (default: 0.15)") 22 | 23 | # Training parameters 24 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)") 25 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)") 26 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)") 27 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 28 | # Misc Parameters 29 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 30 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 31 | 32 | FLAGS = tf.flags.FLAGS 33 | FLAGS._parse_flags() 34 | print("\nParameters:") 35 | for attr, value in sorted(FLAGS.__flags.items()): 36 | print("{}={}".format(attr.upper(), value)) 37 | print("") 38 | 39 | 40 | # Data Preparatopn 41 | # ================================================== 42 | 43 | # Load data 44 | print("Loading data...") 45 | x_text, y = data_helpers.load_data_and_labels() 46 | 47 | # Build vocabulary 48 | max_document_length = max([len(x.split(" ")) for x in x_text]) 49 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 50 | x = np.array(list(vocab_processor.fit_transform(x_text))) 51 | 52 | # Randomly shuffle data 53 | np.random.seed(10) 54 | shuffle_indices = np.random.permutation(np.arange(len(y))) 55 | x_shuffled = x[shuffle_indices] 56 | y_shuffled = y[shuffle_indices] 57 | 58 | # Split train/test set 59 | # TODO: This is very crude, should use cross-validation 60 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:] 61 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:] 62 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_))) 63 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev))) 64 | 65 | 66 | # Training 67 | # ================================================== 68 | 69 | with tf.Graph().as_default(): 70 | session_conf = tf.ConfigProto( 71 | allow_soft_placement=FLAGS.allow_soft_placement, 72 | log_device_placement=FLAGS.log_device_placement) 73 | sess = tf.Session(config=session_conf) 74 | with sess.as_default(): 75 | cnn = TextCNN( 76 | sequence_length=x_train.shape[1], 77 | num_classes=2, 78 | vocab_size=len(vocab_processor.vocabulary_), 79 | embedding_size=FLAGS.embedding_dim, 80 | filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))), 81 | num_filters=FLAGS.num_filters, 82 | l2_reg_lambda=FLAGS.l2_reg_lambda) 83 | 84 | # Define Training procedure 85 | global_step = tf.Variable(0, name="global_step", trainable=False) 86 | optimizer = tf.train.AdamOptimizer(0.001) 87 | grads_and_vars = optimizer.compute_gradients(cnn.loss) 88 | train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 89 | 90 | # Keep track of gradient values and sparsity (optional) 91 | grad_summaries = [] 92 | for g, v in grads_and_vars: 93 | if g is not None: 94 | grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g) 95 | sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 96 | grad_summaries.append(grad_hist_summary) 97 | grad_summaries.append(sparsity_summary) 98 | grad_summaries_merged = tf.merge_summary(grad_summaries) 99 | 100 | # Output directory for models and summaries 101 | timestamp = str(int(time.time())) 102 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 103 | print("Writing to {}\n".format(out_dir)) 104 | 105 | # Summaries for loss and accuracy 106 | loss_summary = tf.scalar_summary("loss", cnn.loss) 107 | acc_summary = tf.scalar_summary("accuracy", cnn.accuracy) 108 | 109 | # Train Summaries 110 | train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged]) 111 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 112 | train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph) 113 | 114 | # Dev summaries 115 | dev_summary_op = tf.merge_summary([loss_summary, acc_summary]) 116 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev") 117 | dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph) 118 | 119 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 120 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 121 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 122 | if not os.path.exists(checkpoint_dir): 123 | os.makedirs(checkpoint_dir) 124 | saver = tf.train.Saver(tf.all_variables()) 125 | 126 | # Write vocabulary 127 | vocab_processor.save(os.path.join(out_dir, "vocab")) 128 | 129 | # Initialize all variables 130 | sess.run(tf.initialize_all_variables()) 131 | 132 | if FLAGS.word2vec: 133 | # Initialize matrix with random uniform distribution 134 | initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim)) 135 | # Load any vectors from word2vec 136 | print("Load word2vec file {}\n".format(FLAGS.word2vec)) 137 | with open(FLAGS.word2vec, "rb") as f: 138 | header = f.readline() 139 | vocab_size, layer1_size = map(int, header.split()) 140 | binary_len = np.dtype('float32').itemsize * layer1_size 141 | 142 | for line in xrange(vocab_size): 143 | word = [] 144 | while True: 145 | ch = f.read(1) 146 | if ch == ' ': 147 | word = ''.join(word) 148 | break 149 | if ch != '\n': 150 | word.append(ch) 151 | 152 | idx = vocab_processor.vocabulary_.get(word) 153 | if idx != 0: 154 | initW[idx] = np.fromstring(f.read(binary_len), dtype='float32') 155 | else: 156 | f.read(binary_len) 157 | 158 | sess.run(cnn.W.assign(initW)) 159 | 160 | def train_step(x_batch, y_batch): 161 | """ 162 | A single training step 163 | """ 164 | feed_dict = { 165 | cnn.input_x: x_batch, 166 | cnn.input_y: y_batch, 167 | cnn.dropout_keep_prob: FLAGS.dropout_keep_prob 168 | } 169 | _, step, summaries, loss, accuracy = sess.run( 170 | [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy], 171 | feed_dict) 172 | time_str = datetime.datetime.now().isoformat() 173 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 174 | train_summary_writer.add_summary(summaries, step) 175 | 176 | def dev_step(x_batch, y_batch, writer=None): 177 | """ 178 | Evaluates model on a dev set 179 | """ 180 | feed_dict = { 181 | cnn.input_x: x_batch, 182 | cnn.input_y: y_batch, 183 | cnn.dropout_keep_prob: 1.0 184 | } 185 | step, summaries, loss, accuracy = sess.run( 186 | [global_step, dev_summary_op, cnn.loss, cnn.accuracy], 187 | feed_dict) 188 | time_str = datetime.datetime.now().isoformat() 189 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 190 | if writer: 191 | writer.add_summary(summaries, step) 192 | 193 | # Generate batches 194 | batches = data_helpers.batch_iter( 195 | list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs) 196 | # Training loop. For each batch... 197 | for batch in batches: 198 | x_batch, y_batch = zip(*batch) 199 | train_step(x_batch, y_batch) 200 | current_step = tf.train.global_step(sess, global_step) 201 | if current_step % FLAGS.evaluate_every == 0: 202 | print("\nEvaluation:") 203 | dev_step(x_dev, y_dev, writer=dev_summary_writer) 204 | print("") 205 | if current_step % FLAGS.checkpoint_every == 0: 206 | path = saver.save(sess, checkpoint_prefix, global_step=current_step) 207 | print("Saved model checkpoint to {}\n".format(path)) 208 | -------------------------------------------------------------------------------- /data_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import itertools 4 | from collections import Counter 5 | 6 | 7 | def clean_str(string): 8 | """ 9 | Tokenization/string cleaning 10 | """ 11 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 12 | string = re.sub(r"\'s", " \'s", string) 13 | string = re.sub(r"\'ve", " \'ve", string) 14 | string = re.sub(r"n\'t", " n\'t", string) 15 | string = re.sub(r"\'re", " \'re", string) 16 | string = re.sub(r"\'d", " \'d", string) 17 | string = re.sub(r"\'ll", " \'ll", string) 18 | string = re.sub(r",", " , ", string) 19 | string = re.sub(r"!", " ! ", string) 20 | string = re.sub(r"\(", " \( ", string) 21 | string = re.sub(r"\)", " \) ", string) 22 | string = re.sub(r"\?", " \? ", string) 23 | string = re.sub(r"\s{2,}", " ", string) 24 | 25 | return string.strip().lower() 26 | 27 | 28 | def load_data_and_labels(): 29 | """ 30 | Loads polarity data from files, splits the data into words and generates labels. 31 | Returns split sentences and labels. 32 | """ 33 | 34 | # Load data from files 35 | positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines()) 36 | positive_examples = [s.strip() for s in positive_examples] 37 | negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines()) 38 | negative_examples = [s.strip() for s in negative_examples] 39 | 40 | # Split by words 41 | x_text = positive_examples + negative_examples 42 | x_text = [clean_str(sent) for sent in x_text] 43 | 44 | # Generate labels 45 | positive_labels = [[0, 1] for _ in positive_examples] 46 | negative_labels = [[1, 0] for _ in negative_examples] 47 | y = np.concatenate([positive_labels, negative_labels], 0) 48 | 49 | # Generate sequence lengths 50 | seqlen = np.array([len(sent.split(" ")) for sent in x_text]) 51 | 52 | return [x_text, y, seqlen] 53 | 54 | 55 | def batch_iter(data, seqlen_data, batch_size, num_epochs, shuffle=True): 56 | """ 57 | Generates a batch iterator for a dataset. 58 | """ 59 | 60 | data = np.array(data) 61 | data_size = len(data) 62 | num_batches_per_epoch = int(len(data)/batch_size) + 1 63 | 64 | for epoch in range(num_epochs): 65 | # Shuffle the data at each epoch 66 | if shuffle: 67 | shuffle_indices = np.random.permutation(np.arange(data_size)) 68 | shuffled_data = data[shuffle_indices] 69 | else: 70 | shuffled_data = data 71 | 72 | for batch_num in range(num_batches_per_epoch): 73 | start_index = batch_num * batch_size 74 | end_index = min((batch_num + 1) * batch_size, data_size) 75 | 76 | seqlen_batch = seqlen_data[start_index:end_index] 77 | 78 | yield shuffled_data[start_index:end_index], seqlen_batch 79 | #TODO: Problem with seqlens 80 | 81 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from text_lstm import TextLSTM 10 | from tensorflow.contrib import learn 11 | 12 | 13 | # Parameters 14 | # ================================================== 15 | 16 | # Eval Parameters 17 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 18 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run") 19 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data") 20 | 21 | # Misc Parameters 22 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 23 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 24 | 25 | 26 | FLAGS = tf.flags.FLAGS 27 | FLAGS._parse_flags() 28 | print("\nParameters:") 29 | for attr, value in sorted(FLAGS.__flags.items()): 30 | print("{}={}".format(attr.upper(), value)) 31 | print("") 32 | 33 | # Load new datasets here 34 | if FLAGS.eval_train: 35 | x_raw, y_test, seqlen_test = data_helpers.load_data_and_labels() 36 | y_test = np.argmax(y_test, axis=1) 37 | else: 38 | x_raw = ["a masterpiece four years in the making", "everything is off."] 39 | y_test = [1, 0] 40 | seqlen_test = [7, 3] 41 | 42 | # Map data into vocabulary 43 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab") 44 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path) 45 | x_test = np.array(list(vocab_processor.transform(x_raw))) 46 | 47 | print("\nEvaluating...\n") 48 | 49 | # Evaluation 50 | # ================================================== 51 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir) 52 | graph = tf.Graph() 53 | with graph.as_default(): 54 | session_conf = tf.ConfigProto( 55 | allow_soft_placement=FLAGS.allow_soft_placement, 56 | log_device_placement=FLAGS.log_device_placement) 57 | sess = tf.Session(config=session_conf) 58 | 59 | with sess.as_default(): 60 | # Load the saved meta graph and restore variables 61 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 62 | saver.restore(sess, checkpoint_file) 63 | 64 | # Get the placeholders from the graph by name 65 | input_x = graph.get_operation_by_name("input_x").outputs[0] 66 | # input_y = graph.get_operation_by_name("input_y").outputs[0] 67 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 68 | 69 | # Tensors we want to evaluate 70 | predictions = graph.get_operation_by_name("output/predictions").outputs[0] 71 | 72 | # Generate batches for one epoch 73 | batches = data_helpers.batch_iter(list(x_test), seqlen_test, FLAGS.batch_size, 1, shuffle=False) 74 | 75 | # Collect the predictions here 76 | all_predictions = [] 77 | 78 | for x_test_batch, seqlen_batch in batches: 79 | batch_predictions = sess.run( 80 | predictions, {input_x: x_test_batch, seqlen: seqlen_batch, dropout_keep_prob: 1.0}) 81 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 82 | 83 | # Print accuracy if y_test is defined 84 | if y_test is not None: 85 | correct_predictions = float(sum(all_predictions == y_test)) 86 | print("Total number of test examples: {}".format(len(y_test))) 87 | print("Accuracy: {:g}".format(correct_predictions/float(len(y_test)))) 88 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | from tensorflow.python.ops import array_ops 4 | 5 | 6 | class Model(object): 7 | def __init__( 8 | self, 9 | sequence_length, 10 | num_classes, 11 | vocab_size, 12 | embedding_size, 13 | hidden_size, 14 | filter_sizes, 15 | num_filters, 16 | l2_reg_lambda=0.0): 17 | 18 | # Placeholders for input, sequence length, output and dropout 19 | self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") 20 | self.seqlen = tf.placeholder(tf.int64, [None], name="seqlen") 21 | self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y") 22 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 23 | 24 | # Keeping track of l2 regularization loss (optional) 25 | l2_loss = tf.constant(0.0) 26 | 27 | 28 | # Embedding layer 29 | with tf.device('/cpu:0'), tf.name_scope("embedding"): 30 | self.W = tf.Variable( 31 | tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), 32 | trainable=True, 33 | name="W") 34 | self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) 35 | #TODO: Embeddings process ignores commas etc. so seqlens might not be accurate for sentences with commas... 36 | 37 | 38 | # Bidirectional LSTM layer 39 | with tf.name_scope("bidirectional-lstm"): 40 | lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0) 41 | lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0) 42 | 43 | # self.lstm_outputs, _, _ = tf.nn.bidirectional_dynamic_rnn( 44 | # lstm_fw_cell, 45 | # lstm_bw_cell, 46 | # self.embedded_chars, 47 | # sequence_length=self.seqlen, 48 | # dtype=tf.float32) 49 | # lstm_outputs_fw, lstm_outputs_bw = tf.split(value=self.lstm_outputs, split_dim=2, num_split=2) 50 | # self.lstm_outputs = tf.add(lstm_outputs_fw, lstm_outputs_bw, name="lstm_outputs") 51 | 52 | with tf.variable_scope("lstm-output-fw"): 53 | self.lstm_outputs_fw, _ = tf.nn.dynamic_rnn( 54 | lstm_fw_cell, 55 | self.embedded_chars, 56 | sequence_length=self.seqlen, 57 | dtype=tf.float32) 58 | 59 | with tf.variable_scope("lstm-output-bw"): 60 | self.embedded_chars_rev = array_ops.reverse_sequence(self.embedded_chars, seq_lengths=self.seqlen, seq_dim=1) 61 | tmp, _ = tf.nn.dynamic_rnn( 62 | lstm_bw_cell, 63 | self.embedded_chars_rev, 64 | sequence_length=self.seqlen, 65 | dtype=tf.float32) 66 | self.lstm_outputs_bw = array_ops.reverse_sequence(tmp, seq_lengths=self.seqlen, seq_dim=1) 67 | 68 | # Concatenate outputs 69 | self.lstm_outputs = tf.add(self.lstm_outputs_fw, self.lstm_outputs_bw, name="lstm_outputs") 70 | 71 | self.lstm_outputs_expanded = tf.expand_dims(self.lstm_outputs, -1) 72 | 73 | 74 | # Convolution + maxpool layer for each filter size 75 | pooled_outputs = [] 76 | for i, filter_size in enumerate(filter_sizes): 77 | with tf.name_scope("conv-maxpool-%s" % filter_size): 78 | # Convolution Layer 79 | filter_shape = [filter_size, hidden_size, 1, num_filters] 80 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") 81 | b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") 82 | 83 | conv = tf.nn.conv2d( 84 | self.lstm_outputs_expanded, 85 | W, 86 | strides=[1, 1, 1, 1], 87 | padding="VALID", 88 | name="conv") 89 | 90 | # Apply nonlinearity 91 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 92 | 93 | # Maxpooling over the outputs 94 | pooled = tf.nn.max_pool( 95 | h, 96 | ksize=[1, sequence_length - filter_size + 1, 1, 1], 97 | strides=[1, 1, 1, 1], 98 | padding='VALID', 99 | name="pool") 100 | pooled_outputs.append(pooled) 101 | 102 | # Combine all the pooled features 103 | num_filters_total = num_filters * len(filter_sizes) 104 | self.h_pool = tf.concat(3, pooled_outputs) 105 | self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) 106 | 107 | 108 | # Dropout layer 109 | with tf.name_scope("dropout"): 110 | self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) 111 | 112 | 113 | # Final (unnormalized) scores and predictions 114 | with tf.name_scope("output"): 115 | # Standard output weights initialization 116 | W = tf.get_variable( 117 | "W", 118 | shape=[num_filters_total, num_classes], 119 | initializer=tf.contrib.layers.xavier_initializer()) 120 | b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") 121 | 122 | # # Initialized output weights to 0.0, might improve accuracy 123 | # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W") 124 | # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b") 125 | 126 | l2_loss += tf.nn.l2_loss(W) 127 | l2_loss += tf.nn.l2_loss(b) 128 | 129 | self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") 130 | self.predictions = tf.argmax(self.scores, 1, name="predictions") 131 | 132 | # Calculate mean cross-entropy loss 133 | with tf.name_scope("loss"): 134 | losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y) 135 | self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss 136 | 137 | # Accuracy 138 | with tf.name_scope("accuracy"): 139 | correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) 140 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy") 141 | -------------------------------------------------------------------------------- /res/acc-val.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/acc-val.png -------------------------------------------------------------------------------- /res/acc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/acc.png -------------------------------------------------------------------------------- /res/bidirectional-rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/bidirectional-rnn.png -------------------------------------------------------------------------------- /res/cnn-128.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/cnn-128.png -------------------------------------------------------------------------------- /res/loss-val.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/loss-val.png -------------------------------------------------------------------------------- /res/loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/loss.png -------------------------------------------------------------------------------- /res/lstm+cnn-128.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/lstm+cnn-128.png -------------------------------------------------------------------------------- /res/lstm+cnn-300.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/lstm+cnn-300.png -------------------------------------------------------------------------------- /runs/cnn-128/events.out.tfevents.1483714098.FYP6: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/cnn-128/events.out.tfevents.1483714098.FYP6 -------------------------------------------------------------------------------- /runs/lstm+cnn-128/events.out.tfevents.1483625861.FYP6: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/lstm+cnn-128/events.out.tfevents.1483625861.FYP6 -------------------------------------------------------------------------------- /runs/lstm+cnn-300/events.out.tfevents.1483786544.FYP6: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/lstm+cnn-300/events.out.tfevents.1483786544.FYP6 -------------------------------------------------------------------------------- /tflearn/cnn.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function, absolute_import 2 | 3 | import tensorflow as tf 4 | import tflearn 5 | from tflearn.data_utils import to_categorical, pad_sequences 6 | from tflearn.datasets import imdb 7 | from tflearn.layers.core import input_data, dropout, fully_connected 8 | from tflearn.layers.embedding_ops import embedding 9 | # from tflearn.layers.recurrent import bidirectional_rnn, BasicLSTMCell 10 | from tflearn.layers.merge_ops import merge 11 | from tflearn.layers.conv import conv_1d, global_max_pool 12 | from tflearn.layers.estimator import regression 13 | 14 | 15 | tf.flags.DEFINE_integer("maxlen", 100, "Maximum Sentence Length") 16 | tf.flags.DEFINE_integer("vocab_size", 10000, "Size of Vocabulary") 17 | tf.flags.DEFINE_integer("embedding_dim", 128, "Word Embedding Size") 18 | # tf.flags.DEFINE_integer("rnn_hidden_size", 128, "Size of biRNN hidden layer") 19 | tf.flags.DEFINE_integer("num_filters", 128, "Number of CNN filters") 20 | tf.flags.DEFINE_float("dropout_prob", 0.5, "Dropout Probability") 21 | tf.flags.DEFINE_float("learning_rate", 0.001, "Learning Rate") 22 | tf.flags.DEFINE_integer("batch_size", 32, "Batch Size") 23 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of Training Epochs") 24 | 25 | FLAGS = tf.flags.FLAGS 26 | FLAGS._parse_flags() 27 | print("\nParameters:") 28 | for attr, value in sorted(FLAGS.__flags.items()): 29 | print("{}={}".format(attr.upper(), value)) 30 | print("") 31 | 32 | maxlen = FLAGS.maxlen 33 | vocab_size = FLAGS.vocab_size 34 | embedding_dim = FLAGS.embedding_dim 35 | # rnn_hidden_size = FLAGS.rnn_hidden_size 36 | num_filters = FLAGS.num_filters 37 | dropout_prob = FLAGS.dropout_prob 38 | learning_rate = FLAGS.learning_rate 39 | batch_size = FLAGS.batch_size 40 | num_epochs = FLAGS.num_epochs 41 | 42 | 43 | # IMDB Dataset loading 44 | train, test, _ = imdb.load_data(path='imdb.pkl', n_words=vocab_size, valid_portion=0.1) 45 | trainX, trainY = train 46 | testX, testY = test 47 | 48 | # Sequence padding 49 | trainX = pad_sequences(trainX, maxlen=maxlen, value=0.) 50 | testX = pad_sequences(testX, maxlen=maxlen, value=0.) 51 | 52 | # Converting labels to binary vectors 53 | trainY = to_categorical(trainY, nb_classes=2) 54 | testY = to_categorical(testY, nb_classes=2) 55 | 56 | 57 | # Building network 58 | network = input_data(shape=[None, maxlen], name='input') 59 | 60 | network = embedding( 61 | network, 62 | input_dim=vocab_size, 63 | output_dim=embedding_dim, 64 | trainable=True) 65 | 66 | # network = bidirectional_rnn( 67 | # network, 68 | # BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 69 | # BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 70 | # return_seq=True, 71 | # dynamic=True) 72 | # network = tf.pack(network, axis=1) 73 | 74 | # fw_outputs, bw_outputs = tf.split(split_dim=2, num_split=2, value=network) 75 | # network = tf.add(fw_outputs, bw_outputs) 76 | 77 | branch1 = conv_1d(network, num_filters, 3, padding='valid', activation='relu', regularizer="L2") 78 | branch2 = conv_1d(network, num_filters, 4, padding='valid', activation='relu', regularizer="L2") 79 | branch3 = conv_1d(network, num_filters, 5, padding='valid', activation='relu', regularizer="L2") 80 | 81 | network = merge([branch1, branch2, branch3], mode='concat', axis=1) 82 | 83 | network = tf.expand_dims(network, 2) 84 | 85 | network = global_max_pool(network) 86 | 87 | network = dropout(network, dropout_prob) 88 | 89 | network = fully_connected(network, 2, activation='softmax') 90 | 91 | network = regression( 92 | network, 93 | optimizer='adam', 94 | learning_rate=learning_rate, 95 | loss='categorical_crossentropy', 96 | name='target') 97 | 98 | 99 | # Training 100 | model = tflearn.DNN(network, tensorboard_verbose=0, tensorboard_dir='runs') 101 | model.fit( 102 | trainX, 103 | trainY, 104 | validation_set=(testX, testY), 105 | n_epoch = num_epochs, 106 | shuffle=True, 107 | show_metric=True, 108 | batch_size=batch_size) 109 | -------------------------------------------------------------------------------- /tflearn/model.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function, absolute_import 2 | 3 | import tensorflow as tf 4 | import tflearn 5 | from tflearn.data_utils import to_categorical, pad_sequences 6 | from tflearn.datasets import imdb 7 | from tflearn.layers.core import input_data, dropout, fully_connected 8 | from tflearn.layers.embedding_ops import embedding 9 | from tflearn.layers.recurrent import bidirectional_rnn, BasicLSTMCell 10 | from tflearn.layers.merge_ops import merge 11 | from tflearn.layers.conv import conv_1d, global_max_pool 12 | from tflearn.layers.estimator import regression 13 | 14 | 15 | tf.flags.DEFINE_integer("maxlen", 100, "Maximum Sentence Length") 16 | tf.flags.DEFINE_integer("vocab_size", 10000, "Size of Vocabulary") 17 | tf.flags.DEFINE_integer("embedding_dim", 128, "Word Embedding Size") 18 | tf.flags.DEFINE_integer("rnn_hidden_size", 128, "Size of biRNN hidden layer") 19 | tf.flags.DEFINE_integer("num_filters", 128, "Number of CNN filters") 20 | tf.flags.DEFINE_float("dropout_prob", 0.5, "Dropout Probability") 21 | tf.flags.DEFINE_float("learning_rate", 0.001, "Learning Rate") 22 | tf.flags.DEFINE_integer("batch_size", 32, "Batch Size") 23 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of Training Epochs") 24 | 25 | FLAGS = tf.flags.FLAGS 26 | FLAGS._parse_flags() 27 | print("\nParameters:") 28 | for attr, value in sorted(FLAGS.__flags.items()): 29 | print("{}={}".format(attr.upper(), value)) 30 | print("") 31 | 32 | maxlen = FLAGS.maxlen 33 | vocab_size = FLAGS.vocab_size 34 | embedding_dim = FLAGS.embedding_dim 35 | rnn_hidden_size = FLAGS.rnn_hidden_size 36 | num_filters = FLAGS.num_filters 37 | dropout_prob = FLAGS.dropout_prob 38 | learning_rate = FLAGS.learning_rate 39 | batch_size = FLAGS.batch_size 40 | num_epochs = FLAGS.num_epochs 41 | 42 | 43 | # IMDB Dataset loading 44 | train, test, _ = imdb.load_data(path='imdb.pkl', n_words=vocab_size, valid_portion=0.1) 45 | trainX, trainY = train 46 | testX, testY = test 47 | 48 | # Sequence padding 49 | trainX = pad_sequences(trainX, maxlen=maxlen, value=0.) 50 | testX = pad_sequences(testX, maxlen=maxlen, value=0.) 51 | 52 | # Converting labels to binary vectors 53 | trainY = to_categorical(trainY, nb_classes=2) 54 | testY = to_categorical(testY, nb_classes=2) 55 | 56 | 57 | # Building network 58 | network = input_data(shape=[None, maxlen], name='input') 59 | 60 | network = embedding( 61 | network, 62 | input_dim=vocab_size, 63 | output_dim=embedding_dim, 64 | trainable=True) 65 | 66 | network = bidirectional_rnn( 67 | network, 68 | BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 69 | BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 70 | return_seq=True, 71 | dynamic=True) 72 | network = tf.pack(network, axis=1) 73 | 74 | fw_outputs, bw_outputs = tf.split(split_dim=2, num_split=2, value=network) 75 | network = tf.add(fw_outputs, bw_outputs) 76 | 77 | branch1 = conv_1d(network, num_filters, 3, padding='valid', activation='relu', regularizer="L2") 78 | branch2 = conv_1d(network, num_filters, 4, padding='valid', activation='relu', regularizer="L2") 79 | branch3 = conv_1d(network, num_filters, 5, padding='valid', activation='relu', regularizer="L2") 80 | 81 | network = merge([branch1, branch2, branch3], mode='concat', axis=1) 82 | 83 | network = tf.expand_dims(network, 2) 84 | 85 | network = global_max_pool(network) 86 | 87 | network = dropout(network, dropout_prob) 88 | 89 | network = fully_connected(network, 2, activation='softmax') 90 | 91 | network = regression( 92 | network, 93 | optimizer='adam', 94 | learning_rate=learning_rate, 95 | loss='categorical_crossentropy', 96 | name='target') 97 | 98 | 99 | # Training 100 | model = tflearn.DNN(network, tensorboard_verbose=0, tensorboard_dir='runs') 101 | model.fit( 102 | trainX, 103 | trainY, 104 | validation_set=(testX, testY), 105 | n_epoch = num_epochs, 106 | shuffle=True, 107 | show_metric=True, 108 | batch_size=batch_size) 109 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from model import Model 10 | from tensorflow.contrib import learn 11 | 12 | # Parameters 13 | # ================================================== 14 | 15 | # Model Hyperparameters 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)") 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)") 18 | tf.flags.DEFINE_integer("hidden_dim", 150, "Dimensionality of hidden layer in LSTM (default: 300") 19 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')") 20 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)") 21 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)") 22 | tf.flags.DEFINE_float("l2_reg_lambda", 0.15, "L2 regularizaion lambda (default: 0.15)") 23 | 24 | # Training parameters 25 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)") 26 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)") 27 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)") 28 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 29 | 30 | # Misc Parameters 31 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 32 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 33 | 34 | FLAGS = tf.flags.FLAGS 35 | FLAGS._parse_flags() 36 | print("\nParameters:") 37 | for attr, value in sorted(FLAGS.__flags.items()): 38 | print("{}={}".format(attr.upper(), value)) 39 | print("") 40 | 41 | # Data Preparatopn 42 | # ================================================== 43 | 44 | # Load data 45 | print("Loading data...") 46 | x_text, y, seqlen = data_helpers.load_data_and_labels() 47 | 48 | # Build vocabulary 49 | max_document_length = max(seqlen) 50 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 51 | x = np.array(list(vocab_processor.fit_transform(x_text))) 52 | 53 | # Randomly shuffle data 54 | np.random.seed(10) 55 | shuffle_indices = np.random.permutation(np.arange(len(y))) 56 | x_shuffled = x[shuffle_indices] 57 | y_shuffled = y[shuffle_indices] 58 | seqlen_shuffled = seqlen[shuffle_indices] 59 | 60 | # Split train/test set 61 | # TODO: This is very crude, should use cross-validation 62 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:] 63 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:] 64 | seqlen_train, seqlen_dev = seqlen_shuffled[:-1000], seqlen_shuffled[-1000:] 65 | 66 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_))) 67 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev))) 68 | 69 | # Training 70 | # ================================================== 71 | 72 | with tf.Graph().as_default(): 73 | session_conf = tf.ConfigProto( 74 | allow_soft_placement=FLAGS.allow_soft_placement, 75 | log_device_placement=FLAGS.log_device_placement) 76 | sess = tf.Session(config=session_conf) 77 | 78 | with sess.as_default(): 79 | model = Model( 80 | sequence_length=x_train.shape[1], 81 | num_classes=2, 82 | vocab_size=len(vocab_processor.vocabulary_), 83 | embedding_size=FLAGS.embedding_dim, 84 | hidden_size=FLAGS.hidden_dim, 85 | filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))), 86 | num_filters=FLAGS.num_filters, 87 | l2_reg_lambda=FLAGS.l2_reg_lambda) 88 | 89 | # Define Training procedure 90 | global_step = tf.Variable(0, name="global_step", trainable=False) 91 | optimizer = tf.train.AdamOptimizer(0.001) 92 | grads_and_vars = optimizer.compute_gradients(model.loss) 93 | train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 94 | 95 | # Keep track of gradient values and sparsity (optional) 96 | grad_summaries = [] 97 | for g, v in grads_and_vars: 98 | if g is not None: 99 | grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g) 100 | sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 101 | grad_summaries.append(grad_hist_summary) 102 | grad_summaries.append(sparsity_summary) 103 | grad_summaries_merged = tf.merge_summary(grad_summaries) 104 | 105 | # Output directory for models and summaries 106 | timestamp = str(int(time.time())) 107 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 108 | print("Writing to {}\n".format(out_dir)) 109 | 110 | # Summaries for loss and accuracy 111 | loss_summary = tf.scalar_summary("loss", model.loss) 112 | acc_summary = tf.scalar_summary("accuracy", model.accuracy) 113 | 114 | # Train Summaries 115 | train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged]) 116 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 117 | train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph) 118 | 119 | # Dev summaries 120 | dev_summary_op = tf.merge_summary([loss_summary, acc_summary]) 121 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev") 122 | dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph) 123 | 124 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 125 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 126 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 127 | if not os.path.exists(checkpoint_dir): 128 | os.makedirs(checkpoint_dir) 129 | saver = tf.train.Saver(tf.global_variables()) 130 | 131 | # Write vocabulary 132 | vocab_processor.save(os.path.join(out_dir, "vocab")) 133 | 134 | # Initialize all variables 135 | sess.run(tf.global_variables_initializer()) 136 | 137 | if FLAGS.word2vec: 138 | # Initialize matrix with random uniform distribution 139 | initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim)) 140 | # Load any vectors from word2vec 141 | print("Load word2vec file {}\n".format(FLAGS.word2vec)) 142 | with open(FLAGS.word2vec, "rb") as f: 143 | header = f.readline() 144 | vocab_size, layer1_size = map(int, header.split()) 145 | binary_len = np.dtype('float32').itemsize * layer1_size 146 | 147 | for line in range(vocab_size): 148 | word = [] 149 | while True: 150 | ch = f.read(1) 151 | if ch == ' ': 152 | word = ''.join(word) 153 | break 154 | if ch != '\n': 155 | word.append(ch) 156 | 157 | idx = vocab_processor.vocabulary_.get(word) 158 | if idx != 0: 159 | initW[idx] = np.fromstring(f.read(binary_len), dtype='float32') 160 | else: 161 | f.read(binary_len) 162 | 163 | sess.run(model.W.assign(initW)) 164 | 165 | def train_step(x_batch, seqlen_batch, y_batch): 166 | """ 167 | A single training step 168 | """ 169 | feed_dict = { 170 | model.input_x: x_batch, 171 | model.seqlen: seqlen_batch, 172 | model.input_y: y_batch, 173 | model.dropout_keep_prob: FLAGS.dropout_keep_prob 174 | } 175 | _, step, summaries, loss, accuracy = sess.run( 176 | [train_op, global_step, train_summary_op, model.loss, model.accuracy], 177 | feed_dict) 178 | 179 | time_str = datetime.datetime.now().isoformat() 180 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 181 | train_summary_writer.add_summary(summaries, step) 182 | 183 | def dev_step(x_batch, seqlen_batch, y_batch, writer=None): 184 | """ 185 | Evaluates model on a dev set 186 | """ 187 | feed_dict = { 188 | model.input_x: x_batch, 189 | model.seqlen: seqlen_batch, 190 | model.input_y: y_batch, 191 | model.dropout_keep_prob: 1.0 192 | } 193 | step, summaries, loss, accuracy = sess.run( 194 | [global_step, dev_summary_op, model.loss, model.accuracy], 195 | feed_dict) 196 | 197 | time_str = datetime.datetime.now().isoformat() 198 | print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy)) 199 | if writer: 200 | writer.add_summary(summaries, step) 201 | 202 | # Generate batches 203 | batches = data_helpers.batch_iter( 204 | list(zip(x_train, y_train)), seqlen_train, FLAGS.batch_size, FLAGS.num_epochs) 205 | 206 | # Training loop. For each batch... 207 | for batch, seqlen_batch in batches: 208 | x_batch, y_batch = zip(*batch) 209 | train_step(x_batch, seqlen_batch, y_batch) 210 | current_step = tf.train.global_step(sess, global_step) 211 | 212 | if current_step % FLAGS.evaluate_every == 0: 213 | print("\nEvaluation:") 214 | dev_step(x_dev, seqlen_dev, y_dev, writer=dev_summary_writer) 215 | print("") 216 | 217 | if current_step % FLAGS.checkpoint_every == 0: 218 | path = saver.save(sess, checkpoint_prefix, global_step=current_step) 219 | print("Saved model checkpoint to {}\n".format(path)) 220 | --------------------------------------------------------------------------------