├── .gitignore
├── LICENSE
├── README.md
├── alt-version
    ├── data_helpers.py
    ├── eval.py
    ├── model.py
    └── train.py
├── cnn-model
    ├── cnn_model.py
    ├── data_helpers.py
    ├── eval.py
    └── train.py
├── data
    └── rt-polaritydata
    │   ├── rt-polarity.neg
    │   └── rt-polarity.pos
├── data_helpers.py
├── eval.py
├── model.py
├── res
    ├── acc-val.png
    ├── acc.png
    ├── bidirectional-rnn.png
    ├── cnn-128.png
    ├── loss-val.png
    ├── loss.png
    ├── lstm+cnn-128.png
    └── lstm+cnn-300.png
├── runs
    ├── cnn-128
    │   └── events.out.tfevents.1483714098.FYP6
    ├── lstm+cnn-128
    │   └── events.out.tfevents.1483625861.FYP6
    └── lstm+cnn-300
    │   └── events.out.tfevents.1483786544.FYP6
├── tflearn
    ├── cnn.py
    └── model.py
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | runs/1484149652  
 2 | runs/1484150035  
 3 | runs/1484150236  
 4 | runs/1484151924
 5 | GoogleNews-vectors-negative300.bin
 6 | imdb.pkl
 7 | 
 8 | # Byte-compiled / optimized / DLL files
 9 | __pycache__/
10 | *.py[cod]
11 | *$py.class
12 | 
13 | # C extensions
14 | *.so
15 | 
16 | # Distribution / packaging
17 | .Python
18 | env/
19 | build/
20 | develop-eggs/
21 | dist/
22 | downloads/
23 | eggs/
24 | .eggs/
25 | lib/
26 | lib64/
27 | parts/
28 | sdist/
29 | var/
30 | wheels/
31 | *.egg-info/
32 | .installed.cfg
33 | *.egg
34 | 
35 | # PyInstaller
36 | #  Usually these files are written by a python script from a template
37 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
38 | *.manifest
39 | *.spec
40 | 
41 | # Installer logs
42 | pip-log.txt
43 | pip-delete-this-directory.txt
44 | 
45 | # Unit test / coverage reports
46 | htmlcov/
47 | .tox/
48 | .coverage
49 | .coverage.*
50 | .cache
51 | nosetests.xml
52 | coverage.xml
53 | *,cover
54 | .hypothesis/
55 | 
56 | # Translations
57 | *.mo
58 | *.pot
59 | 
60 | # Django stuff:
61 | *.log
62 | local_settings.py
63 | 
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 | 
68 | # Scrapy stuff:
69 | .scrapy
70 | 
71 | # Sphinx documentation
72 | docs/_build/
73 | 
74 | # PyBuilder
75 | target/
76 | 
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 | 
80 | # pyenv
81 | .python-version
82 | 
83 | # celery beat schedule file
84 | celerybeat-schedule
85 | 
86 | # dotenv
87 | .env
88 | 
89 | # virtualenv
90 | .venv/
91 | venv/
92 | ENV/
93 | 
94 | # Spyder project settings
95 | .spyderproject
96 | 
97 | # Rope project settings
98 | .ropeproject
99 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Chaitanya Joshi
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | Presented here is a method to modify the word embeddings of a word in a sentence with its surrounding context using a bidirectional Recurrent Neural Network (RNN). The hypothesis is that these modified embeddings are a better input for performing text classification tasks like sentiment analysis or polarity detection. 
 3 | 
 4 | **Read the full blog post here: [chaitjo.github.io/context-embeddings](https://chaitjo.github.io/context-embeddings/)**
 5 | 
 6 | ---
 7 | 
 8 | ![Bidirectional RNN layer](res/bidirectional-rnn.png)
 9 | 
10 | # Implementation
11 | The code implements the proposed model as a pre-processing layer before feeding it into a [Convolutional Neural Network for Sentence Classification](https://arxiv.org/pdf/1408.5882v2.pdf) (Kim, 2014). Two implementations are provided to run experiments: one with [tensorflow](https://www.tensorflow.org/) and one with [tflearn](http://tflearn.org/) (A high-level API for tensorflow). Training happens end-to-end in a supervised manner: the RNN layer is simply inserted as part of the existing model's architecture for text classification.
12 | 
13 | The tensorflow version is built on top of [Denny Britz's implementation of Kim's CNN](https://github.com/dennybritz/cnn-text-classification-tf), and also allows loading pre-trained word2vec embeddings. 
14 | 
15 | Although both versions work exactly as intended, results in the blog post are from experiments with the tflearn version only.
16 | 
17 | # Usage
18 | I used Python 3.6 and Tensorflow 0.12.1 for my experiments.
19 | Tensorflow code is divided into `model.py` which abstracts the model as a class, and `train.py` which is used to train the model. It can be executed by running the `train.py` script (with optional flags to set hyperparameters)-
20 | ```
21 | $ python train.py [--flag=1]
22 | ```
23 | (Tensorflow code for Kim's baseline CNN can be found in `/cnn-model`.)
24 | 
25 | Tflearn code can be found in the `/tflearn` folder and can be run directly to start training (with optional flags to set hyperparameters)-
26 | ```
27 | $ python tflearn/model.py [--flag=1]
28 | ```
29 | 
30 | The summaries generated during training (saved in `/runs` by default) can be used to visualize results using tensorboard with the following command-
31 | ```
32 | $ tensorboard --logdir=<path_to_summary>
33 | ```
34 | 


--------------------------------------------------------------------------------
/alt-version/data_helpers.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import re
 3 | import itertools
 4 | from collections import Counter
 5 | 
 6 | 
 7 | def clean_str(string):
 8 |     """
 9 |     Tokenization/string cleaning
10 |     """
11 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
12 |     string = re.sub(r"\'s", " \'s", string)
13 |     string = re.sub(r"\'ve", " \'ve", string)
14 |     string = re.sub(r"n\'t", " n\'t", string)
15 |     string = re.sub(r"\'re", " \'re", string)
16 |     string = re.sub(r"\'d", " \'d", string)
17 |     string = re.sub(r"\'ll", " \'ll", string)
18 |     string = re.sub(r",", " , ", string)
19 |     string = re.sub(r"!", " ! ", string)
20 |     string = re.sub(r"\(", " \( ", string)
21 |     string = re.sub(r"\)", " \) ", string)
22 |     string = re.sub(r"\?", " \? ", string)
23 |     string = re.sub(r"\s{2,}", " ", string)
24 |     
25 |     return string.strip().lower()
26 | 
27 | 
28 | def load_data_and_labels():
29 |     """
30 |     Loads polarity data from files, splits the data into words and generates labels.
31 |     Returns split sentences and labels.
32 |     """
33 |     
34 |     # Load data from files
35 |     positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines())
36 |     positive_examples = [s.strip() for s in positive_examples]
37 |     negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines())
38 |     negative_examples = [s.strip() for s in negative_examples]
39 |     
40 |     # Split by words
41 |     x_text = positive_examples + negative_examples
42 |     x_text = [clean_str(sent) for sent in x_text]
43 |     
44 |     # Generate labels
45 |     positive_labels = [[0, 1] for _ in positive_examples]
46 |     negative_labels = [[1, 0] for _ in negative_examples]
47 |     y = np.concatenate([positive_labels, negative_labels], 0)
48 |     
49 |     return [x_text, y]
50 | 
51 | 
52 | def batch_iter(data, batch_size, num_epochs, shuffle=True):
53 |     """
54 |     Generates a batch iterator for a dataset.
55 |     """
56 |     data = np.array(data)
57 |     data_size = len(data)
58 |     num_batches_per_epoch = int(len(data)/batch_size) + 1
59 |     
60 |     for epoch in range(num_epochs):
61 |         # Shuffle the data at each epoch
62 |         if shuffle:
63 |             shuffle_indices = np.random.permutation(np.arange(data_size))
64 |             shuffled_data = data[shuffle_indices]
65 |         else:
66 |             shuffled_data = data
67 | 
68 |         for batch_num in range(num_batches_per_epoch):
69 |             start_index = batch_num * batch_size
70 |             end_index = min((batch_num + 1) * batch_size, data_size)    
71 |             yield shuffled_data[start_index:end_index]
72 | 
73 | 
74 | def pad_sentences(sentences, padding_word="<PAD/>", max_filter=5):
75 |     """
76 |     Pads all sentences to the same length. The length is defined by the longest sentence.
77 |     Returns padded sentences.
78 |     """
79 | 
80 |     # Using this might improve accuracy...
81 | 
82 |     pad_filter = max_filter -1
83 |     sequence_length = max(len(x) for x in sentences) + 2*pad_filter
84 | 
85 |     padded_sentences = []
86 |     for i in range(len(sentences)):
87 |         sentence = sentences[i]
88 |         num_padding = sequence_length - len(sentence) - pad_filter
89 |         new_sentence = [padding_word]*max_filter + sentence + [padding_word] * num_padding
90 |         padded_sentences.append(new_sentence)
91 |     
92 |     return padded_sentences
93 | 
94 | 


--------------------------------------------------------------------------------
/alt-version/eval.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | 
 3 | import tensorflow as tf
 4 | import numpy as np
 5 | import os
 6 | import time
 7 | import datetime
 8 | import data_helpers
 9 | from text_lstm import TextLSTM
10 | from tensorflow.contrib import learn
11 | 
12 | 
13 | # Parameters
14 | # ==================================================
15 | 
16 | # Eval Parameters
17 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
18 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run")
19 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data")
20 | 
21 | # Misc Parameters
22 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
23 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
24 | 
25 | 
26 | FLAGS = tf.flags.FLAGS
27 | FLAGS._parse_flags()
28 | print("\nParameters:")
29 | for attr, value in sorted(FLAGS.__flags.items()):
30 |     print("{}={}".format(attr.upper(), value))
31 | print("")
32 | 
33 | # Load new datasets here
34 | if FLAGS.eval_train:
35 |     x_raw, y_test = data_helpers.load_data_and_labels()
36 |     y_test = np.argmax(y_test, axis=1)
37 | else:
38 |     x_raw = ["a masterpiece four years in the making", "everything is off."]
39 |     y_test = [1, 0]
40 | 
41 | # Map data into vocabulary
42 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab")
43 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path)
44 | x_test = np.array(list(vocab_processor.transform(x_raw)))
45 | 
46 | print("\nEvaluating...\n")
47 | 
48 | # Evaluation
49 | # ==================================================
50 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
51 | graph = tf.Graph()
52 | with graph.as_default():
53 |     session_conf = tf.ConfigProto(
54 |       allow_soft_placement=FLAGS.allow_soft_placement,
55 |       log_device_placement=FLAGS.log_device_placement)
56 |     sess = tf.Session(config=session_conf)
57 |     
58 |     with sess.as_default():
59 |         # Load the saved meta graph and restore variables
60 |         saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
61 |         saver.restore(sess, checkpoint_file)
62 | 
63 |         # Get the placeholders from the graph by name
64 |         input_x = graph.get_operation_by_name("input_x").outputs[0]
65 |         # input_y = graph.get_operation_by_name("input_y").outputs[0]
66 |         dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
67 | 
68 |         # Tensors we want to evaluate
69 |         predictions = graph.get_operation_by_name("output/predictions").outputs[0]
70 | 
71 |         # Generate batches for one epoch
72 |         batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
73 | 
74 |         # Collect the predictions here
75 |         all_predictions = []
76 | 
77 |         for x_test_batch in batches:
78 |             batch_predictions = sess.run(
79 |                 predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
80 |             all_predictions = np.concatenate([all_predictions, batch_predictions])
81 | 
82 | # Print accuracy if y_test is defined
83 | if y_test is not None:
84 |     correct_predictions = float(sum(all_predictions == y_test))
85 |     print("Total number of test examples: {}".format(len(y_test)))
86 |     print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))
87 | 


--------------------------------------------------------------------------------
/alt-version/model.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import tensorflow as tf
  3 | 
  4 | class Model(object):
  5 |     def __init__(
  6 |         self, sequence_length, num_classes, vocab_size, 
  7 |         embedding_size, hidden_size,
  8 |         filter_sizes, num_filters, l2_reg_lambda=0.0):
  9 | 
 10 |         # Placeholders for input, output and dropout
 11 |         self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
 12 |         self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
 13 |         self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
 14 | 
 15 |         # Keeping track of l2 regularization loss (optional)
 16 |         l2_loss = tf.constant(0.0)
 17 | 
 18 |         # Embedding layer
 19 |         with tf.device('/cpu:0'), tf.name_scope("embedding"):
 20 |             self.W = tf.Variable(
 21 |                 tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
 22 |                 trainable=True, 
 23 |                 name="W")
 24 |             self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
 25 | 
 26 |         with tf.name_scope("bidirectional-lstm"):
 27 |             b = tf.Variable(tf.constant(0.1, shape=[hidden_size]), name="b")
 28 |             lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0)
 29 |             lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0)
 30 | 
 31 |             self.lstm_outputs, _, _ = tf.nn.bidirectional_dynammic_rnn(lstm_fw_cell, lstm_bw_cell, self.embedded_chars, dtype=tf.float32)
 32 |             self.lstm_outputs = tf.nn.bias_add(self.lstm_outputs, b)
 33 |             lstm_outputs_fw, lstm_outputs_bw = tf.split(value=self.lstm_outputs, split_dim=2, num_split=2)
 34 |             self.lstm_outputs = tf.add(lstm_outputs_fw, lstm_outputs_bw, name="lstm_outputs")
 35 |             
 36 |         self.lstm_outputs_expanded = tf.expand_dims(self.lstm_outputs, -1)
 37 | 
 38 |         # Create a convolution + maxpool layer for each filter size
 39 |         pooled_outputs = []
 40 |         for i, filter_size in enumerate(filter_sizes):
 41 |             with tf.name_scope("conv-maxpool-%s" % filter_size):
 42 |                 # Convolution Layer
 43 |                 filter_shape = [filter_size, hidden_size, 1, num_filters]
 44 |                 W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
 45 |                 b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
 46 |                 
 47 |                 conv = tf.nn.conv2d(
 48 |                     self.lstm_outputs_expanded,
 49 |                     W,
 50 |                     strides=[1, 1, 1, 1],
 51 |                     padding="VALID",
 52 |                     name="conv")
 53 |                 
 54 |                 # Apply nonlinearity
 55 |                 h = tf.nn.relu(nn.bias_add(conv, b), name="relu")
 56 |                 
 57 |                 # Maxpooling over the outputs
 58 |                 pooled = tf.nn.max_pool(
 59 |                     h,
 60 |                     ksize=[1, sequence_length - filter_size + 1, 1, 1],
 61 |                     strides=[1, 1, 1, 1],
 62 |                     padding='VALID',
 63 |                     name="pool")
 64 |                 pooled_outputs.append(pooled)
 65 | 
 66 |         # Combine all the pooled features
 67 |         num_filters_total = num_filters * len(filter_sizes)
 68 |         self.h_pool = tf.concat(3, pooled_outputs)
 69 |         self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
 70 | 
 71 |         # Add dropout
 72 |         with tf.name_scope("dropout"):
 73 |             self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
 74 | 
 75 |         # Final (unnormalized) scores and predictions
 76 |         with tf.name_scope("output"):
 77 |             # Standard output weights initialization
 78 |             W = tf.get_variable(
 79 |                 "W", 
 80 |                 shape=[num_filters_total, num_classes], 
 81 |                 initializer=tf.contrib.layers.xavier_initializer())
 82 |             b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
 83 | 
 84 |             # Initialized output weights to 0.0, might improve accuracy
 85 |             # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W")
 86 |             # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b")
 87 |             
 88 |             l2_loss += tf.nn.l2_loss(W)
 89 |             l2_loss += tf.nn.l2_loss(b)
 90 |             
 91 |             self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
 92 |             self.predictions = tf.argmax(self.scores, 1, name="predictions")
 93 | 
 94 |         # Calculate mean cross-entropy loss
 95 |         with tf.name_scope("loss"):
 96 |             losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
 97 |             self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
 98 | 
 99 |         # Accuracy
100 |         with tf.name_scope("accuracy"):
101 |             correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
102 |             self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
103 | 


--------------------------------------------------------------------------------
/alt-version/train.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import os
  6 | import time
  7 | import datetime
  8 | import data_helpers
  9 | from model import Model
 10 | from tensorflow.contrib import learn
 11 | 
 12 | # Parameters
 13 | # ==================================================
 14 | 
 15 | # Model Hyperparameters
 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)")
 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)")
 18 | tf.flags.DEFINE_integer("hidden_dim", 300, "Dimensionality of hidden layer in LSTM (default: 300")
 19 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
 20 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)")
 21 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
 22 | tf.flags.DEFINE_float("l2_reg_lambda", 0, "L2 regularizaion lambda (default: 0.15)")
 23 | 
 24 | # Training parameters
 25 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)")
 26 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)")
 27 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
 28 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
 29 | 
 30 | # Misc Parameters
 31 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
 32 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
 33 | 
 34 | FLAGS = tf.flags.FLAGS
 35 | FLAGS._parse_flags()
 36 | print("\nParameters:")
 37 | for attr, value in sorted(FLAGS.__flags.items()):
 38 |     print("{}={}".format(attr.upper(), value))
 39 | print("")
 40 | 
 41 | # Data Preparatopn
 42 | # ==================================================
 43 | 
 44 | # Load data
 45 | print("Loading data...")
 46 | x_text, y = data_helpers.load_data_and_labels()
 47 | 
 48 | # Build vocabulary
 49 | max_document_length = max([len(x.split(" ")) for x in x_text])
 50 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 
 51 | x = np.array(list(vocab_processor.fit_transform(x_text))) 
 52 | 
 53 | # Randomly shuffle data
 54 | np.random.seed(10)
 55 | shuffle_indices = np.random.permutation(np.arange(len(y)))
 56 | x_shuffled = x[shuffle_indices]
 57 | y_shuffled = y[shuffle_indices]
 58 | 
 59 | # Split train/test set
 60 | # TODO: This is very crude, should use cross-validation
 61 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:]
 62 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:]
 63 | 
 64 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
 65 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))
 66 | 
 67 | # Training
 68 | # ==================================================
 69 | 
 70 | with tf.Graph().as_default():
 71 |     session_conf = tf.ConfigProto(
 72 |         allow_soft_placement=FLAGS.allow_soft_placement, 
 73 |         log_device_placement=FLAGS.log_device_placement)
 74 |     sess = tf.Session(config=session_conf)
 75 |     
 76 |     with sess.as_default():
 77 |         model = Model(
 78 |             sequence_length=x_train.shape[1],
 79 |             num_classes=2,
 80 |             vocab_size=len(vocab_processor.vocabulary_),
 81 |             embedding_size=FLAGS.embedding_dim,
 82 |             hidden_size=FLAGS.hidden_dim,
 83 |             filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
 84 |             num_filters=FLAGS.num_filters,
 85 |             l2_reg_lambda=FLAGS.l2_reg_lambda)
 86 | 
 87 |         # Define Training procedure
 88 |         global_step = tf.Variable(0, name="global_step", trainable=False)
 89 |         optimizer = tf.train.AdamOptimizer(0.001)
 90 |         grads_and_vars = optimizer.compute_gradients(model.loss)
 91 |         train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
 92 | 
 93 |         # Keep track of gradient values and sparsity (optional)
 94 |         grad_summaries = []
 95 |         for g, v in grads_and_vars:
 96 |             if g is not None:
 97 |                 grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g)
 98 |                 sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
 99 |                 grad_summaries.append(grad_hist_summary)
100 |                 grad_summaries.append(sparsity_summary)
101 |         grad_summaries_merged = tf.merge_summary(grad_summaries)
102 | 
103 |         # Output directory for models and summaries
104 |         timestamp = str(int(time.time()))
105 |         out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
106 |         print("Writing to {}\n".format(out_dir))
107 | 
108 |         # Summaries for loss and accuracy
109 |         loss_summary = tf.scalar_summary("loss", model.loss)
110 |         acc_summary = tf.scalar_summary("accuracy", model.accuracy)
111 | 
112 |         # Train Summaries
113 |         train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged])
114 |         train_summary_dir = os.path.join(out_dir, "summaries", "train")
115 |         train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph)
116 | 
117 |         # Dev summaries
118 |         dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
119 |         dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
120 |         dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph)
121 | 
122 |         # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
123 |         checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
124 |         checkpoint_prefix = os.path.join(checkpoint_dir, "model")
125 |         if not os.path.exists(checkpoint_dir):
126 |             os.makedirs(checkpoint_dir)
127 |         saver = tf.train.Saver(tf.all_variables())
128 | 
129 |         # Write vocabulary
130 |         vocab_processor.save(os.path.join(out_dir, "vocab"))
131 | 
132 |         # Initialize all variables
133 |         sess.run(tf.initialize_all_variables())
134 | 
135 |         if FLAGS.word2vec:
136 |             # Initialize matrix with random uniform distribution
137 |             initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
138 |             # Load any vectors from word2vec
139 |             print("Load word2vec file {}\n".format(FLAGS.word2vec))
140 |             with open(FLAGS.word2vec, "rb") as f:
141 |                 header = f.readline()
142 |                 vocab_size, layer1_size = map(int, header.split())
143 |                 binary_len = np.dtype('float32').itemsize * layer1_size
144 |                 
145 |                 for line in xrange(vocab_size):
146 |                     word = []
147 |                     while True:
148 |                         ch = f.read(1)
149 |                         if ch == ' ':
150 |                             word = ''.join(word)
151 |                             break
152 |                         if ch != '\n':
153 |                             word.append(ch)   
154 |                     
155 |                     idx = vocab_processor.vocabulary_.get(word)
156 |                     if idx != 0:
157 |                         initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')  
158 |                     else:
159 |                         f.read(binary_len)
160 | 
161 |             sess.run(model.W.assign(initW))
162 | 
163 |         def train_step(x_batch, y_batch):
164 |             """
165 |             A single training step
166 |             """
167 |             feed_dict = {
168 |               model.input_x: x_batch,
169 |               model.input_y: y_batch,
170 |               model.dropout_keep_prob: FLAGS.dropout_keep_prob
171 |             }
172 |             _, step, summaries, loss, accuracy = sess.run(
173 |                 [train_op, global_step, train_summary_op, model.loss, model.accuracy],
174 |                 feed_dict)
175 | 
176 |             time_str = datetime.datetime.now().isoformat()
177 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
178 |             train_summary_writer.add_summary(summaries, step)
179 | 
180 |         def dev_step(x_batch, y_batch, writer=None):
181 |             """
182 |             Evaluates model on a dev set
183 |             """
184 |             feed_dict = {
185 |               model.input_x: x_batch,
186 |               model.input_y: y_batch,
187 |               model.dropout_keep_prob: 1.0
188 |             }
189 |             step, summaries, loss, accuracy = sess.run(
190 |                 [global_step, dev_summary_op, model.loss, model.accuracy],
191 |                 feed_dict)
192 |             
193 |             time_str = datetime.datetime.now().isoformat()
194 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
195 |             if writer:
196 |                 writer.add_summary(summaries, step)
197 | 
198 |         # Generate batches
199 |         batches = data_helpers.batch_iter(
200 |             list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
201 |         
202 |         # Training loop
203 |         for batch in batches:
204 |             x_batch, y_batch = zip(*batch)
205 |             train_step(x_batch, y_batch)
206 |             current_step = tf.train.global_step(sess, global_step)
207 |             
208 |             if current_step % FLAGS.evaluate_every == 0:
209 |                 print("\nEvaluation:")
210 |                 dev_step(x_dev, y_dev, writer=dev_summary_writer)
211 |                 print("")
212 |             
213 |             if current_step % FLAGS.checkpoint_every == 0:
214 |                 path = saver.save(sess, checkpoint_prefix, global_step=current_step)
215 |                 print("Saved model checkpoint to {}\n".format(path))
216 | 


--------------------------------------------------------------------------------
/cnn-model/cnn_model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import numpy as np
 3 | 
 4 | 
 5 | class TextCNN(object):
 6 |     """
 7 |     A CNN for text classification.
 8 |     Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
 9 |     """
10 |     def __init__(
11 |       self, sequence_length, num_classes, vocab_size,
12 |       embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
13 | 
14 |         # Placeholders for input, output and dropout
15 |         self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
16 |         self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
17 |         self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
18 | 
19 |         # Keeping track of l2 regularization loss (optional)
20 |         l2_loss = tf.constant(0.0)
21 | 
22 |         # Embedding layer
23 |         with tf.device('/cpu:0'), tf.name_scope("embedding"):
24 |             self.W = tf.Variable(
25 |                 tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
26 |                 trainable=True, 
27 |                 name="W")
28 |             self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
29 |             self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
30 | 
31 |         # Create a convolution + maxpool layer for each filter size
32 |         pooled_outputs = []
33 |         for i, filter_size in enumerate(filter_sizes):
34 |             with tf.name_scope("conv-maxpool-%s" % filter_size):
35 |                 # Convolution Layer
36 |                 filter_shape = [filter_size, embedding_size, 1, num_filters]
37 |                 W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
38 |                 b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
39 |                 conv = tf.nn.conv2d(
40 |                     self.embedded_chars_expanded,
41 |                     W,
42 |                     strides=[1, 1, 1, 1],
43 |                     padding="VALID",
44 |                     name="conv")
45 |                 # Apply nonlinearity
46 |                 h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
47 |                 # Maxpooling over the outputs
48 |                 pooled = tf.nn.max_pool(
49 |                     h,
50 |                     ksize=[1, sequence_length - filter_size + 1, 1, 1],
51 |                     strides=[1, 1, 1, 1],
52 |                     padding='VALID',
53 |                     name="pool")
54 |                 pooled_outputs.append(pooled)
55 | 
56 |         # Combine all the pooled features
57 |         num_filters_total = num_filters * len(filter_sizes)
58 |         self.h_pool = tf.concat(3, pooled_outputs)
59 |         self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
60 | 
61 |         # Add dropout
62 |         with tf.name_scope("dropout"):
63 |             self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
64 | 
65 |         # Final (unnormalized) scores and predictions
66 |         with tf.name_scope("output"):
67 |             # Standard output weights initialization
68 |             W = tf.get_variable(
69 |                 "W", 
70 |                 shape=[num_filters_total, num_classes], 
71 |                 initializer=tf.contrib.layers.xavier_initializer())
72 |             b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
73 | 
74 |             # # Initialized output weights to 0.0, might improve accuracy
75 |             # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W")
76 |             # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b")
77 |             
78 |             l2_loss += tf.nn.l2_loss(W)
79 |             l2_loss += tf.nn.l2_loss(b)
80 |             self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
81 |             self.predictions = tf.argmax(self.scores, 1, name="predictions")
82 | 
83 |         # CalculateMean cross-entropy loss
84 |         with tf.name_scope("loss"):
85 |             losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
86 |             self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
87 | 
88 |         # Accuracy
89 |         with tf.name_scope("accuracy"):
90 |             correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
91 |             self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
92 | 


--------------------------------------------------------------------------------
/cnn-model/data_helpers.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import re
 3 | import itertools
 4 | from collections import Counter
 5 | 
 6 | 
 7 | def clean_str(string):
 8 |     """
 9 |     Tokenization/string cleaning for all datasets except for SST.
10 |     Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
11 |     """
12 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
13 |     string = re.sub(r"\'s", " \'s", string)
14 |     string = re.sub(r"\'ve", " \'ve", string)
15 |     string = re.sub(r"n\'t", " n\'t", string)
16 |     string = re.sub(r"\'re", " \'re", string)
17 |     string = re.sub(r"\'d", " \'d", string)
18 |     string = re.sub(r"\'ll", " \'ll", string)
19 |     string = re.sub(r",", " , ", string)
20 |     string = re.sub(r"!", " ! ", string)
21 |     string = re.sub(r"\(", " \( ", string)
22 |     string = re.sub(r"\)", " \) ", string)
23 |     string = re.sub(r"\?", " \? ", string)
24 |     string = re.sub(r"\s{2,}", " ", string)
25 |     return string.strip().lower()
26 | 
27 | 
28 | def load_data_and_labels():
29 |     """
30 |     Loads MR polarity data from files, splits the data into words and generates labels.
31 |     Returns split sentences and labels.
32 |     """
33 |     # Load data from files
34 |     positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines())
35 |     positive_examples = [s.strip() for s in positive_examples]
36 |     negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines())
37 |     negative_examples = [s.strip() for s in negative_examples]
38 |     # Split by words
39 |     x_text = positive_examples + negative_examples
40 |     x_text = [clean_str(sent) for sent in x_text]
41 |     # Generate labels
42 |     positive_labels = [[0, 1] for _ in positive_examples]
43 |     negative_labels = [[1, 0] for _ in negative_examples]
44 |     y = np.concatenate([positive_labels, negative_labels], 0)
45 |     return [x_text, y]
46 | 
47 | 
48 | def batch_iter(data, batch_size, num_epochs, shuffle=True):
49 |     """
50 |     Generates a batch iterator for a dataset.
51 |     """
52 |     data = np.array(data)
53 |     data_size = len(data)
54 |     num_batches_per_epoch = int(len(data)/batch_size) + 1
55 |     for epoch in range(num_epochs):
56 |         # Shuffle the data at each epoch
57 |         if shuffle:
58 |             shuffle_indices = np.random.permutation(np.arange(data_size))
59 |             shuffled_data = data[shuffle_indices]
60 |         else:
61 |             shuffled_data = data
62 |         for batch_num in range(num_batches_per_epoch):
63 |             start_index = batch_num * batch_size
64 |             end_index = min((batch_num + 1) * batch_size, data_size)
65 |             yield shuffled_data[start_index:end_index]
66 | 


--------------------------------------------------------------------------------
/cnn-model/eval.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | 
 3 | import tensorflow as tf
 4 | import numpy as np
 5 | import os
 6 | import time
 7 | import datetime
 8 | import data_helpers
 9 | from cnn_model import TextCNN
10 | from tensorflow.contrib import learn
11 | 
12 | # Parameters
13 | # ==================================================
14 | 
15 | # Eval Parameters
16 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
17 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run")
18 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data")
19 | 
20 | # Misc Parameters
21 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
22 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
23 | 
24 | 
25 | FLAGS = tf.flags.FLAGS
26 | FLAGS._parse_flags()
27 | print("\nParameters:")
28 | for attr, value in sorted(FLAGS.__flags.items()):
29 |     print("{}={}".format(attr.upper(), value))
30 | print("")
31 | 
32 | # CHANGE THIS: Load data. Load your own data here
33 | if FLAGS.eval_train:
34 |     x_raw, y_test = data_helpers.load_data_and_labels()
35 |     y_test = np.argmax(y_test, axis=1)
36 | else:
37 |     x_raw = ["a masterpiece four years in the making", "everything is off."]
38 |     y_test = [1, 0]
39 | 
40 | # Map data into vocabulary
41 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab")
42 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path)
43 | x_test = np.array(list(vocab_processor.transform(x_raw)))
44 | 
45 | print("\nEvaluating...\n")
46 | 
47 | # Evaluation
48 | # ==================================================
49 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
50 | graph = tf.Graph()
51 | with graph.as_default():
52 |     session_conf = tf.ConfigProto(
53 |       allow_soft_placement=FLAGS.allow_soft_placement,
54 |       log_device_placement=FLAGS.log_device_placement)
55 |     sess = tf.Session(config=session_conf)
56 |     with sess.as_default():
57 |         # Load the saved meta graph and restore variables
58 |         saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
59 |         saver.restore(sess, checkpoint_file)
60 | 
61 |         # Get the placeholders from the graph by name
62 |         input_x = graph.get_operation_by_name("input_x").outputs[0]
63 |         # input_y = graph.get_operation_by_name("input_y").outputs[0]
64 |         dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
65 | 
66 |         # Tensors we want to evaluate
67 |         predictions = graph.get_operation_by_name("output/predictions").outputs[0]
68 | 
69 |         # Generate batches for one epoch
70 |         batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
71 | 
72 |         # Collect the predictions here
73 |         all_predictions = []
74 | 
75 |         for x_test_batch in batches:
76 |             batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
77 |             all_predictions = np.concatenate([all_predictions, batch_predictions])
78 | 
79 | # Print accuracy if y_test is defined
80 | if y_test is not None:
81 |     correct_predictions = float(sum(all_predictions == y_test))
82 |     print("Total number of test examples: {}".format(len(y_test)))
83 |     print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))
84 | 


--------------------------------------------------------------------------------
/cnn-model/train.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import os
  6 | import time
  7 | import datetime
  8 | import data_helpers
  9 | from cnn_model import TextCNN
 10 | from tensorflow.contrib import learn
 11 | 
 12 | # Parameters
 13 | # ==================================================
 14 | 
 15 | # Model Hyperparameters
 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)")
 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)")
 18 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
 19 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)")
 20 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
 21 | tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularizaion lambda (default: 0.15)")
 22 | 
 23 | # Training parameters
 24 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)")
 25 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)")
 26 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
 27 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
 28 | # Misc Parameters
 29 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
 30 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
 31 | 
 32 | FLAGS = tf.flags.FLAGS
 33 | FLAGS._parse_flags()
 34 | print("\nParameters:")
 35 | for attr, value in sorted(FLAGS.__flags.items()):
 36 |     print("{}={}".format(attr.upper(), value))
 37 | print("")
 38 | 
 39 | 
 40 | # Data Preparatopn
 41 | # ==================================================
 42 | 
 43 | # Load data
 44 | print("Loading data...")
 45 | x_text, y = data_helpers.load_data_and_labels()
 46 | 
 47 | # Build vocabulary
 48 | max_document_length = max([len(x.split(" ")) for x in x_text])
 49 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
 50 | x = np.array(list(vocab_processor.fit_transform(x_text)))
 51 | 
 52 | # Randomly shuffle data
 53 | np.random.seed(10)
 54 | shuffle_indices = np.random.permutation(np.arange(len(y)))
 55 | x_shuffled = x[shuffle_indices]
 56 | y_shuffled = y[shuffle_indices]
 57 | 
 58 | # Split train/test set
 59 | # TODO: This is very crude, should use cross-validation
 60 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:]
 61 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:]
 62 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
 63 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))
 64 | 
 65 | 
 66 | # Training
 67 | # ==================================================
 68 | 
 69 | with tf.Graph().as_default():
 70 |     session_conf = tf.ConfigProto(
 71 |       allow_soft_placement=FLAGS.allow_soft_placement,
 72 |       log_device_placement=FLAGS.log_device_placement)
 73 |     sess = tf.Session(config=session_conf)
 74 |     with sess.as_default():
 75 |         cnn = TextCNN(
 76 |             sequence_length=x_train.shape[1],
 77 |             num_classes=2,
 78 |             vocab_size=len(vocab_processor.vocabulary_),
 79 |             embedding_size=FLAGS.embedding_dim,
 80 |             filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
 81 |             num_filters=FLAGS.num_filters,
 82 |             l2_reg_lambda=FLAGS.l2_reg_lambda)
 83 | 
 84 |         # Define Training procedure
 85 |         global_step = tf.Variable(0, name="global_step", trainable=False)
 86 |         optimizer = tf.train.AdamOptimizer(0.001)
 87 |         grads_and_vars = optimizer.compute_gradients(cnn.loss)
 88 |         train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
 89 | 
 90 |         # Keep track of gradient values and sparsity (optional)
 91 |         grad_summaries = []
 92 |         for g, v in grads_and_vars:
 93 |             if g is not None:
 94 |                 grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g)
 95 |                 sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
 96 |                 grad_summaries.append(grad_hist_summary)
 97 |                 grad_summaries.append(sparsity_summary)
 98 |         grad_summaries_merged = tf.merge_summary(grad_summaries)
 99 | 
100 |         # Output directory for models and summaries
101 |         timestamp = str(int(time.time()))
102 |         out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
103 |         print("Writing to {}\n".format(out_dir))
104 | 
105 |         # Summaries for loss and accuracy
106 |         loss_summary = tf.scalar_summary("loss", cnn.loss)
107 |         acc_summary = tf.scalar_summary("accuracy", cnn.accuracy)
108 | 
109 |         # Train Summaries
110 |         train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged])
111 |         train_summary_dir = os.path.join(out_dir, "summaries", "train")
112 |         train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph)
113 | 
114 |         # Dev summaries
115 |         dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
116 |         dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
117 |         dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph)
118 | 
119 |         # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
120 |         checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
121 |         checkpoint_prefix = os.path.join(checkpoint_dir, "model")
122 |         if not os.path.exists(checkpoint_dir):
123 |             os.makedirs(checkpoint_dir)
124 |         saver = tf.train.Saver(tf.all_variables())
125 | 
126 |         # Write vocabulary
127 |         vocab_processor.save(os.path.join(out_dir, "vocab"))
128 | 
129 |         # Initialize all variables
130 |         sess.run(tf.initialize_all_variables())
131 | 
132 |         if FLAGS.word2vec:
133 |             # Initialize matrix with random uniform distribution
134 |             initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
135 |             # Load any vectors from word2vec
136 |             print("Load word2vec file {}\n".format(FLAGS.word2vec))
137 |             with open(FLAGS.word2vec, "rb") as f:
138 |                 header = f.readline()
139 |                 vocab_size, layer1_size = map(int, header.split())
140 |                 binary_len = np.dtype('float32').itemsize * layer1_size
141 |                 
142 |                 for line in xrange(vocab_size):
143 |                     word = []
144 |                     while True:
145 |                         ch = f.read(1)
146 |                         if ch == ' ':
147 |                             word = ''.join(word)
148 |                             break
149 |                         if ch != '\n':
150 |                             word.append(ch)   
151 |                     
152 |                     idx = vocab_processor.vocabulary_.get(word)
153 |                     if idx != 0:
154 |                         initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')  
155 |                     else:
156 |                         f.read(binary_len)
157 | 
158 |             sess.run(cnn.W.assign(initW))
159 | 
160 |         def train_step(x_batch, y_batch):
161 |             """
162 |             A single training step
163 |             """
164 |             feed_dict = {
165 |               cnn.input_x: x_batch,
166 |               cnn.input_y: y_batch,
167 |               cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
168 |             }
169 |             _, step, summaries, loss, accuracy = sess.run(
170 |                 [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
171 |                 feed_dict)
172 |             time_str = datetime.datetime.now().isoformat()
173 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
174 |             train_summary_writer.add_summary(summaries, step)
175 | 
176 |         def dev_step(x_batch, y_batch, writer=None):
177 |             """
178 |             Evaluates model on a dev set
179 |             """
180 |             feed_dict = {
181 |               cnn.input_x: x_batch,
182 |               cnn.input_y: y_batch,
183 |               cnn.dropout_keep_prob: 1.0
184 |             }
185 |             step, summaries, loss, accuracy = sess.run(
186 |                 [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
187 |                 feed_dict)
188 |             time_str = datetime.datetime.now().isoformat()
189 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
190 |             if writer:
191 |                 writer.add_summary(summaries, step)
192 | 
193 |         # Generate batches
194 |         batches = data_helpers.batch_iter(
195 |             list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
196 |         # Training loop. For each batch...
197 |         for batch in batches:
198 |             x_batch, y_batch = zip(*batch)
199 |             train_step(x_batch, y_batch)
200 |             current_step = tf.train.global_step(sess, global_step)
201 |             if current_step % FLAGS.evaluate_every == 0:
202 |                 print("\nEvaluation:")
203 |                 dev_step(x_dev, y_dev, writer=dev_summary_writer)
204 |                 print("")
205 |             if current_step % FLAGS.checkpoint_every == 0:
206 |                 path = saver.save(sess, checkpoint_prefix, global_step=current_step)
207 |                 print("Saved model checkpoint to {}\n".format(path))
208 | 


--------------------------------------------------------------------------------
/data_helpers.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import re
 3 | import itertools
 4 | from collections import Counter
 5 | 
 6 | 
 7 | def clean_str(string):
 8 |     """
 9 |     Tokenization/string cleaning
10 |     """
11 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
12 |     string = re.sub(r"\'s", " \'s", string)
13 |     string = re.sub(r"\'ve", " \'ve", string)
14 |     string = re.sub(r"n\'t", " n\'t", string)
15 |     string = re.sub(r"\'re", " \'re", string)
16 |     string = re.sub(r"\'d", " \'d", string)
17 |     string = re.sub(r"\'ll", " \'ll", string)
18 |     string = re.sub(r",", " , ", string)
19 |     string = re.sub(r"!", " ! ", string)
20 |     string = re.sub(r"\(", " \( ", string)
21 |     string = re.sub(r"\)", " \) ", string)
22 |     string = re.sub(r"\?", " \? ", string)
23 |     string = re.sub(r"\s{2,}", " ", string)
24 |     
25 |     return string.strip().lower()
26 | 
27 | 
28 | def load_data_and_labels():
29 |     """
30 |     Loads polarity data from files, splits the data into words and generates labels.
31 |     Returns split sentences and labels.
32 |     """
33 |     
34 |     # Load data from files
35 |     positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines())
36 |     positive_examples = [s.strip() for s in positive_examples]
37 |     negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines())
38 |     negative_examples = [s.strip() for s in negative_examples]
39 |     
40 |     # Split by words
41 |     x_text = positive_examples + negative_examples
42 |     x_text = [clean_str(sent) for sent in x_text]
43 |     
44 |     # Generate labels
45 |     positive_labels = [[0, 1] for _ in positive_examples]
46 |     negative_labels = [[1, 0] for _ in negative_examples]
47 |     y = np.concatenate([positive_labels, negative_labels], 0)
48 | 
49 |     # Generate sequence lengths
50 |     seqlen = np.array([len(sent.split(" ")) for sent in x_text])
51 |     
52 |     return [x_text, y, seqlen]
53 | 
54 | 
55 | def batch_iter(data, seqlen_data, batch_size, num_epochs, shuffle=True):
56 |     """
57 |     Generates a batch iterator for a dataset.
58 |     """
59 |     
60 |     data = np.array(data)
61 |     data_size = len(data)
62 |     num_batches_per_epoch = int(len(data)/batch_size) + 1
63 |     
64 |     for epoch in range(num_epochs):
65 |         # Shuffle the data at each epoch
66 |         if shuffle:
67 |             shuffle_indices = np.random.permutation(np.arange(data_size))
68 |             shuffled_data = data[shuffle_indices]
69 |         else:
70 |             shuffled_data = data
71 |         
72 |         for batch_num in range(num_batches_per_epoch):
73 |             start_index = batch_num * batch_size
74 |             end_index = min((batch_num + 1) * batch_size, data_size)
75 |             
76 |             seqlen_batch = seqlen_data[start_index:end_index]
77 | 
78 |             yield shuffled_data[start_index:end_index], seqlen_batch
79 |             #TODO: Problem with seqlens
80 | 
81 | 


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | 
 3 | import tensorflow as tf
 4 | import numpy as np
 5 | import os
 6 | import time
 7 | import datetime
 8 | import data_helpers
 9 | from text_lstm import TextLSTM
10 | from tensorflow.contrib import learn
11 | 
12 | 
13 | # Parameters
14 | # ==================================================
15 | 
16 | # Eval Parameters
17 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
18 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run")
19 | tf.flags.DEFINE_boolean("eval_train", False, "Evaluate on all training data")
20 | 
21 | # Misc Parameters
22 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
23 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
24 | 
25 | 
26 | FLAGS = tf.flags.FLAGS
27 | FLAGS._parse_flags()
28 | print("\nParameters:")
29 | for attr, value in sorted(FLAGS.__flags.items()):
30 |     print("{}={}".format(attr.upper(), value))
31 | print("")
32 | 
33 | # Load new datasets here
34 | if FLAGS.eval_train:
35 |     x_raw, y_test, seqlen_test = data_helpers.load_data_and_labels()
36 |     y_test = np.argmax(y_test, axis=1)
37 | else:
38 |     x_raw = ["a masterpiece four years in the making", "everything is off."]
39 |     y_test = [1, 0]
40 |     seqlen_test = [7, 3]
41 | 
42 | # Map data into vocabulary
43 | vocab_path = os.path.join(FLAGS.checkpoint_dir, "..", "vocab")
44 | vocab_processor = learn.preprocessing.VocabularyProcessor.restore(vocab_path)
45 | x_test = np.array(list(vocab_processor.transform(x_raw)))
46 | 
47 | print("\nEvaluating...\n")
48 | 
49 | # Evaluation
50 | # ==================================================
51 | checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
52 | graph = tf.Graph()
53 | with graph.as_default():
54 |     session_conf = tf.ConfigProto(
55 |       allow_soft_placement=FLAGS.allow_soft_placement,
56 |       log_device_placement=FLAGS.log_device_placement)
57 |     sess = tf.Session(config=session_conf)
58 |     
59 |     with sess.as_default():
60 |         # Load the saved meta graph and restore variables
61 |         saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
62 |         saver.restore(sess, checkpoint_file)
63 | 
64 |         # Get the placeholders from the graph by name
65 |         input_x = graph.get_operation_by_name("input_x").outputs[0]
66 |         # input_y = graph.get_operation_by_name("input_y").outputs[0]
67 |         dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
68 | 
69 |         # Tensors we want to evaluate
70 |         predictions = graph.get_operation_by_name("output/predictions").outputs[0]
71 | 
72 |         # Generate batches for one epoch
73 |         batches = data_helpers.batch_iter(list(x_test), seqlen_test, FLAGS.batch_size, 1, shuffle=False)
74 | 
75 |         # Collect the predictions here
76 |         all_predictions = []
77 | 
78 |         for x_test_batch, seqlen_batch in batches:
79 |             batch_predictions = sess.run(
80 |                 predictions, {input_x: x_test_batch, seqlen: seqlen_batch, dropout_keep_prob: 1.0})
81 |             all_predictions = np.concatenate([all_predictions, batch_predictions])
82 | 
83 | # Print accuracy if y_test is defined
84 | if y_test is not None:
85 |     correct_predictions = float(sum(all_predictions == y_test))
86 |     print("Total number of test examples: {}".format(len(y_test)))
87 |     print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))
88 | 


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import tensorflow as tf
  3 | from tensorflow.python.ops import array_ops
  4 | 
  5 | 
  6 | class Model(object):
  7 |     def __init__(
  8 |         self, 
  9 |         sequence_length, 
 10 |         num_classes, 
 11 |         vocab_size, 
 12 |         embedding_size, 
 13 |         hidden_size,
 14 |         filter_sizes, 
 15 |         num_filters, 
 16 |         l2_reg_lambda=0.0):
 17 | 
 18 |         # Placeholders for input, sequence length, output and dropout
 19 |         self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
 20 |         self.seqlen = tf.placeholder(tf.int64, [None], name="seqlen")
 21 |         self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
 22 |         self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
 23 | 
 24 |         # Keeping track of l2 regularization loss (optional)
 25 |         l2_loss = tf.constant(0.0)
 26 | 
 27 | 
 28 |         # Embedding layer
 29 |         with tf.device('/cpu:0'), tf.name_scope("embedding"):
 30 |             self.W = tf.Variable(
 31 |                 tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
 32 |                 trainable=True, 
 33 |                 name="W")
 34 |             self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
 35 |             #TODO: Embeddings process ignores commas etc. so seqlens might not be accurate for sentences with commas...     
 36 | 
 37 | 
 38 |         # Bidirectional LSTM layer
 39 |         with tf.name_scope("bidirectional-lstm"):
 40 |             lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0)
 41 |             lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=1.0)
 42 | 
 43 |             # self.lstm_outputs, _, _ = tf.nn.bidirectional_dynamic_rnn(
 44 |             #     lstm_fw_cell, 
 45 |             #     lstm_bw_cell, 
 46 |             #     self.embedded_chars, 
 47 |             #     sequence_length=self.seqlen, 
 48 |             #     dtype=tf.float32)
 49 |             # lstm_outputs_fw, lstm_outputs_bw = tf.split(value=self.lstm_outputs, split_dim=2, num_split=2)
 50 |             # self.lstm_outputs = tf.add(lstm_outputs_fw, lstm_outputs_bw, name="lstm_outputs")
 51 | 
 52 |             with tf.variable_scope("lstm-output-fw"):
 53 |                 self.lstm_outputs_fw, _ = tf.nn.dynamic_rnn(
 54 |                     lstm_fw_cell, 
 55 |                     self.embedded_chars, 
 56 |                     sequence_length=self.seqlen, 
 57 |                     dtype=tf.float32)
 58 | 
 59 |             with tf.variable_scope("lstm-output-bw"):
 60 |                 self.embedded_chars_rev = array_ops.reverse_sequence(self.embedded_chars, seq_lengths=self.seqlen, seq_dim=1)
 61 |                 tmp, _ = tf.nn.dynamic_rnn(
 62 |                     lstm_bw_cell, 
 63 |                     self.embedded_chars_rev, 
 64 |                     sequence_length=self.seqlen, 
 65 |                     dtype=tf.float32)
 66 |                 self.lstm_outputs_bw = array_ops.reverse_sequence(tmp, seq_lengths=self.seqlen, seq_dim=1)
 67 | 
 68 |             # Concatenate outputs
 69 |             self.lstm_outputs = tf.add(self.lstm_outputs_fw, self.lstm_outputs_bw, name="lstm_outputs")
 70 |             
 71 |         self.lstm_outputs_expanded = tf.expand_dims(self.lstm_outputs, -1)
 72 | 
 73 | 
 74 |         # Convolution + maxpool layer for each filter size
 75 |         pooled_outputs = []
 76 |         for i, filter_size in enumerate(filter_sizes):
 77 |             with tf.name_scope("conv-maxpool-%s" % filter_size):
 78 |                 # Convolution Layer
 79 |                 filter_shape = [filter_size, hidden_size, 1, num_filters]
 80 |                 W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
 81 |                 b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
 82 |                 
 83 |                 conv = tf.nn.conv2d(
 84 |                     self.lstm_outputs_expanded, 
 85 |                     W,
 86 |                     strides=[1, 1, 1, 1], 
 87 |                     padding="VALID",
 88 |                     name="conv")
 89 |                                 
 90 |                 # Apply nonlinearity
 91 |                 h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
 92 |                 
 93 |                 # Maxpooling over the outputs
 94 |                 pooled = tf.nn.max_pool(
 95 |                     h, 
 96 |                     ksize=[1, sequence_length - filter_size + 1, 1, 1],
 97 |                     strides=[1, 1, 1, 1], 
 98 |                     padding='VALID',
 99 |                     name="pool")
100 |                 pooled_outputs.append(pooled)
101 | 
102 |         # Combine all the pooled features
103 |         num_filters_total = num_filters * len(filter_sizes)
104 |         self.h_pool = tf.concat(3, pooled_outputs)
105 |         self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
106 | 
107 | 
108 |         # Dropout layer
109 |         with tf.name_scope("dropout"):
110 |             self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
111 | 
112 | 
113 |         # Final (unnormalized) scores and predictions
114 |         with tf.name_scope("output"):
115 |             # Standard output weights initialization
116 |             W = tf.get_variable(
117 |                 "W", 
118 |                 shape=[num_filters_total, num_classes], 
119 |                 initializer=tf.contrib.layers.xavier_initializer())
120 |             b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
121 | 
122 |             # # Initialized output weights to 0.0, might improve accuracy
123 |             # W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W")
124 |             # b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b")
125 |             
126 |             l2_loss += tf.nn.l2_loss(W)
127 |             l2_loss += tf.nn.l2_loss(b)
128 |             
129 |             self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
130 |             self.predictions = tf.argmax(self.scores, 1, name="predictions")
131 | 
132 |         # Calculate mean cross-entropy loss
133 |         with tf.name_scope("loss"):
134 |             losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
135 |             self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
136 | 
137 |         # Accuracy
138 |         with tf.name_scope("accuracy"):
139 |             correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
140 |             self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
141 | 


--------------------------------------------------------------------------------
/res/acc-val.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/acc-val.png


--------------------------------------------------------------------------------
/res/acc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/acc.png


--------------------------------------------------------------------------------
/res/bidirectional-rnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/bidirectional-rnn.png


--------------------------------------------------------------------------------
/res/cnn-128.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/cnn-128.png


--------------------------------------------------------------------------------
/res/loss-val.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/loss-val.png


--------------------------------------------------------------------------------
/res/loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/loss.png


--------------------------------------------------------------------------------
/res/lstm+cnn-128.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/lstm+cnn-128.png


--------------------------------------------------------------------------------
/res/lstm+cnn-300.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/res/lstm+cnn-300.png


--------------------------------------------------------------------------------
/runs/cnn-128/events.out.tfevents.1483714098.FYP6:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/cnn-128/events.out.tfevents.1483714098.FYP6


--------------------------------------------------------------------------------
/runs/lstm+cnn-128/events.out.tfevents.1483625861.FYP6:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/lstm+cnn-128/events.out.tfevents.1483625861.FYP6


--------------------------------------------------------------------------------
/runs/lstm+cnn-300/events.out.tfevents.1483786544.FYP6:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaitjo/lstm-context-embeddings/ab5894eb727ede8daa394ebe6e87735a6207f292/runs/lstm+cnn-300/events.out.tfevents.1483786544.FYP6


--------------------------------------------------------------------------------
/tflearn/cnn.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function, absolute_import
  2 | 
  3 | import tensorflow as tf
  4 | import tflearn
  5 | from tflearn.data_utils import to_categorical, pad_sequences
  6 | from tflearn.datasets import imdb
  7 | from tflearn.layers.core import input_data, dropout, fully_connected
  8 | from tflearn.layers.embedding_ops import embedding
  9 | # from tflearn.layers.recurrent import bidirectional_rnn, BasicLSTMCell
 10 | from tflearn.layers.merge_ops import merge
 11 | from tflearn.layers.conv import conv_1d, global_max_pool
 12 | from tflearn.layers.estimator import regression
 13 | 
 14 | 
 15 | tf.flags.DEFINE_integer("maxlen", 100, "Maximum Sentence Length")
 16 | tf.flags.DEFINE_integer("vocab_size", 10000, "Size of Vocabulary")
 17 | tf.flags.DEFINE_integer("embedding_dim", 128, "Word Embedding Size")
 18 | # tf.flags.DEFINE_integer("rnn_hidden_size", 128, "Size of biRNN hidden layer")
 19 | tf.flags.DEFINE_integer("num_filters", 128, "Number of CNN filters")
 20 | tf.flags.DEFINE_float("dropout_prob", 0.5, "Dropout Probability")
 21 | tf.flags.DEFINE_float("learning_rate", 0.001, "Learning Rate")
 22 | tf.flags.DEFINE_integer("batch_size", 32, "Batch Size")
 23 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of Training Epochs")
 24 | 
 25 | FLAGS = tf.flags.FLAGS
 26 | FLAGS._parse_flags()
 27 | print("\nParameters:")
 28 | for attr, value in sorted(FLAGS.__flags.items()):
 29 |     print("{}={}".format(attr.upper(), value))
 30 | print("")
 31 | 
 32 | maxlen = FLAGS.maxlen
 33 | vocab_size = FLAGS.vocab_size
 34 | embedding_dim = FLAGS.embedding_dim
 35 | # rnn_hidden_size = FLAGS.rnn_hidden_size
 36 | num_filters = FLAGS.num_filters
 37 | dropout_prob = FLAGS.dropout_prob
 38 | learning_rate = FLAGS.learning_rate
 39 | batch_size = FLAGS.batch_size
 40 | num_epochs = FLAGS.num_epochs
 41 | 
 42 | 
 43 | # IMDB Dataset loading
 44 | train, test, _ = imdb.load_data(path='imdb.pkl', n_words=vocab_size, valid_portion=0.1)
 45 | trainX, trainY = train
 46 | testX, testY = test
 47 | 
 48 | # Sequence padding
 49 | trainX = pad_sequences(trainX, maxlen=maxlen, value=0.)
 50 | testX = pad_sequences(testX, maxlen=maxlen, value=0.)
 51 | 
 52 | # Converting labels to binary vectors
 53 | trainY = to_categorical(trainY, nb_classes=2)
 54 | testY = to_categorical(testY, nb_classes=2)
 55 | 
 56 | 
 57 | # Building network
 58 | network = input_data(shape=[None, maxlen], name='input')   
 59 | 
 60 | network = embedding(
 61 |     network, 
 62 |     input_dim=vocab_size, 
 63 |     output_dim=embedding_dim, 
 64 |     trainable=True)    
 65 | 
 66 | # network = bidirectional_rnn(
 67 | #     network, 
 68 | #     BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 
 69 | #     BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 
 70 | #     return_seq=True,
 71 | #     dynamic=True)
 72 | # network = tf.pack(network, axis=1)
 73 | 
 74 | # fw_outputs, bw_outputs = tf.split(split_dim=2, num_split=2, value=network) 
 75 | # network = tf.add(fw_outputs, bw_outputs)
 76 | 
 77 | branch1 = conv_1d(network, num_filters, 3, padding='valid', activation='relu', regularizer="L2")
 78 | branch2 = conv_1d(network, num_filters, 4, padding='valid', activation='relu', regularizer="L2")
 79 | branch3 = conv_1d(network, num_filters, 5, padding='valid', activation='relu', regularizer="L2")
 80 | 
 81 | network = merge([branch1, branch2, branch3], mode='concat', axis=1)
 82 | 
 83 | network = tf.expand_dims(network, 2)
 84 | 
 85 | network = global_max_pool(network)
 86 | 
 87 | network = dropout(network, dropout_prob)
 88 | 
 89 | network = fully_connected(network, 2, activation='softmax')
 90 | 
 91 | network = regression(
 92 |     network, 
 93 |     optimizer='adam', 
 94 |     learning_rate=learning_rate, 
 95 |     loss='categorical_crossentropy', 
 96 |     name='target')
 97 | 
 98 | 
 99 | # Training
100 | model = tflearn.DNN(network, tensorboard_verbose=0, tensorboard_dir='runs')
101 | model.fit(
102 |     trainX, 
103 |     trainY, 
104 |     validation_set=(testX, testY),
105 |     n_epoch = num_epochs, 
106 |     shuffle=True,  
107 |     show_metric=True, 
108 |     batch_size=batch_size)
109 | 


--------------------------------------------------------------------------------
/tflearn/model.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function, absolute_import
  2 | 
  3 | import tensorflow as tf
  4 | import tflearn
  5 | from tflearn.data_utils import to_categorical, pad_sequences
  6 | from tflearn.datasets import imdb
  7 | from tflearn.layers.core import input_data, dropout, fully_connected
  8 | from tflearn.layers.embedding_ops import embedding
  9 | from tflearn.layers.recurrent import bidirectional_rnn, BasicLSTMCell
 10 | from tflearn.layers.merge_ops import merge
 11 | from tflearn.layers.conv import conv_1d, global_max_pool
 12 | from tflearn.layers.estimator import regression
 13 | 
 14 | 
 15 | tf.flags.DEFINE_integer("maxlen", 100, "Maximum Sentence Length")
 16 | tf.flags.DEFINE_integer("vocab_size", 10000, "Size of Vocabulary")
 17 | tf.flags.DEFINE_integer("embedding_dim", 128, "Word Embedding Size")
 18 | tf.flags.DEFINE_integer("rnn_hidden_size", 128, "Size of biRNN hidden layer")
 19 | tf.flags.DEFINE_integer("num_filters", 128, "Number of CNN filters")
 20 | tf.flags.DEFINE_float("dropout_prob", 0.5, "Dropout Probability")
 21 | tf.flags.DEFINE_float("learning_rate", 0.001, "Learning Rate")
 22 | tf.flags.DEFINE_integer("batch_size", 32, "Batch Size")
 23 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of Training Epochs")
 24 | 
 25 | FLAGS = tf.flags.FLAGS
 26 | FLAGS._parse_flags()
 27 | print("\nParameters:")
 28 | for attr, value in sorted(FLAGS.__flags.items()):
 29 |     print("{}={}".format(attr.upper(), value))
 30 | print("")
 31 | 
 32 | maxlen = FLAGS.maxlen
 33 | vocab_size = FLAGS.vocab_size
 34 | embedding_dim = FLAGS.embedding_dim
 35 | rnn_hidden_size = FLAGS.rnn_hidden_size
 36 | num_filters = FLAGS.num_filters
 37 | dropout_prob = FLAGS.dropout_prob
 38 | learning_rate = FLAGS.learning_rate
 39 | batch_size = FLAGS.batch_size
 40 | num_epochs = FLAGS.num_epochs
 41 | 
 42 | 
 43 | # IMDB Dataset loading
 44 | train, test, _ = imdb.load_data(path='imdb.pkl', n_words=vocab_size, valid_portion=0.1)
 45 | trainX, trainY = train
 46 | testX, testY = test
 47 | 
 48 | # Sequence padding
 49 | trainX = pad_sequences(trainX, maxlen=maxlen, value=0.)
 50 | testX = pad_sequences(testX, maxlen=maxlen, value=0.)
 51 | 
 52 | # Converting labels to binary vectors
 53 | trainY = to_categorical(trainY, nb_classes=2)
 54 | testY = to_categorical(testY, nb_classes=2)
 55 | 
 56 | 
 57 | # Building network
 58 | network = input_data(shape=[None, maxlen], name='input')   
 59 | 
 60 | network = embedding(
 61 |     network, 
 62 |     input_dim=vocab_size, 
 63 |     output_dim=embedding_dim, 
 64 |     trainable=True)    
 65 | 
 66 | network = bidirectional_rnn(
 67 |     network, 
 68 |     BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 
 69 |     BasicLSTMCell(rnn_hidden_size, activation='tanh', inner_activation='sigmoid'), 
 70 |     return_seq=True,
 71 |     dynamic=True)
 72 | network = tf.pack(network, axis=1)
 73 | 
 74 | fw_outputs, bw_outputs = tf.split(split_dim=2, num_split=2, value=network) 
 75 | network = tf.add(fw_outputs, bw_outputs)
 76 | 
 77 | branch1 = conv_1d(network, num_filters, 3, padding='valid', activation='relu', regularizer="L2")
 78 | branch2 = conv_1d(network, num_filters, 4, padding='valid', activation='relu', regularizer="L2")
 79 | branch3 = conv_1d(network, num_filters, 5, padding='valid', activation='relu', regularizer="L2")
 80 | 
 81 | network = merge([branch1, branch2, branch3], mode='concat', axis=1)
 82 | 
 83 | network = tf.expand_dims(network, 2)
 84 | 
 85 | network = global_max_pool(network)
 86 | 
 87 | network = dropout(network, dropout_prob)
 88 | 
 89 | network = fully_connected(network, 2, activation='softmax')
 90 | 
 91 | network = regression(
 92 |     network, 
 93 |     optimizer='adam', 
 94 |     learning_rate=learning_rate, 
 95 |     loss='categorical_crossentropy', 
 96 |     name='target')
 97 | 
 98 | 
 99 | # Training
100 | model = tflearn.DNN(network, tensorboard_verbose=0, tensorboard_dir='runs')
101 | model.fit(
102 |     trainX, 
103 |     trainY, 
104 |     validation_set=(testX, testY),
105 |     n_epoch = num_epochs, 
106 |     shuffle=True,  
107 |     show_metric=True, 
108 |     batch_size=batch_size)
109 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import os
  6 | import time
  7 | import datetime
  8 | import data_helpers
  9 | from model import Model
 10 | from tensorflow.contrib import learn
 11 | 
 12 | # Parameters
 13 | # ==================================================
 14 | 
 15 | # Model Hyperparameters
 16 | tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings (default: None)")
 17 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)")
 18 | tf.flags.DEFINE_integer("hidden_dim", 150, "Dimensionality of hidden layer in LSTM (default: 300")
 19 | tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
 20 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 100)")
 21 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
 22 | tf.flags.DEFINE_float("l2_reg_lambda", 0.15, "L2 regularizaion lambda (default: 0.15)")
 23 | 
 24 | # Training parameters
 25 | tf.flags.DEFINE_integer("batch_size", 50, "Batch Size (default: 50)")
 26 | tf.flags.DEFINE_integer("num_epochs", 25, "Number of training epochs (default: 25)")
 27 | tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
 28 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
 29 | 
 30 | # Misc Parameters
 31 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
 32 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
 33 | 
 34 | FLAGS = tf.flags.FLAGS
 35 | FLAGS._parse_flags()
 36 | print("\nParameters:")
 37 | for attr, value in sorted(FLAGS.__flags.items()):
 38 |     print("{}={}".format(attr.upper(), value))
 39 | print("")
 40 | 
 41 | # Data Preparatopn
 42 | # ==================================================
 43 | 
 44 | # Load data
 45 | print("Loading data...")
 46 | x_text, y, seqlen = data_helpers.load_data_and_labels()
 47 | 
 48 | # Build vocabulary
 49 | max_document_length = max(seqlen)
 50 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 
 51 | x = np.array(list(vocab_processor.fit_transform(x_text)))
 52 | 
 53 | # Randomly shuffle data
 54 | np.random.seed(10)
 55 | shuffle_indices = np.random.permutation(np.arange(len(y)))
 56 | x_shuffled = x[shuffle_indices]
 57 | y_shuffled = y[shuffle_indices]
 58 | seqlen_shuffled = seqlen[shuffle_indices]
 59 | 
 60 | # Split train/test set
 61 | # TODO: This is very crude, should use cross-validation
 62 | x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:]
 63 | y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:]
 64 | seqlen_train, seqlen_dev = seqlen_shuffled[:-1000], seqlen_shuffled[-1000:]
 65 | 
 66 | print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
 67 | print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))
 68 | 
 69 | # Training
 70 | # ==================================================
 71 | 
 72 | with tf.Graph().as_default():
 73 |     session_conf = tf.ConfigProto(
 74 |         allow_soft_placement=FLAGS.allow_soft_placement, 
 75 |         log_device_placement=FLAGS.log_device_placement)
 76 |     sess = tf.Session(config=session_conf)
 77 |     
 78 |     with sess.as_default():
 79 |         model = Model(
 80 |             sequence_length=x_train.shape[1],
 81 |             num_classes=2,
 82 |             vocab_size=len(vocab_processor.vocabulary_),
 83 |             embedding_size=FLAGS.embedding_dim,
 84 |             hidden_size=FLAGS.hidden_dim,
 85 |             filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
 86 |             num_filters=FLAGS.num_filters,
 87 |             l2_reg_lambda=FLAGS.l2_reg_lambda)
 88 | 
 89 |         # Define Training procedure
 90 |         global_step = tf.Variable(0, name="global_step", trainable=False)
 91 |         optimizer = tf.train.AdamOptimizer(0.001)
 92 |         grads_and_vars = optimizer.compute_gradients(model.loss)
 93 |         train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
 94 | 
 95 |         # Keep track of gradient values and sparsity (optional)
 96 |         grad_summaries = []
 97 |         for g, v in grads_and_vars:
 98 |             if g is not None:
 99 |                 grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g)
100 |                 sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
101 |                 grad_summaries.append(grad_hist_summary)
102 |                 grad_summaries.append(sparsity_summary)
103 |         grad_summaries_merged = tf.merge_summary(grad_summaries)
104 | 
105 |         # Output directory for models and summaries
106 |         timestamp = str(int(time.time()))
107 |         out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
108 |         print("Writing to {}\n".format(out_dir))
109 | 
110 |         # Summaries for loss and accuracy
111 |         loss_summary = tf.scalar_summary("loss", model.loss)
112 |         acc_summary = tf.scalar_summary("accuracy", model.accuracy)
113 | 
114 |         # Train Summaries
115 |         train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged])
116 |         train_summary_dir = os.path.join(out_dir, "summaries", "train")
117 |         train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph)
118 | 
119 |         # Dev summaries
120 |         dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
121 |         dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
122 |         dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph)
123 | 
124 |         # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
125 |         checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
126 |         checkpoint_prefix = os.path.join(checkpoint_dir, "model")
127 |         if not os.path.exists(checkpoint_dir):
128 |             os.makedirs(checkpoint_dir)
129 |         saver = tf.train.Saver(tf.global_variables())
130 | 
131 |         # Write vocabulary
132 |         vocab_processor.save(os.path.join(out_dir, "vocab"))
133 | 
134 |         # Initialize all variables
135 |         sess.run(tf.global_variables_initializer())
136 | 
137 |         if FLAGS.word2vec:
138 |             # Initialize matrix with random uniform distribution
139 |             initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
140 |             # Load any vectors from word2vec
141 |             print("Load word2vec file {}\n".format(FLAGS.word2vec))
142 |             with open(FLAGS.word2vec, "rb") as f:
143 |                 header = f.readline()
144 |                 vocab_size, layer1_size = map(int, header.split())
145 |                 binary_len = np.dtype('float32').itemsize * layer1_size
146 |                 
147 |                 for line in range(vocab_size):
148 |                     word = []
149 |                     while True:
150 |                         ch = f.read(1)
151 |                         if ch == ' ':
152 |                             word = ''.join(word)
153 |                             break
154 |                         if ch != '\n':
155 |                             word.append(ch)   
156 |                     
157 |                     idx = vocab_processor.vocabulary_.get(word)
158 |                     if idx != 0:
159 |                         initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')  
160 |                     else:
161 |                         f.read(binary_len)
162 | 
163 |             sess.run(model.W.assign(initW))
164 | 
165 |         def train_step(x_batch, seqlen_batch, y_batch):
166 |             """
167 |             A single training step
168 |             """
169 |             feed_dict = {
170 |               model.input_x: x_batch,
171 |               model.seqlen: seqlen_batch,
172 |               model.input_y: y_batch,
173 |               model.dropout_keep_prob: FLAGS.dropout_keep_prob
174 |             }
175 |             _, step, summaries, loss, accuracy = sess.run(
176 |                 [train_op, global_step, train_summary_op, model.loss, model.accuracy],
177 |                 feed_dict)
178 | 
179 |             time_str = datetime.datetime.now().isoformat()
180 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
181 |             train_summary_writer.add_summary(summaries, step)
182 | 
183 |         def dev_step(x_batch, seqlen_batch, y_batch, writer=None):
184 |             """
185 |             Evaluates model on a dev set
186 |             """
187 |             feed_dict = {
188 |               model.input_x: x_batch,
189 |               model.seqlen: seqlen_batch,
190 |               model.input_y: y_batch,
191 |               model.dropout_keep_prob: 1.0
192 |             }
193 |             step, summaries, loss, accuracy = sess.run(
194 |                 [global_step, dev_summary_op, model.loss, model.accuracy],
195 |                 feed_dict)
196 |             
197 |             time_str = datetime.datetime.now().isoformat()
198 |             print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
199 |             if writer:
200 |                 writer.add_summary(summaries, step)
201 | 
202 |         # Generate batches
203 |         batches = data_helpers.batch_iter(
204 |             list(zip(x_train, y_train)), seqlen_train, FLAGS.batch_size, FLAGS.num_epochs)
205 |         
206 |         # Training loop. For each batch...
207 |         for batch, seqlen_batch in batches:
208 |             x_batch, y_batch = zip(*batch)
209 |             train_step(x_batch, seqlen_batch, y_batch)
210 |             current_step = tf.train.global_step(sess, global_step)
211 |             
212 |             if current_step % FLAGS.evaluate_every == 0:
213 |                 print("\nEvaluation:")
214 |                 dev_step(x_dev, seqlen_dev, y_dev, writer=dev_summary_writer)
215 |                 print("")
216 |             
217 |             if current_step % FLAGS.checkpoint_every == 0:
218 |                 path = saver.save(sess, checkpoint_prefix, global_step=current_step)
219 |                 print("Saved model checkpoint to {}\n".format(path))
220 | 


--------------------------------------------------------------------------------