├── LICENSE ├── README.md ├── data_helpers.py ├── eval.py ├── graphpb.txt ├── input_helpers.py ├── text_cnn.py └── train.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Dhwaj Raj 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | It is a tensorflow implementation using MULTI-TASK LEARNING for Kim's [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882) paper in Tensorflow. 2 | 3 | Core methods are derived from [dennybritz implementation](https://github.com/dennybritz/cnn-text-classification-tf) 4 | The major refactoring has been done to incorporate the following: 5 | - Loading pre-trained word embeddings 6 | - Loading tab separated training text (format : labeltext) 7 | - Training multiple different binary classification tasks at once (Multi-Task Learning - alternative) 8 | 9 | 10 | ## CNN text classifier 11 | Following diagram is depicting the deep architecture for a single binary text classification task using Convolutional Neural Networks. Image taken from Ye Zhang's paper. 12 | ![deep text classifier CNN](https://cloud.githubusercontent.com/assets/9861437/18117883/233370b8-6f6f-11e6-8409-15e7ca5a7541.png) 13 | 14 | ## Multi-Task Learning 15 | 16 | In multi-task alternative training, same model is alternatively trained to perform multiple binary classification tasks in the same language. 17 | ![multi task learning](https://cloud.githubusercontent.com/assets/9861437/18118503/d087e66a-6f72-11e6-9fd8-d157d529e2b2.png) 18 | 19 | Multi-task training can exploit the fact that different sequence tagging tasks in one language share language-specific regularities. The basic idea is to share part of the architecture and parameters between tasks, and to alternatively train multiple objective functions with respect to different tasks. Tensorflow automatically figures out which calculations are needed for the operation you requested, and only conducts those calculations. This means that if we define an optimiser on only one of the tasks, it will only train the parameters required to compute that task - and will leave the rest alone. Since Task 1 relies only on the Task 1 and Shared Layers, the Task 2 layer will be untouched. Let’s draw another diagram with the desired optimisers at the end of each task. 20 | 21 | 22 | 23 | ## Requirements 24 | 25 | - Python 3 26 | - Tensorflow > 0.8 27 | - Numpy 28 | 29 | ## Training 30 | 31 | Print parameters: 32 | 33 | ```bash 34 | ./train.py --help 35 | ``` 36 | 37 | ``` 38 | usage: train.py [-h] [--word2vec WORD2VEC] [--embedding_dim EMBEDDING_DIM] 39 | [--filter_sizes FILTER_SIZES] [--filter_h_pad FILTER_H_PAD] 40 | [--num_filters NUM_FILTERS] 41 | [--dropout_keep_prob DROPOUT_KEEP_PROB] 42 | [--l2_reg_lambda L2_REG_LAMBDA] 43 | [--max_document_words MAX_DOCUMENT_WORDS] 44 | [--training_files TRAINING_FILES] 45 | [--hidden_units HIDDEN_UNITS] [--batch_size BATCH_SIZE] 46 | [--num_epochs NUM_EPOCHS] [--evaluate_every EVALUATE_EVERY] 47 | [--checkpoint_every CHECKPOINT_EVERY] 48 | [--allow_soft_placement [ALLOW_SOFT_PLACEMENT]] 49 | [--noallow_soft_placement] 50 | [--log_device_placement [LOG_DEVICE_PLACEMENT]] 51 | [--nolog_device_placement] 52 | 53 | optional arguments: 54 | -h, --help show this help message and exit 55 | --word2vec WORD2VEC Word2vec file with pre-trained embeddings (default: 56 | None) 57 | --embedding_dim EMBEDDING_DIM 58 | Dimensionality of character embedding (default: 300) 59 | --filter_sizes FILTER_SIZES 60 | Comma-separated filter sizes (default: '2,3,4') 61 | --filter_h_pad FILTER_H_PAD 62 | Pre-padding for each filter (default: 5) 63 | --num_filters NUM_FILTERS 64 | Number of filters per filter size (default: 128) 65 | --dropout_keep_prob DROPOUT_KEEP_PROB 66 | Dropout keep probability (default: 0.5) 67 | --l2_reg_lambda L2_REG_LAMBDA 68 | L2 regularizaion lambda (default: 0.0) 69 | --max_document_words MAX_DOCUMENT_WORDS 70 | Max length (left to right max words to consider) in 71 | every doc, else pad 0 (default: 100) 72 | --training_files TRAINING_FILES 73 | Comma-separated list of training files (each file is 74 | tab separated format) (default: None) 75 | --hidden_units HIDDEN_UNITS 76 | Number of hidden units in softmax regression layer 77 | (default:50) 78 | --batch_size BATCH_SIZE 79 | Batch Size (default: 64) 80 | --num_epochs NUM_EPOCHS 81 | Number of training epochs (default: 200) 82 | --evaluate_every EVALUATE_EVERY 83 | Evaluate model on dev set after this many steps 84 | (default: 100) 85 | --checkpoint_every CHECKPOINT_EVERY 86 | Save model after this many steps (default: 100) 87 | --allow_soft_placement [ALLOW_SOFT_PLACEMENT] 88 | Allow device soft device placement 89 | --noallow_soft_placement 90 | --log_device_placement [LOG_DEVICE_PLACEMENT] 91 | Log placement of ops on devices 92 | --nolog_device_placement 93 | 94 | ``` 95 | 96 | Train: 97 | 98 | ```bash 99 | ./train.py --training_files /mnt/train_task1.txt,/mnt/train_task2.txt 100 | ``` 101 | 102 | ## Evaluating 103 | 104 | ```bash 105 | ./eval.py --eval_train --checkpoint_dir="./runs/1472534740/checkpoints/" 106 | ``` 107 | 108 | Replace the checkpoint dir with the output from the training. To use your own data, change the `eval.py` script to load your data. 109 | 110 | 111 | ## References 112 | 113 | - [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882) 114 | - [A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1510.03820) 115 | - [Jonathan Godwin's explanation of multi-task learning in Tensorflow](http://www.kdnuggets.com/2016/07/multi-task-learning-tensorflow-part-1.html) 116 | - [Nice tutorial, step wise step ](https://github.com/amygdala/tensorflow-workshop/blob/master/workshop_sections/cnn_text_classification/README.md#using-convolutional-nns-for-text-classification-and-tensorboard) 117 | -------------------------------------------------------------------------------- /data_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import itertools 4 | from collections import Counter 5 | 6 | 7 | def clean_str(string): 8 | """ 9 | Tokenization/string cleaning for all datasets except for SST. 10 | Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py 11 | """ 12 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 13 | string = re.sub(r"\'s", " \'s", string) 14 | string = re.sub(r"\'ve", " \'ve", string) 15 | string = re.sub(r"n\'t", " n\'t", string) 16 | string = re.sub(r"\'re", " \'re", string) 17 | string = re.sub(r"\'d", " \'d", string) 18 | string = re.sub(r"\'ll", " \'ll", string) 19 | string = re.sub(r",", " , ", string) 20 | string = re.sub(r"!", " ! ", string) 21 | string = re.sub(r"\(", " \( ", string) 22 | string = re.sub(r"\)", " \) ", string) 23 | string = re.sub(r"\?", " \? ", string) 24 | string = re.sub(r"\s{2,}", " ", string) 25 | return string.strip().lower() 26 | 27 | 28 | def load_data_and_labels(): 29 | """ 30 | Loads MR polarity data from files, splits the data into words and generates labels. 31 | Returns split sentences and labels. 32 | """ 33 | # Load data from files 34 | positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines()) 35 | positive_examples = [s.strip() for s in positive_examples] 36 | negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines()) 37 | negative_examples = [s.strip() for s in negative_examples] 38 | # Split by words 39 | x_text = positive_examples + negative_examples 40 | x_text = [clean_str(sent) for sent in x_text] 41 | # Generate labels 42 | positive_labels = [[0, 1] for _ in positive_examples] 43 | negative_labels = [[1, 0] for _ in negative_examples] 44 | y = np.concatenate([positive_labels, negative_labels], 0) 45 | return [x_text, y] 46 | 47 | 48 | def batch_iter(data, batch_size, num_epochs, shuffle=True): 49 | """ 50 | Generates a batch iterator for a dataset. 51 | """ 52 | data = np.array(data) 53 | data_size = len(data) 54 | num_batches_per_epoch = int(len(data)/batch_size) + 1 55 | for epoch in range(num_epochs): 56 | # Shuffle the data at each epoch 57 | if shuffle: 58 | shuffle_indices = np.random.permutation(np.arange(data_size)) 59 | shuffled_data = data[shuffle_indices] 60 | else: 61 | shuffled_data = data 62 | for batch_num in range(num_batches_per_epoch): 63 | start_index = batch_num * batch_size 64 | end_index = min((batch_num + 1) * batch_size, data_size) 65 | yield shuffled_data[start_index:end_index] 66 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | import data_helpers 9 | from text_cnn import TextCNN 10 | from tensorflow.contrib import learn 11 | from input_helpers import InputHelper 12 | # Parameters 13 | # ================================================== 14 | 15 | # Eval Parameters 16 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 17 | tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run") 18 | tf.flags.DEFINE_string("eval_filepath", "/Users/dhwaj/model_dual_20400/validation.txt1", "Evaluate on this data (Default: None)") 19 | tf.flags.DEFINE_string("vocab_filepath", "/Users/dhwaj/model_dual_20400/vocab", "Load training time vocabulary (Default: None)") 20 | tf.flags.DEFINE_string("model", "/Users/dhwaj/model_dual_20400/model-142200", "Load trained model checkpoint (Default: None)") 21 | 22 | # Misc Parameters 23 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 24 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 25 | 26 | 27 | FLAGS = tf.flags.FLAGS 28 | FLAGS._parse_flags() 29 | print("\nParameters:") 30 | for attr, value in sorted(FLAGS.__flags.items()): 31 | print("{}={}".format(attr.upper(), value)) 32 | print("") 33 | 34 | if FLAGS.eval_filepath==None or FLAGS.vocab_filepath==None or FLAGS.model==None : 35 | print("Eval or Vocab filepaths are empty.") 36 | exit() 37 | 38 | # load data and map id-transform based on training time vocabulary 39 | inpH = InputHelper() 40 | x_test,y_test = inpH.getTestDataSet(FLAGS.eval_filepath, FLAGS.vocab_filepath, 600, 5) 41 | 42 | print("\nEvaluating...\n") 43 | 44 | # Evaluation 45 | # ================================================== 46 | checkpoint_file = FLAGS.model 47 | print checkpoint_file 48 | graph = tf.Graph() 49 | with graph.as_default(): 50 | session_conf = tf.ConfigProto( 51 | allow_soft_placement=FLAGS.allow_soft_placement, 52 | log_device_placement=FLAGS.log_device_placement) 53 | sess = tf.Session(config=session_conf) 54 | with sess.as_default(): 55 | # Load the saved meta graph and restore variables 56 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 57 | sess.run(tf.initialize_all_variables()) 58 | saver.restore(sess, checkpoint_file) 59 | 60 | # Get the placeholders from the graph by name 61 | input_x = graph.get_operation_by_name("input_x").outputs[0] 62 | input_y = graph.get_operation_by_name("input_y1").outputs[0] 63 | 64 | # input_y = graph.get_operation_by_name("input_y").outputs[0] 65 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 66 | 67 | # Tensors we want to evaluate 68 | predictions = graph.get_operation_by_name("output1/predictions1").outputs[0] 69 | 70 | accuracy = graph.get_operation_by_name("accuracy1/accuracy1").outputs[0] 71 | #emb = graph.get_operation_by_name("embedding/W").outputs[0] 72 | #embedded_chars = tf.nn.embedding_lookup(emb,input_x) 73 | # Generate batches for one epoch 74 | batches = data_helpers.batch_iter(list(zip(x_test,y_test)), 2*FLAGS.batch_size, 1, shuffle=False) 75 | # Collect the predictions here 76 | all_predictions = [] 77 | 78 | for db in batches: 79 | x_dev_b,y_dev_b = zip(*db) 80 | batch_predictions, batch_acc = sess.run([predictions,accuracy], {input_x: x_dev_b, input_y:y_dev_b, dropout_keep_prob: 1.0}) 81 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 82 | print("DEV acc {}".format(batch_acc)) 83 | print np.argmax(y_dev_b, 1), batch_predictions 84 | 85 | 86 | y_simple = np.argmax(y_test, 1) 87 | correct_predictions = float(np.sum(all_predictions == y_simple)) 88 | print("Accuracy: {:g}".format(correct_predictions/float(len(y_test)))) 89 | -------------------------------------------------------------------------------- /input_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import itertools 4 | from collections import Counter 5 | import numpy as np 6 | import time 7 | import data_helpers 8 | import gc 9 | from tensorflow.contrib import learn 10 | from gensim.models.word2vec import Word2Vec 11 | import gzip 12 | 13 | 14 | class InputHelper(object): 15 | pre_emb = dict() 16 | 17 | def loadW2V(self,emb_path, type="textgz"): 18 | print("Loading W2V data...") 19 | num_keys = 0 20 | if type=="textgz": 21 | # this seems faster than gensim non-binary load 22 | for line in gzip.open(emb_path): 23 | l = line.strip().split() 24 | self.pre_emb[l[0]]=np.asarray(l[1:]) 25 | num_keys=len(self.pre_emb) 26 | else: 27 | self.pre_emb = Word2Vec.load_word2vec_format(emb_path,binary=True) 28 | self.pre_emb.init_sims(replace=True) 29 | num_keys=len(self.pre_emb.vocab) 30 | print("loaded word2vec len ", num_keys) 31 | gc.collect() 32 | 33 | def deletePreEmb(self): 34 | self.pre_emb=dict() 35 | gc.collect() 36 | 37 | def getTsvData(self, filepath): 38 | print("Loading training data from "+filepath) 39 | x=[] 40 | y=[] 41 | for line in open(filepath): 42 | l=line.strip().split("\t") 43 | if len(l)<2: 44 | continue 45 | x.append(l[1]) 46 | v=np.array([0,1]) 47 | if l[0]=="-1" or l[0]=="0": 48 | v=np.array([1,0]) 49 | y.append(v) 50 | return np.asarray(x),np.asarray(y) 51 | 52 | def getUnlabelData(self, filepath): 53 | print("Loading unlabelled data from "+filepath) 54 | x=[] 55 | for line in open(filepath): 56 | l=line.strip() 57 | if len(l)<1: 58 | continue 59 | x.append(l) 60 | return np.asarray(x) 61 | 62 | def dumpValidation(self,x_text,y,shuffled_index,dev_idx,i): 63 | print("dumping validation "+str(i)) 64 | x_shuffled=x_text[shuffled_index] 65 | y_shuffled=y[shuffled_index] 66 | x_dev=x_shuffled[dev_idx:] 67 | y_dev=y_shuffled[dev_idx:] 68 | del x_shuffled 69 | del y_shuffled 70 | with open('validation.txt'+str(i),'w') as f: 71 | for text,label in zip(x_dev,y_dev): 72 | f.write(str(label)+"\t"+text+"\n") 73 | f.close() 74 | del x_dev 75 | del y_dev 76 | 77 | # Data Preparatopn 78 | # ================================================== 79 | 80 | 81 | def getDataSets(self, training_paths, max_document_length, filter_h_pad, percent_dev, batch_size): 82 | x_list=[] 83 | y_list=[] 84 | multi_train_size = len(training_paths) 85 | for i in xrange(multi_train_size): 86 | x_temp,y_temp = self.getTsvData(training_paths[i]) 87 | x_list.append(x_temp) 88 | y_list.append(y_temp) 89 | del x_temp 90 | del y_temp 91 | gc.collect() 92 | # Build vocabulary 93 | print("Building vocabulary") 94 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length-filter_h_pad,min_frequency=1) 95 | vocab_processor.fit_transform(np.concatenate(x_list,axis=0)) 96 | print("Length of loaded vocabulary ={}".format( len(vocab_processor.vocabulary_))) 97 | i1=0 98 | train_set=[] 99 | dev_set=[] 100 | sum_no_of_batches = 0 101 | for x_text,y in zip(x_list, y_list): 102 | x = np.asarray(list(vocab_processor.transform(x_text))) 103 | x = np.concatenate((np.zeros((len(x),filter_h_pad)),x),axis=1) 104 | # Randomly shuffle data 105 | np.random.seed(10) 106 | shuffle_indices = np.random.permutation(np.arange(len(y))) 107 | x_shuffled = x[shuffle_indices] 108 | y_shuffled = y[shuffle_indices] 109 | dev_idx = -1*len(y_shuffled)//percent_dev 110 | self.dumpValidation(x_text,y,shuffle_indices,dev_idx,i1) 111 | del x 112 | del x_text 113 | del y 114 | # Split train/test set 115 | # TODO: This is very crude, should use cross-validation 116 | x_train, x_dev = x_shuffled[:dev_idx], x_shuffled[dev_idx:] 117 | y_train, y_dev = y_shuffled[:dev_idx], y_shuffled[dev_idx:] 118 | print("Train/Dev split for {}: {:d}/{:d}".format(training_paths[i1], len(y_train), len(y_dev))) 119 | sum_no_of_batches = sum_no_of_batches+(len(y_train)//batch_size) 120 | train_set.append((x_train,y_train)) 121 | dev_set.append((x_dev,y_dev)) 122 | del x_shuffled 123 | del y_shuffled 124 | del x_train 125 | del x_dev 126 | i1=i1+1 127 | del x_list 128 | del y_list 129 | gc.collect() 130 | return train_set,dev_set,vocab_processor,sum_no_of_batches 131 | 132 | def getTestDataSet(self, data_path, vocab_path, max_document_length, filter_h_pad): 133 | x_temp,y = self.getTsvData(data_path) 134 | 135 | # Build vocabulary 136 | vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length-filter_h_pad,min_frequency=1) 137 | vocab_processor = vocab_processor.restore(vocab_path) 138 | print len(vocab_processor.vocabulary_) 139 | 140 | x = np.asarray(list(vocab_processor.transform(x_temp))) 141 | x = np.concatenate((np.zeros((len(x),filter_h_pad)),x),axis=1) 142 | # Randomly shuffle data 143 | del x_temp 144 | del vocab_processor 145 | gc.collect() 146 | return x, y 147 | 148 | -------------------------------------------------------------------------------- /text_cnn.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | 5 | class TextCNN(object): 6 | """ 7 | A CNN for text classification. 8 | Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer. 9 | """ 10 | def __init__( 11 | self, sequence_length, num_classes, multi_size, vocab_size, 12 | embedding_size, filter_sizes, num_filters, hidden_units, l2_reg_lambda, retrain_emb): 13 | 14 | # Placeholders for input, output and dropout 15 | self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") 16 | self.input_y=[] 17 | for i in xrange(multi_size): 18 | self.input_y.append(tf.placeholder(tf.float32, [None, num_classes], name="input_y"+str(i))) 19 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 20 | 21 | # Keeping track of l2 regularization loss (optional) 22 | l2_loss = [] 23 | for i in xrange(multi_size): 24 | l2_loss.append(tf.constant(0.0, name="l2_loss"+str(i))) 25 | # Embedding layer 26 | with tf.device('/cpu:0'), tf.name_scope("embedding"): 27 | self.W = tf.Variable( 28 | tf.constant(0.0, shape=[vocab_size, embedding_size]), 29 | trainable=False,name="W") 30 | self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) 31 | self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1) 32 | 33 | # Create a convolution + maxpool layer for each filter size 34 | pooled_outputs = [] 35 | for i, filter_size in enumerate(filter_sizes): 36 | with tf.name_scope("conv-maxpool-%s" % filter_size): 37 | # Convolution Layer 38 | filter_shape = [filter_size, embedding_size, 1, num_filters] 39 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") 40 | b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") 41 | conv = tf.nn.conv2d( 42 | self.embedded_chars_expanded, 43 | W, 44 | strides=[1, 1, 1, 1], 45 | padding="VALID", 46 | name="conv") 47 | # Apply nonlinearity 48 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 49 | # Maxpooling over the outputs 50 | pooled = tf.nn.max_pool( 51 | h, 52 | ksize=[1, sequence_length - filter_size + 1, 1, 1], 53 | strides=[1, 1, 1, 1], 54 | padding='VALID', 55 | name="pool") 56 | pooled_outputs.append(pooled) 57 | 58 | # Combine all the pooled features 59 | num_filters_total = num_filters * len(filter_sizes) 60 | self.h_pool = tf.concat(3, pooled_outputs) 61 | self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) 62 | 63 | # Add dropout 64 | with tf.name_scope("dropout"): 65 | self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) 66 | 67 | # Final (unnormalized) scores and predictions 68 | self.scores=[] 69 | self.predictions=[] 70 | self.loss=[] 71 | self.accuracy=[] 72 | 73 | for i in xrange(multi_size): 74 | with tf.name_scope("output"+str(i)): 75 | W = tf.get_variable( 76 | "W"+str(i), 77 | shape=[num_filters_total, hidden_units], 78 | initializer=tf.contrib.layers.xavier_initializer()) 79 | b = tf.Variable(tf.constant(0.1, shape=[hidden_units]), name="b"+str(i)) 80 | l2_loss[i] += tf.nn.l2_loss(W) 81 | l2_loss[i] += tf.nn.l2_loss(b) 82 | inference =tf.nn.softmax(tf.nn.bias_add(tf.matmul(self.h_drop,W), b), name="softmax"+str(i)) 83 | inference = tf.nn.dropout(inference, self.dropout_keep_prob) 84 | W2 = tf.get_variable( 85 | "W2"+str(i), 86 | shape=[hidden_units, num_classes], 87 | initializer=tf.contrib.layers.xavier_initializer()) 88 | b2 = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b2"+str(i)) 89 | l2_loss[i] += tf.nn.l2_loss(W2) 90 | l2_loss[i] += tf.nn.l2_loss(b2) 91 | 92 | self.scores.append(tf.nn.xw_plus_b(inference, W2, b2, name="scores"+str(i))) 93 | self.predictions.append(tf.argmax(tf.nn.softmax(self.scores[i]), 1, name="predictions"+str(i))) 94 | 95 | for i in xrange(multi_size): 96 | with tf.name_scope("loss"+str(i)): 97 | losses = tf.nn.softmax_cross_entropy_with_logits(self.scores[i], self.input_y[i]) 98 | self.loss.append(tf.reduce_mean(losses) + l2_reg_lambda * l2_loss[i]) 99 | 100 | # Accuracy 101 | for i in xrange(multi_size): 102 | with tf.name_scope("accuracy"+str(i)): 103 | correct_predictions = tf.equal(self.predictions[i], tf.argmax(self.input_y[i], 1)) 104 | self.accuracy.append(tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy"+str(i))) 105 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | import re 6 | import os 7 | import time 8 | import datetime 9 | import data_helpers 10 | import gc 11 | from input_helpers import InputHelper 12 | from text_cnn import TextCNN 13 | from tensorflow.contrib import learn 14 | from gensim.models.word2vec import Word2Vec 15 | import gzip 16 | # Parameters 17 | # ================================================== 18 | 19 | tf.flags.DEFINE_string("word2vec", "GoogleNews-vectors-negative300.singles.gz", "Word2vec file with pre-trained embeddings (default: None)") 20 | tf.flags.DEFINE_string("word2vec_format", "textgz", "Word2vec pretrained file format. textgz: gzipped text | bin: binary format (default: textgz)") 21 | tf.flags.DEFINE_boolean("word2vec_trainable", False, "Allow modification of w2v embedding weights (True/False)") 22 | tf.flags.DEFINE_integer("embedding_dim", 300, "Dimensionality of character embedding (default: 300)") 23 | tf.flags.DEFINE_string("filter_sizes", "2,3,4", "Comma-separated filter sizes (default: '2,3,4')") 24 | tf.flags.DEFINE_string("filter_h_pad", 5, "Pre-padding for each filter (default: 5)") 25 | tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)") 26 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)") 27 | tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularizaion lambda (default: 0.0)") 28 | tf.flags.DEFINE_integer("max_document_words", 600, "Max length (left to right max words to consider) in every doc, else pad 0 (default: 100)") 29 | tf.flags.DEFINE_string("training_files", None, "Comma-separated list of training files (each file is tab separated format) (default: None)") 30 | tf.flags.DEFINE_integer("hidden_units", 50, "Number of hidden units in softmax regression layer (default:50)") 31 | 32 | # Training parameters 33 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 34 | tf.flags.DEFINE_integer("num_epochs", 300, "Number of training epochs (default: 200)") 35 | tf.flags.DEFINE_integer("evaluate_every", 200, "Evaluate model on dev set after this many steps (default: 100)") 36 | tf.flags.DEFINE_integer("checkpoint_every", 200, "Save model after this many steps (default: 100)") 37 | # Misc Parameters 38 | tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement") 39 | tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices") 40 | 41 | FLAGS = tf.flags.FLAGS 42 | FLAGS._parse_flags() 43 | print("\nParameters:") 44 | for attr, value in sorted(FLAGS.__flags.items()): 45 | print("{}={}".format(attr.upper(), value)) 46 | print("") 47 | 48 | if FLAGS.training_files==None: 49 | print "Input Files List is empty. use --training_files argument." 50 | exit() 51 | 52 | training_paths=FLAGS.training_files.split(",") 53 | 54 | multi_train_size = len(training_paths) 55 | max_document_length = FLAGS.max_document_words 56 | 57 | inpH = InputHelper() 58 | train_set, dev_set, vocab_processor,sum_no_of_batches = inpH.getDataSets(training_paths, max_document_length, FLAGS.filter_h_pad, 10, FLAGS.batch_size) 59 | inpH.loadW2V(FLAGS.word2vec, FLAGS.word2vec_format) 60 | # Training 61 | # ================================================== 62 | print("starting graph def") 63 | with tf.Graph().as_default(): 64 | session_conf = tf.ConfigProto( 65 | allow_soft_placement=FLAGS.allow_soft_placement, 66 | log_device_placement=FLAGS.log_device_placement) 67 | sess = tf.Session(config=session_conf) 68 | print("started session") 69 | with sess.as_default(): 70 | cnn = TextCNN( 71 | sequence_length=max_document_length, 72 | num_classes=2, 73 | multi_size = multi_train_size, 74 | vocab_size=len(vocab_processor.vocabulary_), 75 | embedding_size=FLAGS.embedding_dim, 76 | filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))), 77 | num_filters=FLAGS.num_filters, 78 | hidden_units=FLAGS.hidden_units, 79 | l2_reg_lambda=FLAGS.l2_reg_lambda, 80 | retrain_emb=FLAGS.word2vec_trainable) 81 | 82 | # Define Training procedure 83 | global_step = tf.Variable(0, name="global_step", trainable=False) 84 | optimizer = tf.train.AdamOptimizer(1e-3) 85 | print("initialized cnn object") 86 | grad_set=[] 87 | tr_op_set=[] 88 | for i2 in xrange(multi_train_size): 89 | #optimizer = tf.train.AdamOptimizer(1e-3) 90 | grads_and_vars=optimizer.compute_gradients(cnn.loss[i2]) 91 | tr_op_set.append(optimizer.apply_gradients(grads_and_vars, global_step=global_step)) 92 | print("defined training_ops") 93 | # Keep track of gradient values and sparsity (optional) 94 | grad_summaries = [] 95 | for g, v in grads_and_vars: 96 | if g is not None: 97 | grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g) 98 | sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 99 | grad_summaries.append(grad_hist_summary) 100 | grad_summaries.append(sparsity_summary) 101 | grad_summaries_merged = tf.merge_summary(grad_summaries) 102 | print("defined gradient summaries") 103 | # Output directory for models and summaries 104 | timestamp = str(int(time.time())) 105 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 106 | print("Writing to {}\n".format(out_dir)) 107 | 108 | # Summaries for loss and accuracy 109 | 110 | """ 111 | loss_summary = tf.scalar_summary("loss", cnn.loss) 112 | acc_summary = tf.scalar_summary("accuracy", cnn.accuracy) 113 | 114 | # Train Summaries 115 | train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged]) 116 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 117 | train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph) 118 | train_summary_writer.flush() 119 | print("init train summary") 120 | # Dev summaries 121 | dev_summary_op = tf.merge_summary([loss_summary, acc_summary]) 122 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev") 123 | dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph) 124 | print("init dev summary") 125 | """ 126 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 127 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 128 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 129 | if not os.path.exists(checkpoint_dir): 130 | os.makedirs(checkpoint_dir) 131 | saver = tf.train.Saver(tf.all_variables(), max_to_keep=100) 132 | 133 | # Write vocabulary 134 | vocab_processor.save(os.path.join(checkpoint_dir, "vocab")) 135 | 136 | # Initialize all variables 137 | sess.run(tf.initialize_all_variables()) 138 | 139 | print("init all variables") 140 | graph_def = tf.get_default_graph().as_graph_def() 141 | graphpb_txt = str(graph_def) 142 | with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f: 143 | f.write(graphpb_txt) 144 | 145 | if FLAGS.word2vec: 146 | # initial matrix with random uniform 147 | initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim)) 148 | # load any vectors from the word2vec 149 | print("init initW cnn.W in FLAG") 150 | for w in vocab_processor.vocabulary_._mapping: 151 | arr=[] 152 | s = re.sub('[^0-9a-zA-Z]+', '', w) 153 | if w in inpH.pre_emb: 154 | arr=inpH.pre_emb[w] 155 | elif w.lower() in inpH.pre_emb: 156 | arr=inpH.pre_emb[w.lower()] 157 | elif s in inpH.pre_emb: 158 | arr=inpH.pre_emb[s] 159 | elif s.isdigit(): 160 | arr=inpH.pre_emb["1"] 161 | if len(arr)>0: 162 | idx = vocab_processor.vocabulary_.get(w) 163 | initW[idx]=np.asarray(arr).astype(np.float32) 164 | print("assigning initW to cnn. len="+str(len(initW))) 165 | inpH.deletePreEmb() 166 | gc.collect() 167 | sess.run(cnn.W.assign(initW)) 168 | 169 | def train_step(x_batch, y_batch, typeIdx): 170 | """ 171 | A single training step 172 | """ 173 | feed_dict = { 174 | cnn.input_x: x_batch, 175 | cnn.dropout_keep_prob: FLAGS.dropout_keep_prob, 176 | } 177 | for i in xrange(multi_train_size): 178 | if i==typeIdx: 179 | feed_dict[cnn.input_y[i]] = y_batch 180 | else: 181 | feed_dict[cnn.input_y[i]] = np.zeros((len(x_batch),2)) 182 | 183 | _, step, loss, accuracy, pred = sess.run([tr_op_set[typeIdx], global_step, cnn.loss[typeIdx], cnn.accuracy[typeIdx], cnn.predictions[typeIdx]], feed_dict) 184 | time_str = datetime.datetime.now().isoformat() 185 | print("TRAIN {}: type {}, step {}, loss {:g}, acc {:g}".format(time_str, typeIdx, step, loss, accuracy)) 186 | print np.argmax(y_batch, 1), pred 187 | #train_summary_writer.add_summary(summaries, step) 188 | 189 | def dev_step(x_batch, y_batch, typeIdx, writer=None): 190 | """ 191 | Evaluates model on a dev set 192 | """ 193 | feed_dict = { 194 | cnn.input_x: x_batch, 195 | cnn.dropout_keep_prob: 1.0, 196 | } 197 | for i in xrange(multi_train_size): 198 | if i==typeIdx: 199 | feed_dict[cnn.input_y[i]] = y_batch 200 | else: 201 | feed_dict[cnn.input_y[i]] = np.zeros((len(x_batch),2)) 202 | 203 | step, loss, accuracy, pred = sess.run([global_step, cnn.loss[typeIdx], cnn.accuracy[typeIdx], cnn.predictions[typeIdx]], feed_dict) 204 | 205 | time_str = datetime.datetime.now().isoformat() 206 | print("DEV {}: type {}, step {}, loss {:g}, acc {:g}".format(time_str, typeIdx, step, loss, accuracy)) 207 | print np.argmax(y_batch, 1), pred 208 | #if writer: 209 | # writer.add_summary(summaries, step) 210 | return accuracy 211 | 212 | # Generate batches 213 | batches=[] 214 | for i in xrange(multi_train_size): 215 | batches.append(data_helpers.batch_iter( 216 | list(zip(train_set[i][0], train_set[i][1])), FLAGS.batch_size, FLAGS.num_epochs)) 217 | 218 | ptr=0 219 | max_validation_acc=0.0 220 | for nn in xrange(sum_no_of_batches*FLAGS.num_epochs): 221 | idx=round(np.random.uniform(low=0, high=multi_train_size)) 222 | if idx<0 or idx>multi_train_size-1: 223 | continue 224 | typeIdx = int(idx) 225 | print typeIdx 226 | batch = batches[typeIdx].next() 227 | if len(batch)<1: 228 | continue 229 | x_batch, y_batch = zip(*batch) 230 | if len(y_batch)<1: 231 | continue 232 | train_step(x_batch, y_batch,typeIdx) 233 | current_step = tf.train.global_step(sess, global_step) 234 | sum_acc=0.0 235 | if current_step % FLAGS.evaluate_every == 0: 236 | for dtypeIdx in xrange(multi_train_size): 237 | 238 | print("\nEvaluation:") 239 | dev_batches = data_helpers.batch_iter(list(zip(dev_set[dtypeIdx][0],dev_set[dtypeIdx][1])), 2*FLAGS.batch_size, 1) 240 | for db in dev_batches: 241 | if len(db)<1: 242 | continue 243 | x_dev_b,y_dev_b = zip(*db) 244 | if len(y_dev_b)<1: 245 | continue 246 | acc = dev_step(x_dev_b, y_dev_b, dtypeIdx) 247 | sum_acc = sum_acc + acc 248 | print("") 249 | if current_step % FLAGS.checkpoint_every == 0: 250 | if sum_acc >= max_validation_acc: 251 | max_validation_acc = sum_acc 252 | saver.save(sess, checkpoint_prefix, global_step=current_step) 253 | tf.train.write_graph(sess.graph.as_graph_def(), checkpoint_prefix, "graph"+str(nn)+".pb", as_text=False) 254 | print("Saved model {} with sum_accuracy={} checkpoint to {}\n".format(nn, max_validation_acc, checkpoint_prefix)) 255 | --------------------------------------------------------------------------------