├── README.md
├── conv_net_classes.py
├── conv_net_sentence.py
├── process_data.py
├── rt-polarity.neg
└── rt-polarity.pos


/README.md:
--------------------------------------------------------------------------------
 1 | ## Convolutional Neural Networks for Sentence Classification
 2 | Code for the paper [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882) (EMNLP 2014).
 3 | 
 4 | Runs the model on Pang and Lee's movie review dataset (MR in the paper).
 5 | Please cite the original paper when using the data.
 6 | 
 7 | ### Requirements
 8 | Code is written in Python (2.7) and requires Theano (0.7).
 9 | 
10 | Using the pre-trained `word2vec` vectors will also require downloading the binary file from
11 | https://code.google.com/p/word2vec/
12 | 
13 | 
14 | ### Data Preprocessing
15 | To process the raw data, run
16 | 
17 | ```
18 | python process_data.py path
19 | ```
20 | 
21 | where path points to the word2vec binary file (i.e. `GoogleNews-vectors-negative300.bin` file). 
22 | This will create a pickle object called `mr.p` in the same folder, which contains the dataset
23 | in the right format.
24 | 
25 | Note: This will create the dataset with different fold-assignments than was used in the paper.
26 | You should still be getting a CV score of >81% with CNN-nonstatic model, though.
27 | 
28 | ### Running the models (CPU)
29 | Example commands:
30 | 
31 | ```
32 | THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand
33 | THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -static -word2vec
34 | THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec
35 | ```
36 | 
37 | This will run the CNN-rand, CNN-static, and CNN-nonstatic models respectively in the paper.
38 | 
39 | ### Using the GPU
40 | GPU will result in a good 10x to 20x speed-up, so it is highly recommended. 
41 | To use the GPU, simply change `device=cpu` to `device=gpu` (or whichever gpu you are using).
42 | For example:
43 | ```
44 | THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec
45 | ```
46 | 
47 | ### Example output
48 | CPU output:
49 | ```
50 | epoch: 1, training time: 219.72 secs, train perf: 81.79 %, val perf: 79.26 %
51 | epoch: 2, training time: 219.55 secs, train perf: 82.64 %, val perf: 76.84 %
52 | epoch: 3, training time: 219.54 secs, train perf: 92.06 %, val perf: 80.95 %
53 | ```
54 | GPU output:
55 | ```
56 | epoch: 1, training time: 16.49 secs, train perf: 81.80 %, val perf: 78.32 %
57 | epoch: 2, training time: 16.12 secs, train perf: 82.53 %, val perf: 76.74 %
58 | epoch: 3, training time: 16.16 secs, train perf: 91.87 %, val perf: 81.37 %
59 | ```
60 | 
61 | ### Other Implementations
62 | #### TensorFlow
63 | [Denny Britz](http://www.wildml.com) has an implementation of the model in TensorFlow:
64 | 
65 | https://github.com/dennybritz/cnn-text-classification-tf
66 | 
67 | He also wrote a [nice tutorial](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow) on it, as well as a general tutorial on [CNNs for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp).
68 | 
69 | #### Torch
70 | [HarvardNLP](http://harvardnlp.github.io/) group has an implementation in Torch.
71 | 
72 | https://github.com/harvardnlp/sent-conv-torch
73 | 
74 | ### Hyperparameters
75 | At the time of my original experiments I did not have access to a GPU so I could not run a lot of different experiments.
76 | Hence the paper is missing a lot of things like ablation studies and variance in performance, and some of the conclusions
77 | were premature (e.g. regularization does not always seem to help).
78 | 
79 | Ye Zhang has written a [very nice paper](http://arxiv.org/abs/1510.03820) doing an extensive analysis of model variants (e.g. filter widths, k-max pooling, word2vec vs Glove, etc.) and their effect on performance.
80 | 


--------------------------------------------------------------------------------
/conv_net_classes.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | 
 12 | import numpy
 13 | import theano.tensor.shared_randomstreams
 14 | import theano
 15 | import theano.tensor as T
 16 | from theano.tensor.signal import downsample
 17 | from theano.tensor.nnet import conv
 18 | 
 19 | def ReLU(x):
 20 |     y = T.maximum(0.0, x)
 21 |     return(y)
 22 | def Sigmoid(x):
 23 |     y = T.nnet.sigmoid(x)
 24 |     return(y)
 25 | def Tanh(x):
 26 |     y = T.tanh(x)
 27 |     return(y)
 28 | def Iden(x):
 29 |     y = x
 30 |     return(y)
 31 |         
 32 | class HiddenLayer(object):
 33 |     """
 34 |     Class for HiddenLayer
 35 |     """
 36 |     def __init__(self, rng, input, n_in, n_out, activation, W=None, b=None,
 37 |                  use_bias=False):
 38 | 
 39 |         self.input = input
 40 |         self.activation = activation
 41 | 
 42 |         if W is None:            
 43 |             if activation.func_name == "ReLU":
 44 |                 W_values = numpy.asarray(0.01 * rng.standard_normal(size=(n_in, n_out)), dtype=theano.config.floatX)
 45 |             else:                
 46 |                 W_values = numpy.asarray(rng.uniform(low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)),
 47 |                                                      size=(n_in, n_out)), dtype=theano.config.floatX)
 48 |             W = theano.shared(value=W_values, name='W')        
 49 |         if b is None:
 50 |             b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
 51 |             b = theano.shared(value=b_values, name='b')
 52 | 
 53 |         self.W = W
 54 |         self.b = b
 55 | 
 56 |         if use_bias:
 57 |             lin_output = T.dot(input, self.W) + self.b
 58 |         else:
 59 |             lin_output = T.dot(input, self.W)
 60 | 
 61 |         self.output = (lin_output if activation is None else activation(lin_output))
 62 |     
 63 |         # parameters of the model
 64 |         if use_bias:
 65 |             self.params = [self.W, self.b]
 66 |         else:
 67 |             self.params = [self.W]
 68 | 
 69 | def _dropout_from_layer(rng, layer, p):
 70 |     """p is the probablity of dropping a unit
 71 | """
 72 |     srng = theano.tensor.shared_randomstreams.RandomStreams(rng.randint(999999))
 73 |     # p=1-p because 1's indicate keep and p is prob of dropping
 74 |     mask = srng.binomial(n=1, p=1-p, size=layer.shape)
 75 |     # The cast is important because
 76 |     # int * float32 = float64 which pulls things off the gpu
 77 |     output = layer * T.cast(mask, theano.config.floatX)
 78 |     return output
 79 | 
 80 | class DropoutHiddenLayer(HiddenLayer):
 81 |     def __init__(self, rng, input, n_in, n_out,
 82 |                  activation, dropout_rate, use_bias, W=None, b=None):
 83 |         super(DropoutHiddenLayer, self).__init__(
 84 |                 rng=rng, input=input, n_in=n_in, n_out=n_out, W=W, b=b,
 85 |                 activation=activation, use_bias=use_bias)
 86 | 
 87 |         self.output = _dropout_from_layer(rng, self.output, p=dropout_rate)
 88 | 
 89 | class MLPDropout(object):
 90 |     """A multilayer perceptron with dropout"""
 91 |     def __init__(self,rng,input,layer_sizes,dropout_rates,activations,use_bias=True):
 92 | 
 93 |         #rectified_linear_activation = lambda x: T.maximum(0.0, x)
 94 | 
 95 |         # Set up all the hidden layers
 96 |         self.weight_matrix_sizes = zip(layer_sizes, layer_sizes[1:])
 97 |         self.layers = []
 98 |         self.dropout_layers = []
 99 |         self.activations = activations
100 |         next_layer_input = input
101 |         #first_layer = True
102 |         # dropout the input
103 |         next_dropout_layer_input = _dropout_from_layer(rng, input, p=dropout_rates[0])
104 |         layer_counter = 0
105 |         for n_in, n_out in self.weight_matrix_sizes[:-1]:
106 |             next_dropout_layer = DropoutHiddenLayer(rng=rng,
107 |                     input=next_dropout_layer_input,
108 |                     activation=activations[layer_counter],
109 |                     n_in=n_in, n_out=n_out, use_bias=use_bias,
110 |                     dropout_rate=dropout_rates[layer_counter])
111 |             self.dropout_layers.append(next_dropout_layer)
112 |             next_dropout_layer_input = next_dropout_layer.output
113 | 
114 |             # Reuse the parameters from the dropout layer here, in a different
115 |             # path through the graph.
116 |             next_layer = HiddenLayer(rng=rng,
117 |                     input=next_layer_input,
118 |                     activation=activations[layer_counter],
119 |                     # scale the weight matrix W with (1-p)
120 |                     W=next_dropout_layer.W * (1 - dropout_rates[layer_counter]),
121 |                     b=next_dropout_layer.b,
122 |                     n_in=n_in, n_out=n_out,
123 |                     use_bias=use_bias)
124 |             self.layers.append(next_layer)
125 |             next_layer_input = next_layer.output
126 |             #first_layer = False
127 |             layer_counter += 1
128 |         
129 |         # Set up the output layer
130 |         n_in, n_out = self.weight_matrix_sizes[-1]
131 |         dropout_output_layer = LogisticRegression(
132 |                 input=next_dropout_layer_input,
133 |                 n_in=n_in, n_out=n_out)
134 |         self.dropout_layers.append(dropout_output_layer)
135 | 
136 |         # Again, reuse paramters in the dropout output.
137 |         output_layer = LogisticRegression(
138 |             input=next_layer_input,
139 |             # scale the weight matrix W with (1-p)
140 |             W=dropout_output_layer.W * (1 - dropout_rates[-1]),
141 |             b=dropout_output_layer.b,
142 |             n_in=n_in, n_out=n_out)
143 |         self.layers.append(output_layer)
144 | 
145 |         # Use the negative log likelihood of the logistic regression layer as
146 |         # the objective.
147 |         self.dropout_negative_log_likelihood = self.dropout_layers[-1].negative_log_likelihood
148 |         self.dropout_errors = self.dropout_layers[-1].errors
149 | 
150 |         self.negative_log_likelihood = self.layers[-1].negative_log_likelihood
151 |         self.errors = self.layers[-1].errors
152 | 
153 |         # Grab all the parameters together.
154 |         self.params = [ param for layer in self.dropout_layers for param in layer.params ]
155 | 
156 |     def predict(self, new_data):
157 |         next_layer_input = new_data
158 |         for i,layer in enumerate(self.layers):
159 |             if i<len(self.layers)-1:
160 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
161 |             else:
162 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
163 |         y_pred = T.argmax(p_y_given_x, axis=1)
164 |         return y_pred
165 | 
166 |     def predict_p(self, new_data):
167 |         next_layer_input = new_data
168 |         for i,layer in enumerate(self.layers):
169 |             if i<len(self.layers)-1:
170 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
171 |             else:
172 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
173 |         return p_y_given_x
174 |         
175 | class MLP(object):
176 |     """Multi-Layer Perceptron Class
177 | 
178 |     A multilayer perceptron is a feedforward artificial neural network model
179 |     that has one layer or more of hidden units and nonlinear activations.
180 |     Intermediate layers usually have as activation function tanh or the
181 |     sigmoid function (defined here by a ``HiddenLayer`` class)  while the
182 |     top layer is a softamx layer (defined here by a ``LogisticRegression``
183 |     class).
184 |     """
185 | 
186 |     def __init__(self, rng, input, n_in, n_hidden, n_out):
187 |         """Initialize the parameters for the multilayer perceptron
188 | 
189 |         :type rng: numpy.random.RandomState
190 |         :param rng: a random number generator used to initialize weights
191 | 
192 |         :type input: theano.tensor.TensorType
193 |         :param input: symbolic variable that describes the input of the
194 |         architecture (one minibatch)
195 | 
196 |         :type n_in: int
197 |         :param n_in: number of input units, the dimension of the space in
198 |         which the datapoints lie
199 | 
200 |         :type n_hidden: int
201 |         :param n_hidden: number of hidden units
202 | 
203 |         :type n_out: int
204 |         :param n_out: number of output units, the dimension of the space in
205 |         which the labels lie
206 | 
207 |         """
208 | 
209 |         # Since we are dealing with a one hidden layer MLP, this will translate
210 |         # into a HiddenLayer with a tanh activation function connected to the
211 |         # LogisticRegression layer; the activation function can be replaced by
212 |         # sigmoid or any other nonlinear function
213 |         self.hiddenLayer = HiddenLayer(rng=rng, input=input,
214 |                                        n_in=n_in, n_out=n_hidden,
215 |                                        activation=T.tanh)
216 | 
217 |         # The logistic regression layer gets as input the hidden units
218 |         # of the hidden layer
219 |         self.logRegressionLayer = LogisticRegression(
220 |             input=self.hiddenLayer.output,
221 |             n_in=n_hidden,
222 |             n_out=n_out)
223 | 
224 |         # L1 norm ; one regularization option is to enforce L1 norm to
225 |         # be small
226 | 
227 |         # negative log likelihood of the MLP is given by the negative
228 |         # log likelihood of the output of the model, computed in the
229 |         # logistic regression layer
230 |         self.negative_log_likelihood = self.logRegressionLayer.negative_log_likelihood
231 |         # same holds for the function computing the number of errors
232 |         self.errors = self.logRegressionLayer.errors
233 | 
234 |         # the parameters of the model are the parameters of the two layer it is
235 |         # made out of
236 |         self.params = self.hiddenLayer.params + self.logRegressionLayer.params
237 |         
238 | class LogisticRegression(object):
239 |     """Multi-class Logistic Regression Class
240 | 
241 |     The logistic regression is fully described by a weight matrix :math:`W`
242 |     and bias vector :math:`b`. Classification is done by projecting data
243 |     points onto a set of hyperplanes, the distance to which is used to
244 |     determine a class membership probability.
245 |     """
246 | 
247 |     def __init__(self, input, n_in, n_out, W=None, b=None):
248 |         """ Initialize the parameters of the logistic regression
249 | 
250 |     :type input: theano.tensor.TensorType
251 |     :param input: symbolic variable that describes the input of the
252 |     architecture (one minibatch)
253 |     
254 |     :type n_in: int
255 |     :param n_in: number of input units, the dimension of the space in
256 |     which the datapoints lie
257 |     
258 |     :type n_out: int
259 |     :param n_out: number of output units, the dimension of the space in
260 |     which the labels lie
261 |     
262 |     """
263 | 
264 |         # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
265 |         if W is None:
266 |             self.W = theano.shared(
267 |                     value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),
268 |                     name='W')
269 |         else:
270 |             self.W = W
271 | 
272 |         # initialize the baises b as a vector of n_out 0s
273 |         if b is None:
274 |             self.b = theano.shared(
275 |                     value=numpy.zeros((n_out,), dtype=theano.config.floatX),
276 |                     name='b')
277 |         else:
278 |             self.b = b
279 | 
280 |         # compute vector of class-membership probabilities in symbolic form
281 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
282 | 
283 |         # compute prediction as class whose probability is maximal in
284 |         # symbolic form
285 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
286 | 
287 |         # parameters of the model
288 |         self.params = [self.W, self.b]
289 | 
290 |     def negative_log_likelihood(self, y):
291 |         """Return the mean of the negative log-likelihood of the prediction
292 |         of this model under a given target distribution.
293 | 
294 |     .. math::
295 |     
296 |     \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
297 |     \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
298 |     \ell (\theta=\{W,b\}, \mathcal{D})
299 |     
300 |     :type y: theano.tensor.TensorType
301 |     :param y: corresponds to a vector that gives for each example the
302 |     correct label
303 |     
304 |     Note: we use the mean instead of the sum so that
305 |     the learning rate is less dependent on the batch size
306 |     """
307 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
308 |         # number of examples (call it n) in the minibatch
309 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
310 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
311 |         # Log-Probabilities (call it LP) with one row per example and
312 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
313 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
314 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
315 |         # the mean (across minibatch examples) of the elements in v,
316 |         # i.e., the mean log-likelihood across the minibatch.
317 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
318 | 
319 |     def errors(self, y):
320 |         """Return a float representing the number of errors in the minibatch ;
321 |     zero one loss over the size of the minibatch
322 |     
323 |     :type y: theano.tensor.TensorType
324 |     :param y: corresponds to a vector that gives for each example the
325 |     correct label
326 |     """
327 | 
328 |         # check if y has same dimension of y_pred
329 |         if y.ndim != self.y_pred.ndim:
330 |             raise TypeError('y should have the same shape as self.y_pred',
331 |                 ('y', target.type, 'y_pred', self.y_pred.type))
332 |         # check if y is of the correct datatype
333 |         if y.dtype.startswith('int'):
334 |             # the T.neq operator returns a vector of 0s and 1s, where 1
335 |             # represents a mistake in prediction
336 |             return T.mean(T.neq(self.y_pred, y))
337 |         else:
338 |             raise NotImplementedError()
339 |         
340 | class LeNetConvPoolLayer(object):
341 |     """Pool Layer of a convolutional network """
342 | 
343 |     def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2), non_linear="tanh"):
344 |         """
345 |         Allocate a LeNetConvPoolLayer with shared variable internal parameters.
346 | 
347 |         :type rng: numpy.random.RandomState
348 |         :param rng: a random number generator used to initialize weights
349 | 
350 |         :type input: theano.tensor.dtensor4
351 |         :param input: symbolic image tensor, of shape image_shape
352 | 
353 |         :type filter_shape: tuple or list of length 4
354 |         :param filter_shape: (number of filters, num input feature maps,
355 |                               filter height,filter width)
356 | 
357 |         :type image_shape: tuple or list of length 4
358 |         :param image_shape: (batch size, num input feature maps,
359 |                              image height, image width)
360 | 
361 |         :type poolsize: tuple or list of length 2
362 |         :param poolsize: the downsampling (pooling) factor (#rows,#cols)
363 |         """
364 | 
365 |         assert image_shape[1] == filter_shape[1]
366 |         self.input = input
367 |         self.filter_shape = filter_shape
368 |         self.image_shape = image_shape
369 |         self.poolsize = poolsize
370 |         self.non_linear = non_linear
371 |         # there are "num input feature maps * filter height * filter width"
372 |         # inputs to each hidden unit
373 |         fan_in = numpy.prod(filter_shape[1:])
374 |         # each unit in the lower layer receives a gradient from:
375 |         # "num output feature maps * filter height * filter width" /
376 |         #   pooling size
377 |         fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /numpy.prod(poolsize))
378 |         # initialize weights with random weights
379 |         if self.non_linear=="none" or self.non_linear=="relu":
380 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-0.01,high=0.01,size=filter_shape), 
381 |                                                 dtype=theano.config.floatX),borrow=True,name="W_conv")
382 |         else:
383 |             W_bound = numpy.sqrt(6. / (fan_in + fan_out))
384 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
385 |                 dtype=theano.config.floatX),borrow=True,name="W_conv")   
386 |         b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
387 |         self.b = theano.shared(value=b_values, borrow=True, name="b_conv")
388 |         
389 |         # convolve input feature maps with filters
390 |         conv_out = conv.conv2d(input=input, filters=self.W,filter_shape=self.filter_shape, image_shape=self.image_shape)
391 |         if self.non_linear=="tanh":
392 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
393 |             self.output = downsample.max_pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
394 |         elif self.non_linear=="relu":
395 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
396 |             self.output = downsample.max_pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
397 |         else:
398 |             pooled_out = downsample.max_pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
399 |             self.output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
400 |         self.params = [self.W, self.b]
401 |         
402 |     def predict(self, new_data, batch_size):
403 |         """
404 |         predict for new data
405 |         """
406 |         img_shape = (batch_size, 1, self.image_shape[2], self.image_shape[3])
407 |         conv_out = conv.conv2d(input=new_data, filters=self.W, filter_shape=self.filter_shape, image_shape=img_shape)
408 |         if self.non_linear=="tanh":
409 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
410 |             output = downsample.max_pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
411 |         if self.non_linear=="relu":
412 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
413 |             output = downsample.max_pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
414 |         else:
415 |             pooled_out = downsample.max_pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
416 |             output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
417 |         return output
418 |         
419 | 


--------------------------------------------------------------------------------
/conv_net_sentence.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | import cPickle
 12 | import numpy as np
 13 | from collections import defaultdict, OrderedDict
 14 | import theano
 15 | import theano.tensor as T
 16 | import re
 17 | import warnings
 18 | import sys
 19 | import time
 20 | warnings.filterwarnings("ignore")   
 21 | 
 22 | #different non-linearities
 23 | def ReLU(x):
 24 |     y = T.maximum(0.0, x)
 25 |     return(y)
 26 | def Sigmoid(x):
 27 |     y = T.nnet.sigmoid(x)
 28 |     return(y)
 29 | def Tanh(x):
 30 |     y = T.tanh(x)
 31 |     return(y)
 32 | def Iden(x):
 33 |     y = x
 34 |     return(y)
 35 |        
 36 | def train_conv_net(datasets,
 37 |                    U,
 38 |                    img_w=300, 
 39 |                    filter_hs=[3,4,5],
 40 |                    hidden_units=[100,2], 
 41 |                    dropout_rate=[0.5],
 42 |                    shuffle_batch=True,
 43 |                    n_epochs=25, 
 44 |                    batch_size=50, 
 45 |                    lr_decay = 0.95,
 46 |                    conv_non_linear="relu",
 47 |                    activations=[Iden],
 48 |                    sqr_norm_lim=9,
 49 |                    non_static=True):
 50 |     """
 51 |     Train a simple conv net
 52 |     img_h = sentence length (padded where necessary)
 53 |     img_w = word vector length (300 for word2vec)
 54 |     filter_hs = filter window sizes    
 55 |     hidden_units = [x,y] x is the number of feature maps (per filter window), and y is the penultimate layer
 56 |     sqr_norm_lim = s^2 in the paper
 57 |     lr_decay = adadelta decay parameter
 58 |     """    
 59 |     rng = np.random.RandomState(3435)
 60 |     img_h = len(datasets[0][0])-1  
 61 |     filter_w = img_w    
 62 |     feature_maps = hidden_units[0]
 63 |     filter_shapes = []
 64 |     pool_sizes = []
 65 |     for filter_h in filter_hs:
 66 |         filter_shapes.append((feature_maps, 1, filter_h, filter_w))
 67 |         pool_sizes.append((img_h-filter_h+1, img_w-filter_w+1))
 68 |     parameters = [("image shape",img_h,img_w),("filter shape",filter_shapes), ("hidden_units",hidden_units),
 69 |                   ("dropout", dropout_rate), ("batch_size",batch_size),("non_static", non_static),
 70 |                     ("learn_decay",lr_decay), ("conv_non_linear", conv_non_linear), ("non_static", non_static)
 71 |                     ,("sqr_norm_lim",sqr_norm_lim),("shuffle_batch",shuffle_batch)]
 72 |     print parameters    
 73 |     
 74 |     #define model architecture
 75 |     index = T.lscalar()
 76 |     x = T.matrix('x')   
 77 |     y = T.ivector('y')
 78 |     Words = theano.shared(value = U, name = "Words")
 79 |     zero_vec_tensor = T.vector()
 80 |     zero_vec = np.zeros(img_w)
 81 |     set_zero = theano.function([zero_vec_tensor], updates=[(Words, T.set_subtensor(Words[0,:], zero_vec_tensor))], allow_input_downcast=True)
 82 |     layer0_input = Words[T.cast(x.flatten(),dtype="int32")].reshape((x.shape[0],1,x.shape[1],Words.shape[1]))                                  
 83 |     conv_layers = []
 84 |     layer1_inputs = []
 85 |     for i in xrange(len(filter_hs)):
 86 |         filter_shape = filter_shapes[i]
 87 |         pool_size = pool_sizes[i]
 88 |         conv_layer = LeNetConvPoolLayer(rng, input=layer0_input,image_shape=(batch_size, 1, img_h, img_w),
 89 |                                 filter_shape=filter_shape, poolsize=pool_size, non_linear=conv_non_linear)
 90 |         layer1_input = conv_layer.output.flatten(2)
 91 |         conv_layers.append(conv_layer)
 92 |         layer1_inputs.append(layer1_input)
 93 |     layer1_input = T.concatenate(layer1_inputs,1)
 94 |     hidden_units[0] = feature_maps*len(filter_hs)    
 95 |     classifier = MLPDropout(rng, input=layer1_input, layer_sizes=hidden_units, activations=activations, dropout_rates=dropout_rate)
 96 |     
 97 |     #define parameters of the model and update functions using adadelta
 98 |     params = classifier.params     
 99 |     for conv_layer in conv_layers:
100 |         params += conv_layer.params
101 |     if non_static:
102 |         #if word vectors are allowed to change, add them as model parameters
103 |         params += [Words]
104 |     cost = classifier.negative_log_likelihood(y) 
105 |     dropout_cost = classifier.dropout_negative_log_likelihood(y)           
106 |     grad_updates = sgd_updates_adadelta(params, dropout_cost, lr_decay, 1e-6, sqr_norm_lim)
107 |     
108 |     #shuffle dataset and assign to mini batches. if dataset size is not a multiple of mini batches, replicate 
109 |     #extra data (at random)
110 |     np.random.seed(3435)
111 |     if datasets[0].shape[0] % batch_size > 0:
112 |         extra_data_num = batch_size - datasets[0].shape[0] % batch_size
113 |         train_set = np.random.permutation(datasets[0])   
114 |         extra_data = train_set[:extra_data_num]
115 |         new_data=np.append(datasets[0],extra_data,axis=0)
116 |     else:
117 |         new_data = datasets[0]
118 |     new_data = np.random.permutation(new_data)
119 |     n_batches = new_data.shape[0]/batch_size
120 |     n_train_batches = int(np.round(n_batches*0.9))
121 |     #divide train set into train/val sets 
122 |     test_set_x = datasets[1][:,:img_h] 
123 |     test_set_y = np.asarray(datasets[1][:,-1],"int32")
124 |     train_set = new_data[:n_train_batches*batch_size,:]
125 |     val_set = new_data[n_train_batches*batch_size:,:]     
126 |     train_set_x, train_set_y = shared_dataset((train_set[:,:img_h],train_set[:,-1]))
127 |     val_set_x, val_set_y = shared_dataset((val_set[:,:img_h],val_set[:,-1]))
128 |     n_val_batches = n_batches - n_train_batches
129 |     val_model = theano.function([index], classifier.errors(y),
130 |          givens={
131 |             x: val_set_x[index * batch_size: (index + 1) * batch_size],
132 |              y: val_set_y[index * batch_size: (index + 1) * batch_size]},
133 |                                 allow_input_downcast=True)
134 |             
135 |     #compile theano functions to get train/val/test errors
136 |     test_model = theano.function([index], classifier.errors(y),
137 |              givens={
138 |                 x: train_set_x[index * batch_size: (index + 1) * batch_size],
139 |                  y: train_set_y[index * batch_size: (index + 1) * batch_size]},
140 |                                  allow_input_downcast=True)               
141 |     train_model = theano.function([index], cost, updates=grad_updates,
142 |           givens={
143 |             x: train_set_x[index*batch_size:(index+1)*batch_size],
144 |               y: train_set_y[index*batch_size:(index+1)*batch_size]},
145 |                                   allow_input_downcast = True)     
146 |     test_pred_layers = []
147 |     test_size = test_set_x.shape[0]
148 |     test_layer0_input = Words[T.cast(x.flatten(),dtype="int32")].reshape((test_size,1,img_h,Words.shape[1]))
149 |     for conv_layer in conv_layers:
150 |         test_layer0_output = conv_layer.predict(test_layer0_input, test_size)
151 |         test_pred_layers.append(test_layer0_output.flatten(2))
152 |     test_layer1_input = T.concatenate(test_pred_layers, 1)
153 |     test_y_pred = classifier.predict(test_layer1_input)
154 |     test_error = T.mean(T.neq(test_y_pred, y))
155 |     test_model_all = theano.function([x,y], test_error, allow_input_downcast = True)   
156 |     
157 |     #start training over mini-batches
158 |     print '... training'
159 |     epoch = 0
160 |     best_val_perf = 0
161 |     val_perf = 0
162 |     test_perf = 0       
163 |     cost_epoch = 0    
164 |     while (epoch < n_epochs):
165 |         start_time = time.time()
166 |         epoch = epoch + 1
167 |         if shuffle_batch:
168 |             for minibatch_index in np.random.permutation(range(n_train_batches)):
169 |                 cost_epoch = train_model(minibatch_index)
170 |                 set_zero(zero_vec)
171 |         else:
172 |             for minibatch_index in xrange(n_train_batches):
173 |                 cost_epoch = train_model(minibatch_index)  
174 |                 set_zero(zero_vec)
175 |         train_losses = [test_model(i) for i in xrange(n_train_batches)]
176 |         train_perf = 1 - np.mean(train_losses)
177 |         val_losses = [val_model(i) for i in xrange(n_val_batches)]
178 |         val_perf = 1- np.mean(val_losses)                        
179 |         print('epoch: %i, training time: %.2f secs, train perf: %.2f %%, val perf: %.2f %%' % (epoch, time.time()-start_time, train_perf * 100., val_perf*100.))
180 |         if val_perf >= best_val_perf:
181 |             best_val_perf = val_perf
182 |             test_loss = test_model_all(test_set_x,test_set_y)        
183 |             test_perf = 1- test_loss         
184 |     return test_perf
185 | 
186 | def shared_dataset(data_xy, borrow=True):
187 |         """ Function that loads the dataset into shared variables
188 | 
189 |         The reason we store our dataset in shared variables is to allow
190 |         Theano to copy it into the GPU memory (when code is run on GPU).
191 |         Since copying data into the GPU is slow, copying a minibatch everytime
192 |         is needed (the default behaviour if the data is not in a shared
193 |         variable) would lead to a large decrease in performance.
194 |         """
195 |         data_x, data_y = data_xy
196 |         shared_x = theano.shared(np.asarray(data_x,
197 |                                                dtype=theano.config.floatX),
198 |                                  borrow=borrow)
199 |         shared_y = theano.shared(np.asarray(data_y,
200 |                                                dtype=theano.config.floatX),
201 |                                  borrow=borrow)
202 |         return shared_x, T.cast(shared_y, 'int32')
203 |         
204 | def sgd_updates_adadelta(params,cost,rho=0.95,epsilon=1e-6,norm_lim=9,word_vec_name='Words'):
205 |     """
206 |     adadelta update rule, mostly from
207 |     https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
208 |     """
209 |     updates = OrderedDict({})
210 |     exp_sqr_grads = OrderedDict({})
211 |     exp_sqr_ups = OrderedDict({})
212 |     gparams = []
213 |     for param in params:
214 |         empty = np.zeros_like(param.get_value())
215 |         exp_sqr_grads[param] = theano.shared(value=as_floatX(empty),name="exp_grad_%s" % param.name)
216 |         gp = T.grad(cost, param)
217 |         exp_sqr_ups[param] = theano.shared(value=as_floatX(empty), name="exp_grad_%s" % param.name)
218 |         gparams.append(gp)
219 |     for param, gp in zip(params, gparams):
220 |         exp_sg = exp_sqr_grads[param]
221 |         exp_su = exp_sqr_ups[param]
222 |         up_exp_sg = rho * exp_sg + (1 - rho) * T.sqr(gp)
223 |         updates[exp_sg] = up_exp_sg
224 |         step =  -(T.sqrt(exp_su + epsilon) / T.sqrt(up_exp_sg + epsilon)) * gp
225 |         updates[exp_su] = rho * exp_su + (1 - rho) * T.sqr(step)
226 |         stepped_param = param + step
227 |         if (param.get_value(borrow=True).ndim == 2) and (param.name!='Words'):
228 |             col_norms = T.sqrt(T.sum(T.sqr(stepped_param), axis=0))
229 |             desired_norms = T.clip(col_norms, 0, T.sqrt(norm_lim))
230 |             scale = desired_norms / (1e-7 + col_norms)
231 |             updates[param] = stepped_param * scale
232 |         else:
233 |             updates[param] = stepped_param      
234 |     return updates 
235 | 
236 | def as_floatX(variable):
237 |     if isinstance(variable, float):
238 |         return np.cast[theano.config.floatX](variable)
239 | 
240 |     if isinstance(variable, np.ndarray):
241 |         return np.cast[theano.config.floatX](variable)
242 |     return theano.tensor.cast(variable, theano.config.floatX)
243 |     
244 | def safe_update(dict_to, dict_from):
245 |     """
246 |     re-make update dictionary for safe updating
247 |     """
248 |     for key, val in dict(dict_from).iteritems():
249 |         if key in dict_to:
250 |             raise KeyError(key)
251 |         dict_to[key] = val
252 |     return dict_to
253 |     
254 | def get_idx_from_sent(sent, word_idx_map, max_l=51, k=300, filter_h=5):
255 |     """
256 |     Transforms sentence into a list of indices. Pad with zeroes.
257 |     """
258 |     x = []
259 |     pad = filter_h - 1
260 |     for i in xrange(pad):
261 |         x.append(0)
262 |     words = sent.split()
263 |     for word in words:
264 |         if word in word_idx_map:
265 |             x.append(word_idx_map[word])
266 |     while len(x) < max_l+2*pad:
267 |         x.append(0)
268 |     return x
269 | 
270 | def make_idx_data_cv(revs, word_idx_map, cv, max_l=51, k=300, filter_h=5):
271 |     """
272 |     Transforms sentences into a 2-d matrix.
273 |     """
274 |     train, test = [], []
275 |     for rev in revs:
276 |         sent = get_idx_from_sent(rev["text"], word_idx_map, max_l, k, filter_h)   
277 |         sent.append(rev["y"])
278 |         if rev["split"]==cv:            
279 |             test.append(sent)        
280 |         else:  
281 |             train.append(sent)   
282 |     train = np.array(train,dtype="int")
283 |     test = np.array(test,dtype="int")
284 |     return [train, test]     
285 |   
286 |    
287 | if __name__=="__main__":
288 |     print "loading data...",
289 |     x = cPickle.load(open("mr.p","rb"))
290 |     revs, W, W2, word_idx_map, vocab = x[0], x[1], x[2], x[3], x[4]
291 |     print "data loaded!"
292 |     mode= sys.argv[1]
293 |     word_vectors = sys.argv[2]    
294 |     if mode=="-nonstatic":
295 |         print "model architecture: CNN-non-static"
296 |         non_static=True
297 |     elif mode=="-static":
298 |         print "model architecture: CNN-static"
299 |         non_static=False
300 |     execfile("conv_net_classes.py")    
301 |     if word_vectors=="-rand":
302 |         print "using: random vectors"
303 |         U = W2
304 |     elif word_vectors=="-word2vec":
305 |         print "using: word2vec vectors"
306 |         U = W
307 |     results = []
308 |     r = range(0,10)    
309 |     for i in r:
310 |         datasets = make_idx_data_cv(revs, word_idx_map, i, max_l=56,k=300, filter_h=5)
311 |         perf = train_conv_net(datasets,
312 |                               U,
313 |                               lr_decay=0.95,
314 |                               filter_hs=[3,4,5],
315 |                               conv_non_linear="relu",
316 |                               hidden_units=[100,2], 
317 |                               shuffle_batch=True, 
318 |                               n_epochs=25, 
319 |                               sqr_norm_lim=9,
320 |                               non_static=non_static,
321 |                               batch_size=50,
322 |                               dropout_rate=[0.5])
323 |         print "cv: " + str(i) + ", perf: " + str(perf)
324 |         results.append(perf)  
325 |     print str(np.mean(results))
326 | 


--------------------------------------------------------------------------------
/process_data.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import cPickle
  3 | from collections import defaultdict
  4 | import sys, re
  5 | import pandas as pd
  6 | 
  7 | def build_data_cv(data_folder, cv=10, clean_string=True):
  8 |     """
  9 |     Loads data and split into 10 folds.
 10 |     """
 11 |     revs = []
 12 |     pos_file = data_folder[0]
 13 |     neg_file = data_folder[1]
 14 |     vocab = defaultdict(float)
 15 |     with open(pos_file, "rb") as f:
 16 |         for line in f:       
 17 |             rev = []
 18 |             rev.append(line.strip())
 19 |             if clean_string:
 20 |                 orig_rev = clean_str(" ".join(rev))
 21 |             else:
 22 |                 orig_rev = " ".join(rev).lower()
 23 |             words = set(orig_rev.split())
 24 |             for word in words:
 25 |                 vocab[word] += 1
 26 |             datum  = {"y":1, 
 27 |                       "text": orig_rev,                             
 28 |                       "num_words": len(orig_rev.split()),
 29 |                       "split": np.random.randint(0,cv)}
 30 |             revs.append(datum)
 31 |     with open(neg_file, "rb") as f:
 32 |         for line in f:       
 33 |             rev = []
 34 |             rev.append(line.strip())
 35 |             if clean_string:
 36 |                 orig_rev = clean_str(" ".join(rev))
 37 |             else:
 38 |                 orig_rev = " ".join(rev).lower()
 39 |             words = set(orig_rev.split())
 40 |             for word in words:
 41 |                 vocab[word] += 1
 42 |             datum  = {"y":0, 
 43 |                       "text": orig_rev,                             
 44 |                       "num_words": len(orig_rev.split()),
 45 |                       "split": np.random.randint(0,cv)}
 46 |             revs.append(datum)
 47 |     return revs, vocab
 48 |     
 49 | def get_W(word_vecs, k=300):
 50 |     """
 51 |     Get word matrix. W[i] is the vector for word indexed by i
 52 |     """
 53 |     vocab_size = len(word_vecs)
 54 |     word_idx_map = dict()
 55 |     W = np.zeros(shape=(vocab_size+1, k), dtype='float32')            
 56 |     W[0] = np.zeros(k, dtype='float32')
 57 |     i = 1
 58 |     for word in word_vecs:
 59 |         W[i] = word_vecs[word]
 60 |         word_idx_map[word] = i
 61 |         i += 1
 62 |     return W, word_idx_map
 63 | 
 64 | def load_bin_vec(fname, vocab):
 65 |     """
 66 |     Loads 300x1 word vecs from Google (Mikolov) word2vec
 67 |     """
 68 |     word_vecs = {}
 69 |     with open(fname, "rb") as f:
 70 |         header = f.readline()
 71 |         vocab_size, layer1_size = map(int, header.split())
 72 |         binary_len = np.dtype('float32').itemsize * layer1_size
 73 |         for line in xrange(vocab_size):
 74 |             word = []
 75 |             while True:
 76 |                 ch = f.read(1)
 77 |                 if ch == ' ':
 78 |                     word = ''.join(word)
 79 |                     break
 80 |                 if ch != '\n':
 81 |                     word.append(ch)   
 82 |             if word in vocab:
 83 |                word_vecs[word] = np.fromstring(f.read(binary_len), dtype='float32')  
 84 |             else:
 85 |                 f.read(binary_len)
 86 |     return word_vecs
 87 | 
 88 | def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
 89 |     """
 90 |     For words that occur in at least min_df documents, create a separate word vector.    
 91 |     0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
 92 |     """
 93 |     for word in vocab:
 94 |         if word not in word_vecs and vocab[word] >= min_df:
 95 |             word_vecs[word] = np.random.uniform(-0.25,0.25,k)  
 96 | 
 97 | def clean_str(string, TREC=False):
 98 |     """
 99 |     Tokenization/string cleaning for all datasets except for SST.
100 |     Every dataset is lower cased except for TREC
101 |     """
102 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)     
103 |     string = re.sub(r"\'s", " \'s", string) 
104 |     string = re.sub(r"\'ve", " \'ve", string) 
105 |     string = re.sub(r"n\'t", " n\'t", string) 
106 |     string = re.sub(r"\'re", " \'re", string) 
107 |     string = re.sub(r"\'d", " \'d", string) 
108 |     string = re.sub(r"\'ll", " \'ll", string) 
109 |     string = re.sub(r",", " , ", string) 
110 |     string = re.sub(r"!", " ! ", string) 
111 |     string = re.sub(r"\(", " \( ", string) 
112 |     string = re.sub(r"\)", " \) ", string) 
113 |     string = re.sub(r"\?", " \? ", string) 
114 |     string = re.sub(r"\s{2,}", " ", string)    
115 |     return string.strip() if TREC else string.strip().lower()
116 | 
117 | def clean_str_sst(string):
118 |     """
119 |     Tokenization/string cleaning for the SST dataset
120 |     """
121 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)   
122 |     string = re.sub(r"\s{2,}", " ", string)    
123 |     return string.strip().lower()
124 | 
125 | if __name__=="__main__":    
126 |     w2v_file = sys.argv[1]     
127 |     data_folder = ["rt-polarity.pos","rt-polarity.neg"]    
128 |     print "loading data...",        
129 |     revs, vocab = build_data_cv(data_folder, cv=10, clean_string=True)
130 |     max_l = np.max(pd.DataFrame(revs)["num_words"])
131 |     print "data loaded!"
132 |     print "number of sentences: " + str(len(revs))
133 |     print "vocab size: " + str(len(vocab))
134 |     print "max sentence length: " + str(max_l)
135 |     print "loading word2vec vectors...",
136 |     w2v = load_bin_vec(w2v_file, vocab)
137 |     print "word2vec loaded!"
138 |     print "num words already in word2vec: " + str(len(w2v))
139 |     add_unknown_words(w2v, vocab)
140 |     W, word_idx_map = get_W(w2v)
141 |     rand_vecs = {}
142 |     add_unknown_words(rand_vecs, vocab)
143 |     W2, _ = get_W(rand_vecs)
144 |     cPickle.dump([revs, W, W2, word_idx_map, vocab], open("mr.p", "wb"))
145 |     print "dataset created!"
146 |     
147 | 


--------------------------------------------------------------------------------
/rt-polarity.neg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yoonkim/CNN_sentence/23e0e1f7355705bb083043fda05c031b15acb38c/rt-polarity.neg


--------------------------------------------------------------------------------
/rt-polarity.pos:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yoonkim/CNN_sentence/23e0e1f7355705bb083043fda05c031b15acb38c/rt-polarity.pos


--------------------------------------------------------------------------------