├── .gitignore
├── README.md
├── about.md
├── data
    ├── __init__.py
    ├── mnist.py
    └── text.py
├── layers
    ├── __init__.py
    ├── activations.py
    ├── base.py
    ├── cwrnn.py
    ├── cwrnn_l1.py
    ├── cwrnn_norm.py
    ├── layer_utils.py
    ├── linear.py
    ├── softmax_ce_loss.py
    └── tests.py
├── network.py
├── train_mnist.py
└── train_ptb.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | *.pyc
 3 | 
 4 | results/
 5 | 
 6 | datasets/mnist.pkl.gz
 7 | 
 8 | data/mnist.pkl.gz
 9 | 
10 | data/ptb.txt
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # lrh
 2 | Learning RNN Hierarchies
 3 | 
 4 | `about.md` has the basic details of what the code is trying to do and why.
 5 | 
 6 | * Layers are defined in `layers/` along with tests
 7 | * All layers derive from abstract base class defined in `base.py`
 8 | * All data and data prepping scripts are in `data/` folder
 9 | * `network.py` has tools for taking in a list as a model and doing layer by layer forward/backward pass, getting gradients, setting/getting parameters 
10 | * `train_ptb.py` trains a model on Penn Tree Bank text file, which has to be placed in the `data/` folder
11 | * `train_mnist.py` trains a model on Sequential MNIST. `mnist.pkl.gz` has to be placed in `data/` folder
12 | * As the network trains, logs are generated. Final logs and models are stored as pickle objects in `results/experiment_name`, where `experiment_name` is a string defined in `train_` scripts
13 | 
14 | ##Requires:
15 | * Numpy
16 | * Scipy (for one special function to calculate entropy)
17 | * matplotlib
18 | * climin
19 | 
20 | 


--------------------------------------------------------------------------------
/about.md:
--------------------------------------------------------------------------------
 1 | # Learning RNN Hierarchies
 2 | 
 3 | ## Introduction
 4 | 
 5 | Recognizing patterns in sequences requires detecting temporal **events** that are occurring at different levels of an abstraction hierarchy. An event is a simple specific pattern observed over time present in the sequence that is useful for recognizing a much more complex pattern. Lower level events are due to local structure in input stream (example: phonemes). Higher-level events could be a combination of several lower level events and even higher level events from the past (example: mood of the speaker depends on the conversational tone and words he speaks). RNNs can not only see a composition of such events like a regular NN, but they can also see the overall variation of these events over arbitrary gaps in time and hence are very powerful. 
 6 | 
 7 | In general vanilla RNNs are not that useful because they forget events from the past(belonging to any level of abstraction). This because of its multiplicative update rule for its hidden state, which is repeated over all the time steps, causes the memories of events to decay. Common and now successful approach to tackle this problem is to use the LSTM family of RNNs, which replace the multiplicative update rule with an additive update. This makes the RNN prone to explosion and makes it unstable, thus a protective gating mechanism is put in place. While this solves the Vanishing Gradients problem, a single LSTM layer won't give the best performance. There is abundant empirical evidence that suggests that stacking LSTMs (and RNNs in general) offers better performance compared to a single LSTM layer with the memory size fixed. If LSTMs can remember everything from the past and if LSTMs are already very deep in time, why stack them at all?
 8 | 
 9 | The most intuitive and commonly given reason is that lower RNNs specialize to local events, while the higher level RNNs can focus on more abstract events. For example seq2seq architecture uses 4-stacked LSTMs for the encoder to compress the input sequence to a fixed length vector. Other possible reasons for this include ease of optimization, reduction of number parameters required per cell of memory, increased non linear depth per time step and many more (it is still an open research question). But it is clear that stacking RNNs is essential for good performance on complex tasks.
10 | 
11 | If we can simultaneously do both of the following:
12 | 
13 | 1. Solving the vanishing gradients problem and, at the same time 
14 | 
15 | 2. Make our models better at handling events in multiple levels of abstraction
16 | 
17 | Using a single simpler model, such a system would be more efficient than an LSTM. The main objective of this work is to find such a model. Taking inspirations from previous methods and combining them with our novel contributions.
18 | 
19 | ## Background
20 | 
21 | <br>We can split up our big RNN into multiple smaller RNN modules. A module could either be active or inactive at a particular time step. If a module stays frequently inactive, more memory retention capability it possesses - these are slow modules. If a module stays frequently active, less memory retention capability it possesses - these are fast modules. Thus a combination of slower and faster RNNs can together retain memory for longer durations and thus make recognition of patterns based on temporally distant events possible.
22 | 
23 | There have been a few attempts to do this in the past. Here we discuss their approaches, strengths and weaknesses:
24 | 
25 | 1. __Chunker/Neural History Compressor (1991):__ It is a stack of simple RNNs. The lowest RNN layer gets actual inputs as input. Higher level take inputs only from the layer below it and give their outputs as inputs to the layer above it. Each of the RNN layers, starting from the lowest RNN are trained to predict the input it is going to receive in the next timestep, based on the history of inputs the RNN has received so far. This is an unsupervised step similar to greedy auto encoder training. The main trick is to activate a RNN at a level in proportion to the extent of failure by the RNN layer below it in predicting its current input. If predictions by lower RNN are frequently correct, then the RNN is rarely on, thus has longer memory. The higher-level RNN is now trained to predict its inputs from layer below it, which is only at timesteps where it failed to predict. This is done iteratively over all RNNs in a stack. *Each RNN layer has now learned to expect what is unexpected to the RNN below it*. Schmidhuber calls this history compression as predictability increases with more layers.
26 | </br>__Pros__: Unsupervised. Triggers for higher RNNs are event driven, meaning RNN in higher layer can come in when an unexpected event occurs and gather all data it needs.
27 | </br>__Cons__: Local predictability is necessary - not always possible. Cannot combine information from multiple levels effectively. Needs layer wise pretraining and then supervised fine tuning.
28 | 
29 | 2. __Clockwork RNN (2014)__: This is a supervised variant of Chunker. RNNs are present in a cluster. Each RNN has a dedicated timer or a clock, which only activates the RNN module once per for every ___P___ time steps. P is chosen to form a hierarchy  (example P_i = 2^i, where i is layer index). Further they restrict connections to go only from slow RNNs to fast RNNs and not vice versa. This is similar to Chunker, except events don’t trigger RNNs, rather clock pulses do. They activate according to a predefined period. This allows for supervised training as RNNs specialize automatically at their level.
30 | </br>__Pros__: Supervised. Has lesser complexity than a vanilla RNN due to restricted activity and connectivity scheme.
31 | </br>__Cons__: The major problem with this is that it requires hand engineered clock periods, which depend on the memory requirements of the task. This varies widely from task to task. Thus a lot of domain knowledge is required to setup the initial hierarchy. Further, it is not a trivial task to set this up. If there is a lot of gap between activities of 2 RNN modules, the slower RNN could miss the contents in faster RNN's memory as it would decay with time. Lastly, the connection scheme between modules is not good, both intuitively and in practice. 
32 | 
33 | <hr>
34 | 
35 | ## Proposed Method
36 | 
37 | We cast the learning process as a combination of 
38 | 
39 | 1. learning to do the task and 
40 | 2. learning the hierarchical interconnected RNN architecture
41 | 
42 | That is to come up with a model that can:
43 | 
44 | 1. Learn the hierarchy
45 | 
46 | 2. Learn how the modules are interconnected
47 | 
48 | 3. Learn to activate based on events
49 | 
50 | The methods for the last two have been developed and tested. The first one is tricky and this repo only tries to do that. Described how below
51 | 
52 | 
53 | ### Learning the hierarchy
54 | <br>
55 | This is the most important and extremely challenging aspect of designing the model. Clock frequencies (inverse of clock periods) are a really good characteristic of a RNN's position in a hierarchy. A RNN having a small frequency naturally has to depend on contents stored in other (faster) RNNs as it rarely gets any input. Accordingly, this low frequency RNN learns to operate at a more abstract form of inputs, thus forming the higher levels of the hierarchy. Conversely, fast clocks form lower levels of the hierarchy. This makes the clock frequency a sufficient parameter that describes the RNN's position in a hierarchy. Thus, learning clock periods of a set of RNNs is equivalent to learning the hierarchy.
56 | 
57 | This is more powerful than it seems. Learning a symmetric set of clock frequencies between 2 sets of RNNs is equivalent to learning the seq2seq model itself for example (see figure below). A stack of RNNs with continuously decreasing frequencies forms a abstraction pyramid. If this is combined with another set of RNNs, which is connected only to the top most RNN, but with continuously increasing frequencies, we now have a crude seq2seq model. 
58 | 
59 | ![](https://cloud.githubusercontent.com/assets/8753078/11612806/46e59834-9c2e-11e5-8309-7a93aa72383c.png)
60 | 
61 | (Not claiming that this learning capability has been achieved, but just showing the representational power)
62 | 
63 | Learning clock frequencies is not as trivial as learning just another parameter. Clocks used in Clockwork RNN were binary clocks. If we move to a smoother version of it, i. e the sine wave, we now have to learn the frequency of this sine wave. This sine wave represents the activation pattern of a RNN module in the hierarchy.
64 | 
65 | Unfortunately, learning frequency directly is not possible. This is because of extremely large amount of local minima that is present. Consider the following example: current wave frequency is 1/4, but the required wave frequency is 1/8. If the frequency slightly decreases, to say 1/5 this frequency is actually worse than 1/4 as there is less agreement between 1/8  and 1/5 compared to 1/8 and 1/4. That is there is a local minima wherever there is an LCM between the clock periods. Thus learning frequency directly is not possible (learnt it the hard way weeks before ICML deadline)
66 | 
67 | Instead of operating in amplitude-time domain, we move to amplitude-frequency domain. That is express the wave we want as a set of DCT coefficients. Perform inverse DCT to get the wave and use it to activate the modules. The error derivatives are also transferred to frequency domain during backward pass using DCT. This does not have the above problem of minima. 
68 | 
69 | This can be viewed as regularization of activities in the frequency domain. There can be many ways to restrict the learnt frequency to have just one major frequency component:
70 | 
71 | 1. L1 penalty over coefficients
72 | 
73 | 2. Softmax over the coefficients for discriminative choosing of frequencies.
74 | 
75 | <hr>
76 | 
77 | The code in this repo is only for this purpose. The others are not here, but in a separate repo. They have been independently tested to work, but not as a whole unit.
78 | 
79 | Note: Due to some reasons, binary clocks seemed like a better fit. So instead of DCT bases, binary bases are used and this whole "transform" is just implemented as a dot product of a vector and a matrix.
80 | 
81 | 


--------------------------------------------------------------------------------
/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pranv/lrh/fec5fc6355c1fee3456ef35568815759867474f8/data/__init__.py


--------------------------------------------------------------------------------
/data/mnist.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | import gzip, cPickle, sys
 4 | 
 5 | def to_categorical(y):
 6 |     y = np.asarray(y)
 7 |     Y = np.zeros((len(y), 10))
 8 |     for i in range(len(y)):
 9 |         Y[i, y[i]] = 1.
10 |     return Y
11 | 
12 | class loader(object):
13 | 	def __init__(self, batch_size=50, permuted=False):
14 | 		path = 'data/mnist.pkl.gz'
15 | 		f = gzip.open(path, 'rb')
16 | 		(X_train, y_train), (X_val, y_val), (X_test, y_test) = cPickle.load(f)
17 | 		f.close()
18 | 
19 | 		X_train = X_train.reshape(X_train.shape[0], -1, 1)
20 | 		X_val = X_val.reshape(X_val.shape[0], -1, 1)
21 | 		X_test = X_test.reshape(X_test.shape[0], -1, 1)
22 | 
23 | 		X_train = X_train.swapaxes(0, 1).swapaxes(1, 2)
24 | 		X_val = X_val.swapaxes(0, 1).swapaxes(1, 2)
25 | 		X_test = X_test.swapaxes(0, 1).swapaxes(1, 2)
26 | 
27 | 		if permuted:
28 | 			p = range(28*28)
29 | 			np.random.shuffle(p)
30 | 			X_train = X_train[p]
31 | 			X_val = X_val[p]
32 | 			X_test = X_test[p]
33 | 
34 | 		self.i = 0
35 | 
36 | 		self.X_train = X_train
37 | 		self.X_val = X_val 
38 | 		self.X_test = X_test
39 | 		self.y_train = to_categorical(y_train).T.reshape(1, 10, -1)
40 | 		self.y_val = to_categorical(y_val ).T.reshape(1, 10, -1)
41 | 		self.y_test = to_categorical(y_test).T.reshape(1, 10, -1)
42 | 
43 | 		self.batch_size = batch_size
44 | 		self.permuted = permuted
45 | 		self.epoch = 1
46 | 		self.epoch_complete = False
47 | 
48 | 	def fetch_train(self):
49 | 		X = self.X_train[:, :, self.i * self.batch_size: (self.i + 1) * self.batch_size]
50 | 		y = self.y_train[:, :, self.i * self.batch_size: (self.i + 1) * self.batch_size]
51 | 		self.i = (self.i + 1)
52 | 		if (self.i * self.batch_size) >= self.X_train.shape[2]:
53 | 			self.epoch_complete = True
54 | 			self.epoch += 1
55 | 			self.i = self.i % (self.X_train.shape[2] / self.batch_size)
56 | 		return (X, y)
57 | 
58 | 	def fetch_val(self):
59 | 		return self.X_val, self.y_val
60 | 
61 | 	def fetch_test(self):
62 | 		return self.X_test, self.y_test
63 | 


--------------------------------------------------------------------------------
/data/text.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class OneHot(object):
 4 | 	def __init__(self, alphabet_size, char_to_i):
 5 | 		self.alphabet_size = alphabet_size
 6 | 		self.matrix = np.eye(alphabet_size, dtype='uint8')
 7 | 		self.to_i = char_to_i
 8 | 
 9 | 	def __call__(self, chars):
10 | 		I = np.zeros((len(chars), self.alphabet_size))
11 | 		for c in range(len(chars)):
12 | 			i = self.to_i[chars[c]]
13 | 			I[c][i] = 1
14 | 		return I
15 | 
16 | 
17 | class UnOneHot(object):
18 | 	def __init__(self, i_to_char):
19 | 		self.to_c = i_to_char
20 | 
21 | 	def __call__(self, vectors):
22 | 		chars = ''
23 | 		for vector in vectors:
24 | 			i = vector.argmax()
25 | 			chars += self.to_c[i]
26 | 		return chars
27 | 
28 | class loader(object):
29 | 	def __init__(self, filename, sequence_length, batch_size):
30 | 		f = open(filename, 'r')
31 | 		lines = f.readlines()
32 | 
33 | 		string = ''.join(lines)
34 | 
35 | 		vocabulary = list(set(string))
36 | 		vocabulary_size = len(vocabulary)
37 | 		data_size = len(string)
38 | 
39 | 		char_to_i = {ch:i for i,ch in enumerate(vocabulary)}
40 | 		i_to_char = {i:ch for i,ch in enumerate(vocabulary)}
41 | 
42 | 		encoder = OneHot(vocabulary_size, char_to_i)
43 | 		decoder = UnOneHot(i_to_char)
44 | 
45 | 		chars_per_batch = data_size / batch_size
46 | 		total_used_chars = (data_size / chars_per_batch) * chars_per_batch
47 | 		string = string[:total_used_chars]
48 | 		data_size = len(string)
49 | 		chars_per_batch = data_size / batch_size
50 | 		iterators = range(0, total_used_chars, chars_per_batch)
51 | 
52 | 		self.sequence_length = sequence_length
53 | 		self.batch_size = batch_size
54 | 		self.string = string
55 | 		self.vocabulary = vocabulary
56 | 		self.vocabulary_size = vocabulary_size
57 | 		self.data_size = data_size
58 | 		self.char_to_i = char_to_i
59 | 		self.i_to_char = i_to_char
60 | 		self.encoder = encoder
61 | 		self.decoder = decoder
62 | 		self.chars_per_batch = chars_per_batch
63 | 		self.total_used_chars = total_used_chars
64 | 		self.string = string
65 | 		self.chars_per_batch = chars_per_batch
66 | 		self.iterators = iterators
67 | 
68 | 	def fetch_train(self):
69 | 		T = self.sequence_length
70 | 		batch_string = ''
71 | 
72 | 		for i in range(len(self.iterators)):
73 | 			batch_string += self.string[self.iterators[i]:self.iterators[i] + T]
74 | 			self.iterators[i] += T
75 | 
76 | 		if self.iterators[0] + T >= self.chars_per_batch:
77 | 			self.iterators = range(0, self.total_used_chars, self.chars_per_batch)
78 | 		
79 | 		batch_x = self.encoder(batch_string)
80 | 		batch_y = self.encoder(batch_string[1:] + batch_string[0])
81 | 
82 | 		X = batch_x.reshape((self.batch_size, T, self.vocabulary_size), order='C').swapaxes(1, 2).swapaxes(0, 2)
83 | 		Y = batch_y.reshape((self.batch_size, T, self.vocabulary_size), order='C').swapaxes(1, 2).swapaxes(0, 2)
84 | 		
85 | 		return X, Y


--------------------------------------------------------------------------------
/layers/__init__.py:
--------------------------------------------------------------------------------
1 | from linear import Linear
2 | from softmax_ce_loss import SoftmaxCrossEntropyLoss
3 | from cwrnn import CWRNN
4 | from cwrnn_norm import CWRNN_NORM
5 | from cwrnn_l1 import CWRNN_L1
6 | from activations import *
7 | 


--------------------------------------------------------------------------------
/layers/activations.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | from base import Layer
 4 | 
 5 | class TanH(Layer):
 6 | 	def forward(self, X):
 7 | 		Y = np.tanh(X)
 8 | 		self.Y = Y
 9 | 		return Y
10 | 
11 | 	def backward(self, dY):
12 | 		Y = self.Y
13 | 		dX = (1.0 - Y ** 2) * dY
14 | 		return dX
15 | 
16 | 
17 | class Sigmoid(Layer):
18 | 	def forward(self, X):
19 | 		Y = 1.0 / (1.0 + np.exp(-X))
20 | 		self.Y = Y
21 | 		return Y
22 | 
23 | 	def backward(self, dY):
24 | 		Y = self.Y
25 | 		dX = Y * (1.0 - Y) * dY
26 | 		return dX
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/layers/base.py:
--------------------------------------------------------------------------------
 1 | class Layer(object):
 2 | 	def __init__(self):
 3 | 		pass
 4 | 
 5 | 	def forward(self):
 6 | 		pass
 7 | 
 8 | 	def backward(self):
 9 | 		pass
10 | 
11 | 	def get_params(self):
12 | 		pass
13 | 
14 | 	def set_params(self):
15 | 		pass
16 | 
17 | 	def get_grads(self):
18 | 		pass
19 | 
20 | 	def clear_grads(self):
21 | 		pass
22 | 
23 | 	def forget(self):
24 | 		pass
25 | 
26 | 	def remember(self):
27 | 		pass
28 | 
29 | 	def print_info(self):
30 | 		pass


--------------------------------------------------------------------------------
/layers/cwrnn.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from base import Layer
  4 | from layer_utils import glorotize, orthogonalize
  5 | 
  6 | 
  7 | class Softmax(Layer):
  8 | 	def forward(self, X):
  9 | 		exp = np.exp(X)
 10 | 		probs = exp / np.sum(exp, axis=0, keepdims=True)
 11 | 		self.probs = probs	
 12 | 		return probs
 13 | 
 14 | 	def backward(self, dY):
 15 | 		Y = self.probs
 16 | 		dX = Y * dY
 17 | 		sumdX = np.sum(dX, axis=0, keepdims=True)
 18 | 		dX -= Y * sumdX
 19 | 		return dX
 20 | 
 21 | 
 22 | class CWRNN(Layer):
 23 | 	def __init__(self, n_input, n_hidden, n_modules, T_max, last_state_only=False):
 24 | 		assert(n_hidden % n_modules == 0)
 25 | 
 26 | 		W = np.random.randn(n_hidden, n_input + n_hidden + 1) 	# +1 for bias, single combined matrix 
 27 | 																# for recurrent and input projections
 28 | 
 29 | 		# glorotize and orthogonalize the non recurrent and recurrent aspects respectively
 30 | 		W[:, :n_input] = glorotize(W[:, :n_input])
 31 | 		W[:, n_input:-1] = orthogonalize(W[:, n_input:-1])
 32 | 		
 33 | 		# time kernel (T_max x n_clocks)
 34 | 		C = np.repeat(np.arange(T_max).reshape(1, -1), T_max, axis=0)
 35 | 		C = ((C % np.arange(1, T_max + 1).reshape(-1, 1)) == 0) * 1.0
 36 | 		C = C.T
 37 | 
 38 | 		# distribution over clocks for each module (T_max x n_modules)
 39 | 		d = np.zeros((T_max, n_modules))
 40 | 
 41 | 		self.softmax = Softmax()
 42 | 
 43 | 		self.W = W
 44 | 		self.d = d
 45 | 		self.C = C
 46 | 		self.n_input, self.n_hidden, self.n_modules, self.T_max, self.last_state_only = n_input, n_hidden, n_modules, T_max, last_state_only
 47 | 
 48 | 
 49 | 	def forward(self, X):
 50 | 		T, n, B = X.shape
 51 | 		n_input = self.n_input
 52 | 		n_hidden = self.n_hidden
 53 | 		n_modules = self.n_modules
 54 | 		
 55 | 		D = self.softmax.forward(self.d)		# get activations
 56 | 		a = np.dot(self.C, D)
 57 | 		A = np.repeat(a, n_hidden / n_modules, axis=1) 	# for each state in a module
 58 | 		A = A[:, :, np.newaxis]
 59 | 
 60 | 		V = np.zeros((T, n_input + n_hidden + 1, B))
 61 | 		h_new = np.zeros((T, n_hidden, B))
 62 | 		H_new = np.zeros((T, n_hidden, B))
 63 | 		H = np.zeros((T, n_hidden, B))
 64 | 
 65 | 		H_prev = np.zeros((n_hidden, B))
 66 | 
 67 | 		for t in xrange(T):
 68 | 			V[t] = np.concatenate([X[t], H_prev, np.ones((1, B))], axis=0)
 69 | 			h_new[t] = np.dot(self.W, V[t])
 70 | 			H_new[t] = np.tanh(h_new[t])
 71 | 			H[t] = A[t] * H_new[t] + (1 - A[t]) * H_prev		# leaky update
 72 | 			H_prev = H[t]
 73 | 
 74 | 		self.A, self.a = A, a
 75 | 		self.V, self.h_new, self.H_new, self.H = V, h_new, H_new, H
 76 | 
 77 | 		if self.last_state_only:
 78 | 			return H[-1:]
 79 | 		else:
 80 | 			return H
 81 | 
 82 | 	def backward(self, dH):
 83 | 		if self.last_state_only:
 84 | 			last_step_error = dH.copy()
 85 | 			dH = np.zeros_like(self.H)
 86 | 			dH[-1:] = last_step_error[:]
 87 | 
 88 | 		T, _, B = dH.shape
 89 | 		n_input = self.n_input
 90 | 		n_hidden = self.n_hidden
 91 | 		n_modules = self.n_modules
 92 | 
 93 | 		A = self.A
 94 | 		V, h_new, H_new, H = self.V, self.h_new, self.H_new, self.H 
 95 | 		dA, dH_prev, dW, dX = np.zeros_like(A), np.zeros((n_hidden, B)), \
 96 | 								np.zeros_like(self.W), np.zeros((T, n_input, B))
 97 | 
 98 | 		for t in reversed(xrange(T)):
 99 | 			if t == 0:
100 | 				H_prev = np.zeros((n_hidden, B))
101 | 			else:
102 | 				H_prev = H[t - 1]
103 | 
104 | 			dH_t = dH[t] + dH_prev
105 | 			
106 | 			dH_new = A[t] * dH_t
107 | 			dH_prev = (1 - A[t]) * dH_t
108 | 			dA[t] = np.sum((H_new[t] - H_prev) * dH_t, axis=1, keepdims=True)
109 | 
110 | 			dh_new = (1.0 - H_new[t] ** 2) * dH_new
111 | 			
112 | 			dW += np.dot(dh_new, V[t].T)
113 | 			dV = np.dot(self.W.T, dh_new)
114 | 
115 | 			dX[t] = dV[:n_input]
116 | 			dH_prev += dV[n_input:-1]
117 | 
118 | 		dA = dA[:, :, 0]
119 | 		da = dA.reshape(self.T_max, -1, n_hidden / n_modules).sum(axis=-1)
120 | 		dD = np.dot(self.C.T, da)
121 | 		dd = self.softmax.backward(dD)
122 | 
123 | 		
124 | 		self.dW = dW
125 | 		self.dd = dd
126 | 		 
127 | 		return dX
128 | 
129 | 	def get_params(self):
130 | 		W = self.W.flatten()
131 | 		d = self.d.flatten()
132 | 		return np.concatenate([W, d])
133 | 
134 | 	def set_params(self, P):
135 | 		a, b = self.W.size, self.d.size
136 | 		W, d = np.split(P, [a])
137 | 		self.W = W.reshape(self.W.shape)
138 | 		self.d = d.reshape(self.d.shape)
139 | 
140 | 	def get_grads(self):
141 | 		dW = self.dW.flatten()
142 | 		dd = self.dd.flatten()
143 | 		return np.concatenate([dW, dd])
144 | 
145 | 	def clear_grads(self):
146 | 		self.dW = None
147 | 		self.dd = None
148 | 
149 | 	def forget(self):
150 | 		pass
151 | 
152 | 	def remember(self):
153 | 		pass
154 | 
155 | 	def print_info(self):
156 | 		print 'dominant wave period: ', self.d.argmax(axis=0) + 1
157 | 		print 'avg. power (all coefficients): ', np.abs(self.d).mean()
158 | 		print 'avg. power activation waves: ', self.A.mean()
159 | 


--------------------------------------------------------------------------------
/layers/cwrnn_l1.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from base import Layer
  4 | from layer_utils import glorotize, orthogonalize
  5 | 
  6 | 
  7 | class CWRNN_L1(Layer):
  8 | 	def __init__(self, n_input, n_hidden, n_modules, T_max, last_state_only=False):
  9 | 		assert(n_hidden % n_modules == 0)
 10 | 
 11 | 		W = np.random.randn(n_hidden, n_input + n_hidden + 1) 	# +1 for bias, single combined matrix 
 12 | 																# for recurrent and input projections
 13 | 
 14 | 		# glorotize and orthogonalize the non recurrent and recurrent aspects respectively
 15 | 		W[:, :n_input] = glorotize(W[:, :n_input])
 16 | 		W[:, n_input:-1] = orthogonalize(W[:, n_input:-1])
 17 | 		
 18 | 		# time kernel (T_max x n_clocks)
 19 | 		C = np.repeat(np.arange(T_max).reshape(1, -1), T_max, axis=0)
 20 | 		C = ((C % np.arange(1, T_max + 1).reshape(-1, 1)) == 0) * 1.0
 21 | 		C = C.T
 22 | 
 23 | 		# distribution over clocks for each module (T_max x n_modules)
 24 | 		d = np.zeros((T_max, n_modules))
 25 | 
 26 | 		self.W = W
 27 | 		self.d = d
 28 | 		self.C = C
 29 | 		self.n_input, self.n_hidden, self.n_modules, self.T_max, self.last_state_only = n_input, n_hidden, n_modules, T_max, last_state_only
 30 | 
 31 | 
 32 | 	def forward(self, X):
 33 | 		T, n, B = X.shape
 34 | 		n_input = self.n_input
 35 | 		n_hidden = self.n_hidden
 36 | 		n_modules = self.n_modules
 37 | 		
 38 | 		D = self.d		# get activations
 39 | 		a = np.dot(self.C, D)
 40 | 		a = np.clip(a, 0.0, 1.0)
 41 | 		A = np.repeat(a, n_hidden / n_modules, axis=1) 	# for each state in a module
 42 | 		A = A[:, :, np.newaxis]
 43 | 
 44 | 		V = np.zeros((T, n_input + n_hidden + 1, B))
 45 | 		h_new = np.zeros((T, n_hidden, B))
 46 | 		H_new = np.zeros((T, n_hidden, B))
 47 | 		H = np.zeros((T, n_hidden, B))
 48 | 
 49 | 		H_prev = np.zeros((n_hidden, B))
 50 | 
 51 | 		for t in xrange(T):
 52 | 			V[t] = np.concatenate([X[t], H_prev, np.ones((1, B))], axis=0)
 53 | 			h_new[t] = np.dot(self.W, V[t])
 54 | 			H_new[t] = np.tanh(h_new[t])
 55 | 			H[t] = A[t] * H_new[t] + (1 - A[t]) * H_prev		# leaky update
 56 | 			H_prev = H[t]
 57 | 
 58 | 		self.A, self.a = A, a
 59 | 		self.V, self.h_new, self.H_new, self.H = V, h_new, H_new, H
 60 | 
 61 | 		if self.last_state_only:
 62 | 			return H[-1:]
 63 | 		else:
 64 | 			return H
 65 | 
 66 | 	def backward(self, dH):
 67 | 		if self.last_state_only:
 68 | 			last_step_error = dH.copy()
 69 | 			dH = np.zeros_like(self.H)
 70 | 			dH[-1:] = last_step_error[:]
 71 | 
 72 | 		T, _, B = dH.shape
 73 | 		n_input = self.n_input
 74 | 		n_hidden = self.n_hidden
 75 | 		n_modules = self.n_modules
 76 | 
 77 | 		A = self.A
 78 | 		V, h_new, H_new, H = self.V, self.h_new, self.H_new, self.H 
 79 | 		dA, dH_prev, dW, dX = np.zeros_like(A), np.zeros((n_hidden, B)), \
 80 | 								np.zeros_like(self.W), np.zeros((T, n_input, B))
 81 | 
 82 | 		for t in reversed(xrange(T)):
 83 | 			if t == 0:
 84 | 				H_prev = np.zeros((n_hidden, B))
 85 | 			else:
 86 | 				H_prev = H[t - 1]
 87 | 
 88 | 			dH_t = dH[t] + dH_prev
 89 | 			
 90 | 			dH_new = A[t] * dH_t
 91 | 			dH_prev = (1 - A[t]) * dH_t
 92 | 			dA[t] = np.sum((H_new[t] - H_prev) * dH_t, axis=1, keepdims=True)
 93 | 
 94 | 			dh_new = (1.0 - H_new[t] ** 2) * dH_new
 95 | 			
 96 | 			dW += np.dot(dh_new, V[t].T)
 97 | 			dV = np.dot(self.W.T, dh_new)
 98 | 
 99 | 			dX[t] = dV[:n_input]
100 | 			dH_prev += dV[n_input:-1]
101 | 
102 | 		dA = dA[:, :, 0]
103 | 		da = dA.reshape(self.T_max, -1, n_hidden / n_modules).sum(axis=-1)
104 | 		dD = np.dot(self.C.T, da)
105 | 		dd = dD
106 | 		
107 | 		self.dW = dW
108 | 		self.dd = dd + 0.01 * np.sign(self.d)
109 | 		 
110 | 		return dX
111 | 
112 | 	def get_params(self):
113 | 		W = self.W.flatten()
114 | 		d = self.d.flatten()
115 | 		return np.concatenate([W, d])
116 | 
117 | 	def set_params(self, P):
118 | 		a, b = self.W.size, self.d.size
119 | 		W, d = np.split(P, [a])
120 | 		self.W = W.reshape(self.W.shape)
121 | 		self.d = d.reshape(self.d.shape)
122 | 
123 | 	def get_grads(self):
124 | 		dW = self.dW.flatten()
125 | 		dd = self.dd.flatten()
126 | 		return np.concatenate([dW, dd])
127 | 
128 | 	def clear_grads(self):
129 | 		self.dW = None
130 | 		self.dd = None
131 | 
132 | 	def print_info(self):
133 | 		print 'dominant wave period: ', self.d.argmax(axis=0) + 1
134 | 		print 'avg. power (all coefficients): ', np.abs(self.d).mean()
135 | 		print 'avg. power activation waves: ', self.A.mean()
136 | 


--------------------------------------------------------------------------------
/layers/cwrnn_norm.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from base import Layer
  4 | from layer_utils import glorotize, orthogonalize
  5 | 
  6 | from scipy import special
  7 | 
  8 | 
  9 | class Normalize(Layer):
 10 | 	def __init__(self, axis=0):
 11 | 		self.axis = axis
 12 | 
 13 | 	def forward(self, X):
 14 | 		mn = X.min(axis=self.axis)
 15 | 		mx = X.max(axis=self.axis)
 16 | 
 17 | 		amn = X.argmin(axis=self.axis)
 18 | 		amx = X.argmax(axis=self.axis)
 19 | 
 20 | 		Y = (X - mn) / (mx - mn)
 21 | 
 22 | 		self.mn = mn
 23 | 		self.mx = mx
 24 | 		self.amn = amn
 25 | 		self.amx = amx
 26 | 		self.X = X
 27 | 		self.Y = Y
 28 | 		return Y
 29 | 
 30 | 	def backward(self, dY):
 31 | 		dY[self.amx, range(dY.shape[1])] = 0.0
 32 | 		dY[self.amn, range(dY.shape[1])] = 0.0
 33 | 		dX = dY / (self.mx - self.mn)
 34 | 		dX[self.amx, range(dY.shape[1])] = np.sum(-self.Y * dY / (self.mx - self.mn), axis=self.axis)
 35 | 		dX[self.amn, range(dY.shape[1])] = np.sum((self.X - self.mx) * dY / (self.mx - self.mn) ** 2, axis=self.axis)
 36 | 		
 37 | 		return dX
 38 | 
 39 | 
 40 | class Softmax(Layer):
 41 | 	def forward(self, X):
 42 | 		exp = np.exp(X)
 43 | 		probs = exp / np.sum(exp, axis=0, keepdims=True)
 44 | 		self.probs = probs	
 45 | 		return probs
 46 | 
 47 | 	def backward(self, dY):
 48 | 		Y = self.probs
 49 | 		dX = Y * dY
 50 | 		sumdX = np.sum(dX, axis=0, keepdims=True)
 51 | 		dX -= Y * sumdX
 52 | 		return dX
 53 | 
 54 | 
 55 | class CWRNN_NORM(Layer):
 56 | 	def __init__(self, n_input, n_hidden, n_modules, T_max, last_state_only=False):
 57 | 		assert(n_hidden % n_modules == 0)
 58 | 
 59 | 		W = np.random.randn(n_hidden, n_input + n_hidden + 1) 	# +1 for bias, single combined matrix 
 60 | 																# for recurrent and input projections
 61 | 
 62 | 		# glorotize and orthogonalize the recurrent and non recurrent aspects respectively
 63 | 		W[:, :n_input] = glorotize(W[:, :n_input])
 64 | 		W[:, n_input:-1] = orthogonalize(W[:, n_input:-1])
 65 | 
 66 | 		# time kernel (T_max x T_max)
 67 | 		C = np.repeat(np.arange(1, T_max + 1).reshape(1, -1), T_max, axis=0)
 68 | 		C = ((C % np.arange(1, T_max + 1).reshape(-1, 1)) == 0) * 1.0
 69 | 		C = C.T
 70 | 
 71 | 		# distribution over clocks for each module (T_max x n_modules)
 72 | 		d = np.random.randn(T_max, n_modules)
 73 | 
 74 | 		self.softmax = Softmax()
 75 | 		self.norm = Normalize()
 76 | 
 77 | 		self.W = W
 78 | 		self.d = d
 79 | 		self.C = C
 80 | 		self.n_input, self.n_hidden, self.n_modules, self.T_max, self.last_state_only = n_input, n_hidden, n_modules, T_max, last_state_only
 81 | 
 82 | 
 83 | 	def forward(self, X):
 84 | 		T, n, B = X.shape
 85 | 		n_input = self.n_input
 86 | 		n_hidden = self.n_hidden
 87 | 		n_modules = self.n_modules
 88 | 		
 89 | 		D = self.softmax.forward(self.d)				# get activations
 90 | 		a = np.dot(self.C, D)
 91 | 		a = self.norm.forward(a)
 92 | 		A = np.repeat(a, n_hidden / n_modules, axis=1) 	# for each state in a module
 93 | 		A = A[:, :, np.newaxis]
 94 | 
 95 | 		V = np.zeros((T, n_input + n_hidden + 1, B))
 96 | 		h_new = np.zeros((T, n_hidden, B))
 97 | 		H_new = np.zeros((T, n_hidden, B))
 98 | 		H = np.zeros((T, n_hidden, B))
 99 | 
100 | 		H_prev = np.zeros((n_hidden, B))
101 | 
102 | 		for t in xrange(T):
103 | 			V[t] = np.concatenate([X[t], H_prev, np.ones((1, B))], axis=0)
104 | 			h_new[t] = np.dot(self.W, V[t])
105 | 			H_new[t] = np.tanh(h_new[t])
106 | 			H[t] = A[t] * H_new[t] + (1 - A[t]) * H_prev		# leaky update
107 | 			H_prev = H[t]
108 | 
109 | 		self.D, self.A, self.a = D, A, a
110 | 		self.V, self.h_new, self.H_new, self.H = V, h_new, H_new, H
111 | 
112 | 		if self.last_state_only:
113 | 			return H[-1:]
114 | 		else:
115 | 			return H
116 | 
117 | 	def backward(self, dH):
118 | 		if self.last_state_only:
119 | 			last_step_error = dH.copy()
120 | 			dH = np.zeros_like(self.H)
121 | 			dH[-1:] = last_step_error[:]
122 | 
123 | 		T, _, B = dH.shape
124 | 		n_input = self.n_input
125 | 		n_hidden = self.n_hidden
126 | 		n_modules = self.n_modules
127 | 
128 | 		A = self.A
129 | 		V, h_new, H_new, H = self.V, self.h_new, self.H_new, self.H 
130 | 		dA, dH_prev, dW, dX = np.zeros_like(A), np.zeros((n_hidden, B)), \
131 | 								np.zeros_like(self.W), np.zeros((T, n_input, B))
132 | 
133 | 		for t in reversed(xrange(T)):
134 | 			if t == 0:
135 | 				H_prev = np.zeros((n_hidden, B))
136 | 			else:
137 | 				H_prev = H[t - 1]
138 | 
139 | 			dH_t = dH[t] + dH_prev
140 | 			
141 | 			dH_new = A[t] * dH_t
142 | 			dH_prev = (1 - A[t]) * dH_t
143 | 			dA[t] = np.sum((H_new[t] - H_prev) * dH_t, axis=1, keepdims=True)
144 | 
145 | 			dh_new = (1.0 - H_new[t] ** 2) * dH_new
146 | 			
147 | 			dW += np.dot(dh_new, V[t].T)
148 | 			dV = np.dot(self.W.T, dh_new)
149 | 
150 | 			dX[t] = dV[:n_input]
151 | 			dH_prev += dV[n_input:-1]
152 | 
153 | 		dA = dA[:, :, 0]
154 | 		da = dA.reshape(self.T_max, -1, n_hidden / n_modules).sum(axis=-1)
155 | 		da = self.norm.backward(da) 
156 | 		dD = np.dot(self.C.T, da)
157 | 		dd = self.softmax.backward(dD)
158 | 		
159 | 		self.dW = dW
160 | 		self.dd = dd 
161 | 		 
162 | 		return dX
163 | 
164 | 	def get_params(self):
165 | 		W = self.W.flatten()
166 | 		d = self.d.flatten()
167 | 		return np.concatenate([W, d])
168 | 
169 | 	def set_params(self, P):
170 | 		a, b = self.W.size, self.d.size
171 | 		W, d = np.split(P, [a])
172 | 		self.W = W.reshape(self.W.shape)
173 | 		self.d = d.reshape(self.d.shape)
174 | 
175 | 	def get_grads(self):
176 | 		dW = self.dW.flatten()
177 | 		dd = self.dd.flatten()
178 | 		return np.concatenate([dW, dd])
179 | 
180 | 	def clear_grads(self):
181 | 		self.dW = None
182 | 		self.dd = None
183 | 
184 | 	def forget(self):
185 | 		pass
186 | 
187 | 	def remember(self):
188 | 		pass
189 | 
190 | 	def print_info(self):
191 | 		_D = self.d.copy()
192 | 		print 'dominant wave period: \n', _D.argmax(axis=0) + 1
193 | 		print '\n\navg. power (all):\t', np.abs(_D).mean()
194 | 		print 'avg. power waves:\t', self.A.mean()
195 | 		print '\n\nentropy:\n', special.entr(self.D).sum(axis=0)
196 | 		print '\n\n mean entropy: ', np.mean(special.entr(self.D).sum(axis=0))
197 | 		d = self.d.copy()
198 | 		maxx = d.max(axis=0)
199 | 		d[d.argmax(axis=0), range(d.shape[1])] = -10.0
200 | 		print '\n\ndiff b/w top 2:\n', maxx - d.max(axis=0)
201 | 
202 | 
203 | 
204 | 


--------------------------------------------------------------------------------
/layers/layer_utils.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | 
 4 | def glorotize(W):
 5 | 	W *= np.sqrt(6)
 6 | 	W /= np.sqrt(np.sum(W.shape))
 7 | 	return W 
 8 | 
 9 | 
10 | def orthogonalize(W):
11 | 	W, _, _ = np.linalg.svd(W)
12 | 	return W
13 | 


--------------------------------------------------------------------------------
/layers/linear.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | from base import Layer
 4 | from layer_utils import glorotize
 5 | 
 6 | 
 7 | class Linear(Layer):
 8 | 	def __init__(self, n_input, n_output):
 9 | 		W = np.random.randn(n_output, n_input + 1)						# +1 for the bias
10 | 		W = glorotize(W)
11 | 		self.W = W
12 | 
13 | 		self.n_input = n_input
14 | 		self.n_output = n_output
15 | 
16 | 	def forward(self, X):
17 | 		T, n, B = X.shape
18 | 		
19 | 		X_flat = X.swapaxes(0, 1).reshape(n, -1)						# flatten over time and batch
20 | 		X_flat = np.concatenate([X_flat, np.ones((1, B * T))], axis=0)	# add bias
21 | 		
22 | 		Y = np.dot(self.W, X_flat)
23 | 		Y = Y.reshape((-1, T, B)).swapaxes(0,1)
24 | 		
25 | 		self.X_flat = X_flat
26 | 
27 | 		return Y
28 | 
29 | 	def backward(self, dY):
30 | 		T, n, B = dY.shape
31 | 		
32 | 		dY = dY.swapaxes(0,1).reshape(n, -1)
33 | 		
34 | 		self.dW = np.dot(dY, self.X_flat.T)
35 | 		
36 | 		dX = np.dot(self.W.T, dY)
37 | 		
38 | 		dX = dX[:-1]								# skip the bias we added above
39 | 		dX = dX.reshape((-1, T, B)).swapaxes(0,1)
40 | 		
41 | 		return dX
42 | 
43 | 	def get_params(self):
44 | 		return self.W.flatten()
45 | 
46 | 	def set_params(self, W):
47 | 		self.W = W.reshape(self.W.shape)
48 | 
49 | 	def get_grads(self):
50 | 		return self.dW.flatten()
51 | 
52 | 	def clear_grads(self):
53 | 		self.dW = None
54 | 


--------------------------------------------------------------------------------
/layers/softmax_ce_loss.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | from base import Layer
 4 | 
 5 | 
 6 | class SoftmaxCrossEntropyLoss(Layer):
 7 | 	def forward(self, X, target):
 8 | 		T, _, B = X.shape
 9 | 
10 | 		exp = np.exp(X)
11 | 		probs = exp / np.sum(exp, axis=1, keepdims=True)
12 | 		
13 | 		loss = -1.0 * np.sum(target * np.log(probs)) / (T * B)
14 | 
15 | 		self.probs = probs
16 | 		self.target, self.T, self.B =  target, T, B
17 | 		
18 | 		return loss
19 | 
20 | 	def backward(self):
21 | 		target, T, B = self.target, self.T, self.B
22 | 		
23 | 		dX = self.probs - target
24 | 		
25 | 		return dX / (T * B)


--------------------------------------------------------------------------------
/layers/tests.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from __init__ import *
  4 | 
  5 | 
  6 | def finite_difference_check(layer, fwd, all_values, backpropagated_gradients, names, delta, error_threshold):
  7 | 	error_count = 0
  8 | 	for v in range(len(names)):
  9 | 		values = all_values[v]
 10 | 		dvalues = backpropagated_gradients[v]
 11 | 		name = names[v]
 12 | 		
 13 | 		for i in range(values.size):
 14 | 			actual = values.flat[i]
 15 | 			values.flat[i] = actual + delta
 16 | 			loss_minus = fwd()
 17 | 			values.flat[i] = actual - delta
 18 | 			loss_plus = fwd()
 19 | 			values.flat[i] = actual
 20 | 			backpropagated_gradient = dvalues.flat[i]
 21 | 			numerical_gradient = (loss_minus - loss_plus) / (2 * delta)
 22 | 			
 23 | 			if numerical_gradient == 0 and backpropagated_gradient == 0:
 24 | 				error = 0 
 25 | 			elif abs(numerical_gradient) < 1e-7 and abs(backpropagated_gradient) < 1e-7:
 26 | 				error = 0 
 27 | 			else:
 28 | 				error = abs(backpropagated_gradient - numerical_gradient) / abs(numerical_gradient + backpropagated_gradient)
 29 | 			
 30 | 			if error > error_threshold:
 31 | 				print 'FAILURE!!!\n'
 32 | 				print '\tparameter: ', name, '\tindex: ', np.unravel_index(i, values.shape)
 33 | 				print '\tvalues: ', actual
 34 | 				print '\tbackpropagated_gradient: ', backpropagated_gradient 
 35 | 				print '\tnumerical_gradient', numerical_gradient 
 36 | 				print '\terror: ', error
 37 | 				print '----' * 20
 38 | 				print '\n\n'
 39 | 
 40 | 				error_count += 1
 41 | 
 42 | 	if error_count == 0:
 43 | 		print layer.__class__.__name__, 'Layer Gradient Check Passed'
 44 | 	else:
 45 | 		param_count = 0
 46 | 		for val in all_values:
 47 | 			param_count += val.size
 48 | 		print layer.__class__.__name__, 'Layer Gradient Check Failed for {}/{} parameters'.format(error_count, param_count)
 49 | 
 50 | 
 51 | def test_layer(layer):
 52 | 	P = layer.get_params()
 53 | 	Y = layer.forward(X)
 54 | 	T_rand = np.random.randn(*Y.shape)		# random target for a multiplicative loss
 55 | 	loss = np.sum(Y * T_rand)				# loss
 56 | 
 57 | 	dX = layer.backward(T_rand)
 58 | 	dP = layer.get_grads()
 59 | 
 60 | 	def fwd():
 61 | 		layer.forget()
 62 | 		layer.set_params(P)
 63 | 		return np.sum(layer.forward(X) * T_rand)
 64 | 
 65 | 	all_values = [X, P]
 66 | 	backpropagated_gradients = [dX, dP]
 67 | 	names = ['X', 'P']
 68 | 
 69 | 
 70 | 	finite_difference_check(layer, fwd, all_values, backpropagated_gradients, names, delta, error_threshold)
 71 | 
 72 | 
 73 | def test_loss(layer):
 74 | 			
 75 | 	exp = np.exp(np.random.random(X.shape))
 76 | 	target = exp / np.sum(exp, axis=1, keepdims=True)	# random target for a multiplicative loss
 77 | 	loss = layer.forward(X, target)
 78 | 
 79 | 	dX = layer.backward()
 80 | 	
 81 | 	def fwd():
 82 | 		return layer.forward(X, target)
 83 | 
 84 | 	all_values = [X]
 85 | 	backpropagated_gradients = [dX]
 86 | 	names = ['X']
 87 | 
 88 | 	finite_difference_check(layer, fwd, all_values, backpropagated_gradients, names, delta, error_threshold)
 89 | 
 90 | 
 91 | def test_activation(layer):
 92 | 	Y = layer.forward(X)
 93 | 	T_rand = np.random.randn(*Y.shape)		# random target for a multiplicative loss
 94 | 	loss = np.sum(Y * T_rand)				# loss
 95 | 
 96 | 	dX = layer.backward(T_rand)
 97 | 
 98 | 	def fwd():
 99 | 		return np.sum(layer.forward(X) * T_rand)
100 | 
101 | 	all_values = [X]
102 | 	backpropagated_gradients = [dX]
103 | 	names = ['X']
104 | 
105 | 	finite_difference_check(layer, fwd, all_values, backpropagated_gradients, names, delta, error_threshold)
106 | 
107 | 
108 | delta = 1e-4
109 | error_threshold = 1e-3
110 | time_steps = 5
111 | n_input = 3
112 | batch_size = 7
113 | 
114 | X = np.random.randn(time_steps, n_input, batch_size)
115 | 
116 | n_output = 20
117 | layer = Linear(n_input, n_output)
118 | test_layer(layer=layer)
119 | 
120 | layer = SoftmaxCrossEntropyLoss()
121 | test_loss(layer=layer)
122 | 
123 | layer = CWRNN(n_input=n_input, n_hidden=8, n_modules=4, T_max=time_steps, last_state_only=False)
124 | test_layer(layer=layer)
125 | 
126 | layer = CWRNN_NORM(n_input=n_input, n_hidden=8, n_modules=4, T_max=time_steps, last_state_only=False)
127 | test_layer(layer=layer)
128 | 
129 | layer = CWRNN_L1(n_input=n_input, n_hidden=8, n_modules=4, T_max=time_steps, last_state_only=False)
130 | test_layer(layer=layer)
131 | 
132 | layer = TanH()
133 | test_activation(layer=layer)
134 | 
135 | layer = Sigmoid()
136 | test_activation(layer=layer)
137 | 


--------------------------------------------------------------------------------
/network.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | 
 4 | def forward(model, input, target):
 5 | 	for layer in model[:-1]:
 6 | 		input = layer.forward(input)
 7 | 	output = model[-1].forward(input, target)
 8 | 	return output
 9 | 
10 | 
11 | def backward(model):
12 | 	gradient = model[-1].backward()
13 | 	for layer in reversed(model[:-1]):
14 | 		gradient = layer.backward(gradient)
15 | 	return gradient
16 | 
17 | 
18 | def load_params(model, W):
19 | 	for layer in model:
20 | 		w = layer.get_params()
21 | 		if w is None:
22 | 			continue
23 | 		w_shape = w.shape
24 | 		w, W = np.split(W, [np.prod(w_shape)])
25 | 		layer.set_params(w.reshape(w_shape))
26 | 
27 | 
28 | def extract_params(model):
29 | 	params = []
30 | 	for layer in model:
31 | 		w = layer.get_params()
32 | 		if w is None:
33 | 			continue
34 | 		params.append(w)
35 | 	W = np.concatenate(params)
36 | 	return np.array(W)
37 | 
38 | 
39 | def extract_grads(model):
40 | 	grads = []
41 | 	for layer in model:
42 | 		g = layer.get_grads()
43 | 		if g is None:
44 | 			continue
45 | 		grads.append(g)
46 | 	dW = np.concatenate(grads)
47 | 	return np.array(dW)
48 | 
49 | 
50 | def forget(model):
51 | 	for layer in model:
52 | 		layer.forget()
53 | 
54 | 
55 | def print_info(model):
56 | 	for layer in model:
57 | 		layer.print_info()
58 | 


--------------------------------------------------------------------------------
/train_mnist.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from layers import *
  4 | from network import *
  5 | from climin import RmsProp, Adam, GradientDescent
  6 | from data.mnist import loader
  7 | 
  8 | import time
  9 | import pickle
 10 | import os
 11 | 
 12 | import matplotlib.pyplot as plt
 13 | plt.ion()
 14 | plt.style.use('kosh')
 15 | plt.figure(figsize=(12, 7))
 16 | 
 17 | 
 18 | np.random.seed(np.random.randint(1213))
 19 | 
 20 | experiment_name = 'plot_freq_softmax'
 21 | 
 22 | permuted = False
 23 | 
 24 | n_input = 1
 25 | n_hidden = 128
 26 | n_modules = 8
 27 | n_output = 10
 28 | 
 29 | batch_size = 50
 30 | learning_rate = 1e-3
 31 | niterations = 20000
 32 | momentum = 0.9
 33 | 
 34 | gradient_clip = (-1.0, 1.0)
 35 | 
 36 | save_every = 1000
 37 | plot_every = 100
 38 | 
 39 | logs = {}
 40 | 
 41 | data = loader(batch_size=batch_size, permuted=permuted)
 42 | 
 43 | def dW(W):
 44 | 	load_params(model, W)
 45 | 	forget(model)	
 46 | 	inputs, targets = data.fetch_train()
 47 | 	loss = forward(model, inputs, targets)
 48 | 	backward(model)
 49 | 
 50 | 	gradients = extract_grads(model)
 51 | 	clipped_gradients = np.clip(gradients, gradient_clip[0], gradient_clip[1])
 52 | 	
 53 | 	gradient_norm = (gradients ** 2).sum() / gradients.size
 54 | 	clipped_gradient_norm = (clipped_gradients ** 2).sum() / gradients.size
 55 | 	
 56 | 	logs['loss'].append(loss)
 57 | 	logs['smooth_loss'].append(loss * 0.01 + logs['smooth_loss'][-1] * 0.99)
 58 | 	logs['gradient_norm'].append(gradient_norm) 
 59 | 	logs['clipped_gradient_norm'].append(clipped_gradient_norm)
 60 | 	
 61 | 	return clipped_gradients
 62 | 
 63 | 
 64 | os.system('mkdir results/' + experiment_name)
 65 | path = 'results/' + experiment_name + '/'
 66 | 
 67 | logs['loss'] = []
 68 | logs['val_loss'] = []
 69 | logs['accuracy'] = []
 70 | logs['smooth_loss'] = [np.log(10)]
 71 | logs['gradient_norm'] = []
 72 | logs['clipped_gradient_norm'] = []
 73 | 
 74 | 
 75 | model = [
 76 | 			CWRNN_L1(n_input=n_input, n_hidden=n_hidden, n_modules=n_modules, T_max=784, last_state_only=True),
 77 | 			Linear(n_hidden, n_output),
 78 |  			SoftmaxCrossEntropyLoss()
 79 |  		]
 80 | 
 81 | W = extract_params(model)
 82 | 
 83 | optimizer = Adam(W, dW, learning_rate, momentum=momentum)
 84 | 
 85 | print 'Approx. Parameters: ', W.size
 86 | 
 87 | for i in optimizer:
 88 | 	if i['n_iter'] > niterations:
 89 | 		break
 90 | 
 91 | 	print '\n\n'
 92 | 	print str(data.epoch) + '\t' + str(i['n_iter']), '\t',
 93 | 	print logs['loss'][-1], '\t',
 94 | 	print logs['gradient_norm'][-1]
 95 | 	print_info(model)
 96 | 	print '\n', '----' * 20, '\n'
 97 | 
 98 | 	if data.epoch_complete:
 99 | 		inputs, labels = data.fetch_val()
100 | 		nsamples = inputs.shape[2]
101 | 		inputs = np.split(inputs, nsamples / batch_size, axis=2)
102 | 		labels = np.split(labels, nsamples / batch_size, axis=2)
103 | 		val_loss = 0
104 | 		for j in range(len(inputs)):
105 | 			forget(model)
106 | 			input = inputs[j]
107 | 			label = labels[j]
108 | 			val_loss += forward(model, input, label)
109 | 		val_loss /= len(inputs)
110 | 		logs['val_loss'].append(val_loss)
111 | 		print '..' * 20
112 | 		print 'validation loss: ', val_loss
113 | 
114 | 		inputs, labels = data.fetch_test()
115 | 		nsamples = inputs.shape[2]
116 | 		inputs = np.split(inputs, nsamples / batch_size, axis=2)
117 | 		labels = np.split(labels, nsamples / batch_size, axis=2)
118 | 	
119 | 		correct = 0
120 | 		for j in range(len(inputs)):
121 | 			forget(model)
122 | 			input = inputs[j]
123 | 			label = labels[j]
124 | 			_ = forward(model, input, label)
125 | 			pred = model[-1].probs
126 | 			good = np.sum(label.argmax(axis=1) == pred.argmax(axis=1))
127 | 			correct += good
128 | 
129 | 		accuracy = correct / float(nsamples)
130 | 		logs['accuracy'].append(accuracy)
131 | 		print 'accuracy: ', accuracy
132 | 		print '..' * 20
133 | 		
134 | 		data.epoch_complete = False
135 | 
136 | 		plt.figure(2)
137 | 		plt.clf()
138 | 		plt.plot(logs['val_loss'], label='validation')
139 | 		plt.legend()
140 | 		plt.draw()
141 | 		plt.figure(3)
142 | 		plt.clf()
143 | 		plt.plot(logs['accuracy'], label='accuracy')
144 | 		plt.legend()
145 | 		plt.draw()
146 | 
147 | 
148 | 	if i['n_iter'] % save_every == 0:
149 | 		print 'serializing model... '
150 | 		f = open(path + 'iter_' + str(i['n_iter']) +'.model', 'w')
151 | 		pickle.dump(model, f)
152 | 		f.close()
153 | 
154 | 	if i['n_iter'] % plot_every == 0:
155 | 		plt.figure(1)
156 | 		plt.clf()
157 | 		plt.plot(logs['smooth_loss'], label='training')
158 | 		plt.legend()
159 | 		plt.draw()
160 | 
161 | 
162 | print 'serializing logs... '
163 | f = open(path + 'logs.logs', 'w')
164 | pickle.dump(logs, f)
165 | f.close()
166 | 
167 | print 'serializing final model... '
168 | f = open(path + 'final.model', 'w')
169 | pickle.dump(model, f)
170 | f.close()
171 | 
172 | plt.savefig(path + 'loss_curve')
173 | 


--------------------------------------------------------------------------------
/train_ptb.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | from layers import *
  4 | from network import *
  5 | from climin import RmsProp, Adam, GradientDescent
  6 | from data.text import loader
  7 | 
  8 | import time
  9 | import pickle
 10 | import os
 11 | 
 12 | import matplotlib.pyplot as plt
 13 | plt.ion()
 14 | plt.style.use('kosh')
 15 | plt.figure(figsize=(12, 7))
 16 | 
 17 | 
 18 | np.random.seed(np.random.randint(1213))
 19 | 
 20 | experiment_name = 'penn_random_init_noscale_0.5binaryActi_0.01WDecay'
 21 | 
 22 | text_file = 'ptb.txt'
 23 | 
 24 | vocabulary_size = 49
 25 | 
 26 | n_output = n_input = vocabulary_size
 27 | n_hidden = 1024
 28 | n_modules = 8
 29 | noutputs = vocabulary_size
 30 | 
 31 | sequence_length = 100
 32 | 
 33 | batch_size = 64
 34 | learning_rate = 2e-3
 35 | niterations = 100000
 36 | momentum = 0.9
 37 | 
 38 | forget_every = 100
 39 | gradient_clip = (-1.0, 1.0)
 40 | 
 41 | sample_every = 1000
 42 | save_every = 1000
 43 | plot_every = 100
 44 | 
 45 | logs = {}
 46 | 
 47 | data = loader('data/' + text_file, sequence_length, batch_size)
 48 | 
 49 | 
 50 | def dW(W):
 51 | 	load_params(model, W)
 52 | 	forget(model)	
 53 | 	inputs, targets = data.fetch_train()
 54 | 	loss = forward(model, inputs, targets)
 55 | 	backward(model)
 56 | 
 57 | 	gradients = extract_grads(model)
 58 | 	clipped_gradients = np.clip(gradients, gradient_clip[0], gradient_clip[1])
 59 | 	
 60 | 	gradient_norm = (gradients ** 2).sum() / gradients.size
 61 | 	clipped_gradient_norm = (clipped_gradients ** 2).sum() / gradients.size
 62 | 	
 63 | 	logs['loss'].append(loss)
 64 | 	logs['smooth_loss'].append(loss * 0.01 + logs['smooth_loss'][-1] * 0.99)
 65 | 	logs['gradient_norm'].append(gradient_norm) 
 66 | 	logs['clipped_gradient_norm'].append(clipped_gradient_norm)
 67 | 	
 68 | 	return clipped_gradients
 69 | 
 70 | 
 71 | os.system('mkdir results/' + experiment_name)
 72 | path = 'results/' + experiment_name + '/'
 73 | 
 74 | logs['loss'] = []
 75 | logs['val_loss'] = []
 76 | logs['accuracy'] = []
 77 | logs['smooth_loss'] = [np.log(49)]
 78 | logs['gradient_norm'] = []
 79 | logs['clipped_gradient_norm'] = []
 80 | 
 81 | model = [
 82 | 			CWRNN2(n_input=n_input, n_hidden=n_hidden, n_modules=n_modules, T_max=sequence_length),
 83 | 			Linear(n_hidden, n_output),
 84 |  			SoftmaxCrossEntropyLoss()
 85 |  		]
 86 | 
 87 | W = extract_params(model)
 88 | 
 89 | optimizer = Adam(W, dW, learning_rate, momentum=momentum)
 90 | 
 91 | print 'Approx. Parameters: ', W.size
 92 | 
 93 | for i in optimizer:
 94 | 	if i['n_iter'] > niterations:
 95 | 		break
 96 | 
 97 | 	print '\n\n'
 98 | 	print str(i['n_iter']), '\t',
 99 | 	print logs['loss'][-1], '\t',
100 | 	print logs['gradient_norm'][-1]
101 | 	print_info(model)
102 | 	print '\n', '----' * 20, '\n'
103 | 
104 | 	if i['n_iter'] % sample_every == 0:
105 | 		forget(model)
106 | 		x = np.zeros((20, vocabulary_size, 1))
107 | 		input, _ = data.fetch_train()
108 | 		x[:20, :, :] = input[:20, :, 0:1]
109 | 		ixes = []
110 | 		for t in xrange(1000):
111 | 			forward(model, np.array(x), 1.0)
112 | 			p = model[-1].probs
113 | 			p = p[-1]
114 | 			ix = np.random.choice(range(vocabulary_size), p=p.ravel())
115 | 			x = np.zeros((1, vocabulary_size, 1))
116 | 			x[0, ix, 0] = 1
117 | 			ixes.append(ix)
118 | 		sample = ''.join(data.decoder.to_c[ix] for ix in ixes)
119 | 		print '----' * 20
120 | 		print sample
121 | 		print '----' * 20
122 | 		forget(model)
123 | 	
124 | 	if i['n_iter'] % save_every == 0:
125 | 		print 'serializing model... '
126 | 		f = open(path + 'iter_' + str(i['n_iter']) +'.model', 'w')
127 | 		pickle.dump(model, f)
128 | 		f.close()
129 | 
130 | 	if i['n_iter'] % plot_every == 0:
131 | 		plt.clf()
132 | 		plt.plot(logs['smooth_loss'])
133 | 		plt.draw()
134 | 
135 | print 'serializing logs... '
136 | f = open(path + 'logs.logs', 'w')
137 | pickle.dump(logs, f)
138 | f.close()
139 | 
140 | print 'serializing final model... '
141 | f = open(path + 'final.model', 'w')
142 | pickle.dump(model, f)
143 | f.close()
144 | 
145 | plt.savefig(path + 'loss_curve')
146 | 


--------------------------------------------------------------------------------