├── .gitignore
├── README.md
├── oov_vec.py
├── oov_vec_nlc.py
├── LICENCE
├── semlp_rate_l2_dpout.py
├── lstmmlp_rate_l2_dpout.py
├── dataset.py
├── segae_gaereg.py
├── segae_l2_dpout.py
└── util_layers.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *~
 2 | *.swo
 3 | *.swp
 4 | 
 5 | *.pyc
 6 | *.ipynb
 7 | 
 8 | *.npy
 9 | *.pkl
10 | *.html
11 | *.log
12 | 
13 | SMART_DISPATCH_LOGS/*
14 | html/*
15 | params/*
16 | screenlog.0
17 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Self Attentive Sentence Embedding
 2 | This is the implementation for the paper **A Structured Self-Attentive Sentence Embedding**,  which is published in ICLR 2017: https://arxiv.org/abs/1703.03130 . We provide reproductions for the results on Yelp, Age and SNLI datasets, as well as their baselines. 
 3 | 
 4 | Thanks to the community, there have been various reimplementations of this work
 5 | by researchers from different groups before we release
 6 | this version of code. Some of them even achieved higher performances than the
 7 | results we reported in the paper. We would really like to thank them here, and refer
 8 | those third party implementations at the end of this readme. They provide
 9 | our model in different frameworks (TensorFlow, PyTorch) as well.
10 | 
11 | 
12 | ## Requirements:
13 | [Theano](http://deeplearning.net/software/theano/)  
14 | [Lasagne](http://lasagne.readthedocs.io/en/latest/)  
15 | [scikit-learn](http://scikit-learn.org/stable/)  
16 | [NLTK](http://www.nltk.org/)
17 | 
18 | 
19 | ## Datasets and Preprocessing
20 | The SNLI dataset can be downloaded from https://nlp.stanford.edu/projects/snli/ .
21 | The file ``oov_vec.py`` is for preprocessing this dataset, no additional command line arguments needed.
22 | 
23 | For [Yelp](https://www.yelp.com/dataset_challenge) and [Age](http://pan.webis.de/clef16/pan16-web/author-profiling.html) data, they are preprocessed by the same file, with different command args:
24 | ```
25 | oov_vec_nlc.py age2 glove
26 | oov_vec_nlc.py yelp glove
27 | ```
28 | You can also choose between `word2vec` and `glove` through the command line args.
29 | 
30 | 
31 | ## Word Embeddings
32 | Our experiments are majorly based on GloVe embeddings (https://nlp.stanford.edu/projects/glove/), but we've also tested them on `word2vec` (https://code.google.com/archive/p/word2vec/) as well for Age and Yelp datasets.
33 | 
34 | 
35 | ## Traning Baselines
36 | After running the preprocessing scripts beforehand, the baseline results on Age and Yelp datasets can be reproduced by the following configurations:
37 | 
38 | ```
39 | python lstmmlp_rate_l2_dpout.py 300 3000 0.06 0.0001 0.5 word2vec 100 16 0.5 300 0.1 1 age2
40 | python lstmmlp_rate_l2_dpout.py 300 3000 0.06 0.0001 0.5 word2vec 100 32 0.5 300 0.1 1 yelp
41 | ```
42 | 
43 | ## Training the Proposed Model
44 | 
45 | For reproducing the results in our paper on Age and Yelp, please run:
46 | ```
47 | python semlp_rate_l2_dpout.py 300 350 2000 30 0.001 0.3 0.0001 1. glove 300 50 0.5 100 0.1 1 age2
48 | python semlp_rate_l2_dpout.py 300 350 3000 30 0.001 0.3 0.0001 1. glove 300 50 0.5 100 0.1 1 yelp
49 | ```
50 | 
51 | And on SNLI dataset:
52 | ```
53 | python segae_gaereg.py 300 150 4000 30 0.01 0.1 0.5 300 50 100 12 0.1
54 | ```
55 | 
56 | ## Third Party Implementations
57 | * PyTorch implementation by Haoyue Shi (@ExplorerFreda): https://github.com/ExplorerFreda/Structured-Self-Attentive-Sentence-Embedding
58 | 
59 | * PyTorch implementation by Yufeng Ma (@yufengm): https://github.com/yufengm/SelfAttentive
60 | 
61 | * TensorFlow implementation by Diego Antognini (@Diego999): https://github.com/Diego999/SelfSent
62 | 


--------------------------------------------------------------------------------
/oov_vec.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import string
  3 | import numpy
  4 | import cPickle
  5 | import numpy as np
  6 | import nltk
  7 | 
  8 | import pdb
  9 | 
 10 | print "loading GloVe..."
 11 | w1 = {}
 12 | vec = open('/Users/johanlin/Datasets/wordembeddings/glove.840B.300d.txt', 'r')
 13 | for line in vec.readlines():
 14 |     line=line.split(' ')
 15 |     w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32')
 16 | vec.close()
 17 | 
 18 | classname = {'entailment':0, 'neutral': 1, 'contradiction': 2, '-': 3}
 19 | f1 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_train.txt', 'r')
 20 | f2 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_dev.txt', 'r')
 21 | f3 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_test.txt', 'r')
 22 | f = [f1, f2, f3]
 23 | 
 24 | 
 25 | print "processing dataset: 3 dots to punch: ",
 26 | sys.stdout.flush()
 27 | w2 = {}
 28 | w_referred = {0: 0}  # reserve 0 for future padding
 29 | vocab_count = 1  # 0 is reserved for future padding
 30 | train_valid_test = []
 31 | for file in f:
 32 |     print ".",
 33 |     sys.stdout.flush()
 34 |     pairs = []
 35 |     filehead = file.readline()  # strip the file head
 36 |     for line in file.readlines():
 37 |         line=line.split('\t')
 38 |         s1 = nltk.word_tokenize(line[5])
 39 |         s1[0]=s1[0].lower()
 40 |         s2 = nltk.word_tokenize(line[6])
 41 |         s2[0]=s2[0].lower()
 42 | 
 43 |         truth = classname[line[0]]
 44 |         
 45 |         if truth != 3:  # exclude those '-' tags
 46 |             s1_words = []
 47 |             for word in s1:
 48 |                 # strip some possible weird punctuations
 49 |                 word = word.strip(string.punctuation)
 50 |                 if not w_referred.has_key(word):
 51 |                     w_referred[word] = vocab_count
 52 |                     vocab_count += 1
 53 |                 s1_words.append(w_referred[word])
 54 |                 if not w1.has_key(word):
 55 |                     if not w2.has_key(word):
 56 |                         w2[word]=[]
 57 |                     # find the WE for its surounding words
 58 |                     for neighbor in s1:
 59 |                         if w1.has_key(neighbor):
 60 |                             w2[word].append(w1[neighbor])
 61 | 
 62 |             s2_words = []
 63 |             for word in s2:
 64 |                 word = word.strip(string.punctuation)
 65 |                 if not w_referred.has_key(word):
 66 |                     w_referred[word] = vocab_count
 67 |                     vocab_count += 1
 68 |                 s2_words.append(w_referred[word])
 69 |                 if not w1.has_key(word):
 70 |                     if not w2.has_key(word):
 71 |                         w2[word]=[]
 72 |                     for neighbor in s2:
 73 |                         if w1.has_key(neighbor):
 74 |                             w2[word].append(w1[neighbor])
 75 | 
 76 |             pairs.append((numpy.asarray(s1_words).astype('int32'),
 77 |                           numpy.asarray(s2_words).astype('int32'),
 78 |                           numpy.asarray(truth).astype('int32')))
 79 | 
 80 |     train_valid_test.append(pairs)
 81 |     file.close()
 82 | 
 83 | 
 84 | print "\naugmenting word embedding vocabulary..."
 85 | # this block is causing memory error in a 8G computer. Using alternatives.
 86 | # all_sentences = [w2[x] for x in w2.iterkeys()]
 87 | # all_words = [item for sublist in all_sentences for item in sublist]
 88 | # mean_words = np.mean(all_words)
 89 | # mean_words_std = np.std(all_words)
 90 | mean_words = np.zeros((300,))
 91 | mean_words_std = 1e-1
 92 | 
 93 | npy_rng = np.random.RandomState(123)
 94 | for k in w2.iterkeys():
 95 |     if len(w2[k]) != 0:
 96 |         w2[k] = sum(w2[k]) / len(w2[k])  # mean of all surounding words
 97 |     else:
 98 |         # len(w2[k]) == 0 cases: ['cantunderstans', 'motocyckes', 'arefun']
 99 |         # I hate those silly guys...
100 |         w2[k] = mean_words + npy_rng.randn(mean_words.shape[0]) * \
101 |                              mean_words_std * 0.1
102 | 
103 | w2.update(w1)
104 | 
105 | print "generating weight values..."
106 | # reverse w_referred's key-value;
107 | inv_w_referred = {v: k for k, v in w_referred.items()}
108 | 
109 | # number   --inv_w_referred-->   word   --w2-->   embedding
110 | ordered_word_embedding = [numpy.zeros((1, 300), dtype='float32'), ] + \
111 |     [w2[inv_w_referred[n]].reshape(1, -1) for n in range(1, len(inv_w_referred))]
112 | 
113 | # to get the matrix
114 | weight = numpy.concatenate(ordered_word_embedding, axis=0)
115 | 
116 | 
117 | print "dumping converted datasets..."
118 | save_file = open('/Users/johanlin/Datasets/snli_1.0/SNLI_GloVe_converted', 'wb')
119 | cPickle.dump("dict: truth values and their corresponding class name\n"
120 |              "the whole dataset, in list of list of tuples: list of train/valid/test set -> "
121 |                 "list of sentence pairs -> tuple with structure:"
122 |                 "(hypothesis, premise, truth class), all entries in numbers\n"
123 |              "numpy.ndarray: a matrix with all referred words' embedding in its rows,"
124 |                 "embeddings are ordered by their corresponding word numbers.\n"
125 |              "dict: the augmented GloVe word embedding. contains all possible tokens in SNLI."
126 |                 "All initial GloVe entries are included.\n"
127 |              "dict w_referred: word to their corresponding number\n"
128 |              "inverse of w_referred, number to words\n",
129 |              save_file)
130 | cPickle.dump(classname, save_file)
131 | cPickle.dump(train_valid_test, save_file)
132 | cPickle.dump(weight, save_file)
133 | cPickle.dump(w2, save_file)
134 | cPickle.dump(w_referred, save_file)
135 | cPickle.dump(inv_w_referred, save_file)
136 | save_file.close()
137 | 
138 | 
139 | # check:
140 | def reconstruct_sentence(sent_nums):
141 |     sent_words = [inv_w_referred[n] for n in sent_nums]
142 |     return sent_words
143 | 
144 | def check_word_embed(sent_nums):
145 |     sent_words = reconstruct_sentence(sent_nums)
146 | 
147 |     word_embeds_from_nums = [weight[n] for n in sent_nums]
148 |     word_embeds_from_words = [w2[n] for n in sent_words]
149 | 
150 |     error = 0.
151 |     for i, j in zip(word_embeds_from_nums, word_embeds_from_words):
152 |         error += numpy.sum(i-j)
153 |     
154 |     if error == 0.:
155 |         return True
156 |     else:
157 |         return False
158 | 


--------------------------------------------------------------------------------
/oov_vec_nlc.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import string
  4 | import numpy
  5 | import cPickle
  6 | import numpy as np
  7 | import nltk
  8 | 
  9 | import pdb
 10 | 
 11 | 
 12 | ###################################################
 13 | #                Overall stats
 14 | ###################################################
 15 | #                       # entries            # dims
 16 | # adapted word2vec         530158               100
 17 | # glove                   2196016               300
 18 | ###################################################
 19 | #                            age1     age2     yelp
 20 | # train data                    -    68485   500000
 21 | # dev/test data                 -     4000     2000
 22 | # word2vec known token              126862   233293
 23 | #            UNK token               93124   184427
 24 | # glove    known token              126862   233293
 25 | #            UNK token               49268   104717 
 26 | ###################################################
 27 | #                                   120739
 28 | #                                    43256    88984
 29 | 
 30 | wdembed = sys.argv[1]      # word2vec, glove
 31 | data_choice = sys.argv[2]  # age1, age2, yelp
 32 | 
 33 | 
 34 | # load word embedding:
 35 | if wdembed == 'word2vec':
 36 |     print "loading adapted word2vec..."
 37 |     fname = '/home/hantek/datasets/NLC_data/embedding'
 38 |     w1 = {}
 39 |     vec = open(fname, 'r')
 40 |     for line in vec.readlines():
 41 |         line=line.split()
 42 |         w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32')
 43 |     vec.close()
 44 | elif wdembed == 'glove':
 45 |     print "loading GloVe..."
 46 |     fname = '/home/hantek/datasets/glove/glove.840B.300d.dict.pkl'
 47 |     if os.path.isfile(fname):
 48 |         w1 = cPickle.load(open(fname, 'rb'))
 49 |     else:
 50 |         w1 = {}
 51 |         vec = open('/home/hantek/datasets/glove/glove.840B.300d.txt', 'r')
 52 |         for line in vec.readlines():
 53 |             line=line.split(' ')
 54 |             w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32')
 55 |         vec.close()
 56 |         save_file = open(fname, 'wb')
 57 |         cPickle.dump(w1, save_file)
 58 |         save_file.close()
 59 | else:
 60 |     raise ValueError("cmd args 1 has to be either 'word2vec' or 'glove'.")
 61 | 
 62 | 
 63 | # load data:
 64 | if data_choice == 'age1':
 65 |     f1 = open('/home/hantek/datasets/NLC_data/age1/age1_train', 'r')
 66 |     f2 = open('/home/hantek/datasets/NLC_data/age1/age1_valid', 'r')
 67 |     f3 = open('/home/hantek/datasets/NLC_data/age1/age1_test', 'r')
 68 |     classname = {}
 69 | elif data_choice == 'age2':
 70 |     f1 = open('/home/hantek/datasets/NLC_data/age2/age2_train', 'r')
 71 |     f2 = open('/home/hantek/datasets/NLC_data/age2/age2_valid', 'r')
 72 |     f3 = open('/home/hantek/datasets/NLC_data/age2/age2_test', 'r')
 73 |     # note that class No. = rating -1
 74 |     classname = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4}
 75 | elif data_choice == 'yelp':
 76 |     f1 = open('/home/hantek/datasets/NLC_data/yelp/yelp_train_500k', 'r')
 77 |     f2 = open('/home/hantek/datasets/NLC_data/yelp/yelp_valid_2000', 'r')
 78 |     f3 = open('/home/hantek/datasets/NLC_data/yelp/yelp_test_2000', 'r')
 79 |     # note that class No. = rating -1
 80 |     classname = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4}
 81 | else:
 82 |     raise ValueError("command line argument has to be either 'age1', 'age2' or 'yelp'.")
 83 | f = [f1, f2, f3]
 84 | 
 85 | 
 86 | print "processing dataset, 3 dots to punch: ",
 87 | sys.stdout.flush()
 88 | w2 = {}
 89 | w_referred = {0: 0}  # reserve 0 for future padding
 90 | vocab_count = 1  # 0 is reserved for future padding
 91 | train_dev_test = []
 92 | for file in f:
 93 |     print ".",
 94 |     sys.stdout.flush()
 95 |     pairs = []
 96 |     for line in file.readlines():
 97 |         line=line.decode('utf-8').split()
 98 |         s1 = line[1:]
 99 |         s1[0]=s1[0].lower()
100 | 
101 |         rate_score = classname[line[0]]
102 |         # rate_score = line[0]
103 | 
104 |         s1_words = []
105 |         for word in s1:
106 |             if not w_referred.has_key(word):
107 |                 w_referred[word] = vocab_count
108 |                 vocab_count += 1
109 |             s1_words.append(w_referred[word])
110 |             if not w1.has_key(word):
111 |                 if not w2.has_key(word):
112 |                     w2[word]=[]
113 |                 # find the WE for its surounding words
114 |                 for neighbor in s1:
115 |                     if w1.has_key(neighbor):
116 |                         w2[word].append(w1[neighbor])
117 | 
118 |         pairs.append((numpy.asarray(s1_words).astype('int32'),
119 |                       rate_score))
120 |                       # numpy.asarray(rate_score).astype('int32')))
121 | 
122 |     train_dev_test.append(pairs)
123 |     file.close()
124 | 
125 | pdb.set_trace()
126 | 
127 | print "\naugmenting word embedding vocabulary..."
128 | # this block is causing memory error in a 8G computer. Using alternatives.
129 | # all_sentences = [w2[x] for x in w2.iterkeys()]
130 | # all_words = [item for sublist in all_sentences for item in sublist]
131 | # mean_words = np.mean(all_words)
132 | # mean_words_std = np.std(all_words)
133 | mean_words = np.zeros((len(w1['the']),))
134 | mean_words_std = 1e-1
135 | 
136 | npy_rng = np.random.RandomState(123)
137 | for k in w2.iterkeys():
138 |     if len(w2[k]) != 0:
139 |         w2[k] = sum(w2[k]) / len(w2[k])  # mean of all surounding words
140 |     else:
141 |         # len(w2[k]) == 0 cases: ['cantunderstans', 'motocyckes', 'arefun']
142 |         # I hate those silly guys...
143 |         w2[k] = mean_words + npy_rng.randn(mean_words.shape[0]) * \
144 |                              mean_words_std * 0.1
145 | 
146 | w2.update(w1)
147 | 
148 | print "generating weight values..."
149 | # reverse w_referred's key-value;
150 | inv_w_referred = {v: k for k, v in w_referred.items()}
151 | 
152 | # number   --inv_w_referred-->   word   --w2-->   embedding
153 | ordered_word_embedding = [numpy.zeros((1, len(w1['the'])), dtype='float32'), ] + \
154 |     [w2[inv_w_referred[n]].reshape(1, -1) for n in range(1, len(inv_w_referred))]
155 | 
156 | # to get the matrix
157 | weight = numpy.concatenate(ordered_word_embedding, axis=0)
158 | 
159 | 
160 | print "dumping converted datasets..."
161 | if data_choice == 'age1':
162 |     save_file = open('/home/hantek/datasets/NLC_data/age1/' + wdembed + '_age1.pkl', 'wb')
163 | elif data_choice == 'age2':
164 |     save_file = open('/home/hantek/datasets/NLC_data/age2/' + wdembed + '_age2.pkl', 'wb')
165 | elif data_choice == 'yelp':
166 |     save_file = open('/home/hantek/datasets/NLC_data/yelp/' + wdembed + '_yelp.pkl', 'wb')
167 | 
168 | cPickle.dump("dict: truth values and their corresponding class name\n"
169 |              "the whole dataset, in list of list of tuples: list of train/valid/test set -> "
170 |                 "list of sentence pairs -> tuple with structure:"
171 |                 "(review, truth rate), all entries in numbers\n"
172 |              "numpy.ndarray: a matrix with all referred words' embedding in its rows,"
173 |                 "embeddings are ordered by their corresponding word numbers.\n"
174 |              "dict: the augmented GloVe word embedding. contains all possible tokens in SNLI."
175 |                 "All initial GloVe entries are included.\n"
176 |              "dict w_referred: word to their corresponding number\n"
177 |              "inverse of w_referred, number to words\n",
178 |              save_file)
179 | cPickle.dump(classname, save_file)
180 | cPickle.dump(train_dev_test, save_file)
181 | cPickle.dump(weight, save_file)
182 | fake_w2 = None; cPickle.dump(fake_w2, save_file)
183 | # cPickle.dump(w2, save_file)  # this is a huge dictionary, delete it if you don't need.
184 | cPickle.dump(w_referred, save_file)
185 | cPickle.dump(inv_w_referred, save_file)
186 | save_file.close()
187 | print "Done."
188 | 
189 | 
190 | # check:
191 | def reconstruct_sentence(sent_nums):
192 |     sent_words = [inv_w_referred[n] for n in sent_nums]
193 |     return sent_words
194 | 
195 | def check_word_embed(sent_nums):
196 |     sent_words = reconstruct_sentence(sent_nums)
197 | 
198 |     word_embeds_from_nums = [weight[n] for n in sent_nums]
199 |     word_embeds_from_words = [w2[n] for n in sent_words]
200 | 
201 |     error = 0.
202 |     for i, j in zip(word_embeds_from_nums, word_embeds_from_words):
203 |         error += numpy.sum(i-j)
204 |     
205 |     if error == 0.:
206 |         return True
207 |     else:
208 |         return False
209 | 


--------------------------------------------------------------------------------
/LICENCE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "{}"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright {yyyy} {name of copyright owner}
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 
203 | 


--------------------------------------------------------------------------------
/semlp_rate_l2_dpout.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from __future__ import print_function
  5 | 
  6 | import time
  7 | import os
  8 | import sys
  9 | import numpy
 10 | import cPickle
 11 | import theano
 12 | import theano.tensor as T
 13 | import lasagne
 14 | from lasagne.layers.recurrent import Gate
 15 | from lasagne.layers import (DropoutLayer, LSTMLayer, EmbeddingLayer,
 16 |                             ConcatLayer, DenseLayer)
 17 | from lasagne import init, nonlinearities
 18 | 
 19 | from util_layers import (DenseLayer3DInput, Softmax3D, ApplyAttention,
 20 |                          GatedEncoder3D)
 21 | from dataset import YELP, AGE2
 22 | 
 23 | import pdb
 24 | theano.config.compute_test_value = 'warn'  # 'off' # Use 'warn' to activate
 25 | 
 26 | 
 27 | LSTMHID = int(sys.argv[1])          # 300 Hidden unit numbers in LSTM
 28 | ATTHID = int(sys.argv[2])           # 350 Hidden unit numbers in attention MLP
 29 | OUTHID = int(sys.argv[3])           # 3000 Hidden unit numbers in output MLP
 30 | NROW = int(sys.argv[4])             # 30 Number of rows in matrix representation
 31 | LR = float(sys.argv[5])             # 0.001
 32 | DPOUT = float(sys.argv[6])          # 0.3 dropout rate
 33 | L2REG = float(sys.argv[7])          # 0.0001 L2 regularization
 34 | ATTPNLT = float(sys.argv[8])        # 0.
 35 | WE = str(sys.argv[9])               # either `word2vec` or `glove`
 36 | WEDIM = int(sys.argv[10])           # either  100       or  300   Dim
 37 | BSIZE = int(sys.argv[11])           # 50 Minibatch size
 38 | GCLIP = float(sys.argv[12])         # 0.5 All gradients above will be clipped
 39 | NEPOCH = int(sys.argv[13])          # 100 Number of epochs to train the net
 40 | STD = float(sys.argv[14])           # 0.1 Standard deviation of weights in init
 41 | UPDATEWE = bool(int(sys.argv[15]))  # 0 for False and 1 for True. Update WE
 42 | DSET = str(sys.argv[16])            # dataset, either `yelp` or `age2`
 43 | 
 44 | filename = __file__.split('.')[0] + \
 45 |            '_LSTMHID' + str(LSTMHID) + \
 46 |            '_ATTHID' + str(ATTHID) + \
 47 |            '_OUTHID' + str(OUTHID) + \
 48 |            '_NROW' + str(NROW) + \
 49 |            '_LR' + str(LR) + \
 50 |            '_DPOUT' + str(DPOUT) + \
 51 |            '_L2REG' + str(L2REG) + \
 52 |            '_ATTPNLT' + str(ATTPNLT) + \
 53 |            '_WE' + str(WE) + \
 54 |            '_WEDIM' + str(WEDIM) + \
 55 |            '_BSIZE' + str(BSIZE) + \
 56 |            '_GCLIP' + str(GCLIP) + \
 57 |            '_NEPOCH' + str(NEPOCH) + \
 58 |            '_STD' + str(STD) + \
 59 |            '_UPDATEWE' + str(UPDATEWE) + \
 60 |            '_DSET' + DSET
 61 | 
 62 | def main(num_epochs=NEPOCH):
 63 |     if DSET == 'yelp':
 64 |         print("Loading yelp dataset ...")
 65 |         loaded_dataset = YELP(
 66 |             batch_size=BSIZE,
 67 |             datapath="/home/hantek/datasets/NLC_data/yelp/word2vec_yelp.pkl")
 68 |     elif DSET == 'age2':
 69 |         print("Loading age2 dataset ...")
 70 |         loaded_dataset = AGE2(
 71 |             batch_size=BSIZE,
 72 |             datapath="/home/hantek/datasets/NLC_data/age2/word2vec_age2.pkl")
 73 |     else:
 74 |         raise ValueError("DSET was set incorrectly. Check your cmd args.")
 75 |     #                     yelp     age2
 76 |     # train data        500000    68450
 77 |     # dev/test data       2000     4000
 78 |     # vocab                      ~1.2e5
 79 |     # 
 80 | 
 81 |     train_batches = list(loaded_dataset.train_minibatch_generator())
 82 |     dev_batches = list(loaded_dataset.dev_minibatch_generator())
 83 |     test_batches = list(loaded_dataset.test_minibatch_generator())
 84 |     W_word_embedding = loaded_dataset.weight  # W shape: (# vocab size, WE_DIM)
 85 |     del loaded_dataset
 86 | 
 87 |     print("Building network ...")
 88 |     ########### sentence embedding encoder ###########
 89 |     # sentence vector, with each number standing for a word number
 90 |     input_var = T.TensorType('int32', [False, False])('sentence_vector')
 91 |     input_var.tag.test_value = numpy.hstack((
 92 |         numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'),
 93 |         numpy.zeros((BSIZE, 5)).astype('int32')))
 94 |     input_var.tag.test_value[1, 20:22] = (413, 45)
 95 |     l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var)
 96 |     
 97 |     input_mask = T.TensorType('int32', [False, False])('sentence_mask')
 98 |     input_mask.tag.test_value = numpy.hstack((
 99 |         numpy.ones((BSIZE, 20), dtype='int32'),
100 |         numpy.zeros((BSIZE, 5), dtype='int32')))
101 |     input_mask.tag.test_value[1, 20:22] = 1
102 |     l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None),
103 |                                        input_var=input_mask)
104 | 
105 |     # output shape (BSIZE, None, WEDIM)
106 |     l_word_embed = lasagne.layers.EmbeddingLayer(
107 |         l_in,
108 |         input_size=W_word_embedding.shape[0],
109 |         output_size=W_word_embedding.shape[1],
110 |         W=W_word_embedding)
111 | 
112 |     # bidirectional LSTM
113 |     l_forward = lasagne.layers.LSTMLayer(
114 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
115 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
116 |                     W_cell=init.Normal(STD)),
117 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
118 |                     W_cell=init.Normal(STD)),
119 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
120 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
121 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
122 |                     W_cell=init.Normal(STD)),
123 |         nonlinearity=lasagne.nonlinearities.tanh,
124 |         peepholes = False,
125 |         grad_clipping=GCLIP)
126 | 
127 |     l_backward = lasagne.layers.LSTMLayer(
128 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
129 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
130 |                     W_cell=init.Normal(STD)),
131 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
132 |                     W_cell=init.Normal(STD)),
133 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
134 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
135 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
136 |                     W_cell=init.Normal(STD)),
137 |         nonlinearity=lasagne.nonlinearities.tanh,
138 |         peepholes = False,
139 |         grad_clipping=GCLIP, backwards=True)
140 |     
141 |     # output dim: (BSIZE, None, 2*LSTMHID)
142 |     l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2)
143 |     l_concat_dpout = lasagne.layers.DropoutLayer(l_concat, p=DPOUT, rescale=True)
144 | 
145 |     # Attention mechanism to get sentence embedding
146 |     # output dim: (BSIZE, None, ATTHID)
147 |     l_ws1 = DenseLayer3DInput(l_concat_dpout, num_units=ATTHID)
148 |     l_ws1_dpout = lasagne.layers.DropoutLayer(l_ws1, p=DPOUT, rescale=True)
149 |     # output dim: (BSIZE, None, NROW)
150 |     l_ws2 = DenseLayer3DInput(l_ws1_dpout, num_units=NROW, nonlinearity=None)
151 |     l_annotations = Softmax3D(l_ws2, mask=l_mask)
152 |     # output dim: (BSIZE, 2*LSTMHID, NROW)
153 |     l_sentence_embedding = ApplyAttention([l_annotations, l_concat])
154 |     l_sentence_embedding_dpout = lasagne.layers.DropoutLayer(
155 |         l_sentence_embedding, p=DPOUT, rescale=True)
156 | 
157 |     l_outhid = lasagne.layers.DenseLayer(
158 |         l_sentence_embedding_dpout, num_units=OUTHID,
159 |         nonlinearity=lasagne.nonlinearities.rectify)
160 |     l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True)
161 | 
162 |     l_output = lasagne.layers.DenseLayer(
163 |         l_outhid_dpout, num_units=5, nonlinearity=lasagne.nonlinearities.softmax)
164 | 
165 | 
166 |     ########### target, cost, validation, etc. ##########
167 |     target_values = T.ivector('target_output')
168 |     target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32')
169 | 
170 |     network_output, annotation = lasagne.layers.get_output(
171 |         [l_output, l_annotations])
172 |     network_prediction = T.argmax(network_output, axis=1)
173 |     accuracy = T.mean(T.eq(network_prediction, target_values))
174 | 
175 |     network_output_clean, annotation_clean = lasagne.layers.get_output(
176 |         [l_output, l_annotations], deterministic=True)
177 |     network_prediction_clean = T.argmax(network_output_clean, axis=1)
178 |     accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values))
179 | 
180 |     L2_attentionmlp = (l_ws1.W ** 2).sum() + (l_ws2.W ** 2).sum()
181 |     L2_outputhid = (l_outhid.W ** 2).sum()
182 |     L2_softmax = (l_output.W ** 2).sum()
183 |     L2 = L2_attentionmlp + L2_outputhid + L2_softmax 
184 | 
185 |     # penalty term and cost
186 |     attention_penalty = T.mean((T.batched_dot(
187 |         annotation, annotation.dimshuffle(0, 2, 1)
188 |     ) - T.eye(annotation.shape[1]).dimshuffle('x', 0, 1)
189 |     )**2, axis=(0, 1, 2))
190 |     
191 |     cost = T.mean(T.nnet.categorical_crossentropy(network_output,
192 |                                                   target_values)) + \
193 |            ATTPNLT * attention_penalty + L2REG * L2
194 |     cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean,
195 |                                                         target_values)) + \
196 |                   ATTPNLT * attention_penalty + L2REG * L2
197 | 
198 |     # Retrieve all parameters from the network
199 |     all_params = lasagne.layers.get_all_params(l_output)
200 |     if not UPDATEWE:
201 |         all_params.remove(l_word_embed.W)
202 | 
203 |     numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]])
204 |     print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams))
205 |     print("-----------------------------------------------------------------")
206 |     for item in all_params:
207 |         print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval())))
208 | 
209 |     # if exist param file then load params
210 |     look_for = 'params' + os.sep + 'params_' + filename + '.pkl'
211 |     if os.path.isfile(look_for):
212 |         print("Resuming from file: " + look_for)
213 |         all_param_values = cPickle.load(open(look_for, 'rb'))
214 |         for p, v in zip(all_params, all_param_values):
215 |             p.set_value(v)
216 |    
217 |     # Compute SGD updates for training
218 |     print("Computing updates ...")
219 |     updates = lasagne.updates.sgd(cost, all_params, LR)
220 | 
221 |     # Theano functions for training and computing cost
222 |     print("Compiling functions ...")
223 |     train = theano.function(
224 |         [l_in.input_var, l_mask.input_var, target_values],
225 |         [cost, accuracy], updates=updates)
226 |     compute_cost = theano.function(
227 |         [l_in.input_var, l_mask.input_var, target_values],
228 |         [cost_clean, accuracy_clean])
229 | 
230 |     def evaluate(mode):
231 |         if mode == 'dev':
232 |             data = dev_batches
233 |         if mode == 'test':
234 |             data = test_batches
235 |         
236 |         set_cost = 0.
237 |         set_accuracy = 0.
238 |         for batches_seen, (hypo, hm, truth) in enumerate(data, 1):
239 |             _cost, _accuracy = compute_cost(hypo, hm, truth)
240 |             set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \
241 |                        1.0 / batches_seen * _cost
242 |             set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \
243 |                            1.0 / batches_seen * _accuracy
244 |         
245 |         return set_cost, set_accuracy
246 |     
247 |     print("Done. Evaluating scratch model ...")
248 |     test_set_cost,  test_set_accuracy  = evaluate('test')
249 |     print("BEFORE TRAINING: test cost %f, accuracy %f" % (
250 |         test_set_cost,  test_set_accuracy))
251 |     print("Training ...")
252 |     try:
253 |         for epoch in range(num_epochs):
254 |             train_set_cost = 0.
255 |             train_set_accuracy = 0.
256 |             start = time.time()
257 |             
258 |             for batches_seen, (hypo, hm, truth) in enumerate(train_batches, 1):
259 |                 _cost, _accuracy = train(hypo, hm, truth)
260 |                 train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \
261 |                                  1.0 / batches_seen * _cost
262 |                 train_set_accuracy = \
263 |                     (1.0 - 1.0 / batches_seen) * train_set_accuracy + \
264 |                     1.0 / batches_seen * _accuracy
265 |                 if batches_seen % 100 == 0:
266 |                     end = time.time()
267 |                     print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f" % (
268 |                         batches_seen * BSIZE,
269 |                         end - start,
270 |                         LR,
271 |                         train_set_cost,
272 |                         train_set_accuracy))
273 |                     start = end
274 | 
275 |                 if batches_seen % 2000 == 0:
276 |                     dev_set_cost,  dev_set_accuracy  = evaluate('dev')
277 |                     test_set_cost, test_set_accuracy = evaluate('test')
278 |                     print("RECORD: cost: train %f dev %f test %f\n"
279 |                           "        accu: train %f dev %f test %f" % (
280 |                         train_set_cost,     dev_set_cost,     test_set_cost,
281 |                         train_set_accuracy, dev_set_accuracy, test_set_accuracy))
282 | 
283 |             # save parameters
284 |             all_param_values = [p.get_value() for p in all_params]
285 |             cPickle.dump(all_param_values,
286 |                          open('params' + os.sep + 'params_' + filename + '.pkl', 'wb'))
287 | 
288 |             dev_set_cost,  dev_set_accuracy  = evaluate('dev')
289 |             test_set_cost, test_set_accuracy = evaluate('test')
290 |             print("RECORD:epoch %d, cost: train %f dev %f test %f\n"
291 |                   "                 accu: train %f dev %f test %f" % (
292 |                 epoch,
293 |                 train_set_cost,     dev_set_cost,   test_set_cost,
294 |                 train_set_accuracy, dev_set_accuracy, test_set_accuracy))
295 |     except KeyboardInterrupt:
296 |         pdb.set_trace()
297 |         pass
298 | 
299 | if __name__ == '__main__':
300 |     main()
301 | 
302 | 


--------------------------------------------------------------------------------
/lstmmlp_rate_l2_dpout.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from __future__ import print_function
  5 | 
  6 | import time
  7 | import os
  8 | import sys
  9 | import numpy
 10 | import cPickle
 11 | import theano
 12 | import theano.tensor as T
 13 | from sklearn.metrics import confusion_matrix
 14 | import lasagne
 15 | from lasagne.layers.recurrent import Gate
 16 | from lasagne import init, nonlinearities
 17 | 
 18 | from util_layers import (DenseLayer3DInput, Softmax3D, ApplyAttention,
 19 |                          GatedEncoder3D, Maxpooling)
 20 | from dataset import YELP, AGE2
 21 | 
 22 | import pdb
 23 | theano.config.compute_test_value = 'off'  # 'off' # Use 'warn' to activate
 24 | 
 25 | """
 26 | BEST test set result:
 27 | yelp    77.575% L2REG=0.0001,  DPOUT=0.3
 28 | age2    63.65%  L2REG=0.00001, DPOUT=0.2
 29 | """
 30 | LSTMHID = int(sys.argv[1])          # 500 Hidden unit numbers in LSTM
 31 | OUTHID = int(sys.argv[2])           # 1000 Hidden unit numbers in output MLP
 32 | LR = float(sys.argv[3])             # 0.01 Smaller than 0.04.
 33 | L2REG = float(sys.argv[4])          # 0.0001 L2 regularization
 34 | DPOUT = float(sys.argv[5])          # 0.3 dropout rate
 35 | WE = str(sys.argv[6])               # either `word2vec` or `glove`
 36 | WEDIM = int(sys.argv[7])            # either  100       or  300   Dim
 37 | BSIZE = int(sys.argv[8])            # 16 Minibatch size
 38 | GCLIP = float(sys.argv[9])          # 0.5 All gradients above will be clipped
 39 | NEPOCH = int(sys.argv[10])          # 300 Number of epochs to train the net
 40 | STD = float(sys.argv[11])           # 0.1 Standard deviation of weights in init
 41 |                                     # very slightly better than 0.01
 42 | UPDATEWE = bool(int(sys.argv[12]))  # 1 0 for False and 1 for True. Update WE
 43 | DSET = str(sys.argv[13])            # dataset, either `yelp` or `age2`
 44 | 
 45 | filename = __file__.split('.')[0] + \
 46 |            '_LSTMHID' + str(LSTMHID) + \
 47 |            '_OUTHID' + str(OUTHID) + \
 48 |            '_LR' + str(LR) + \
 49 |            '_L2REG' + str(L2REG) + \
 50 |            '_DPOUT' + str(DPOUT) + \
 51 |            '_WE' + str(WE) + \
 52 |            '_WEDIM' + str(WEDIM) + \
 53 |            '_BSIZE' + str(BSIZE) + \
 54 |            '_GCLIP' + str(GCLIP) + \
 55 |            '_NEPOCH' + str(NEPOCH) + \
 56 |            '_STD' + str(STD) + \
 57 |            '_UPDATEWE' + str(UPDATEWE) + \
 58 |            '_DSET' + DSET
 59 | 
 60 | def main(num_epochs=NEPOCH):
 61 |     if DSET == 'yelp':
 62 |         print("Loading yelp dataset ...")
 63 |         loaded_dataset = YELP(
 64 |             batch_size=BSIZE,
 65 |             datapath="/home/hantek/datasets/NLC_data/yelp/word2vec_yelp.pkl")
 66 |     elif DSET == 'age2':
 67 |         print("Loading age2 dataset ...")
 68 |         loaded_dataset = AGE2(
 69 |             batch_size=BSIZE,
 70 |             datapath="/home/hantek/datasets/NLC_data/age2/word2vec_age2.pkl")
 71 |     else:
 72 |         raise ValueError("DSET was set incorrectly. Check your cmd args.")
 73 |     #                     yelp     age2
 74 |     # train data        500000    68450
 75 |     # dev/test data       2000     4000
 76 |     # vocab                      ~1.2e5
 77 |     # 
 78 | 
 79 |     train_batches = list(loaded_dataset.train_minibatch_generator())
 80 |     dev_batches = list(loaded_dataset.dev_minibatch_generator())
 81 |     test_batches = list(loaded_dataset.test_minibatch_generator())
 82 |     W_word_embedding = loaded_dataset.weight  # W shape: (# vocab size, WE_DIM)
 83 |     del loaded_dataset
 84 | 
 85 |     print("Building network ...")
 86 |     ########### sentence embedding encoder ###########
 87 |     # sentence vector, with each number standing for a word number
 88 |     input_var = T.TensorType('int32', [False, False])('sentence_vector')
 89 |     input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'),
 90 |                                              numpy.zeros((BSIZE, 5)).astype('int32')))
 91 |     input_var.tag.test_value[1, 20:22] = (413, 45)
 92 |     l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var)
 93 |     
 94 |     input_mask = T.TensorType('int32', [False, False])('sentence_mask')
 95 |     input_mask.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 20), dtype='int32'),
 96 |                                              numpy.zeros((BSIZE, 5), dtype='int32')))
 97 |     input_mask.tag.test_value[1, 20:22] = 1
 98 |     l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask)
 99 | 
100 |     # output shape (BSIZE, None, WEDIM)
101 |     l_word_embed = lasagne.layers.EmbeddingLayer(
102 |         l_in,
103 |         input_size=W_word_embedding.shape[0],
104 |         output_size=W_word_embedding.shape[1],
105 |         W=W_word_embedding)
106 | 
107 |     # bidirectional LSTM
108 |     l_forward = lasagne.layers.LSTMLayer(
109 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
110 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
111 |                     W_cell=init.Normal(STD)),
112 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
113 |                     W_cell=init.Normal(STD)),
114 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
115 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
116 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
117 |                     W_cell=init.Normal(STD)),
118 |         nonlinearity=lasagne.nonlinearities.tanh,
119 |         peepholes = False,
120 |         only_return_final=False,
121 |         grad_clipping=GCLIP)
122 | 
123 |     l_backward = lasagne.layers.LSTMLayer(
124 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
125 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
126 |                     W_cell=init.Normal(STD)),
127 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
128 |                     W_cell=init.Normal(STD)),
129 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
130 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
131 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
132 |                     W_cell=init.Normal(STD)),
133 |         nonlinearity=lasagne.nonlinearities.tanh,
134 |         peepholes = False,
135 |         only_return_final=False,
136 |         grad_clipping=GCLIP, backwards=True)
137 |     
138 |     # output dim: (BSIZE, None, 2*LSTMHID)
139 |     l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2)
140 | 
141 |     # output dim: (BSIZE, 2*LSTMHID)
142 |     l_maxpool = Maxpooling(l_concat, axis=1)
143 |     l_maxpool_dpout = lasagne.layers.DropoutLayer(l_maxpool, p=DPOUT, rescale=True)
144 | 
145 |     l_outhid = lasagne.layers.DenseLayer(
146 |         l_maxpool_dpout, num_units=OUTHID,
147 |         nonlinearity=lasagne.nonlinearities.rectify)
148 |     l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True)
149 | 
150 |     l_output = lasagne.layers.DenseLayer(
151 |         l_outhid_dpout, num_units=5, nonlinearity=lasagne.nonlinearities.softmax)
152 | 
153 | 
154 |     ########### target, cost, validation, etc. ##########
155 |     target_values = T.ivector('target_output')
156 |     target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32')
157 | 
158 |     network_output = lasagne.layers.get_output(l_output)
159 |     network_prediction = T.argmax(network_output, axis=1)
160 |     accuracy = T.mean(T.eq(network_prediction, target_values))
161 | 
162 |     network_output_clean = lasagne.layers.get_output(l_output, deterministic=True)
163 |     network_prediction_clean = T.argmax(network_output_clean, axis=1)
164 |     accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values))
165 |     
166 |     L2_lstm = ((l_forward.W_in_to_ingate ** 2).sum() + \
167 |                (l_forward.W_hid_to_ingate ** 2).sum() + \
168 |                (l_forward.W_in_to_forgetgate ** 2).sum() + \
169 |                (l_forward.W_hid_to_forgetgate ** 2).sum() + \
170 |                (l_forward.W_in_to_cell ** 2).sum() + \
171 |                (l_forward.W_hid_to_cell ** 2).sum() + \
172 |                (l_forward.W_in_to_outgate ** 2).sum() + \
173 |                (l_forward.W_hid_to_outgate ** 2).sum() + \
174 |                (l_backward.W_in_to_ingate ** 2).sum() + \
175 |                (l_backward.W_hid_to_ingate ** 2).sum() + \
176 |                (l_backward.W_in_to_forgetgate ** 2).sum() + \
177 |                (l_backward.W_hid_to_forgetgate ** 2).sum() + \
178 |                (l_backward.W_in_to_cell ** 2).sum() + \
179 |                (l_backward.W_hid_to_cell ** 2).sum() + \
180 |                (l_backward.W_in_to_outgate ** 2).sum() + \
181 |                (l_backward.W_hid_to_outgate ** 2).sum())
182 |     L2_outputhid = (l_outhid.W ** 2).sum()
183 |     L2_softmax = (l_output.W ** 2).sum()
184 |     L2 = L2_lstm + L2_outputhid + L2_softmax 
185 |     
186 |     cost = T.mean(T.nnet.categorical_crossentropy(network_output,
187 |                                                   target_values)) + \
188 |            L2REG * L2
189 |     cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean,
190 |                                                         target_values)) + \
191 |                  L2REG * L2
192 | 
193 |     # Retrieve all parameters from the network
194 |     all_params = lasagne.layers.get_all_params(l_output)
195 |     if not UPDATEWE:
196 |         all_params.remove(l_word_embed.W)
197 | 
198 |     numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]])
199 |     print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams))
200 |     print("-----------------------------------------------------------------")
201 |     for item in all_params:
202 |         print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval())))
203 | 
204 |     # if exist param file then load params
205 |     look_for = 'params' + os.sep + 'params_' + filename + '.pkl'
206 |     if os.path.isfile(look_for):
207 |         print("Resuming from file: " + look_for)
208 |         all_param_values = cPickle.load(open(look_for, 'rb'))
209 |         for p, v in zip(all_params, all_param_values):
210 |             p.set_value(v)
211 |    
212 |     # Compute SGD updates for training
213 |     print("Computing updates ...")
214 |     updates = lasagne.updates.adagrad(cost, all_params, LR)
215 | 
216 |     # Theano functions for training and computing cost
217 |     print("Compiling functions ...")
218 |     train = theano.function(
219 |         [l_in.input_var, l_mask.input_var, target_values],
220 |         [cost, accuracy], updates=updates)
221 |     compute_cost = theano.function(
222 |         [l_in.input_var, l_mask.input_var, target_values],
223 |         [cost_clean, accuracy_clean])
224 |     predict = theano.function(
225 |         [l_in.input_var, l_mask.input_var],
226 |         network_prediction_clean)
227 | 
228 |     def evaluate(mode, verbose=False):
229 |         if mode == 'dev':
230 |             data = dev_batches
231 |         if mode == 'test':
232 |             data = test_batches
233 |         
234 |         set_cost = 0.
235 |         set_accuracy = 0.
236 |         for batches_seen, (hypo, hm, truth) in enumerate(data, 1):
237 |             _cost, _accuracy = compute_cost(hypo, hm, truth)
238 |             set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \
239 |                        1.0 / batches_seen * _cost
240 |             set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \
241 |                            1.0 / batches_seen * _accuracy
242 |         if verbose == True:
243 |             predicted = []
244 |             truth = []
245 |             for batches_seen, (sent, mask, th) in enumerate(data, 1):
246 |                 predicted.append(predict(sent, mask))
247 |                 truth.append(th)
248 |             truth = numpy.concatenate(truth)
249 |             predicted = numpy.concatenate(predicted)
250 |             cm = confusion_matrix(truth, predicted)
251 |             pr_a = cm.trace()*1.0 / truth.size
252 |             pr_e = ((cm.sum(axis=0)*1.0/truth.size) * \
253 |                     (cm.sum(axis=1)*1.0/truth.size)).sum()
254 |             k = (pr_a - pr_e) / (1 - pr_e)
255 |             print(mode + " set statistics:")
256 |             print("kappa index of agreement: %f" % k)
257 |             print("confusion matrix:")
258 |             print(cm)
259 | 
260 |         return set_cost, set_accuracy
261 |     
262 | 
263 |     print("Done. Evaluating scratch model ...")
264 |     test_set_cost,  test_set_accuracy  = evaluate('test', verbose=True)
265 |     print("BEFORE TRAINING: test cost %f, accuracy %f" % (
266 |         test_set_cost, test_set_accuracy))
267 |     print("Training ...")
268 |     try:
269 |         for epoch in range(num_epochs):
270 |             train_set_cost = 0.
271 |             train_set_accuracy = 0.
272 |             start = time.time()
273 |             
274 |             for batches_seen, (hypo, hm, truth) in enumerate(train_batches, 1):
275 |                 _cost, _accuracy = train(hypo, hm, truth)
276 |                 train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \
277 |                                  1.0 / batches_seen * _cost
278 |                 train_set_accuracy = (1.0 - 1.0 / batches_seen) * train_set_accuracy + \
279 |                                   1.0 / batches_seen * _accuracy
280 |                 if batches_seen % 100 == 0:
281 |                     end = time.time()
282 |                     print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f"  % (
283 |                         batches_seen * BSIZE,
284 |                         end - start,
285 |                         LR,
286 |                         train_set_cost,
287 |                         train_set_accuracy))
288 |                     start = end
289 | 
290 |                 if batches_seen % 2000 == 0:
291 |                     dev_set_cost,  dev_set_accuracy = evaluate('dev')
292 |                     test_set_cost, test_set_accuracy = evaluate('test')
293 |                     print("RECORD: cost: train %f dev %f test %f\n"
294 |                           "        accu: train %f dev %f test %f" % (
295 |                         train_set_cost,     dev_set_cost,     test_set_cost,
296 |                         train_set_accuracy, dev_set_accuracy, test_set_accuracy))
297 | 
298 |             # save parameters
299 |             all_param_values = [p.get_value() for p in all_params]
300 |             cPickle.dump(all_param_values,
301 |                          open('params' + os.sep + 'params_' + filename + '.pkl', 'wb'))
302 | 
303 |             dev_set_cost,  dev_set_accuracy  = evaluate('dev')
304 |             test_set_cost, test_set_accuracy = evaluate('test', verbose=True)
305 |             print("RECORD:epoch %d, cost: train %f dev %f test %f\n"
306 |                   "         accu: train %f dev %f test %f" % (
307 |                 epoch,
308 |                 train_set_cost,     dev_set_cost,     test_set_cost,
309 |                 train_set_accuracy, dev_set_accuracy, test_set_accuracy))
310 |     except KeyboardInterrupt:
311 |         pdb.set_trace()
312 |         pass
313 | 
314 | if __name__ == '__main__':
315 |     main()
316 | 
317 | 


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import cPickle
  3 | import theano
  4 | import numpy
  5 | import warnings
  6 | 
  7 | import pdb
  8 | 
  9 | 
 10 | class SNLI(object):
 11 |     def __init__(self, batch_size=50, loadall=False,
 12 |                  datapath="/home/hantek/datasets/SNLI_GloVe_converted"):
 13 |         self.batch_size = batch_size
 14 |         self.datapath = datapath
 15 |         
 16 |         data_file = open(self.datapath, 'rb')
 17 |         cPickle.load(data_file)
 18 |         cPickle.load(data_file)
 19 |         self.train_set, self.dev_set, self.test_set = cPickle.load(data_file)
 20 |         self.weight = cPickle.load(data_file).astype(theano.config.floatX)
 21 |         if loadall:
 22 |             self.word2embed = cPickle.load(data_file)   # key: word, value: embedding
 23 |             self.word2num = cPickle.load(data_file)     # key: word, value: number
 24 |             self.num2word = cPickle.load(data_file)     # key: number, value: word
 25 |         data_file.close()
 26 | 
 27 |         self.train_size = len(self.train_set)
 28 |         self.dev_size = len(self.dev_set)
 29 |         self.test_size = len(self.test_set)
 30 |         self.train_ptr = 0
 31 |         self.dev_ptr = 0
 32 |         self.test_ptr = 0
 33 | 
 34 |     def train_minibatch_generator(self):
 35 |         while self.train_ptr <= self.train_size - self.batch_size:
 36 |             self.train_ptr += self.batch_size
 37 |             minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr]
 38 |             if len (minibatch) < self.batch_size:
 39 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
 40 |             
 41 |             longest_hypo, longest_premise = \
 42 |                 numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0)
 43 | 
 44 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
 45 |             premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
 46 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
 47 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
 48 |             mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
 49 |             for i, (h, p, t) in enumerate(minibatch):
 50 |                 hypos[i, :len(h)] = h
 51 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
 52 |                 premises[i, :len(p)] = p
 53 |                 mask_premises[i, :len(p)] = (1,) * len(p)
 54 |                 truth[i] = t
 55 |             
 56 |             yield hypos, mask_hypos, premises, mask_premises, truth
 57 | 
 58 |         else:
 59 |             self.train_ptr = 0
 60 |             raise StopIteration
 61 | 
 62 |     def dev_minibatch_generator(self, ):
 63 |         while self.dev_ptr <= self.dev_size - self.batch_size:
 64 |             self.dev_ptr += self.batch_size
 65 |             minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr]
 66 |             if len (minibatch) < self.batch_size:
 67 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
 68 | 
 69 |             longest_hypo, longest_premise = \
 70 |                 numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0)
 71 | 
 72 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
 73 |             premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
 74 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
 75 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
 76 |             mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
 77 |             for i, (h, p, t) in enumerate(minibatch):
 78 |                 hypos[i, :len(h)] = h
 79 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
 80 |                 premises[i, :len(p)] = p
 81 |                 mask_premises[i, :len(p)] = (1,) * len(p)
 82 |                 truth[i] = t
 83 | 
 84 |             yield hypos, mask_hypos, premises, mask_premises, truth
 85 | 
 86 |         else:
 87 |             self.dev_ptr = 0
 88 |             raise StopIteration
 89 | 
 90 |     def test_minibatch_generator(self, ):
 91 |         while self.test_ptr <= self.test_size - self.batch_size:
 92 |             self.test_ptr += self.batch_size
 93 |             minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr]
 94 |             if len (minibatch) < self.batch_size:
 95 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
 96 | 
 97 |             longest_hypo, longest_premise = \
 98 |                 numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0)
 99 | 
100 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
101 |             premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
102 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
103 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
104 |             mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32')
105 |             for i, (h, p, t) in enumerate(minibatch):
106 |                 hypos[i, :len(h)] = h
107 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
108 |                 premises[i, :len(p)] = p
109 |                 mask_premises[i, :len(p)] = (1,) * len(p)
110 |                 truth[i] = t
111 | 
112 |             yield hypos, mask_hypos, premises, mask_premises, truth
113 | 
114 |         else:
115 |             self.test_ptr = 0
116 |             raise StopIteration
117 | 
118 | 
119 | class SICK(SNLI):
120 |     def __init__(self, batch_size=50, loadall=False, augment=False,
121 |                  datapath="/Users/johanlin/Datasets/SICK/"):
122 |         self.batch_size = batch_size
123 |         if augment:
124 |             self.datapath = datapath + os.sep + 'SICK_augmented.pkl'
125 |         else:
126 |             self.datapath = datapath + os.sep + 'SICK.pkl'
127 |         super(SICK, self).__init__(batch_size, loadall, self.datapath)
128 | 
129 | 
130 | class YELP(object):
131 |     def __init__(self, batch_size=50, loadall=False,
132 |                  datapath="/home/hantek/datasets/NLC_data/yelp/yelp.pkl"):
133 |         self.batch_size = batch_size
134 |         self.datapath = datapath
135 |         
136 |         data_file = open(self.datapath, 'rb')
137 |         cPickle.load(data_file)
138 |         cPickle.load(data_file)
139 |         self.train_set, self.dev_set, self.test_set = cPickle.load(data_file)
140 |         self.weight = cPickle.load(data_file).astype(theano.config.floatX)
141 |         if loadall:
142 |             self.word2embed = cPickle.load(data_file)   # key: word, value: embedding
143 |             self.word2num = cPickle.load(data_file)     # key: word, value: number
144 |             self.num2word = cPickle.load(data_file)     # key: number, value: word
145 |         data_file.close()
146 | 
147 |         self.train_size = len(self.train_set)
148 |         self.dev_size = len(self.dev_set)
149 |         self.test_size = len(self.test_set)
150 |         self.train_ptr = 0
151 |         self.dev_ptr = 0
152 |         self.test_ptr = 0
153 | 
154 |     def train_minibatch_generator(self):
155 |         while self.train_ptr <= self.train_size - self.batch_size:
156 |             self.train_ptr += self.batch_size
157 |             minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr]
158 |             if len (minibatch) < self.batch_size:
159 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
160 |             
161 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
162 | 
163 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
164 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
165 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
166 |             for i, (h, t) in enumerate(minibatch):
167 |                 hypos[i, :len(h)] = h
168 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
169 |                 truth[i] = t
170 |             
171 |             yield hypos, mask_hypos, truth
172 | 
173 |         else:
174 |             self.train_ptr = 0
175 |             raise StopIteration
176 | 
177 |     def dev_minibatch_generator(self, ):
178 |         while self.dev_ptr <= self.dev_size - self.batch_size:
179 |             self.dev_ptr += self.batch_size
180 |             minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr]
181 |             if len (minibatch) < self.batch_size:
182 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
183 | 
184 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
185 | 
186 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
187 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
188 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
189 |             for i, (h, t) in enumerate(minibatch):
190 |                 hypos[i, :len(h)] = h
191 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
192 |                 truth[i] = t
193 |             
194 |             yield hypos, mask_hypos, truth
195 | 
196 |         else:
197 |             self.dev_ptr = 0
198 |             raise StopIteration
199 | 
200 |     def test_minibatch_generator(self, ):
201 |         while self.test_ptr <= self.test_size - self.batch_size:
202 |             self.test_ptr += self.batch_size
203 |             minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr]
204 |             if len (minibatch) < self.batch_size:
205 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
206 | 
207 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
208 | 
209 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
210 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
211 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
212 |             for i, (h, t) in enumerate(minibatch):
213 |                 hypos[i, :len(h)] = h
214 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
215 |                 truth[i] = t
216 |             
217 |             yield hypos, mask_hypos, truth
218 | 
219 |         else:
220 |             self.test_ptr = 0
221 |             raise StopIteration
222 | 
223 | 
224 | class AGE2(YELP):
225 |     def __init__(self, batch_size=50, loadall=False,
226 |                  datapath="/home/hantek/datasets/NLC_data/age2/age2.pkl"):
227 |         super(AGE2, self).__init__(batch_size, loadall, datapath)
228 | 
229 | 
230 | class STANFORDSENTIMENTTREEBANK(object):
231 |     def __init__(self, batch_size=50, loadext=False, loadhelper=False, wordembed='word2vec',
232 |                  datapath="/home/hantek/datasets/StanfordSentimentTreebank"):
233 |         self.batch_size = batch_size
234 |         self.datapath = datapath
235 |         
236 |         save_file = open(self.datapath + os.sep + 'sst_' + wordembed + '.pkl', 'rb')
237 |         cPickle.load(save_file)
238 |         self.train_set, self.dev_set, self.test_set = cPickle.load(save_file)
239 |         self.weight = cPickle.load(save_file).astype(theano.config.floatX)
240 |         save_file.close()
241 |         
242 |         if loadext == True:
243 |             save_file_ext = open(self.datapath + os.sep + 'sst_' + wordembed + '_ext.pkl', 'rb')
244 |             train_set, dev_set, test_set = cPickle.load(save_file_ext)
245 |             self.train_set += train_set
246 |             self.dev_set += dev_set
247 |             self.test_set += test_set
248 |             save_file_ext.close()
249 |         
250 |         if loadhelper == True:
251 |             helper = open(self.datapath + os.sep + 'sst_' + wordembed + '_helper.pkl', 'rb')
252 |             self.word2embed = cPickle.load(helper)   # key: word, value: embedding
253 |             self.word2num = cPickle.load(helper)     # key: word, value: number
254 |             self.num2word = cPickle.load(helper)     # key: number, value: word
255 |             helper.close()
256 | 
257 |         self.train_size = len(self.train_set)
258 |         self.dev_size = len(self.dev_set)
259 |         self.test_size = len(self.test_set)
260 |         self.train_ptr = 0
261 |         self.dev_ptr = 0
262 |         self.test_ptr = 0
263 | 
264 |     def train_minibatch_generator(self):
265 |         while self.train_ptr <= self.train_size - self.batch_size:
266 |             self.train_ptr += self.batch_size
267 |             minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr]
268 |             if len (minibatch) < self.batch_size:
269 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
270 |             
271 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
272 | 
273 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
274 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
275 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
276 |             for i, (h, t) in enumerate(minibatch):
277 |                 hypos[i, :len(h)] = h
278 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
279 |                 truth[i] = t
280 |             
281 |             yield hypos, mask_hypos, truth
282 | 
283 |         else:
284 |             self.train_ptr = 0
285 |             raise StopIteration
286 | 
287 |     def dev_minibatch_generator(self, ):
288 |         while self.dev_ptr <= self.dev_size - self.batch_size:
289 |             self.dev_ptr += self.batch_size
290 |             minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr]
291 |             if len (minibatch) < self.batch_size:
292 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
293 | 
294 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
295 | 
296 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
297 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
298 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
299 |             for i, (h, t) in enumerate(minibatch):
300 |                 hypos[i, :len(h)] = h
301 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
302 |                 truth[i] = t
303 |             
304 |             yield hypos, mask_hypos, truth
305 | 
306 |         else:
307 |             self.dev_ptr = 0
308 |             raise StopIteration
309 | 
310 |     def test_minibatch_generator(self, ):
311 |         while self.test_ptr <= self.test_size - self.batch_size:
312 |             self.test_ptr += self.batch_size
313 |             minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr]
314 |             if len (minibatch) < self.batch_size:
315 |                 warnings.warn("There will be empty slots in minibatch data.", UserWarning)
316 | 
317 |             longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0)
318 | 
319 |             hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
320 |             truth = numpy.zeros((self.batch_size,), dtype='int32')
321 |             mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32')
322 |             for i, (h, t) in enumerate(minibatch):
323 |                 hypos[i, :len(h)] = h
324 |                 mask_hypos[i, :len(h)] = (1,) * len(h)
325 |                 truth[i] = t
326 |             
327 |             yield hypos, mask_hypos, truth
328 | 
329 |         else:
330 |             self.test_ptr = 0
331 |             raise StopIteration
332 | 


--------------------------------------------------------------------------------
/segae_gaereg.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from __future__ import print_function
  5 | 
  6 | import time
  7 | import os
  8 | import sys
  9 | import numpy
 10 | import cPickle
 11 | import theano
 12 | import theano.tensor as T
 13 | import lasagne
 14 | from lasagne.layers.recurrent import Gate
 15 | from lasagne import init, nonlinearities
 16 | 
 17 | from util_layers import DenseLayer3DInput, Softmax3D, ApplyAttention, GatedEncoder3D
 18 | from dataset import SNLI
 19 | 
 20 | import pdb
 21 | theano.config.compute_test_value = 'warn'  # 'off' # Use 'warn' to activate this feature
 22 | 
 23 | 
 24 | LSTM_HIDDEN = int(sys.argv[1])          # 150 Hidden unit numbers in LSTM
 25 | ATTENTION_HIDDEN = int(sys.argv[2])     # 350 Hidden unit numbers in attention MLP
 26 | OUT_HIDDEN = int(sys.argv[3])           # 3000 Hidden unit numbers in output MLP
 27 | N_ROWS = int(sys.argv[4])               # 10 Number of rows in matrix representation
 28 | LEARNING_RATE = float(sys.argv[5])      # 0.01
 29 | ATTENTION_PENALTY = float(sys.argv[6])  # 1.
 30 | GAEREG = float(sys.argv[7])             # 0.5 Dropout in GAE
 31 | WE_DIM = int(sys.argv[8])               # 300 Dim of word embedding
 32 | BATCH_SIZE = int(sys.argv[9])           # 50 Minibatch size
 33 | GRAD_CLIP = int(sys.argv[10])           # 100 All gradients above this will be clipped
 34 | NUM_EPOCHS = int(sys.argv[11])          # 12 Number of epochs to train the net
 35 | STD = float(sys.argv[12])               # 0.1 Standard deviation of weights in initialization
 36 | filename = __file__.split('.')[0] + \
 37 |            '_LSTMHIDDEN' + str(LSTM_HIDDEN) + \
 38 |            '_ATTENTIONHIDDEN' + str(ATTENTION_HIDDEN) + \
 39 |            '_OUTHIDDEN' + str(OUT_HIDDEN) + \
 40 |            '_NROWS' + str(N_ROWS) + \
 41 |            '_LEARNINGRATE' + str(LEARNING_RATE) + \
 42 |            '_ATTENTIONPENALTY' + str(ATTENTION_PENALTY) + \
 43 |            '_GAEREG' + str(GAEREG) + \
 44 |            '_WEDIM' + str(WE_DIM) + \
 45 |            '_BATCHSIZE' + str(BATCH_SIZE) + \
 46 |            '_GRADCLIP' + str(GRAD_CLIP) + \
 47 |            '_NUMEPOCHS' + str(NUM_EPOCHS) + \
 48 |            '_STD' + str(STD)
 49 | 
 50 | 
 51 | def main(num_epochs=NUM_EPOCHS):
 52 |     print("Loading data ...")
 53 |     snli = SNLI(batch_size=BATCH_SIZE)
 54 |     train_batches = list(snli.train_minibatch_generator())
 55 |     dev_batches = list(snli.dev_minibatch_generator())
 56 |     test_batches = list(snli.test_minibatch_generator())
 57 |     W_word_embedding = snli.weight  # W shape: (# vocab size, WE_DIM)
 58 |     del snli
 59 | 
 60 |     print("Building network ...")
 61 |     ########### sentence embedding encoder ###########
 62 |     # sentence vector, with each number standing for a word number
 63 |     input_var = T.TensorType('int32', [False, False])('sentence_vector')
 64 |     input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 20), 'int32'),
 65 |                                              numpy.zeros((50, 5)).astype('int32')))
 66 |     input_var.tag.test_value[1, 20:22] = (413, 45)
 67 |     l_in = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var)
 68 |     
 69 |     input_mask = T.TensorType('int32', [False, False])('sentence_mask')
 70 |     input_mask.tag.test_value = numpy.hstack((numpy.ones((50, 20), dtype='int32'),
 71 |                                              numpy.zeros((50, 5), dtype='int32')))
 72 |     input_mask.tag.test_value[1, 20:22] = 1
 73 |     l_mask = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask)
 74 | 
 75 |     # output shape (BATCH_SIZE, None, WE_DIM)
 76 |     l_word_embed = lasagne.layers.EmbeddingLayer(
 77 |         l_in,
 78 |         input_size=W_word_embedding.shape[0],
 79 |         output_size=W_word_embedding.shape[1],
 80 |         W=W_word_embedding)  # how to set it to be non-trainable?
 81 | 
 82 | 
 83 |     # bidirectional LSTM
 84 |     l_forward = lasagne.layers.LSTMLayer(
 85 |         l_word_embed, mask_input=l_mask, num_units=LSTM_HIDDEN,
 86 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
 87 |                     W_cell=init.Normal(STD)),
 88 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
 89 |                     W_cell=init.Normal(STD)),
 90 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
 91 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
 92 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
 93 |                     W_cell=init.Normal(STD)),
 94 |         nonlinearity=lasagne.nonlinearities.tanh,
 95 |         peepholes = False,
 96 |         grad_clipping=GRAD_CLIP)
 97 | 
 98 |     l_backward = lasagne.layers.LSTMLayer(
 99 |         l_word_embed, mask_input=l_mask, num_units=LSTM_HIDDEN,
100 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
101 |                     W_cell=init.Normal(STD)),
102 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
103 |                     W_cell=init.Normal(STD)),
104 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
105 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
106 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
107 |                     W_cell=init.Normal(STD)),
108 |         nonlinearity=lasagne.nonlinearities.tanh,
109 |         peepholes = False,
110 |         grad_clipping=GRAD_CLIP, backwards=True)
111 |     
112 |     # output dim: (BATCH_SIZE, None, 2*LSTM_HIDDEN)
113 |     l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2)
114 | 
115 |     # Attention mechanism to get sentence embedding
116 |     # output dim: (BATCH_SIZE, None, ATTENTION_HIDDEN)
117 |     l_ws1 = DenseLayer3DInput(l_concat, num_units=ATTENTION_HIDDEN)
118 |     # output dim: (BATCH_SIZE, None, N_ROWS)
119 |     l_ws2 = DenseLayer3DInput(l_ws1, num_units=N_ROWS, nonlinearity=None)
120 |     l_annotations = Softmax3D(l_ws2, mask=l_mask)
121 |     # output dim: (BATCH_SIZE, 2*LSTM_HIDDEN, N_ROWS)
122 |     l_sentence_embedding = ApplyAttention([l_annotations, l_concat])
123 | 
124 |     # beam search? Bi lstm in the sentence embedding layer? etc.
125 | 
126 | 
127 |     ########### get embeddings for hypothesis and premise ###########
128 |     # hypothesis
129 |     input_var_h = T.TensorType('int32', [False, False])('hypothesis_vector')
130 |     input_var_h.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 18), 'int32'),
131 |                                                numpy.zeros((50, 6)).astype('int32')))
132 |     l_in_h = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var_h)
133 |     
134 |     input_mask_h = T.TensorType('int32', [False, False])('hypo_mask')
135 |     input_mask_h.tag.test_value = numpy.hstack((numpy.ones((50, 18), dtype='int32'),
136 |                                                 numpy.zeros((50, 6), dtype='int32')))
137 |     input_mask_h.tag.test_value[1, 18:22] = 1
138 |     l_mask_h = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask_h)
139 |     
140 |     # premise
141 |     input_var_p = T.TensorType('int32', [False, False])('premise_vector')
142 |     input_var_p.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 16), 'int32'),
143 |                                                numpy.zeros((50, 3)).astype('int32')))
144 |     l_in_p = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var_p)
145 |     
146 |     input_mask_p = T.TensorType('int32', [False, False])('premise_mask')
147 |     input_mask_p.tag.test_value = numpy.hstack((numpy.ones((50, 16), dtype='int32'),
148 |                                                 numpy.zeros((50, 3), dtype='int32')))
149 |     input_mask_p.tag.test_value[1, 16:18] = 1
150 |     l_mask_p = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask_p)
151 |     
152 |     
153 |     hypothesis_embedding, hypothesis_annotation = lasagne.layers.get_output(
154 |         [l_sentence_embedding, l_annotations],
155 |         {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var})
156 |     premise_embedding, premise_annotation = lasagne.layers.get_output(
157 |         [l_sentence_embedding, l_annotations],
158 |         {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var})
159 | 
160 | 
161 |     ########### gated encoder and output MLP ##########
162 |     l_hypo_embed = lasagne.layers.InputLayer(
163 |         shape=(BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN), input_var=hypothesis_embedding)
164 |     l_pre_embed = lasagne.layers.InputLayer(
165 |         shape=(BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN), input_var=premise_embedding)
166 |    
167 |     # output dim: (BATCH_SIZE, 2*LSTM_HIDDEN, N_ROWS)
168 |     l_factors = GatedEncoder3D([l_hypo_embed, l_pre_embed], num_hfactors=2*LSTM_HIDDEN)
169 | 
170 |     # Dropout:
171 |     l_factors_noise = lasagne.layers.DropoutLayer(l_factors, p=GAEREG, rescale=True)
172 | 
173 |     # l_hids = DenseLayer3DWeight()
174 | 
175 |     l_outhid = lasagne.layers.DenseLayer(
176 |         l_factors_noise, num_units=OUT_HIDDEN, nonlinearity=lasagne.nonlinearities.rectify)
177 | 
178 |     # Dropout:
179 |     l_outhid_noise = lasagne.layers.DropoutLayer(l_outhid, p=GAEREG, rescale=True)
180 |     
181 |     l_output = lasagne.layers.DenseLayer(
182 |         l_outhid_noise, num_units=3, nonlinearity=lasagne.nonlinearities.softmax)
183 | 
184 | 
185 |     ########### target, cost, validation, etc. ##########
186 |     target_values = T.ivector('target_output')
187 |     target_values.tag.test_value = numpy.asarray([1,] * 50, dtype='int32')
188 | 
189 |     network_output = lasagne.layers.get_output(l_output)
190 |     network_output_clean = lasagne.layers.get_output(l_output, deterministic=True)
191 | 
192 |     # penalty term and cost
193 |     attention_penalty = T.mean((T.batched_dot(
194 |         hypothesis_annotation,
195 |         # pay attention to this line:
196 |         # T.extra_ops.cpu_contiguous(hypothesis_annotation.dimshuffle(0, 2, 1))
197 |         hypothesis_annotation.dimshuffle(0, 2, 1)
198 |     ) - T.eye(hypothesis_annotation.shape[1]).dimshuffle('x', 0, 1)
199 |     )**2, axis=(0, 1, 2)) + T.mean((T.batched_dot(
200 |         premise_annotation,
201 |         # T.extra_ops.cpu_contiguous(premise_annotation.dimshuffle(0, 2, 1))  # ditto.
202 |         premise_annotation.dimshuffle(0, 2, 1)  # ditto.
203 |     ) - T.eye(premise_annotation.shape[1]).dimshuffle('x', 0, 1))**2, axis=(0, 1, 2))
204 |     
205 |     cost = T.mean(T.nnet.categorical_crossentropy(network_output, target_values) + \
206 |                   ATTENTION_PENALTY * attention_penalty)
207 |     cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, target_values) + \
208 |                         ATTENTION_PENALTY * attention_penalty)
209 | 
210 |     # Retrieve all parameters from the network
211 |     all_params = lasagne.layers.get_all_params(l_output) + \
212 |                  lasagne.layers.get_all_params(l_sentence_embedding)
213 |     numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]])
214 |     print("Number of params: {}".format(numparams))
215 |     
216 |     # if exist param file then load params
217 |     look_for = 'params' + os.sep + 'params_' + filename + '.pkl'
218 |     if os.path.isfile(look_for):
219 |         print("Resuming from file: " + look_for)
220 |         all_param_values = cPickle.load(open(look_for, 'rb'))
221 |         for p, v in zip(all_params, all_param_values):
222 |             p.set_value(v)
223 | 
224 |     # withoutwe_params = all_params + [l_word_embed.W]
225 |     
226 |     # Compute updates for training
227 |     print("Computing updates ...")
228 |     updates = lasagne.updates.adagrad(cost, all_params, LEARNING_RATE)
229 | 
230 |     # Theano functions for training and computing cost
231 |     print("Compiling functions ...")
232 |     network_prediction = T.argmax(network_output, axis=1)
233 |     error_rate = T.mean(T.neq(network_prediction, target_values))
234 |     network_prediction_clean = T.argmax(network_output_clean, axis=1)
235 |     error_rate_clean = T.mean(T.neq(network_prediction_clean, target_values))
236 |     
237 |     train = theano.function(
238 |         [l_in_h.input_var, l_mask_h.input_var,
239 |          l_in_p.input_var, l_mask_p.input_var, target_values],
240 |         [cost, error_rate], updates=updates)
241 |     compute_cost = theano.function(
242 |         [l_in_h.input_var, l_mask_h.input_var,
243 |          l_in_p.input_var, l_mask_p.input_var, target_values],
244 |         [cost_clean, error_rate_clean])
245 | 
246 |     def evaluate(mode):
247 |         if mode == 'dev':
248 |             data = dev_batches
249 |         if mode == 'test':
250 |             data = test_batches
251 |         
252 |         set_cost = 0.
253 |         set_error_rate = 0.
254 |         for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(data, 1):
255 |             _cost, _error = compute_cost(hypo, hm, premise, pm, truth)
256 |             set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \
257 |                        1.0 / batches_seen * _cost
258 |             set_error_rate = (1.0 - 1.0 / batches_seen) * set_error_rate + \
259 |                              1.0 / batches_seen * _error
260 |         
261 |         return set_cost, set_error_rate
262 |     
263 |     dev_set_cost,  dev_set_error  = evaluate('dev')
264 |     print("BEFORE TRAINING: dev cost %f, error %f" % (dev_set_cost,  dev_set_error))
265 |     print("Training ...")
266 |     try:
267 |         for epoch in range(num_epochs):
268 |             train_set_cost = 0.
269 |             train_set_error = 0.
270 |             start = time.time()
271 |             
272 |             for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(
273 |                     train_batches, 1):
274 |                 _cost, _error = train(hypo, hm, premise, pm, truth)
275 |                 train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \
276 |                                  1.0 / batches_seen * _cost
277 |                 train_set_error = (1.0 - 1.0 / batches_seen) * train_set_error + \
278 |                                        1.0 / batches_seen * _error
279 |                 if batches_seen % 100 == 0:
280 |                     end = time.time()
281 |                     print("Sample %d %.2fs, lr %.4f, train cost %f, error %f"  % (
282 |                         batches_seen * BATCH_SIZE,
283 |                         LEARNING_RATE,
284 |                         end - start,
285 |                         train_set_cost,
286 |                         train_set_error))
287 |                     start = end
288 | 
289 |                 if batches_seen % 2000 == 0:
290 |                     dev_set_cost,  dev_set_error  = evaluate('dev')
291 |                     test_set_cost, test_set_error = evaluate('test')
292 |                     print("***dev  cost %f, error %f" % (dev_set_cost,  dev_set_error))
293 |                     print("***test cost %f, error %f" % (test_set_cost,  test_set_error))
294 | 
295 |             # save parameters
296 |             all_param_values = [p.get_value() for p in all_params]
297 |             cPickle.dump(all_param_values,
298 |                          open('params' + os.sep + 'params_' + filename + '.pkl', 'wb'))
299 | 
300 |             # load params
301 |             # all_param_values = cPickle.load(open('params' + os.sep + 'params_' + filename, 'rb'))
302 |             # for p, v in zip(all_params, all_param_values):
303 |             #     p.set_value(v)
304 | 
305 |             dev_set_cost,  dev_set_error  = evaluate('dev')
306 |             test_set_cost, test_set_error = evaluate('test')
307 | 
308 |             print("epoch %d, cost: train %f dev %f test %f;\n"
309 |                   "         error train %f dev %f test %f" % (
310 |                 epoch,
311 |                 train_set_cost,     dev_set_cost,   test_set_cost,
312 |                 train_set_error,    dev_set_error,  test_set_error))
313 |     except KeyboardInterrupt:
314 |         pdb.set_trace()
315 |         pass
316 | 
317 | if __name__ == '__main__':
318 |     main()
319 | 


--------------------------------------------------------------------------------
/segae_l2_dpout.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from __future__ import print_function
  5 | 
  6 | import time
  7 | import os
  8 | import sys
  9 | import numpy
 10 | import cPickle
 11 | import theano
 12 | import theano.tensor as T
 13 | from sklearn.metrics import confusion_matrix
 14 | import lasagne
 15 | from lasagne.layers.recurrent import Gate
 16 | from lasagne import init, nonlinearities
 17 | 
 18 | from util_layers import DenseLayer3DInput, Softmax3D, ApplyAttention, GatedEncoder3D
 19 | from dataset import SNLI
 20 | 
 21 | import pdb
 22 | theano.config.compute_test_value = 'warn'  # 'off' # Use 'warn' to activate this feature
 23 | 
 24 | 
 25 | LSTMHID = int(sys.argv[1])          # 150 Hidden unit numbers in LSTM
 26 | ATTHID = int(sys.argv[2])           # 350 Hidden unit numbers in attention MLP
 27 | OUTHID = int(sys.argv[3])           # 3000 Hidden unit numbers in output MLP
 28 | NROW = int(sys.argv[4])             # 10 Number of rows in matrix representation
 29 | LR = float(sys.argv[5])             # 0.01
 30 | L2REG = float(sys.argv[6])          # 0.0001 L2 regularization
 31 | DPOUT = float(sys.argv[7])          # 0.3 dropout rate
 32 | ATTPENALTY = float(sys.argv[8])     # 1.
 33 | WEDIM = int(sys.argv[9])            # 300 Dim of word embedding
 34 | BSIZE = int(sys.argv[10])           # 50 Minibatch size
 35 | GCLIP = float(sys.argv[11])         # 0.5 All gradients above this will be clipped
 36 | NEPOCH = int(sys.argv[12])          # 12 Number of epochs to train the net
 37 | STD = float(sys.argv[13])           # 0.1 Standard deviation of weights in initialization
 38 | UPDATEWE = bool(int(sys.argv[14]))  # 0 for False and 1 for True. Update word embedding in training
 39 | filename = __file__.split('.')[0] + \
 40 |            '_LSTMHID' + str(LSTMHID) + \
 41 |            '_ATTHID' + str(ATTHID) + \
 42 |            '_OUTHID' + str(OUTHID) + \
 43 |            '_NROWS' + str(NROW) + \
 44 |            '_LR' + str(LR) + \
 45 |            '_L2REG' + str(L2REG) + \
 46 |            '_DPOUT' + str(DPOUT) + \
 47 |            '_ATTPENALTY' + str(ATTPENALTY) + \
 48 |            '_WEDIM' + str(WEDIM) + \
 49 |            '_BSIZE' + str(BSIZE) + \
 50 |            '_GCLIP' + str(GCLIP) + \
 51 |            '_NEPOCH' + str(NEPOCH) + \
 52 |            '_STD' + str(STD) + \
 53 |            '_UPDATEWE' + str(UPDATEWE)
 54 | 
 55 | def main(num_epochs=NEPOCH):
 56 |     print("Loading data ...")
 57 |     snli = SNLI(batch_size=BSIZE)
 58 |     train_batches = list(snli.train_minibatch_generator())
 59 |     dev_batches = list(snli.dev_minibatch_generator())
 60 |     test_batches = list(snli.test_minibatch_generator())
 61 |     W_word_embedding = snli.weight  # W shape: (# vocab size, WE_DIM)
 62 |     del snli
 63 | 
 64 |     print("Building network ...")
 65 |     ########### sentence embedding encoder ###########
 66 |     # sentence vector, with each number standing for a word number
 67 |     input_var = T.TensorType('int32', [False, False])('sentence_vector')
 68 |     input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'),
 69 |                                              numpy.zeros((BSIZE, 5)).astype('int32')))
 70 |     input_var.tag.test_value[1, 20:22] = (413, 45)
 71 |     l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var)
 72 |     
 73 |     input_mask = T.TensorType('int32', [False, False])('sentence_mask')
 74 |     input_mask.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 20), dtype='int32'),
 75 |                                              numpy.zeros((BSIZE, 5), dtype='int32')))
 76 |     input_mask.tag.test_value[1, 20:22] = 1
 77 |     l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask)
 78 | 
 79 |     # output shape (BSIZE, None, WEDIM)
 80 |     l_word_embed = lasagne.layers.EmbeddingLayer(
 81 |         l_in,
 82 |         input_size=W_word_embedding.shape[0],
 83 |         output_size=W_word_embedding.shape[1],
 84 |         W=W_word_embedding)
 85 | 
 86 |     # bidirectional LSTM
 87 |     l_forward = lasagne.layers.LSTMLayer(
 88 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
 89 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
 90 |                     W_cell=init.Normal(STD)),
 91 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
 92 |                     W_cell=init.Normal(STD)),
 93 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
 94 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
 95 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
 96 |                     W_cell=init.Normal(STD)),
 97 |         nonlinearity=lasagne.nonlinearities.tanh,
 98 |         peepholes = False,
 99 |         grad_clipping=GCLIP)
100 | 
101 |     l_backward = lasagne.layers.LSTMLayer(
102 |         l_word_embed, mask_input=l_mask, num_units=LSTMHID,
103 |         ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
104 |                     W_cell=init.Normal(STD)),
105 |         forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
106 |                     W_cell=init.Normal(STD)),
107 |         cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD),
108 |                   W_cell=None, nonlinearity=nonlinearities.tanh),
109 |         outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 
110 |                     W_cell=init.Normal(STD)),
111 |         nonlinearity=lasagne.nonlinearities.tanh,
112 |         peepholes = False,
113 |         grad_clipping=GCLIP, backwards=True)
114 |     
115 |     # output dim: (BSIZE, None, 2*LSTMHID)
116 |     l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2)
117 |     l_concat_dpout = lasagne.layers.DropoutLayer(l_concat, p=DPOUT, rescale=True)  # might not need this line
118 | 
119 |     # Attention mechanism to get sentence embedding
120 |     # output dim: (BSIZE, None, ATTHID)
121 |     l_ws1 = DenseLayer3DInput(l_concat_dpout, num_units=ATTHID)
122 |     l_ws1_dpout = lasagne.layers.DropoutLayer(l_ws1, p=DPOUT, rescale=True)
123 | 
124 |     # output dim: (BSIZE, None, NROW)
125 |     l_ws2 = DenseLayer3DInput(l_ws1_dpout, num_units=NROW, nonlinearity=None)
126 |     l_annotations = Softmax3D(l_ws2, mask=l_mask)
127 |     # output dim: (BSIZE, 2*LSTMHID, NROW)
128 |     l_sentence_embedding = ApplyAttention([l_annotations, l_concat])
129 | 
130 |     # beam search? Bi lstm in the sentence embedding layer? etc.
131 | 
132 | 
133 |     ########### get embeddings for hypothesis and premise ###########
134 |     # hypothesis
135 |     input_var_h = T.TensorType('int32', [False, False])('hypothesis_vector')
136 |     input_var_h.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 18), 'int32'),
137 |                                                numpy.zeros((BSIZE, 6)).astype('int32')))
138 |     l_in_h = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var_h)
139 |     
140 |     input_mask_h = T.TensorType('int32', [False, False])('hypo_mask')
141 |     input_mask_h.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 18), dtype='int32'),
142 |                                                 numpy.zeros((BSIZE, 6), dtype='int32')))
143 |     input_mask_h.tag.test_value[1, 18:22] = 1
144 |     l_mask_h = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask_h)
145 |     
146 |     # premise
147 |     input_var_p = T.TensorType('int32', [False, False])('premise_vector')
148 |     input_var_p.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 16), 'int32'),
149 |                                                numpy.zeros((BSIZE, 3)).astype('int32')))
150 |     l_in_p = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var_p)
151 |     
152 |     input_mask_p = T.TensorType('int32', [False, False])('premise_mask')
153 |     input_mask_p.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 16), dtype='int32'),
154 |                                                 numpy.zeros((BSIZE, 3), dtype='int32')))
155 |     input_mask_p.tag.test_value[1, 16:18] = 1
156 |     l_mask_p = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask_p)
157 |     
158 |     
159 |     hypothesis_embedding, hypothesis_annotation = lasagne.layers.get_output(
160 |         [l_sentence_embedding, l_annotations],
161 |         {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var})
162 |     premise_embedding, premise_annotation = lasagne.layers.get_output(
163 |         [l_sentence_embedding, l_annotations],
164 |         {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var})
165 | 
166 |     hypothesis_embedding_clean, hypothesis_annotation_clean = lasagne.layers.get_output(
167 |         [l_sentence_embedding, l_annotations],
168 |         {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var}, deterministic=True)
169 |     premise_embedding_clean, premise_annotation_clean = lasagne.layers.get_output(
170 |         [l_sentence_embedding, l_annotations],
171 |         {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var}, deterministic=True)
172 | 
173 |     ########### gated encoder and output MLP ##########
174 |     l_hypo_embed = lasagne.layers.InputLayer(
175 |         shape=(BSIZE, NROW, 2*LSTMHID), input_var=hypothesis_embedding)
176 |     l_hypo_embed_dpout = lasagne.layers.DropoutLayer(l_hypo_embed, p=DPOUT, rescale=True)
177 |     l_pre_embed = lasagne.layers.InputLayer(
178 |         shape=(BSIZE, NROW, 2*LSTMHID), input_var=premise_embedding)
179 |     l_pre_embed_dpout = lasagne.layers.DropoutLayer(l_pre_embed, p=DPOUT, rescale=True)
180 |    
181 |     # output dim: (BSIZE, NROW, 2*LSTMHID)
182 |     l_factors = GatedEncoder3D([l_hypo_embed_dpout, l_pre_embed_dpout], num_hfactors=2*LSTMHID)
183 |     l_factors_dpout = lasagne.layers.DropoutLayer(l_factors, p=DPOUT, rescale=True)
184 |     
185 |     # l_hids = DenseLayer3DWeight()
186 | 
187 |     l_outhid = lasagne.layers.DenseLayer(
188 |         l_factors_dpout, num_units=OUTHID, nonlinearity=lasagne.nonlinearities.rectify)
189 |     l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True)
190 | 
191 |     l_output = lasagne.layers.DenseLayer(
192 |         l_outhid_dpout, num_units=3, nonlinearity=lasagne.nonlinearities.softmax)
193 | 
194 | 
195 |     ########### target, cost, validation, etc. ##########
196 |     target_values = T.ivector('target_output')
197 |     target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32')
198 | 
199 |     network_output = lasagne.layers.get_output(l_output)
200 |     network_prediction = T.argmax(network_output, axis=1)
201 |     accuracy = T.mean(T.eq(network_prediction, target_values))
202 | 
203 |     network_output_clean = lasagne.layers.get_output(
204 |         l_output,
205 |         {l_hypo_embed: hypothesis_embedding_clean,
206 |          l_pre_embed: premise_embedding_clean},
207 |         deterministic=True)
208 |     network_prediction_clean = T.argmax(network_output_clean, axis=1)
209 |     accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values))
210 |     
211 |     # penalty term and cost
212 |     attention_penalty = T.mean((T.batched_dot(
213 |         hypothesis_annotation,
214 |         hypothesis_annotation.dimshuffle(0, 2, 1)
215 |     ) - T.eye(hypothesis_annotation.shape[1]).dimshuffle('x', 0, 1)
216 |     )**2, axis=(0, 1, 2)) + T.mean((T.batched_dot(
217 |         premise_annotation,
218 |         premise_annotation.dimshuffle(0, 2, 1)
219 |     ) - T.eye(premise_annotation.shape[1]).dimshuffle('x', 0, 1))**2, axis=(0, 1, 2))
220 |    
221 |     L2_lstm = ((l_forward.W_in_to_ingate ** 2).sum() + \
222 |                (l_forward.W_hid_to_ingate ** 2).sum() + \
223 |                (l_forward.W_in_to_forgetgate ** 2).sum() + \
224 |                (l_forward.W_hid_to_forgetgate ** 2).sum() + \
225 |                (l_forward.W_in_to_cell ** 2).sum() + \
226 |                (l_forward.W_hid_to_cell ** 2).sum() + \
227 |                (l_forward.W_in_to_outgate ** 2).sum() + \
228 |                (l_forward.W_hid_to_outgate ** 2).sum() + \
229 |                (l_backward.W_in_to_ingate ** 2).sum() + \
230 |                (l_backward.W_hid_to_ingate ** 2).sum() + \
231 |                (l_backward.W_in_to_forgetgate ** 2).sum() + \
232 |                (l_backward.W_hid_to_forgetgate ** 2).sum() + \
233 |                (l_backward.W_in_to_cell ** 2).sum() + \
234 |                (l_backward.W_hid_to_cell ** 2).sum() + \
235 |                (l_backward.W_in_to_outgate ** 2).sum() + \
236 |                (l_backward.W_hid_to_outgate ** 2).sum())
237 |     L2_attention = (l_ws1.W ** 2).sum() + (l_ws2.W ** 2).sum()
238 |     L2_gae = (l_factors.Wxf ** 2).sum() + (l_factors.Wyf ** 2).sum()
239 |     L2_outputhid = (l_outhid.W ** 2).sum()
240 |     L2_softmax = (l_output.W ** 2).sum()
241 |     L2 = L2_lstm + L2_attention + L2_gae + L2_outputhid + L2_softmax 
242 | 
243 |     cost = T.mean(T.nnet.categorical_crossentropy(network_output, target_values)) + \
244 |            L2REG * L2
245 |     cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, target_values)) + \
246 |                  L2REG * L2
247 |     if ATTPENALTY != 0.:
248 |         cost = cost + ATTPENALTY * attention_penalty
249 |         cost_clean = cost_clean + ATTPENALTY * attention_penalty
250 | 
251 |     # Retrieve all parameters from the network
252 |     all_params = lasagne.layers.get_all_params(l_output) + \
253 |                  lasagne.layers.get_all_params(l_sentence_embedding)
254 |     if not UPDATEWE:
255 |         all_params.remove(l_word_embed.W)
256 | 
257 |     numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]])
258 |     print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams))
259 |     print("-----------------------------------------------------------------")
260 |     for item in all_params:
261 |         print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval())))
262 | 
263 |     # if exist param file then load params
264 |     look_for = 'params' + os.sep + 'params_' + filename + '.pkl'
265 |     if os.path.isfile(look_for):
266 |         print("Resuming from file: " + look_for)
267 |         all_param_values = cPickle.load(open(look_for, 'rb'))
268 |         for p, v in zip(all_params, all_param_values):
269 |             p.set_value(v)
270 |    
271 |     # Compute SGD updates for training
272 |     print("Computing updates ...")
273 |     updates = lasagne.updates.adagrad(cost, all_params, LR)
274 | 
275 |     # Theano functions for training and computing cost
276 |     print("Compiling functions ...")
277 |     train = theano.function(
278 |         [l_in_h.input_var, l_mask_h.input_var,
279 |          l_in_p.input_var, l_mask_p.input_var, target_values],
280 |         [cost, accuracy], updates=updates)
281 |     compute_cost = theano.function(
282 |         [l_in_h.input_var, l_mask_h.input_var,
283 |          l_in_p.input_var, l_mask_p.input_var, target_values],
284 |         [cost_clean, accuracy_clean])
285 |     predict = theano.function(
286 |         [l_in_h.input_var, l_mask_h.input_var,
287 |          l_in_p.input_var, l_mask_p.input_var],
288 |         network_prediction_clean)
289 | 
290 |     def evaluate(mode, verbose=False):
291 |         if mode == 'dev':
292 |             data = dev_batches
293 |         if mode == 'test':
294 |             data = test_batches
295 |         
296 |         set_cost = 0.
297 |         set_accuracy = 0.
298 |         for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(data, 1):
299 |             _cost, _accuracy = compute_cost(hypo, hm, premise, pm, truth)
300 |             set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \
301 |                        1.0 / batches_seen * _cost
302 |             set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \
303 |                              1.0 / batches_seen * _accuracy
304 |         
305 |         if verbose == True:
306 |             predicted = []
307 |             truth = []
308 |             for batches_seen, (hypo, hm, premise, pm, th) in enumerate(data, 1):
309 |                 predicted.append(predict(hypo, hm, premise, pm))
310 |                 truth.append(th)
311 |             truth = numpy.concatenate(truth)
312 |             predicted = numpy.concatenate(predicted)
313 |             cm = confusion_matrix(truth, predicted)
314 |             pr_a = cm.trace()*1.0 / truth.size
315 |             pr_e = ((cm.sum(axis=0)*1.0/truth.size) * \
316 |                     (cm.sum(axis=1)*1.0/truth.size)).sum()
317 |             k = (pr_a - pr_e) / (1 - pr_e)
318 |             print(mode + " set statistics:")
319 |             print("kappa index of agreement: %f" % k)
320 |             print("confusion matrix:")
321 |             print(cm)
322 | 
323 |         return set_cost, set_accuracy
324 |     
325 |     print("Done. Evaluating scratch model ...")
326 |     test_set_cost,  test_set_accuracy  = evaluate('test', verbose=True)
327 |     print("BEFORE TRAINING: dev cost %f, accuracy %f" % (test_set_cost,  test_set_accuracy))
328 |     print("Training ...")
329 |     try:
330 |         for epoch in range(num_epochs):
331 |             train_set_cost = 0.
332 |             train_set_accuracy = 0.
333 |             start = time.time()
334 |             
335 |             for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(
336 |                     train_batches, 1):
337 |                 _cost, _accuracy = train(hypo, hm, premise, pm, truth)
338 |                 train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \
339 |                                  1.0 / batches_seen * _cost
340 |                 train_set_accuracy = (1.0 - 1.0 / batches_seen) * train_set_accuracy + \
341 |                                   1.0 / batches_seen * _accuracy
342 |                 if batches_seen % 100 == 0:
343 |                     end = time.time()
344 |                     print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f"  % (
345 |                         batches_seen * BSIZE,
346 |                         end - start,
347 |                         LR,
348 |                         train_set_cost,
349 |                         train_set_accuracy))
350 |                     start = end
351 | 
352 |                 if batches_seen % 2000 == 0:
353 |                     dev_set_cost,  dev_set_accuracy = evaluate('dev')
354 |                     print("***dev cost %f, accuracy %f" % (dev_set_cost,  dev_set_accuracy))
355 | 
356 |             # save parameters
357 |             all_param_values = [p.get_value() for p in all_params]
358 |             cPickle.dump(all_param_values,
359 |                          open('params' + os.sep + 'params_' + filename + '.pkl', 'wb'))
360 | 
361 |             dev_set_cost,  dev_set_accuracy  = evaluate('dev')
362 |             test_set_cost, test_set_accuracy = evaluate('test', verbose=True)
363 | 
364 |             print("epoch %d, cost: train %f dev %f test %f;\n"
365 |                   "         accu: train %f dev %f test %f" % (
366 |                 epoch,
367 |                 train_set_cost,     dev_set_cost,     test_set_cost,
368 |                 train_set_accuracy, dev_set_accuracy, test_set_accuracy))
369 |     except KeyboardInterrupt:
370 |         pdb.set_trace()
371 |         pass
372 | 
373 | if __name__ == '__main__':
374 |     main()
375 | 
376 | 


--------------------------------------------------------------------------------
/util_layers.py:
--------------------------------------------------------------------------------
  1 | import numpy
  2 | import theano
  3 | import theano.tensor as T
  4 | 
  5 | from lasagne import nonlinearities, init
  6 | from lasagne.layers.base import Layer, MergeLayer
  7 | 
  8 | import pdb
  9 | 
 10 | 
 11 | class FlatConcat(MergeLayer):
 12 |     """
 13 |     ConCatLayer but Flattened to 2 dims before concatenation.
 14 |     Accepts more than 2 input. But all inputs should have the same dimention in
 15 |     the first dimention. This layer flattens all input to a 2-D matrix and
 16 |     concatenates them in the second dimention.
 17 | 
 18 |     """
 19 |     def get_output_shape_for(self, input_shapes):
 20 |         output_shapes = []
 21 |         for shape in input_shapes:
 22 |             output_shapes.append((shape[0], numpy.prod(shape[1:])))
 23 |         return (output_shapes[0][0], sum([i[-1] for i in output_shapes]))
 24 | 
 25 |     def get_output_for(self, inputs, **kwargs):
 26 |         inputs = [i.flatten(2) for i in inputs]
 27 |         return T.concatenate(inputs, axis=1)
 28 | 
 29 | 
 30 | class DenseLayerTensorDot(Layer):
 31 |     """
 32 |     multiply N 3D matrices along two dimensions of a 3D matrix, and produce a
 33 |     3D output. In batch training case, these setting corresponds to:
 34 |     
 35 |     Input shape:    (dim1, dim2, dim3, dim4)  # (BATCH_SIZE, num_inputslices, N_ROWS, num_inputfeatures)
 36 |     weight shape:   There are two type of weight dims:
 37 |                     'col': (num_slices, num_features, dim2, dim4)
 38 |                     'row': (num_slices, num_features, dim2, dim3)
 39 |     Output shape:   There are two types of output shapes:
 40 |                     'col': (dim1, num_slices, dim3, num_features)
 41 |                          # (BSIZE, num_slices, N_ROWS, num_features)
 42 |                     'row': (dim1, num_slices, num_features, num_inputfeatures)
 43 |                          # (BSIZE, num_slices, num_features, num_inputfeatures)
 44 | 
 45 |     direction: 'row': you are modifying along the row direction, thus the num_inputfeatures keeps intact.
 46 |             or 'col': you are modifying along the col direction (the number of features),
 47 |                       thus the N_ROWS will keep constant
 48 |     """
 49 |     def __init__(self, incoming, num_slices, num_features, direction='col',
 50 |                  W=init.GlorotUniform(gain='relu'), nonlinearity=nonlinearities.rectify,
 51 |                  **kwargs):
 52 |         super(DenseLayerTensorDot, self).__init__(incoming, **kwargs)
 53 |         self.nonlinearity = (nonlinearities.identity if nonlinearity is None
 54 |                              else nonlinearity)
 55 |         self.num_inputslices = self.input_shape[1]
 56 |         self.num_slices = num_slices
 57 |         self.num_inputfeatures = self.input_shape[3]
 58 |         self.num_features = num_features
 59 |         self.batch_size = self.input_shape[0]
 60 |         self.num_rows = self.input_shape[2]
 61 | 
 62 |         self.direction = direction
 63 |         if direction == 'col':
 64 |             self.W = self.add_param(
 65 |                 W,
 66 |                 (num_slices, num_features, self.num_inputslices, self.num_inputfeatures),
 67 |                 name="W4D_TensorDot_col")
 68 |             self.axes = [[1, 3], [2, 3]]
 69 |         elif direction == 'row':
 70 |             self.W = self.add_param(
 71 |                 W,
 72 |                 (num_slices, num_features, self.num_inputslices, self.num_rows),
 73 |                 name="W4D_TensorDot_row")
 74 |             self.axes = [[1, 2], [2, 3]]
 75 |         else:
 76 |             raise ValueError("`direction` has to be either `row` or `col`.")
 77 | 
 78 |     def get_output_shape_for(self, input_shape):
 79 |         num_inputfeatures = input_shape[3]
 80 |         batch_size = input_shape[0]
 81 |         num_rows = input_shape[2]
 82 | 
 83 |         # this may change according to the dims you choose to multiply
 84 |         if self.direction == 'col':
 85 |             return (batch_size, self.num_slices, num_rows, self.num_features)
 86 |         elif self.direction == 'row':
 87 |             return (batch_size, self.num_slices, self.num_features, num_inputfeatures)
 88 |         
 89 |     def get_output_for(self, input, **kwargs):
 90 |         x = input
 91 |         if self.direction == 'col':
 92 |             preactivation = T.tensordot(x, self.W, axes=self.axes).dimshuffle(0, 2, 1, 3)
 93 |         elif self.direction == 'row':
 94 |             preactivation = T.tensordot(x, self.W, axes=self.axes).dimshuffle(0, 2, 3, 1)
 95 |         return self.nonlinearity(preactivation)
 96 | 
 97 | 
 98 | class DenseLayerTensorBatcheddot(Layer):
 99 |     """
100 |     """
101 |     def __init__(self):
102 |         pass
103 |     def get_output_shape_for(self):
104 |         pass
105 |     def get_output_for(self):
106 |         pass
107 | 
108 | 
109 | class DenseLayer3DWeight(Layer):
110 |     """
111 |     Apply a 3D matrix to a 3D input, basically it is just batched dot.
112 | 
113 |     Input: (BATCH_SIZE, inputs_per_row, N_ROWS)
114 | 
115 |     Weight: 
116 |     Depending on whether the weight is multiplied from left side of input,
117 |     there are two shapes:
118 |         right multiply case: (N_ROWS, inputs_per_row, units_per_row)
119 |         left multiply case:  (inputs_per_row, N_ROWS, units_per_row)
120 | 
121 |     Output:
122 |         right multiply case: (BATCH_SIZE, units_per_row, N_ROWS)
123 |         left multiply case:  (BATCH_SIZE, inputs_per_row, units_per_row)
124 |     
125 |     Params:
126 |         incoming,
127 |         units_per_row,
128 |         W
129 |         b
130 |         leftmul : True if the weight is left multiplied to the input.
131 |         nonlinearity
132 |         **kwargs
133 |     """
134 |     def __init__(self, incoming, units_per_row, W=init.GlorotUniform(),
135 |                  b=init.Constant(0.), leftmul=False, nonlinearity=nonlinearities.tanh,
136 |                  **kwargs):
137 |         super(DenseLayer3DWeight, self).__init__(incoming, **kwargs)
138 |         self.nonlinearity = (nonlinearities.identity if nonlinearity is None
139 |                              else nonlinearity)
140 | 
141 |         self.units_per_row = units_per_row
142 |         self.inputs_per_row = self.input_shape[1]
143 |         self.num_rows = self.input_shape[2]
144 |         self.leftmul = leftmul
145 |         
146 |         if leftmul:
147 |             self.W = self.add_param(
148 |                 W, (self.inputs_per_row, self.num_rows, self.units_per_row), name='W3D')
149 |         else:
150 |             self.W = self.add_param(
151 |                 W, (self.num_rows, self.inputs_per_row, self.units_per_row), name='W3D')
152 |         
153 |         if b is None:
154 |             self.b = None
155 |         else:
156 |             if self.leftmul:
157 |                 b = theano.shared(
158 |                     numpy.zeros((1, self.inputs_per_row, self.units_per_row),
159 |                                 dtype=theano.config.floatX),
160 |                     broadcastable=(True, False, False), 
161 |                     name="b3D")
162 |                 self.b = self.add_param(spec=b,
163 |                                         shape=(1, self.inputs_per_row, self.units_per_row),
164 |                                         regularizable=False)
165 |             else:
166 |                 b = theano.shared(
167 |                     numpy.zeros((1, self.units_per_row, self.num_rows),
168 |                                 dtype=theano.config.floatX),
169 |                     broadcastable=(True, False, False), 
170 |                     name="b3D")
171 |                 self.b = self.add_param(spec=b,
172 |                                         shape=(1, self.units_per_row, self.num_rows),
173 |                                         regularizable=False)
174 | 
175 |     def get_output_shape_for(self, input_shape):
176 |         if self.leftmul:
177 |             return (input_shape[0], input_shape[1], self.units_per_row)
178 |         else:
179 |             return (input_shape[0], self.units_per_row, input_shape[2])
180 | 
181 |     def get_output_for(self, input, **kwargs):
182 |         if self.leftmul:
183 |             preact = T.batched_dot(T.extra_ops.cpu_contiguous(input.dimshuffle(1, 0, 2)),
184 |                                    self.W).dimshuffle(1, 0, 2)
185 |         else:
186 |             preact = T.batched_dot(T.extra_ops.cpu_contiguous(input.dimshuffle(2, 0, 1)),
187 |                                    self.W).dimshuffle(1, 2, 0)
188 |         if self.b is not None:
189 |             preact = preact + self.b
190 |         return self.nonlinearity(preact)
191 | 
192 | 
193 | class DenseLayer3DInput(Layer):
194 |     """
195 |     Apply a 2D matrix to a 3D input, so its a batched dot with shared slices.
196 |     
197 |     Input: (BATCH_SIZE, inputdim1, inputdim2)
198 | 
199 |     Weight: 
200 |     Depending on whether the weight is multiplied from left side of input,
201 |     there are two shapes:
202 |         right multiply case: (inputdim2, num_units)
203 | 
204 |     Output:
205 |     
206 |     Params:
207 |         incoming,
208 |         units_per_row,
209 |         W
210 |         b
211 |         leftmul : True if the weight is left multiplied to the input.
212 |         nonlinearity
213 |         **kwargs
214 |     """
215 |     def __init__(self, incoming, num_units, W=init.GlorotUniform(),
216 |                  b=init.Constant(0.), nonlinearity=nonlinearities.tanh,
217 |                  **kwargs):
218 |         super(DenseLayer3DInput, self).__init__(incoming, **kwargs)
219 |         self.nonlinearity = (nonlinearities.identity if nonlinearity is None
220 |                              else nonlinearity)
221 | 
222 |         self.num_units = num_units
223 | 
224 |         num_inputs = self.input_shape[2]
225 | 
226 |         self.W = self.add_param(W, (num_inputs, num_units), name="W2D")
227 |         if b is None:
228 |             self.b = None
229 |         else:
230 |             self.b = self.add_param(b, (num_units,), name="b2D",
231 |                                     regularizable=False)
232 | 
233 |     def get_output_shape_for(self, input_shape):
234 |         return (input_shape[0], input_shape[1], self.num_units)
235 | 
236 |     def get_output_for(self, input, **kwargs):
237 |         
238 |         # pdb.set_trace()
239 | 
240 |         activation = T.dot(input, self.W)
241 |         if self.b is not None:
242 |             activation = activation + self.b.dimshuffle('x', 'x', 0)
243 |         return self.nonlinearity(activation)
244 | 
245 | 
246 | class Softmax3D(MergeLayer):
247 |     """Softmax is conducted on the middle dimension of a 3D tensor."""
248 |     def __init__(self, incoming, mask=None, **kwargs):
249 |         """
250 |         mask: a lasagne layer.
251 |         """
252 |         incomings = [incoming]
253 |         self.have_mask = False
254 |         if mask:
255 |             incomings.append(mask)
256 |             self.have_mask = True
257 |         super(Softmax3D, self).__init__(incomings, **kwargs)
258 | 
259 |     def get_output_shape_for(self, input_shapes):
260 |         return input_shapes[0]
261 | 
262 |     def get_output_for(self, inputs, **kwargs):
263 |         preactivations = inputs[0]
264 |         if self.have_mask:
265 |             mask = inputs[1]
266 |             preactivations = \
267 |                 preactivations * mask.dimshuffle(0, 1, 'x').astype(theano.config.floatX) - \
268 |                 numpy.asarray(1e36).astype(theano.config.floatX) * \
269 |                 (1 - mask).dimshuffle(0, 1, 'x').astype(theano.config.floatX)
270 |             
271 |         annotation = T.nnet.softmax(
272 |             preactivations.dimshuffle(0, 2, 1).reshape((
273 |                 preactivations.shape[0] * preactivations.shape[2],
274 |                 preactivations.shape[1]))
275 |         ).reshape((
276 |             preactivations.shape[0],
277 |             preactivations.shape[2],
278 |             preactivations.shape[1]
279 |         )).dimshuffle(0, 2, 1)
280 |         return annotation
281 | 
282 | 
283 | class ApplyAttention(MergeLayer):
284 |     def get_output_shape_for(self, input_shapes):
285 |         return (input_shapes[0][0], input_shapes[0][2], input_shapes[1][2])
286 | 
287 |     def get_output_for(self, inputs, **kwargs):
288 |         annotation, sentence = inputs[0], inputs[1]
289 |         return T.batched_dot(sentence.dimshuffle(0, 2, 1), annotation).dimshuffle(0, 2, 1)
290 | 
291 | 
292 | class AugmentFeature(MergeLayer):
293 |     """
294 |     Input:
295 |     x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
296 |     y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
297 | 
298 |     Output: (BATCH_SIZE, N_ROWS, 8*LSTM_HIDDEN)
299 |     """
300 |     def get_output_shape_for(self, input_shapes):
301 |         assert input_shapes[0] == input_shapes[1], (
302 |             "The two input to AugmentFeature layer should have the same shape.")
303 |         batch_size = input_shapes[0][0]
304 |         num_rows = input_shapes[0][1]
305 |         num_dim = input_shapes[0][2]
306 |         return (batch_size, num_rows, 4 * num_dim)
307 |     
308 |     def get_output_for(self, inputs, **kwargs):
309 |         x, y = inputs[0], inputs[1]
310 |         return T.concatenate([x, y, x - y, x * y], axis=2)
311 | 
312 | 
313 | class GatedEncoder3D(MergeLayer):
314 |     """
315 |     An implementation of the encoder part of a 3D Gated Autoencoder. It has
316 |     the encoder only. 
317 |     
318 |     It just returns the factor of H, not H. To get the real H, add
319 |     another dense layer on top of the output.
320 | 
321 |     See __paper__ for more info.
322 | 
323 |     Input:
324 |     x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
325 |     y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
326 | 
327 |     Output:
328 |     hfactors = (BATCH_SIZE, N_ROWS, num_hfactors)
329 |     
330 |     """
331 |     def __init__(self, incomings, num_hfactors,
332 |                  Wxf=init.GlorotUniform(),
333 |                  Wyf=init.GlorotUniform(),
334 |                  **kwargs):
335 |         super(GatedEncoder3D, self).__init__(incomings, **kwargs)
336 |         self.num_xfactors = self.input_shapes[0][2]
337 |         self.num_yfactors = self.input_shapes[1][2]
338 |         self.num_rows = self.input_shapes[0][1]
339 |         self.num_hfactors = num_hfactors
340 |         self.Wxf = self.add_param(
341 |             Wxf, (self.num_rows, self.num_xfactors, self.num_hfactors), name='Wxf')
342 |         self.Wyf = self.add_param(
343 |             Wyf, (self.num_rows, self.num_yfactors, self.num_hfactors), name='Wyf')
344 | 
345 |     def get_output_shape_for(self, input_shapes):
346 |         batch_size = input_shapes[0][0]
347 |         return (batch_size, self.num_rows, self.num_hfactors)
348 | 
349 |     def get_output_for(self, inputs, **kwargs):
350 |         x, y = inputs[0], inputs[1]
351 |         # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0)
352 |         # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0)
353 |         xfactor = T.batched_dot(
354 |             T.extra_ops.cpu_contiguous(x.dimshuffle(1, 0, 2)), self.Wxf).dimshuffle(1, 0, 2)
355 |         yfactor = T.batched_dot(
356 |             T.extra_ops.cpu_contiguous(y.dimshuffle(1, 0, 2)), self.Wyf).dimshuffle(1, 0, 2)
357 |         return xfactor * yfactor
358 | 
359 | 
360 | class StackedGatedEncoder3D(MergeLayer):
361 |     """
362 |     An implementation of the encoder part of a 3D Gated Autoencoder. It has
363 |     the encoder only. 
364 |     
365 |     It just returns the factor of H, not H. To get the real H, add
366 |     another dense layer on top of the output.
367 | 
368 |     See __paper__ for more info.
369 | 
370 |     Input:
371 |     x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
372 |     y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
373 | 
374 |     Output:
375 |     hfactors = (BATCH_SIZE, N_ROWS, num_hfactors)
376 |     
377 |     """
378 |     def __init__(self, incomings,
379 |                  Wxf1=init.GlorotUniform(),
380 |                  Wyf1=init.GlorotUniform(),
381 |                  Wxf2=init.GlorotUniform(),
382 |                  Wyf2=init.GlorotUniform(),
383 |                  **kwargs):
384 |         super(StackedGatedEncoder3D, self).__init__(incomings, **kwargs)
385 |         self.num_xfactors = self.input_shapes[0][2]
386 |         self.num_yfactors = self.input_shapes[1][2]
387 |         assert self.num_xfactors == self.num_yfactors
388 |         self.num_rows = self.input_shapes[0][1]
389 |         self.Wxf1 = self.add_param(
390 |             Wxf1, (self.num_rows, self.num_xfactors, self.num_xfactors), name='Wxf1')
391 |         self.Wyf1 = self.add_param(
392 |             Wyf1, (self.num_rows, self.num_yfactors, self.num_yfactors), name='Wyf1')
393 |         self.Wxf2 = self.add_param(
394 |             Wxf2, (self.num_rows, self.num_xfactors, self.num_xfactors), name='Wxf2')
395 |         self.Wyf2 = self.add_param(
396 |             Wyf2, (self.num_rows, self.num_yfactors, self.num_yfactors), name='Wyf2')
397 | 
398 |     def get_output_shape_for(self, input_shapes):
399 |         batch_size = input_shapes[0][0]
400 |         return (batch_size, self.num_rows, self.num_xfactors)
401 | 
402 |     def get_output_for(self, inputs, **kwargs):
403 |         x, y = inputs[0], inputs[1]
404 |         # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0)
405 |         # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0)
406 |         xfactor1 = T.tanh(T.batched_dot(
407 |             T.extra_ops.cpu_contiguous(x.dimshuffle(1, 0, 2)), self.Wxf1).dimshuffle(1, 0, 2))
408 |         yfactor1 = T.tanh(T.batched_dot(
409 |             T.extra_ops.cpu_contiguous(y.dimshuffle(1, 0, 2)), self.Wyf1).dimshuffle(1, 0, 2))
410 |         xfactor2 = T.batched_dot(
411 |             T.extra_ops.cpu_contiguous(xfactor1.dimshuffle(1, 0, 2)), self.Wxf2).dimshuffle(1, 0, 2)
412 |         yfactor2 = T.batched_dot(
413 |             T.extra_ops.cpu_contiguous(yfactor1.dimshuffle(1, 0, 2)), self.Wyf2).dimshuffle(1, 0, 2)
414 |         return xfactor2 * yfactor2
415 | 
416 | 
417 | class GatedEncoder3DSharedW(MergeLayer):
418 |     """
419 |     An implementation of the encoder part of a 3D Gated Autoencoder.
420 | 
421 |     It has the encoder only. 
422 |     
423 |     It just returns the factor of H, not H. To get the real H, add
424 |     another dense layer on top of the output.
425 | 
426 |     See __paper__ for more info.
427 |     
428 |     the two inputs, x and y, have to have the same shape.
429 | 
430 |     """
431 |     def __init__(self, incomings, num_hfactors,
432 |                  Wf=init.GlorotUniform(),
433 |                  **kwargs):
434 |         super(GatedEncoder3DSharedW, self).__init__(incomings, **kwargs)
435 |         self.num_factors = self.input_shapes[0][1]
436 |         self.num_rows = self.input_shapes[0][2]
437 |         self.num_hfactors = num_hfactors
438 |         self.Wf = self.add_param(
439 |             Wf, (self.num_rows, self.num_factors, self.num_hfactors), name='Wf')
440 | 
441 |     def get_output_shape_for(self, input_shapes):
442 |         batch_size = input_shapes[0][0]
443 |         return (batch_size, self.num_hfactors, self.num_rows)
444 | 
445 |     def get_output_for(self, inputs, **kwargs):
446 |         x, y = inputs[0], inputs[1]
447 |         # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0)
448 |         # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0)
449 |         xfactor = T.batched_dot(T.extra_ops.cpu_contiguous(x.dimshuffle(2, 0, 1)), self.Wf).dimshuffle(1, 2, 0)
450 |         yfactor = T.batched_dot(T.extra_ops.cpu_contiguous(y.dimshuffle(2, 0, 1)), self.Wf).dimshuffle(1, 2, 0)
451 |         return xfactor * yfactor
452 | 
453 | 
454 | class GatedEncoder4D(MergeLayer):
455 |     """
456 |     An implementation of the encoder part of a 4D Gated Autoencoder.
457 | 
458 |     It has the encoder only. 
459 |     
460 |     It just returns the factor of H, not H. To get the real H, add
461 |     another dense layer on top of the output.
462 | 
463 |     the two inputs, x and y, have to have the same shape.
464 |     
465 |     Input shape:    (dim1, dim2, num_factors)               # (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN)
466 |     weight shape:   (num_slices, num_factors, num_hfactors) # (N_SLICES, 2*LSTM_HIDDEN, num_hfactors)
467 |     Output shape:   (dim1, num_slices, dim2, num_hfactors)  # (BATCH_SIZE, N_SLICES, N_ROWS, num_hfactors)
468 | 
469 |     """
470 |     def __init__(self, incomings, num_slices, num_hfactors,
471 |                  Wf=init.GlorotUniform(),
472 |                  **kwargs):
473 |         super(GatedEncoder4D, self).__init__(incomings, **kwargs)
474 |         self.num_slices = num_slices
475 |         self.num_factors = self.input_shapes[0][2]
476 |         self.num_rows = self.input_shapes[0][1]
477 |         self.num_hfactors = num_hfactors
478 |         self.Wf = self.add_param(
479 |             Wf, (self.num_slices, self.num_factors, self.num_hfactors), name='Wf')
480 | 
481 |     def get_output_shape_for(self, input_shapes):
482 |         batch_size = input_shapes[0][0]
483 |         return (batch_size, self.num_slices, self.num_rows, self.num_hfactors)
484 | 
485 |     def get_output_for(self, inputs, **kwargs):
486 |         x, y = inputs[0], inputs[1]
487 |         xfactor = T.tensordot(x, self.Wf, axes=(2, 1)).dimshuffle(0, 2, 1, 3)
488 |         yfactor = T.tensordot(y, self.Wf, axes=(2, 1)).dimshuffle(0, 2, 1, 3)
489 |         return xfactor * yfactor
490 | 
491 | 
492 | class APAttentionBatch(MergeLayer):
493 |     """
494 |     Attention Pooling mechanism. Compute a normalized weight over input sentences Q and A.
495 | 
496 |     input: Q & A:     (BSIZE, dim1(dim2), DIM)
497 |            Q & A mask (BSIZE, dim1(dim2))
498 |     U:                (NROW, DIM, DIM)
499 |     output: G:        (BSIZE, NROW, dim1, dim2)
500 |     """
501 |     def __init__(self, incomings, masks=None, num_row=10, init_noise=0.001, **kwargs):
502 |         self.have_mask = False
503 |         if masks:
504 |             incomings = incomings + masks
505 |             self.have_mask = True
506 |         super(APAttentionBatch, self).__init__(incomings, **kwargs)
507 |         self.num_row = num_row
508 |         self.init_noise = init_noise
509 |         self.num_dim = self.input_shapes[0][2]
510 |         U = (numpy.identity(self.num_dim) + init.Normal(std=self.init_noise).sample(
511 |                  shape=(self.num_row, self.num_dim, self.num_dim))
512 |             ).astype(theano.config.floatX)
513 |         self.U = self.add_param(U, U.shape, name='U')
514 | 
515 |     def get_output_shape_for(self, input_shapes):
516 |         batch_size = input_shapes[0][0]
517 |         num_wordQ = input_shapes[0][1]
518 |         num_wordA = input_shapes[1][1]
519 |         return (batch_size, self.num_row, num_wordQ, num_wordA)
520 | 
521 |     def get_output_for(self, inputs, **kwargs):
522 |         Q = inputs[0]
523 |         A = inputs[1]
524 |         QU = T.tensordot(Q, self.U, axes=[2, 1])  # (BSIZE, dim1, NROW, DIM)
525 |         QUA = T.batched_tensordot(QU, A, axes=[3, 2]).dimshuffle(0, 2, 1, 3)
526 |         G = T.tanh(QUA)  # (BSIZE, NROW, dim1, dim2)
527 | 
528 |         if self.have_mask:
529 |             Qmask = inputs[2]
530 |             Amask = inputs[3]
531 |             Gmask = T.batched_dot(Qmask.dimshuffle(0, 1, 'x'),
532 |                                   Amask.dimshuffle(0, 'x', 1)).dimshuffle(0, 'x', 1, 2)
533 |             G = G * Gmask - (1 - Gmask)  # pad -1 to trailing spaces.
534 |         
535 |         return G
536 | 
537 | 
538 | class ComputeEmbeddingPool(MergeLayer):
539 |     """
540 |     Input :
541 |         x: (BSIZE, NROW, DIM)
542 |         y: (BSIZE, NROW, DIM)
543 |     Output :
544 |         (BSIZE, NROW, NROW)
545 |     """
546 |     def __init__(self, incomings, **kwargs):
547 |         super(ComputeEmbeddingPool, self).__init__(incomings, **kwargs)
548 | 
549 |     def get_output_shape_for(self, input_shapes):
550 |         xshape = input_shapes[0]
551 |         yshape = input_shapes[1]
552 |         return (xshape[0], xshape[1], yshape[1])
553 | 
554 |     def get_output_for(self, inputs, **kwargs):
555 |         x = inputs[0]
556 |         y = inputs[1]
557 |         return T.batched_dot(x, y.dimshuffle(0, 2, 1))
558 | 
559 | 
560 | class AttendOnEmbedding(MergeLayer):
561 |     """
562 |     incomings=[x, embeddingpool], masks=[xmask, ymask], direction='col'
563 |     or
564 |               [y, embeddingpool], masks=[xmask, ymask], direction='row'
565 |     
566 |     Output :
567 |               alpha; or beta
568 |     """
569 |     def __init__(self, incomings, masks=None, direction='col', **kwargs):
570 |         self.have_mask = False
571 |         if masks:
572 |             incomings = incomings + masks
573 |             self.have_mask = True
574 |         super(AttendOnEmbedding, self).__init__(incomings, **kwargs)
575 |         self.direction = direction
576 | 
577 |     def get_output_shape_for(self, input_shapes):
578 |         sent_shape = input_shapes[0]
579 |         emat_shape = input_shapes[1]
580 |         if self.direction == 'col':
581 |             # x:    (BSIZE, R_x, DIM)
582 |             # emat: (BSIZE. R_x, R_y)
583 |             # out:  (BSIZE, R_y, DIM)
584 |             return (sent_shape[0], emat_shape[2], sent_shape[2])
585 |         elif self.direction == 'row':
586 |             # y:    (BSIZE, R_y, DIM)
587 |             # emat: (BSIZE. R_x, R_y)
588 |             # out:  (BSIZE, R_x, DIM)
589 |             return (sent_shape[0], emat_shape[1], sent_shape[2])
590 | 
591 |     def get_output_for(self, inputs, **kwargs):
592 |         sentence = inputs[0]
593 |         emat = inputs[1]
594 |         if self.have_mask:
595 |             xmask = inputs[2]
596 |             ymask = inputs[3]
597 |             xymask = T.batched_dot(xmask.dimshuffle(0, 1, 'x'),
598 |                                    ymask.dimshuffle(0, 'x', 1))
599 |             emat = emat * xymask.astype(theano.config.floatX) - \
600 |                    numpy.asarray(1e36).astype(theano.config.floatX) * \
601 |                    (1 - xymask).astype(theano.config.floatX)
602 | 
603 |         if self.direction == 'col':  # softmax on x's dim, and multiply by x
604 |             annotation = T.nnet.softmax(
605 |                 emat.dimshuffle(0, 2, 1).reshape((
606 |                     emat.shape[0] * emat.shape[2], emat.shape[1]))
607 |             ).reshape((
608 |                 emat.shape[0], emat.shape[2], emat.shape[1]
609 |             ))  # (BSIZE, R_y, R_x)
610 |             if self.have_mask:
611 |                 annotation = annotation * ymask.dimshuffle(
612 |                     0, 1, 'x').astype(theano.config.floatX)
613 |         elif self.direction == 'row':  # softmax on y's dim, and multiply by y
614 |             annotation = T.nnet.softmax(
615 |                 emat.reshape((
616 |                     emat.shape[0] * emat.shape[1], emat.shape[2]))
617 |             ).reshape((
618 |                 emat.shape[0], emat.shape[1], emat.shape[2]
619 |             ))  # (BSIZE, R_x, R_y)
620 |             if self.have_mask:
621 |                 annotation = annotation * xmask.dimshuffle(
622 |                     0, 1, 'x').astype(theano.config.floatX)
623 |         return T.batched_dot(annotation, sentence)
624 | 
625 | 
626 | class MeanOverDim(MergeLayer):
627 |     """
628 |     dim can be a number or a tuple of numbers to indicate which dim to compute mean.
629 |     """
630 |     def __init__(self, incoming, mask=None, dim=1, **kwargs):
631 |         incomings = [incoming]
632 |         self.have_mask = False
633 |         if mask:
634 |             incomings.append(mask)
635 |             self.have_mask = True
636 |         super(MeanOverDim, self).__init__(incomings, **kwargs)
637 |         self.dim = dim
638 | 
639 |     def get_output_shape_for(self, input_shapes):
640 |         return tuple(x for i, x in enumerate(input_shapes[0]) if i != self.dim)
641 | 
642 |     def get_output_for(self, inputs, **kwargs):
643 |         if self.have_mask:
644 |             return T.sum(inputs[0], axis=self.dim) / \
645 |                    inputs[1].sum(axis=1).dimshuffle(0, 'x')
646 |         else:
647 |             return T.mean(inputs[0], axis=self.dim)
648 | 
649 | 
650 | class MaxpoolingG(Layer):
651 |     """
652 |     Input : G matrix,
653 |     Input shape: (BSIZE, NROW, dim1, dim2)
654 | 
655 |     Output shape:
656 |         'row': (BSIZE, dim2, NROW)
657 |         'col': (BSIZE, dim1, NROW)
658 |     """
659 |     def __init__(self, incoming, direction='col', **kwargs):
660 |         super(MaxpoolingG, self).__init__(incoming, **kwargs)
661 |         self.direction = direction
662 | 
663 |     def get_output_shape_for(self, input_shape):
664 |         if self.direction == 'row':
665 |             return (input_shape[0], input_shape[3], input_shape[1])
666 |         elif self.direction == 'col':
667 |             return (input_shape[0], input_shape[2], input_shape[1])
668 | 
669 |     def get_output_for(self, input, **kwargs):
670 |         G = input
671 |         if self.direction == 'row':
672 |             return T.max(G, axis=2).dimshuffle(0, 2, 1)
673 |         elif self.direction == 'col':
674 |             return T.max(G, axis=3).dimshuffle(0, 2, 1)
675 | 
676 | 
677 | class Maxpooling(Layer):
678 |     """
679 |     Input : N-D matrix,
680 |     Input shape: (BSIZE, NROW, dim1, dim2)
681 | 
682 |     Output shape:
683 |     """
684 |     def __init__(self, incoming, axis=1, **kwargs):
685 |         super(Maxpooling, self).__init__(incoming, **kwargs)
686 |         self.axis = axis
687 | 
688 |     def get_output_shape_for(self, input_shape):
689 |         return input_shape[:self.axis] + input_shape[(self.axis+1):]
690 | 
691 |     def get_output_for(self, input, **kwargs):
692 |         return T.max(input, axis=self.axis)
693 | 


--------------------------------------------------------------------------------