├── .gitignore ├── README.md ├── oov_vec.py ├── oov_vec_nlc.py ├── LICENCE ├── semlp_rate_l2_dpout.py ├── lstmmlp_rate_l2_dpout.py ├── dataset.py ├── segae_gaereg.py ├── segae_l2_dpout.py └── util_layers.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.swo 3 | *.swp 4 | 5 | *.pyc 6 | *.ipynb 7 | 8 | *.npy 9 | *.pkl 10 | *.html 11 | *.log 12 | 13 | SMART_DISPATCH_LOGS/* 14 | html/* 15 | params/* 16 | screenlog.0 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Self Attentive Sentence Embedding 2 | This is the implementation for the paper **A Structured Self-Attentive Sentence Embedding**, which is published in ICLR 2017: https://arxiv.org/abs/1703.03130 . We provide reproductions for the results on Yelp, Age and SNLI datasets, as well as their baselines. 3 | 4 | Thanks to the community, there have been various reimplementations of this work 5 | by researchers from different groups before we release 6 | this version of code. Some of them even achieved higher performances than the 7 | results we reported in the paper. We would really like to thank them here, and refer 8 | those third party implementations at the end of this readme. They provide 9 | our model in different frameworks (TensorFlow, PyTorch) as well. 10 | 11 | 12 | ## Requirements: 13 | [Theano](http://deeplearning.net/software/theano/) 14 | [Lasagne](http://lasagne.readthedocs.io/en/latest/) 15 | [scikit-learn](http://scikit-learn.org/stable/) 16 | [NLTK](http://www.nltk.org/) 17 | 18 | 19 | ## Datasets and Preprocessing 20 | The SNLI dataset can be downloaded from https://nlp.stanford.edu/projects/snli/ . 21 | The file ``oov_vec.py`` is for preprocessing this dataset, no additional command line arguments needed. 22 | 23 | For [Yelp](https://www.yelp.com/dataset_challenge) and [Age](http://pan.webis.de/clef16/pan16-web/author-profiling.html) data, they are preprocessed by the same file, with different command args: 24 | ``` 25 | oov_vec_nlc.py age2 glove 26 | oov_vec_nlc.py yelp glove 27 | ``` 28 | You can also choose between `word2vec` and `glove` through the command line args. 29 | 30 | 31 | ## Word Embeddings 32 | Our experiments are majorly based on GloVe embeddings (https://nlp.stanford.edu/projects/glove/), but we've also tested them on `word2vec` (https://code.google.com/archive/p/word2vec/) as well for Age and Yelp datasets. 33 | 34 | 35 | ## Traning Baselines 36 | After running the preprocessing scripts beforehand, the baseline results on Age and Yelp datasets can be reproduced by the following configurations: 37 | 38 | ``` 39 | python lstmmlp_rate_l2_dpout.py 300 3000 0.06 0.0001 0.5 word2vec 100 16 0.5 300 0.1 1 age2 40 | python lstmmlp_rate_l2_dpout.py 300 3000 0.06 0.0001 0.5 word2vec 100 32 0.5 300 0.1 1 yelp 41 | ``` 42 | 43 | ## Training the Proposed Model 44 | 45 | For reproducing the results in our paper on Age and Yelp, please run: 46 | ``` 47 | python semlp_rate_l2_dpout.py 300 350 2000 30 0.001 0.3 0.0001 1. glove 300 50 0.5 100 0.1 1 age2 48 | python semlp_rate_l2_dpout.py 300 350 3000 30 0.001 0.3 0.0001 1. glove 300 50 0.5 100 0.1 1 yelp 49 | ``` 50 | 51 | And on SNLI dataset: 52 | ``` 53 | python segae_gaereg.py 300 150 4000 30 0.01 0.1 0.5 300 50 100 12 0.1 54 | ``` 55 | 56 | ## Third Party Implementations 57 | * PyTorch implementation by Haoyue Shi (@ExplorerFreda): https://github.com/ExplorerFreda/Structured-Self-Attentive-Sentence-Embedding 58 | 59 | * PyTorch implementation by Yufeng Ma (@yufengm): https://github.com/yufengm/SelfAttentive 60 | 61 | * TensorFlow implementation by Diego Antognini (@Diego999): https://github.com/Diego999/SelfSent 62 | -------------------------------------------------------------------------------- /oov_vec.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import string 3 | import numpy 4 | import cPickle 5 | import numpy as np 6 | import nltk 7 | 8 | import pdb 9 | 10 | print "loading GloVe..." 11 | w1 = {} 12 | vec = open('/Users/johanlin/Datasets/wordembeddings/glove.840B.300d.txt', 'r') 13 | for line in vec.readlines(): 14 | line=line.split(' ') 15 | w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32') 16 | vec.close() 17 | 18 | classname = {'entailment':0, 'neutral': 1, 'contradiction': 2, '-': 3} 19 | f1 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_train.txt', 'r') 20 | f2 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_dev.txt', 'r') 21 | f3 = open('/Users/johanlin/Datasets/snli_1.0/snli_1.0_test.txt', 'r') 22 | f = [f1, f2, f3] 23 | 24 | 25 | print "processing dataset: 3 dots to punch: ", 26 | sys.stdout.flush() 27 | w2 = {} 28 | w_referred = {0: 0} # reserve 0 for future padding 29 | vocab_count = 1 # 0 is reserved for future padding 30 | train_valid_test = [] 31 | for file in f: 32 | print ".", 33 | sys.stdout.flush() 34 | pairs = [] 35 | filehead = file.readline() # strip the file head 36 | for line in file.readlines(): 37 | line=line.split('\t') 38 | s1 = nltk.word_tokenize(line[5]) 39 | s1[0]=s1[0].lower() 40 | s2 = nltk.word_tokenize(line[6]) 41 | s2[0]=s2[0].lower() 42 | 43 | truth = classname[line[0]] 44 | 45 | if truth != 3: # exclude those '-' tags 46 | s1_words = [] 47 | for word in s1: 48 | # strip some possible weird punctuations 49 | word = word.strip(string.punctuation) 50 | if not w_referred.has_key(word): 51 | w_referred[word] = vocab_count 52 | vocab_count += 1 53 | s1_words.append(w_referred[word]) 54 | if not w1.has_key(word): 55 | if not w2.has_key(word): 56 | w2[word]=[] 57 | # find the WE for its surounding words 58 | for neighbor in s1: 59 | if w1.has_key(neighbor): 60 | w2[word].append(w1[neighbor]) 61 | 62 | s2_words = [] 63 | for word in s2: 64 | word = word.strip(string.punctuation) 65 | if not w_referred.has_key(word): 66 | w_referred[word] = vocab_count 67 | vocab_count += 1 68 | s2_words.append(w_referred[word]) 69 | if not w1.has_key(word): 70 | if not w2.has_key(word): 71 | w2[word]=[] 72 | for neighbor in s2: 73 | if w1.has_key(neighbor): 74 | w2[word].append(w1[neighbor]) 75 | 76 | pairs.append((numpy.asarray(s1_words).astype('int32'), 77 | numpy.asarray(s2_words).astype('int32'), 78 | numpy.asarray(truth).astype('int32'))) 79 | 80 | train_valid_test.append(pairs) 81 | file.close() 82 | 83 | 84 | print "\naugmenting word embedding vocabulary..." 85 | # this block is causing memory error in a 8G computer. Using alternatives. 86 | # all_sentences = [w2[x] for x in w2.iterkeys()] 87 | # all_words = [item for sublist in all_sentences for item in sublist] 88 | # mean_words = np.mean(all_words) 89 | # mean_words_std = np.std(all_words) 90 | mean_words = np.zeros((300,)) 91 | mean_words_std = 1e-1 92 | 93 | npy_rng = np.random.RandomState(123) 94 | for k in w2.iterkeys(): 95 | if len(w2[k]) != 0: 96 | w2[k] = sum(w2[k]) / len(w2[k]) # mean of all surounding words 97 | else: 98 | # len(w2[k]) == 0 cases: ['cantunderstans', 'motocyckes', 'arefun'] 99 | # I hate those silly guys... 100 | w2[k] = mean_words + npy_rng.randn(mean_words.shape[0]) * \ 101 | mean_words_std * 0.1 102 | 103 | w2.update(w1) 104 | 105 | print "generating weight values..." 106 | # reverse w_referred's key-value; 107 | inv_w_referred = {v: k for k, v in w_referred.items()} 108 | 109 | # number --inv_w_referred--> word --w2--> embedding 110 | ordered_word_embedding = [numpy.zeros((1, 300), dtype='float32'), ] + \ 111 | [w2[inv_w_referred[n]].reshape(1, -1) for n in range(1, len(inv_w_referred))] 112 | 113 | # to get the matrix 114 | weight = numpy.concatenate(ordered_word_embedding, axis=0) 115 | 116 | 117 | print "dumping converted datasets..." 118 | save_file = open('/Users/johanlin/Datasets/snli_1.0/SNLI_GloVe_converted', 'wb') 119 | cPickle.dump("dict: truth values and their corresponding class name\n" 120 | "the whole dataset, in list of list of tuples: list of train/valid/test set -> " 121 | "list of sentence pairs -> tuple with structure:" 122 | "(hypothesis, premise, truth class), all entries in numbers\n" 123 | "numpy.ndarray: a matrix with all referred words' embedding in its rows," 124 | "embeddings are ordered by their corresponding word numbers.\n" 125 | "dict: the augmented GloVe word embedding. contains all possible tokens in SNLI." 126 | "All initial GloVe entries are included.\n" 127 | "dict w_referred: word to their corresponding number\n" 128 | "inverse of w_referred, number to words\n", 129 | save_file) 130 | cPickle.dump(classname, save_file) 131 | cPickle.dump(train_valid_test, save_file) 132 | cPickle.dump(weight, save_file) 133 | cPickle.dump(w2, save_file) 134 | cPickle.dump(w_referred, save_file) 135 | cPickle.dump(inv_w_referred, save_file) 136 | save_file.close() 137 | 138 | 139 | # check: 140 | def reconstruct_sentence(sent_nums): 141 | sent_words = [inv_w_referred[n] for n in sent_nums] 142 | return sent_words 143 | 144 | def check_word_embed(sent_nums): 145 | sent_words = reconstruct_sentence(sent_nums) 146 | 147 | word_embeds_from_nums = [weight[n] for n in sent_nums] 148 | word_embeds_from_words = [w2[n] for n in sent_words] 149 | 150 | error = 0. 151 | for i, j in zip(word_embeds_from_nums, word_embeds_from_words): 152 | error += numpy.sum(i-j) 153 | 154 | if error == 0.: 155 | return True 156 | else: 157 | return False 158 | -------------------------------------------------------------------------------- /oov_vec_nlc.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import string 4 | import numpy 5 | import cPickle 6 | import numpy as np 7 | import nltk 8 | 9 | import pdb 10 | 11 | 12 | ################################################### 13 | # Overall stats 14 | ################################################### 15 | # # entries # dims 16 | # adapted word2vec 530158 100 17 | # glove 2196016 300 18 | ################################################### 19 | # age1 age2 yelp 20 | # train data - 68485 500000 21 | # dev/test data - 4000 2000 22 | # word2vec known token 126862 233293 23 | # UNK token 93124 184427 24 | # glove known token 126862 233293 25 | # UNK token 49268 104717 26 | ################################################### 27 | # 120739 28 | # 43256 88984 29 | 30 | wdembed = sys.argv[1] # word2vec, glove 31 | data_choice = sys.argv[2] # age1, age2, yelp 32 | 33 | 34 | # load word embedding: 35 | if wdembed == 'word2vec': 36 | print "loading adapted word2vec..." 37 | fname = '/home/hantek/datasets/NLC_data/embedding' 38 | w1 = {} 39 | vec = open(fname, 'r') 40 | for line in vec.readlines(): 41 | line=line.split() 42 | w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32') 43 | vec.close() 44 | elif wdembed == 'glove': 45 | print "loading GloVe..." 46 | fname = '/home/hantek/datasets/glove/glove.840B.300d.dict.pkl' 47 | if os.path.isfile(fname): 48 | w1 = cPickle.load(open(fname, 'rb')) 49 | else: 50 | w1 = {} 51 | vec = open('/home/hantek/datasets/glove/glove.840B.300d.txt', 'r') 52 | for line in vec.readlines(): 53 | line=line.split(' ') 54 | w1[line[0]] = np.asarray([float(x) for x in line[1:]]).astype('float32') 55 | vec.close() 56 | save_file = open(fname, 'wb') 57 | cPickle.dump(w1, save_file) 58 | save_file.close() 59 | else: 60 | raise ValueError("cmd args 1 has to be either 'word2vec' or 'glove'.") 61 | 62 | 63 | # load data: 64 | if data_choice == 'age1': 65 | f1 = open('/home/hantek/datasets/NLC_data/age1/age1_train', 'r') 66 | f2 = open('/home/hantek/datasets/NLC_data/age1/age1_valid', 'r') 67 | f3 = open('/home/hantek/datasets/NLC_data/age1/age1_test', 'r') 68 | classname = {} 69 | elif data_choice == 'age2': 70 | f1 = open('/home/hantek/datasets/NLC_data/age2/age2_train', 'r') 71 | f2 = open('/home/hantek/datasets/NLC_data/age2/age2_valid', 'r') 72 | f3 = open('/home/hantek/datasets/NLC_data/age2/age2_test', 'r') 73 | # note that class No. = rating -1 74 | classname = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4} 75 | elif data_choice == 'yelp': 76 | f1 = open('/home/hantek/datasets/NLC_data/yelp/yelp_train_500k', 'r') 77 | f2 = open('/home/hantek/datasets/NLC_data/yelp/yelp_valid_2000', 'r') 78 | f3 = open('/home/hantek/datasets/NLC_data/yelp/yelp_test_2000', 'r') 79 | # note that class No. = rating -1 80 | classname = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4} 81 | else: 82 | raise ValueError("command line argument has to be either 'age1', 'age2' or 'yelp'.") 83 | f = [f1, f2, f3] 84 | 85 | 86 | print "processing dataset, 3 dots to punch: ", 87 | sys.stdout.flush() 88 | w2 = {} 89 | w_referred = {0: 0} # reserve 0 for future padding 90 | vocab_count = 1 # 0 is reserved for future padding 91 | train_dev_test = [] 92 | for file in f: 93 | print ".", 94 | sys.stdout.flush() 95 | pairs = [] 96 | for line in file.readlines(): 97 | line=line.decode('utf-8').split() 98 | s1 = line[1:] 99 | s1[0]=s1[0].lower() 100 | 101 | rate_score = classname[line[0]] 102 | # rate_score = line[0] 103 | 104 | s1_words = [] 105 | for word in s1: 106 | if not w_referred.has_key(word): 107 | w_referred[word] = vocab_count 108 | vocab_count += 1 109 | s1_words.append(w_referred[word]) 110 | if not w1.has_key(word): 111 | if not w2.has_key(word): 112 | w2[word]=[] 113 | # find the WE for its surounding words 114 | for neighbor in s1: 115 | if w1.has_key(neighbor): 116 | w2[word].append(w1[neighbor]) 117 | 118 | pairs.append((numpy.asarray(s1_words).astype('int32'), 119 | rate_score)) 120 | # numpy.asarray(rate_score).astype('int32'))) 121 | 122 | train_dev_test.append(pairs) 123 | file.close() 124 | 125 | pdb.set_trace() 126 | 127 | print "\naugmenting word embedding vocabulary..." 128 | # this block is causing memory error in a 8G computer. Using alternatives. 129 | # all_sentences = [w2[x] for x in w2.iterkeys()] 130 | # all_words = [item for sublist in all_sentences for item in sublist] 131 | # mean_words = np.mean(all_words) 132 | # mean_words_std = np.std(all_words) 133 | mean_words = np.zeros((len(w1['the']),)) 134 | mean_words_std = 1e-1 135 | 136 | npy_rng = np.random.RandomState(123) 137 | for k in w2.iterkeys(): 138 | if len(w2[k]) != 0: 139 | w2[k] = sum(w2[k]) / len(w2[k]) # mean of all surounding words 140 | else: 141 | # len(w2[k]) == 0 cases: ['cantunderstans', 'motocyckes', 'arefun'] 142 | # I hate those silly guys... 143 | w2[k] = mean_words + npy_rng.randn(mean_words.shape[0]) * \ 144 | mean_words_std * 0.1 145 | 146 | w2.update(w1) 147 | 148 | print "generating weight values..." 149 | # reverse w_referred's key-value; 150 | inv_w_referred = {v: k for k, v in w_referred.items()} 151 | 152 | # number --inv_w_referred--> word --w2--> embedding 153 | ordered_word_embedding = [numpy.zeros((1, len(w1['the'])), dtype='float32'), ] + \ 154 | [w2[inv_w_referred[n]].reshape(1, -1) for n in range(1, len(inv_w_referred))] 155 | 156 | # to get the matrix 157 | weight = numpy.concatenate(ordered_word_embedding, axis=0) 158 | 159 | 160 | print "dumping converted datasets..." 161 | if data_choice == 'age1': 162 | save_file = open('/home/hantek/datasets/NLC_data/age1/' + wdembed + '_age1.pkl', 'wb') 163 | elif data_choice == 'age2': 164 | save_file = open('/home/hantek/datasets/NLC_data/age2/' + wdembed + '_age2.pkl', 'wb') 165 | elif data_choice == 'yelp': 166 | save_file = open('/home/hantek/datasets/NLC_data/yelp/' + wdembed + '_yelp.pkl', 'wb') 167 | 168 | cPickle.dump("dict: truth values and their corresponding class name\n" 169 | "the whole dataset, in list of list of tuples: list of train/valid/test set -> " 170 | "list of sentence pairs -> tuple with structure:" 171 | "(review, truth rate), all entries in numbers\n" 172 | "numpy.ndarray: a matrix with all referred words' embedding in its rows," 173 | "embeddings are ordered by their corresponding word numbers.\n" 174 | "dict: the augmented GloVe word embedding. contains all possible tokens in SNLI." 175 | "All initial GloVe entries are included.\n" 176 | "dict w_referred: word to their corresponding number\n" 177 | "inverse of w_referred, number to words\n", 178 | save_file) 179 | cPickle.dump(classname, save_file) 180 | cPickle.dump(train_dev_test, save_file) 181 | cPickle.dump(weight, save_file) 182 | fake_w2 = None; cPickle.dump(fake_w2, save_file) 183 | # cPickle.dump(w2, save_file) # this is a huge dictionary, delete it if you don't need. 184 | cPickle.dump(w_referred, save_file) 185 | cPickle.dump(inv_w_referred, save_file) 186 | save_file.close() 187 | print "Done." 188 | 189 | 190 | # check: 191 | def reconstruct_sentence(sent_nums): 192 | sent_words = [inv_w_referred[n] for n in sent_nums] 193 | return sent_words 194 | 195 | def check_word_embed(sent_nums): 196 | sent_words = reconstruct_sentence(sent_nums) 197 | 198 | word_embeds_from_nums = [weight[n] for n in sent_nums] 199 | word_embeds_from_words = [w2[n] for n in sent_words] 200 | 201 | error = 0. 202 | for i, j in zip(word_embeds_from_nums, word_embeds_from_words): 203 | error += numpy.sum(i-j) 204 | 205 | if error == 0.: 206 | return True 207 | else: 208 | return False 209 | -------------------------------------------------------------------------------- /LICENCE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | 203 | -------------------------------------------------------------------------------- /semlp_rate_l2_dpout.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import print_function 5 | 6 | import time 7 | import os 8 | import sys 9 | import numpy 10 | import cPickle 11 | import theano 12 | import theano.tensor as T 13 | import lasagne 14 | from lasagne.layers.recurrent import Gate 15 | from lasagne.layers import (DropoutLayer, LSTMLayer, EmbeddingLayer, 16 | ConcatLayer, DenseLayer) 17 | from lasagne import init, nonlinearities 18 | 19 | from util_layers import (DenseLayer3DInput, Softmax3D, ApplyAttention, 20 | GatedEncoder3D) 21 | from dataset import YELP, AGE2 22 | 23 | import pdb 24 | theano.config.compute_test_value = 'warn' # 'off' # Use 'warn' to activate 25 | 26 | 27 | LSTMHID = int(sys.argv[1]) # 300 Hidden unit numbers in LSTM 28 | ATTHID = int(sys.argv[2]) # 350 Hidden unit numbers in attention MLP 29 | OUTHID = int(sys.argv[3]) # 3000 Hidden unit numbers in output MLP 30 | NROW = int(sys.argv[4]) # 30 Number of rows in matrix representation 31 | LR = float(sys.argv[5]) # 0.001 32 | DPOUT = float(sys.argv[6]) # 0.3 dropout rate 33 | L2REG = float(sys.argv[7]) # 0.0001 L2 regularization 34 | ATTPNLT = float(sys.argv[8]) # 0. 35 | WE = str(sys.argv[9]) # either `word2vec` or `glove` 36 | WEDIM = int(sys.argv[10]) # either 100 or 300 Dim 37 | BSIZE = int(sys.argv[11]) # 50 Minibatch size 38 | GCLIP = float(sys.argv[12]) # 0.5 All gradients above will be clipped 39 | NEPOCH = int(sys.argv[13]) # 100 Number of epochs to train the net 40 | STD = float(sys.argv[14]) # 0.1 Standard deviation of weights in init 41 | UPDATEWE = bool(int(sys.argv[15])) # 0 for False and 1 for True. Update WE 42 | DSET = str(sys.argv[16]) # dataset, either `yelp` or `age2` 43 | 44 | filename = __file__.split('.')[0] + \ 45 | '_LSTMHID' + str(LSTMHID) + \ 46 | '_ATTHID' + str(ATTHID) + \ 47 | '_OUTHID' + str(OUTHID) + \ 48 | '_NROW' + str(NROW) + \ 49 | '_LR' + str(LR) + \ 50 | '_DPOUT' + str(DPOUT) + \ 51 | '_L2REG' + str(L2REG) + \ 52 | '_ATTPNLT' + str(ATTPNLT) + \ 53 | '_WE' + str(WE) + \ 54 | '_WEDIM' + str(WEDIM) + \ 55 | '_BSIZE' + str(BSIZE) + \ 56 | '_GCLIP' + str(GCLIP) + \ 57 | '_NEPOCH' + str(NEPOCH) + \ 58 | '_STD' + str(STD) + \ 59 | '_UPDATEWE' + str(UPDATEWE) + \ 60 | '_DSET' + DSET 61 | 62 | def main(num_epochs=NEPOCH): 63 | if DSET == 'yelp': 64 | print("Loading yelp dataset ...") 65 | loaded_dataset = YELP( 66 | batch_size=BSIZE, 67 | datapath="/home/hantek/datasets/NLC_data/yelp/word2vec_yelp.pkl") 68 | elif DSET == 'age2': 69 | print("Loading age2 dataset ...") 70 | loaded_dataset = AGE2( 71 | batch_size=BSIZE, 72 | datapath="/home/hantek/datasets/NLC_data/age2/word2vec_age2.pkl") 73 | else: 74 | raise ValueError("DSET was set incorrectly. Check your cmd args.") 75 | # yelp age2 76 | # train data 500000 68450 77 | # dev/test data 2000 4000 78 | # vocab ~1.2e5 79 | # 80 | 81 | train_batches = list(loaded_dataset.train_minibatch_generator()) 82 | dev_batches = list(loaded_dataset.dev_minibatch_generator()) 83 | test_batches = list(loaded_dataset.test_minibatch_generator()) 84 | W_word_embedding = loaded_dataset.weight # W shape: (# vocab size, WE_DIM) 85 | del loaded_dataset 86 | 87 | print("Building network ...") 88 | ########### sentence embedding encoder ########### 89 | # sentence vector, with each number standing for a word number 90 | input_var = T.TensorType('int32', [False, False])('sentence_vector') 91 | input_var.tag.test_value = numpy.hstack(( 92 | numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'), 93 | numpy.zeros((BSIZE, 5)).astype('int32'))) 94 | input_var.tag.test_value[1, 20:22] = (413, 45) 95 | l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var) 96 | 97 | input_mask = T.TensorType('int32', [False, False])('sentence_mask') 98 | input_mask.tag.test_value = numpy.hstack(( 99 | numpy.ones((BSIZE, 20), dtype='int32'), 100 | numpy.zeros((BSIZE, 5), dtype='int32'))) 101 | input_mask.tag.test_value[1, 20:22] = 1 102 | l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None), 103 | input_var=input_mask) 104 | 105 | # output shape (BSIZE, None, WEDIM) 106 | l_word_embed = lasagne.layers.EmbeddingLayer( 107 | l_in, 108 | input_size=W_word_embedding.shape[0], 109 | output_size=W_word_embedding.shape[1], 110 | W=W_word_embedding) 111 | 112 | # bidirectional LSTM 113 | l_forward = lasagne.layers.LSTMLayer( 114 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 115 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 116 | W_cell=init.Normal(STD)), 117 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 118 | W_cell=init.Normal(STD)), 119 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 120 | W_cell=None, nonlinearity=nonlinearities.tanh), 121 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 122 | W_cell=init.Normal(STD)), 123 | nonlinearity=lasagne.nonlinearities.tanh, 124 | peepholes = False, 125 | grad_clipping=GCLIP) 126 | 127 | l_backward = lasagne.layers.LSTMLayer( 128 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 129 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 130 | W_cell=init.Normal(STD)), 131 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 132 | W_cell=init.Normal(STD)), 133 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 134 | W_cell=None, nonlinearity=nonlinearities.tanh), 135 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 136 | W_cell=init.Normal(STD)), 137 | nonlinearity=lasagne.nonlinearities.tanh, 138 | peepholes = False, 139 | grad_clipping=GCLIP, backwards=True) 140 | 141 | # output dim: (BSIZE, None, 2*LSTMHID) 142 | l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2) 143 | l_concat_dpout = lasagne.layers.DropoutLayer(l_concat, p=DPOUT, rescale=True) 144 | 145 | # Attention mechanism to get sentence embedding 146 | # output dim: (BSIZE, None, ATTHID) 147 | l_ws1 = DenseLayer3DInput(l_concat_dpout, num_units=ATTHID) 148 | l_ws1_dpout = lasagne.layers.DropoutLayer(l_ws1, p=DPOUT, rescale=True) 149 | # output dim: (BSIZE, None, NROW) 150 | l_ws2 = DenseLayer3DInput(l_ws1_dpout, num_units=NROW, nonlinearity=None) 151 | l_annotations = Softmax3D(l_ws2, mask=l_mask) 152 | # output dim: (BSIZE, 2*LSTMHID, NROW) 153 | l_sentence_embedding = ApplyAttention([l_annotations, l_concat]) 154 | l_sentence_embedding_dpout = lasagne.layers.DropoutLayer( 155 | l_sentence_embedding, p=DPOUT, rescale=True) 156 | 157 | l_outhid = lasagne.layers.DenseLayer( 158 | l_sentence_embedding_dpout, num_units=OUTHID, 159 | nonlinearity=lasagne.nonlinearities.rectify) 160 | l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True) 161 | 162 | l_output = lasagne.layers.DenseLayer( 163 | l_outhid_dpout, num_units=5, nonlinearity=lasagne.nonlinearities.softmax) 164 | 165 | 166 | ########### target, cost, validation, etc. ########## 167 | target_values = T.ivector('target_output') 168 | target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32') 169 | 170 | network_output, annotation = lasagne.layers.get_output( 171 | [l_output, l_annotations]) 172 | network_prediction = T.argmax(network_output, axis=1) 173 | accuracy = T.mean(T.eq(network_prediction, target_values)) 174 | 175 | network_output_clean, annotation_clean = lasagne.layers.get_output( 176 | [l_output, l_annotations], deterministic=True) 177 | network_prediction_clean = T.argmax(network_output_clean, axis=1) 178 | accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values)) 179 | 180 | L2_attentionmlp = (l_ws1.W ** 2).sum() + (l_ws2.W ** 2).sum() 181 | L2_outputhid = (l_outhid.W ** 2).sum() 182 | L2_softmax = (l_output.W ** 2).sum() 183 | L2 = L2_attentionmlp + L2_outputhid + L2_softmax 184 | 185 | # penalty term and cost 186 | attention_penalty = T.mean((T.batched_dot( 187 | annotation, annotation.dimshuffle(0, 2, 1) 188 | ) - T.eye(annotation.shape[1]).dimshuffle('x', 0, 1) 189 | )**2, axis=(0, 1, 2)) 190 | 191 | cost = T.mean(T.nnet.categorical_crossentropy(network_output, 192 | target_values)) + \ 193 | ATTPNLT * attention_penalty + L2REG * L2 194 | cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, 195 | target_values)) + \ 196 | ATTPNLT * attention_penalty + L2REG * L2 197 | 198 | # Retrieve all parameters from the network 199 | all_params = lasagne.layers.get_all_params(l_output) 200 | if not UPDATEWE: 201 | all_params.remove(l_word_embed.W) 202 | 203 | numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]]) 204 | print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams)) 205 | print("-----------------------------------------------------------------") 206 | for item in all_params: 207 | print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval()))) 208 | 209 | # if exist param file then load params 210 | look_for = 'params' + os.sep + 'params_' + filename + '.pkl' 211 | if os.path.isfile(look_for): 212 | print("Resuming from file: " + look_for) 213 | all_param_values = cPickle.load(open(look_for, 'rb')) 214 | for p, v in zip(all_params, all_param_values): 215 | p.set_value(v) 216 | 217 | # Compute SGD updates for training 218 | print("Computing updates ...") 219 | updates = lasagne.updates.sgd(cost, all_params, LR) 220 | 221 | # Theano functions for training and computing cost 222 | print("Compiling functions ...") 223 | train = theano.function( 224 | [l_in.input_var, l_mask.input_var, target_values], 225 | [cost, accuracy], updates=updates) 226 | compute_cost = theano.function( 227 | [l_in.input_var, l_mask.input_var, target_values], 228 | [cost_clean, accuracy_clean]) 229 | 230 | def evaluate(mode): 231 | if mode == 'dev': 232 | data = dev_batches 233 | if mode == 'test': 234 | data = test_batches 235 | 236 | set_cost = 0. 237 | set_accuracy = 0. 238 | for batches_seen, (hypo, hm, truth) in enumerate(data, 1): 239 | _cost, _accuracy = compute_cost(hypo, hm, truth) 240 | set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \ 241 | 1.0 / batches_seen * _cost 242 | set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \ 243 | 1.0 / batches_seen * _accuracy 244 | 245 | return set_cost, set_accuracy 246 | 247 | print("Done. Evaluating scratch model ...") 248 | test_set_cost, test_set_accuracy = evaluate('test') 249 | print("BEFORE TRAINING: test cost %f, accuracy %f" % ( 250 | test_set_cost, test_set_accuracy)) 251 | print("Training ...") 252 | try: 253 | for epoch in range(num_epochs): 254 | train_set_cost = 0. 255 | train_set_accuracy = 0. 256 | start = time.time() 257 | 258 | for batches_seen, (hypo, hm, truth) in enumerate(train_batches, 1): 259 | _cost, _accuracy = train(hypo, hm, truth) 260 | train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \ 261 | 1.0 / batches_seen * _cost 262 | train_set_accuracy = \ 263 | (1.0 - 1.0 / batches_seen) * train_set_accuracy + \ 264 | 1.0 / batches_seen * _accuracy 265 | if batches_seen % 100 == 0: 266 | end = time.time() 267 | print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f" % ( 268 | batches_seen * BSIZE, 269 | end - start, 270 | LR, 271 | train_set_cost, 272 | train_set_accuracy)) 273 | start = end 274 | 275 | if batches_seen % 2000 == 0: 276 | dev_set_cost, dev_set_accuracy = evaluate('dev') 277 | test_set_cost, test_set_accuracy = evaluate('test') 278 | print("RECORD: cost: train %f dev %f test %f\n" 279 | " accu: train %f dev %f test %f" % ( 280 | train_set_cost, dev_set_cost, test_set_cost, 281 | train_set_accuracy, dev_set_accuracy, test_set_accuracy)) 282 | 283 | # save parameters 284 | all_param_values = [p.get_value() for p in all_params] 285 | cPickle.dump(all_param_values, 286 | open('params' + os.sep + 'params_' + filename + '.pkl', 'wb')) 287 | 288 | dev_set_cost, dev_set_accuracy = evaluate('dev') 289 | test_set_cost, test_set_accuracy = evaluate('test') 290 | print("RECORD:epoch %d, cost: train %f dev %f test %f\n" 291 | " accu: train %f dev %f test %f" % ( 292 | epoch, 293 | train_set_cost, dev_set_cost, test_set_cost, 294 | train_set_accuracy, dev_set_accuracy, test_set_accuracy)) 295 | except KeyboardInterrupt: 296 | pdb.set_trace() 297 | pass 298 | 299 | if __name__ == '__main__': 300 | main() 301 | 302 | -------------------------------------------------------------------------------- /lstmmlp_rate_l2_dpout.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import print_function 5 | 6 | import time 7 | import os 8 | import sys 9 | import numpy 10 | import cPickle 11 | import theano 12 | import theano.tensor as T 13 | from sklearn.metrics import confusion_matrix 14 | import lasagne 15 | from lasagne.layers.recurrent import Gate 16 | from lasagne import init, nonlinearities 17 | 18 | from util_layers import (DenseLayer3DInput, Softmax3D, ApplyAttention, 19 | GatedEncoder3D, Maxpooling) 20 | from dataset import YELP, AGE2 21 | 22 | import pdb 23 | theano.config.compute_test_value = 'off' # 'off' # Use 'warn' to activate 24 | 25 | """ 26 | BEST test set result: 27 | yelp 77.575% L2REG=0.0001, DPOUT=0.3 28 | age2 63.65% L2REG=0.00001, DPOUT=0.2 29 | """ 30 | LSTMHID = int(sys.argv[1]) # 500 Hidden unit numbers in LSTM 31 | OUTHID = int(sys.argv[2]) # 1000 Hidden unit numbers in output MLP 32 | LR = float(sys.argv[3]) # 0.01 Smaller than 0.04. 33 | L2REG = float(sys.argv[4]) # 0.0001 L2 regularization 34 | DPOUT = float(sys.argv[5]) # 0.3 dropout rate 35 | WE = str(sys.argv[6]) # either `word2vec` or `glove` 36 | WEDIM = int(sys.argv[7]) # either 100 or 300 Dim 37 | BSIZE = int(sys.argv[8]) # 16 Minibatch size 38 | GCLIP = float(sys.argv[9]) # 0.5 All gradients above will be clipped 39 | NEPOCH = int(sys.argv[10]) # 300 Number of epochs to train the net 40 | STD = float(sys.argv[11]) # 0.1 Standard deviation of weights in init 41 | # very slightly better than 0.01 42 | UPDATEWE = bool(int(sys.argv[12])) # 1 0 for False and 1 for True. Update WE 43 | DSET = str(sys.argv[13]) # dataset, either `yelp` or `age2` 44 | 45 | filename = __file__.split('.')[0] + \ 46 | '_LSTMHID' + str(LSTMHID) + \ 47 | '_OUTHID' + str(OUTHID) + \ 48 | '_LR' + str(LR) + \ 49 | '_L2REG' + str(L2REG) + \ 50 | '_DPOUT' + str(DPOUT) + \ 51 | '_WE' + str(WE) + \ 52 | '_WEDIM' + str(WEDIM) + \ 53 | '_BSIZE' + str(BSIZE) + \ 54 | '_GCLIP' + str(GCLIP) + \ 55 | '_NEPOCH' + str(NEPOCH) + \ 56 | '_STD' + str(STD) + \ 57 | '_UPDATEWE' + str(UPDATEWE) + \ 58 | '_DSET' + DSET 59 | 60 | def main(num_epochs=NEPOCH): 61 | if DSET == 'yelp': 62 | print("Loading yelp dataset ...") 63 | loaded_dataset = YELP( 64 | batch_size=BSIZE, 65 | datapath="/home/hantek/datasets/NLC_data/yelp/word2vec_yelp.pkl") 66 | elif DSET == 'age2': 67 | print("Loading age2 dataset ...") 68 | loaded_dataset = AGE2( 69 | batch_size=BSIZE, 70 | datapath="/home/hantek/datasets/NLC_data/age2/word2vec_age2.pkl") 71 | else: 72 | raise ValueError("DSET was set incorrectly. Check your cmd args.") 73 | # yelp age2 74 | # train data 500000 68450 75 | # dev/test data 2000 4000 76 | # vocab ~1.2e5 77 | # 78 | 79 | train_batches = list(loaded_dataset.train_minibatch_generator()) 80 | dev_batches = list(loaded_dataset.dev_minibatch_generator()) 81 | test_batches = list(loaded_dataset.test_minibatch_generator()) 82 | W_word_embedding = loaded_dataset.weight # W shape: (# vocab size, WE_DIM) 83 | del loaded_dataset 84 | 85 | print("Building network ...") 86 | ########### sentence embedding encoder ########### 87 | # sentence vector, with each number standing for a word number 88 | input_var = T.TensorType('int32', [False, False])('sentence_vector') 89 | input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'), 90 | numpy.zeros((BSIZE, 5)).astype('int32'))) 91 | input_var.tag.test_value[1, 20:22] = (413, 45) 92 | l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var) 93 | 94 | input_mask = T.TensorType('int32', [False, False])('sentence_mask') 95 | input_mask.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 20), dtype='int32'), 96 | numpy.zeros((BSIZE, 5), dtype='int32'))) 97 | input_mask.tag.test_value[1, 20:22] = 1 98 | l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask) 99 | 100 | # output shape (BSIZE, None, WEDIM) 101 | l_word_embed = lasagne.layers.EmbeddingLayer( 102 | l_in, 103 | input_size=W_word_embedding.shape[0], 104 | output_size=W_word_embedding.shape[1], 105 | W=W_word_embedding) 106 | 107 | # bidirectional LSTM 108 | l_forward = lasagne.layers.LSTMLayer( 109 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 110 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 111 | W_cell=init.Normal(STD)), 112 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 113 | W_cell=init.Normal(STD)), 114 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 115 | W_cell=None, nonlinearity=nonlinearities.tanh), 116 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 117 | W_cell=init.Normal(STD)), 118 | nonlinearity=lasagne.nonlinearities.tanh, 119 | peepholes = False, 120 | only_return_final=False, 121 | grad_clipping=GCLIP) 122 | 123 | l_backward = lasagne.layers.LSTMLayer( 124 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 125 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 126 | W_cell=init.Normal(STD)), 127 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 128 | W_cell=init.Normal(STD)), 129 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 130 | W_cell=None, nonlinearity=nonlinearities.tanh), 131 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 132 | W_cell=init.Normal(STD)), 133 | nonlinearity=lasagne.nonlinearities.tanh, 134 | peepholes = False, 135 | only_return_final=False, 136 | grad_clipping=GCLIP, backwards=True) 137 | 138 | # output dim: (BSIZE, None, 2*LSTMHID) 139 | l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2) 140 | 141 | # output dim: (BSIZE, 2*LSTMHID) 142 | l_maxpool = Maxpooling(l_concat, axis=1) 143 | l_maxpool_dpout = lasagne.layers.DropoutLayer(l_maxpool, p=DPOUT, rescale=True) 144 | 145 | l_outhid = lasagne.layers.DenseLayer( 146 | l_maxpool_dpout, num_units=OUTHID, 147 | nonlinearity=lasagne.nonlinearities.rectify) 148 | l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True) 149 | 150 | l_output = lasagne.layers.DenseLayer( 151 | l_outhid_dpout, num_units=5, nonlinearity=lasagne.nonlinearities.softmax) 152 | 153 | 154 | ########### target, cost, validation, etc. ########## 155 | target_values = T.ivector('target_output') 156 | target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32') 157 | 158 | network_output = lasagne.layers.get_output(l_output) 159 | network_prediction = T.argmax(network_output, axis=1) 160 | accuracy = T.mean(T.eq(network_prediction, target_values)) 161 | 162 | network_output_clean = lasagne.layers.get_output(l_output, deterministic=True) 163 | network_prediction_clean = T.argmax(network_output_clean, axis=1) 164 | accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values)) 165 | 166 | L2_lstm = ((l_forward.W_in_to_ingate ** 2).sum() + \ 167 | (l_forward.W_hid_to_ingate ** 2).sum() + \ 168 | (l_forward.W_in_to_forgetgate ** 2).sum() + \ 169 | (l_forward.W_hid_to_forgetgate ** 2).sum() + \ 170 | (l_forward.W_in_to_cell ** 2).sum() + \ 171 | (l_forward.W_hid_to_cell ** 2).sum() + \ 172 | (l_forward.W_in_to_outgate ** 2).sum() + \ 173 | (l_forward.W_hid_to_outgate ** 2).sum() + \ 174 | (l_backward.W_in_to_ingate ** 2).sum() + \ 175 | (l_backward.W_hid_to_ingate ** 2).sum() + \ 176 | (l_backward.W_in_to_forgetgate ** 2).sum() + \ 177 | (l_backward.W_hid_to_forgetgate ** 2).sum() + \ 178 | (l_backward.W_in_to_cell ** 2).sum() + \ 179 | (l_backward.W_hid_to_cell ** 2).sum() + \ 180 | (l_backward.W_in_to_outgate ** 2).sum() + \ 181 | (l_backward.W_hid_to_outgate ** 2).sum()) 182 | L2_outputhid = (l_outhid.W ** 2).sum() 183 | L2_softmax = (l_output.W ** 2).sum() 184 | L2 = L2_lstm + L2_outputhid + L2_softmax 185 | 186 | cost = T.mean(T.nnet.categorical_crossentropy(network_output, 187 | target_values)) + \ 188 | L2REG * L2 189 | cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, 190 | target_values)) + \ 191 | L2REG * L2 192 | 193 | # Retrieve all parameters from the network 194 | all_params = lasagne.layers.get_all_params(l_output) 195 | if not UPDATEWE: 196 | all_params.remove(l_word_embed.W) 197 | 198 | numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]]) 199 | print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams)) 200 | print("-----------------------------------------------------------------") 201 | for item in all_params: 202 | print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval()))) 203 | 204 | # if exist param file then load params 205 | look_for = 'params' + os.sep + 'params_' + filename + '.pkl' 206 | if os.path.isfile(look_for): 207 | print("Resuming from file: " + look_for) 208 | all_param_values = cPickle.load(open(look_for, 'rb')) 209 | for p, v in zip(all_params, all_param_values): 210 | p.set_value(v) 211 | 212 | # Compute SGD updates for training 213 | print("Computing updates ...") 214 | updates = lasagne.updates.adagrad(cost, all_params, LR) 215 | 216 | # Theano functions for training and computing cost 217 | print("Compiling functions ...") 218 | train = theano.function( 219 | [l_in.input_var, l_mask.input_var, target_values], 220 | [cost, accuracy], updates=updates) 221 | compute_cost = theano.function( 222 | [l_in.input_var, l_mask.input_var, target_values], 223 | [cost_clean, accuracy_clean]) 224 | predict = theano.function( 225 | [l_in.input_var, l_mask.input_var], 226 | network_prediction_clean) 227 | 228 | def evaluate(mode, verbose=False): 229 | if mode == 'dev': 230 | data = dev_batches 231 | if mode == 'test': 232 | data = test_batches 233 | 234 | set_cost = 0. 235 | set_accuracy = 0. 236 | for batches_seen, (hypo, hm, truth) in enumerate(data, 1): 237 | _cost, _accuracy = compute_cost(hypo, hm, truth) 238 | set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \ 239 | 1.0 / batches_seen * _cost 240 | set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \ 241 | 1.0 / batches_seen * _accuracy 242 | if verbose == True: 243 | predicted = [] 244 | truth = [] 245 | for batches_seen, (sent, mask, th) in enumerate(data, 1): 246 | predicted.append(predict(sent, mask)) 247 | truth.append(th) 248 | truth = numpy.concatenate(truth) 249 | predicted = numpy.concatenate(predicted) 250 | cm = confusion_matrix(truth, predicted) 251 | pr_a = cm.trace()*1.0 / truth.size 252 | pr_e = ((cm.sum(axis=0)*1.0/truth.size) * \ 253 | (cm.sum(axis=1)*1.0/truth.size)).sum() 254 | k = (pr_a - pr_e) / (1 - pr_e) 255 | print(mode + " set statistics:") 256 | print("kappa index of agreement: %f" % k) 257 | print("confusion matrix:") 258 | print(cm) 259 | 260 | return set_cost, set_accuracy 261 | 262 | 263 | print("Done. Evaluating scratch model ...") 264 | test_set_cost, test_set_accuracy = evaluate('test', verbose=True) 265 | print("BEFORE TRAINING: test cost %f, accuracy %f" % ( 266 | test_set_cost, test_set_accuracy)) 267 | print("Training ...") 268 | try: 269 | for epoch in range(num_epochs): 270 | train_set_cost = 0. 271 | train_set_accuracy = 0. 272 | start = time.time() 273 | 274 | for batches_seen, (hypo, hm, truth) in enumerate(train_batches, 1): 275 | _cost, _accuracy = train(hypo, hm, truth) 276 | train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \ 277 | 1.0 / batches_seen * _cost 278 | train_set_accuracy = (1.0 - 1.0 / batches_seen) * train_set_accuracy + \ 279 | 1.0 / batches_seen * _accuracy 280 | if batches_seen % 100 == 0: 281 | end = time.time() 282 | print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f" % ( 283 | batches_seen * BSIZE, 284 | end - start, 285 | LR, 286 | train_set_cost, 287 | train_set_accuracy)) 288 | start = end 289 | 290 | if batches_seen % 2000 == 0: 291 | dev_set_cost, dev_set_accuracy = evaluate('dev') 292 | test_set_cost, test_set_accuracy = evaluate('test') 293 | print("RECORD: cost: train %f dev %f test %f\n" 294 | " accu: train %f dev %f test %f" % ( 295 | train_set_cost, dev_set_cost, test_set_cost, 296 | train_set_accuracy, dev_set_accuracy, test_set_accuracy)) 297 | 298 | # save parameters 299 | all_param_values = [p.get_value() for p in all_params] 300 | cPickle.dump(all_param_values, 301 | open('params' + os.sep + 'params_' + filename + '.pkl', 'wb')) 302 | 303 | dev_set_cost, dev_set_accuracy = evaluate('dev') 304 | test_set_cost, test_set_accuracy = evaluate('test', verbose=True) 305 | print("RECORD:epoch %d, cost: train %f dev %f test %f\n" 306 | " accu: train %f dev %f test %f" % ( 307 | epoch, 308 | train_set_cost, dev_set_cost, test_set_cost, 309 | train_set_accuracy, dev_set_accuracy, test_set_accuracy)) 310 | except KeyboardInterrupt: 311 | pdb.set_trace() 312 | pass 313 | 314 | if __name__ == '__main__': 315 | main() 316 | 317 | -------------------------------------------------------------------------------- /dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import cPickle 3 | import theano 4 | import numpy 5 | import warnings 6 | 7 | import pdb 8 | 9 | 10 | class SNLI(object): 11 | def __init__(self, batch_size=50, loadall=False, 12 | datapath="/home/hantek/datasets/SNLI_GloVe_converted"): 13 | self.batch_size = batch_size 14 | self.datapath = datapath 15 | 16 | data_file = open(self.datapath, 'rb') 17 | cPickle.load(data_file) 18 | cPickle.load(data_file) 19 | self.train_set, self.dev_set, self.test_set = cPickle.load(data_file) 20 | self.weight = cPickle.load(data_file).astype(theano.config.floatX) 21 | if loadall: 22 | self.word2embed = cPickle.load(data_file) # key: word, value: embedding 23 | self.word2num = cPickle.load(data_file) # key: word, value: number 24 | self.num2word = cPickle.load(data_file) # key: number, value: word 25 | data_file.close() 26 | 27 | self.train_size = len(self.train_set) 28 | self.dev_size = len(self.dev_set) 29 | self.test_size = len(self.test_set) 30 | self.train_ptr = 0 31 | self.dev_ptr = 0 32 | self.test_ptr = 0 33 | 34 | def train_minibatch_generator(self): 35 | while self.train_ptr <= self.train_size - self.batch_size: 36 | self.train_ptr += self.batch_size 37 | minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr] 38 | if len (minibatch) < self.batch_size: 39 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 40 | 41 | longest_hypo, longest_premise = \ 42 | numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0) 43 | 44 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 45 | premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 46 | truth = numpy.zeros((self.batch_size,), dtype='int32') 47 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 48 | mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 49 | for i, (h, p, t) in enumerate(minibatch): 50 | hypos[i, :len(h)] = h 51 | mask_hypos[i, :len(h)] = (1,) * len(h) 52 | premises[i, :len(p)] = p 53 | mask_premises[i, :len(p)] = (1,) * len(p) 54 | truth[i] = t 55 | 56 | yield hypos, mask_hypos, premises, mask_premises, truth 57 | 58 | else: 59 | self.train_ptr = 0 60 | raise StopIteration 61 | 62 | def dev_minibatch_generator(self, ): 63 | while self.dev_ptr <= self.dev_size - self.batch_size: 64 | self.dev_ptr += self.batch_size 65 | minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr] 66 | if len (minibatch) < self.batch_size: 67 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 68 | 69 | longest_hypo, longest_premise = \ 70 | numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0) 71 | 72 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 73 | premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 74 | truth = numpy.zeros((self.batch_size,), dtype='int32') 75 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 76 | mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 77 | for i, (h, p, t) in enumerate(minibatch): 78 | hypos[i, :len(h)] = h 79 | mask_hypos[i, :len(h)] = (1,) * len(h) 80 | premises[i, :len(p)] = p 81 | mask_premises[i, :len(p)] = (1,) * len(p) 82 | truth[i] = t 83 | 84 | yield hypos, mask_hypos, premises, mask_premises, truth 85 | 86 | else: 87 | self.dev_ptr = 0 88 | raise StopIteration 89 | 90 | def test_minibatch_generator(self, ): 91 | while self.test_ptr <= self.test_size - self.batch_size: 92 | self.test_ptr += self.batch_size 93 | minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr] 94 | if len (minibatch) < self.batch_size: 95 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 96 | 97 | longest_hypo, longest_premise = \ 98 | numpy.max(map(lambda x: (len(x[0]), len(x[1])), minibatch), axis=0) 99 | 100 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 101 | premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 102 | truth = numpy.zeros((self.batch_size,), dtype='int32') 103 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 104 | mask_premises = numpy.zeros((self.batch_size, longest_premise), dtype='int32') 105 | for i, (h, p, t) in enumerate(minibatch): 106 | hypos[i, :len(h)] = h 107 | mask_hypos[i, :len(h)] = (1,) * len(h) 108 | premises[i, :len(p)] = p 109 | mask_premises[i, :len(p)] = (1,) * len(p) 110 | truth[i] = t 111 | 112 | yield hypos, mask_hypos, premises, mask_premises, truth 113 | 114 | else: 115 | self.test_ptr = 0 116 | raise StopIteration 117 | 118 | 119 | class SICK(SNLI): 120 | def __init__(self, batch_size=50, loadall=False, augment=False, 121 | datapath="/Users/johanlin/Datasets/SICK/"): 122 | self.batch_size = batch_size 123 | if augment: 124 | self.datapath = datapath + os.sep + 'SICK_augmented.pkl' 125 | else: 126 | self.datapath = datapath + os.sep + 'SICK.pkl' 127 | super(SICK, self).__init__(batch_size, loadall, self.datapath) 128 | 129 | 130 | class YELP(object): 131 | def __init__(self, batch_size=50, loadall=False, 132 | datapath="/home/hantek/datasets/NLC_data/yelp/yelp.pkl"): 133 | self.batch_size = batch_size 134 | self.datapath = datapath 135 | 136 | data_file = open(self.datapath, 'rb') 137 | cPickle.load(data_file) 138 | cPickle.load(data_file) 139 | self.train_set, self.dev_set, self.test_set = cPickle.load(data_file) 140 | self.weight = cPickle.load(data_file).astype(theano.config.floatX) 141 | if loadall: 142 | self.word2embed = cPickle.load(data_file) # key: word, value: embedding 143 | self.word2num = cPickle.load(data_file) # key: word, value: number 144 | self.num2word = cPickle.load(data_file) # key: number, value: word 145 | data_file.close() 146 | 147 | self.train_size = len(self.train_set) 148 | self.dev_size = len(self.dev_set) 149 | self.test_size = len(self.test_set) 150 | self.train_ptr = 0 151 | self.dev_ptr = 0 152 | self.test_ptr = 0 153 | 154 | def train_minibatch_generator(self): 155 | while self.train_ptr <= self.train_size - self.batch_size: 156 | self.train_ptr += self.batch_size 157 | minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr] 158 | if len (minibatch) < self.batch_size: 159 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 160 | 161 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 162 | 163 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 164 | truth = numpy.zeros((self.batch_size,), dtype='int32') 165 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 166 | for i, (h, t) in enumerate(minibatch): 167 | hypos[i, :len(h)] = h 168 | mask_hypos[i, :len(h)] = (1,) * len(h) 169 | truth[i] = t 170 | 171 | yield hypos, mask_hypos, truth 172 | 173 | else: 174 | self.train_ptr = 0 175 | raise StopIteration 176 | 177 | def dev_minibatch_generator(self, ): 178 | while self.dev_ptr <= self.dev_size - self.batch_size: 179 | self.dev_ptr += self.batch_size 180 | minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr] 181 | if len (minibatch) < self.batch_size: 182 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 183 | 184 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 185 | 186 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 187 | truth = numpy.zeros((self.batch_size,), dtype='int32') 188 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 189 | for i, (h, t) in enumerate(minibatch): 190 | hypos[i, :len(h)] = h 191 | mask_hypos[i, :len(h)] = (1,) * len(h) 192 | truth[i] = t 193 | 194 | yield hypos, mask_hypos, truth 195 | 196 | else: 197 | self.dev_ptr = 0 198 | raise StopIteration 199 | 200 | def test_minibatch_generator(self, ): 201 | while self.test_ptr <= self.test_size - self.batch_size: 202 | self.test_ptr += self.batch_size 203 | minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr] 204 | if len (minibatch) < self.batch_size: 205 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 206 | 207 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 208 | 209 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 210 | truth = numpy.zeros((self.batch_size,), dtype='int32') 211 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 212 | for i, (h, t) in enumerate(minibatch): 213 | hypos[i, :len(h)] = h 214 | mask_hypos[i, :len(h)] = (1,) * len(h) 215 | truth[i] = t 216 | 217 | yield hypos, mask_hypos, truth 218 | 219 | else: 220 | self.test_ptr = 0 221 | raise StopIteration 222 | 223 | 224 | class AGE2(YELP): 225 | def __init__(self, batch_size=50, loadall=False, 226 | datapath="/home/hantek/datasets/NLC_data/age2/age2.pkl"): 227 | super(AGE2, self).__init__(batch_size, loadall, datapath) 228 | 229 | 230 | class STANFORDSENTIMENTTREEBANK(object): 231 | def __init__(self, batch_size=50, loadext=False, loadhelper=False, wordembed='word2vec', 232 | datapath="/home/hantek/datasets/StanfordSentimentTreebank"): 233 | self.batch_size = batch_size 234 | self.datapath = datapath 235 | 236 | save_file = open(self.datapath + os.sep + 'sst_' + wordembed + '.pkl', 'rb') 237 | cPickle.load(save_file) 238 | self.train_set, self.dev_set, self.test_set = cPickle.load(save_file) 239 | self.weight = cPickle.load(save_file).astype(theano.config.floatX) 240 | save_file.close() 241 | 242 | if loadext == True: 243 | save_file_ext = open(self.datapath + os.sep + 'sst_' + wordembed + '_ext.pkl', 'rb') 244 | train_set, dev_set, test_set = cPickle.load(save_file_ext) 245 | self.train_set += train_set 246 | self.dev_set += dev_set 247 | self.test_set += test_set 248 | save_file_ext.close() 249 | 250 | if loadhelper == True: 251 | helper = open(self.datapath + os.sep + 'sst_' + wordembed + '_helper.pkl', 'rb') 252 | self.word2embed = cPickle.load(helper) # key: word, value: embedding 253 | self.word2num = cPickle.load(helper) # key: word, value: number 254 | self.num2word = cPickle.load(helper) # key: number, value: word 255 | helper.close() 256 | 257 | self.train_size = len(self.train_set) 258 | self.dev_size = len(self.dev_set) 259 | self.test_size = len(self.test_set) 260 | self.train_ptr = 0 261 | self.dev_ptr = 0 262 | self.test_ptr = 0 263 | 264 | def train_minibatch_generator(self): 265 | while self.train_ptr <= self.train_size - self.batch_size: 266 | self.train_ptr += self.batch_size 267 | minibatch = self.train_set[self.train_ptr - self.batch_size : self.train_ptr] 268 | if len (minibatch) < self.batch_size: 269 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 270 | 271 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 272 | 273 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 274 | truth = numpy.zeros((self.batch_size,), dtype='int32') 275 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 276 | for i, (h, t) in enumerate(minibatch): 277 | hypos[i, :len(h)] = h 278 | mask_hypos[i, :len(h)] = (1,) * len(h) 279 | truth[i] = t 280 | 281 | yield hypos, mask_hypos, truth 282 | 283 | else: 284 | self.train_ptr = 0 285 | raise StopIteration 286 | 287 | def dev_minibatch_generator(self, ): 288 | while self.dev_ptr <= self.dev_size - self.batch_size: 289 | self.dev_ptr += self.batch_size 290 | minibatch = self.dev_set[self.dev_ptr - self.batch_size : self.dev_ptr] 291 | if len (minibatch) < self.batch_size: 292 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 293 | 294 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 295 | 296 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 297 | truth = numpy.zeros((self.batch_size,), dtype='int32') 298 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 299 | for i, (h, t) in enumerate(minibatch): 300 | hypos[i, :len(h)] = h 301 | mask_hypos[i, :len(h)] = (1,) * len(h) 302 | truth[i] = t 303 | 304 | yield hypos, mask_hypos, truth 305 | 306 | else: 307 | self.dev_ptr = 0 308 | raise StopIteration 309 | 310 | def test_minibatch_generator(self, ): 311 | while self.test_ptr <= self.test_size - self.batch_size: 312 | self.test_ptr += self.batch_size 313 | minibatch = self.test_set[self.test_ptr - self.batch_size : self.test_ptr] 314 | if len (minibatch) < self.batch_size: 315 | warnings.warn("There will be empty slots in minibatch data.", UserWarning) 316 | 317 | longest_hypo = numpy.max(map(lambda x: len(x[0]), minibatch), axis=0) 318 | 319 | hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 320 | truth = numpy.zeros((self.batch_size,), dtype='int32') 321 | mask_hypos = numpy.zeros((self.batch_size, longest_hypo), dtype='int32') 322 | for i, (h, t) in enumerate(minibatch): 323 | hypos[i, :len(h)] = h 324 | mask_hypos[i, :len(h)] = (1,) * len(h) 325 | truth[i] = t 326 | 327 | yield hypos, mask_hypos, truth 328 | 329 | else: 330 | self.test_ptr = 0 331 | raise StopIteration 332 | -------------------------------------------------------------------------------- /segae_gaereg.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import print_function 5 | 6 | import time 7 | import os 8 | import sys 9 | import numpy 10 | import cPickle 11 | import theano 12 | import theano.tensor as T 13 | import lasagne 14 | from lasagne.layers.recurrent import Gate 15 | from lasagne import init, nonlinearities 16 | 17 | from util_layers import DenseLayer3DInput, Softmax3D, ApplyAttention, GatedEncoder3D 18 | from dataset import SNLI 19 | 20 | import pdb 21 | theano.config.compute_test_value = 'warn' # 'off' # Use 'warn' to activate this feature 22 | 23 | 24 | LSTM_HIDDEN = int(sys.argv[1]) # 150 Hidden unit numbers in LSTM 25 | ATTENTION_HIDDEN = int(sys.argv[2]) # 350 Hidden unit numbers in attention MLP 26 | OUT_HIDDEN = int(sys.argv[3]) # 3000 Hidden unit numbers in output MLP 27 | N_ROWS = int(sys.argv[4]) # 10 Number of rows in matrix representation 28 | LEARNING_RATE = float(sys.argv[5]) # 0.01 29 | ATTENTION_PENALTY = float(sys.argv[6]) # 1. 30 | GAEREG = float(sys.argv[7]) # 0.5 Dropout in GAE 31 | WE_DIM = int(sys.argv[8]) # 300 Dim of word embedding 32 | BATCH_SIZE = int(sys.argv[9]) # 50 Minibatch size 33 | GRAD_CLIP = int(sys.argv[10]) # 100 All gradients above this will be clipped 34 | NUM_EPOCHS = int(sys.argv[11]) # 12 Number of epochs to train the net 35 | STD = float(sys.argv[12]) # 0.1 Standard deviation of weights in initialization 36 | filename = __file__.split('.')[0] + \ 37 | '_LSTMHIDDEN' + str(LSTM_HIDDEN) + \ 38 | '_ATTENTIONHIDDEN' + str(ATTENTION_HIDDEN) + \ 39 | '_OUTHIDDEN' + str(OUT_HIDDEN) + \ 40 | '_NROWS' + str(N_ROWS) + \ 41 | '_LEARNINGRATE' + str(LEARNING_RATE) + \ 42 | '_ATTENTIONPENALTY' + str(ATTENTION_PENALTY) + \ 43 | '_GAEREG' + str(GAEREG) + \ 44 | '_WEDIM' + str(WE_DIM) + \ 45 | '_BATCHSIZE' + str(BATCH_SIZE) + \ 46 | '_GRADCLIP' + str(GRAD_CLIP) + \ 47 | '_NUMEPOCHS' + str(NUM_EPOCHS) + \ 48 | '_STD' + str(STD) 49 | 50 | 51 | def main(num_epochs=NUM_EPOCHS): 52 | print("Loading data ...") 53 | snli = SNLI(batch_size=BATCH_SIZE) 54 | train_batches = list(snli.train_minibatch_generator()) 55 | dev_batches = list(snli.dev_minibatch_generator()) 56 | test_batches = list(snli.test_minibatch_generator()) 57 | W_word_embedding = snli.weight # W shape: (# vocab size, WE_DIM) 58 | del snli 59 | 60 | print("Building network ...") 61 | ########### sentence embedding encoder ########### 62 | # sentence vector, with each number standing for a word number 63 | input_var = T.TensorType('int32', [False, False])('sentence_vector') 64 | input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 20), 'int32'), 65 | numpy.zeros((50, 5)).astype('int32'))) 66 | input_var.tag.test_value[1, 20:22] = (413, 45) 67 | l_in = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var) 68 | 69 | input_mask = T.TensorType('int32', [False, False])('sentence_mask') 70 | input_mask.tag.test_value = numpy.hstack((numpy.ones((50, 20), dtype='int32'), 71 | numpy.zeros((50, 5), dtype='int32'))) 72 | input_mask.tag.test_value[1, 20:22] = 1 73 | l_mask = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask) 74 | 75 | # output shape (BATCH_SIZE, None, WE_DIM) 76 | l_word_embed = lasagne.layers.EmbeddingLayer( 77 | l_in, 78 | input_size=W_word_embedding.shape[0], 79 | output_size=W_word_embedding.shape[1], 80 | W=W_word_embedding) # how to set it to be non-trainable? 81 | 82 | 83 | # bidirectional LSTM 84 | l_forward = lasagne.layers.LSTMLayer( 85 | l_word_embed, mask_input=l_mask, num_units=LSTM_HIDDEN, 86 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 87 | W_cell=init.Normal(STD)), 88 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 89 | W_cell=init.Normal(STD)), 90 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 91 | W_cell=None, nonlinearity=nonlinearities.tanh), 92 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 93 | W_cell=init.Normal(STD)), 94 | nonlinearity=lasagne.nonlinearities.tanh, 95 | peepholes = False, 96 | grad_clipping=GRAD_CLIP) 97 | 98 | l_backward = lasagne.layers.LSTMLayer( 99 | l_word_embed, mask_input=l_mask, num_units=LSTM_HIDDEN, 100 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 101 | W_cell=init.Normal(STD)), 102 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 103 | W_cell=init.Normal(STD)), 104 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 105 | W_cell=None, nonlinearity=nonlinearities.tanh), 106 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 107 | W_cell=init.Normal(STD)), 108 | nonlinearity=lasagne.nonlinearities.tanh, 109 | peepholes = False, 110 | grad_clipping=GRAD_CLIP, backwards=True) 111 | 112 | # output dim: (BATCH_SIZE, None, 2*LSTM_HIDDEN) 113 | l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2) 114 | 115 | # Attention mechanism to get sentence embedding 116 | # output dim: (BATCH_SIZE, None, ATTENTION_HIDDEN) 117 | l_ws1 = DenseLayer3DInput(l_concat, num_units=ATTENTION_HIDDEN) 118 | # output dim: (BATCH_SIZE, None, N_ROWS) 119 | l_ws2 = DenseLayer3DInput(l_ws1, num_units=N_ROWS, nonlinearity=None) 120 | l_annotations = Softmax3D(l_ws2, mask=l_mask) 121 | # output dim: (BATCH_SIZE, 2*LSTM_HIDDEN, N_ROWS) 122 | l_sentence_embedding = ApplyAttention([l_annotations, l_concat]) 123 | 124 | # beam search? Bi lstm in the sentence embedding layer? etc. 125 | 126 | 127 | ########### get embeddings for hypothesis and premise ########### 128 | # hypothesis 129 | input_var_h = T.TensorType('int32', [False, False])('hypothesis_vector') 130 | input_var_h.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 18), 'int32'), 131 | numpy.zeros((50, 6)).astype('int32'))) 132 | l_in_h = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var_h) 133 | 134 | input_mask_h = T.TensorType('int32', [False, False])('hypo_mask') 135 | input_mask_h.tag.test_value = numpy.hstack((numpy.ones((50, 18), dtype='int32'), 136 | numpy.zeros((50, 6), dtype='int32'))) 137 | input_mask_h.tag.test_value[1, 18:22] = 1 138 | l_mask_h = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask_h) 139 | 140 | # premise 141 | input_var_p = T.TensorType('int32', [False, False])('premise_vector') 142 | input_var_p.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (50, 16), 'int32'), 143 | numpy.zeros((50, 3)).astype('int32'))) 144 | l_in_p = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_var_p) 145 | 146 | input_mask_p = T.TensorType('int32', [False, False])('premise_mask') 147 | input_mask_p.tag.test_value = numpy.hstack((numpy.ones((50, 16), dtype='int32'), 148 | numpy.zeros((50, 3), dtype='int32'))) 149 | input_mask_p.tag.test_value[1, 16:18] = 1 150 | l_mask_p = lasagne.layers.InputLayer(shape=(BATCH_SIZE, None), input_var=input_mask_p) 151 | 152 | 153 | hypothesis_embedding, hypothesis_annotation = lasagne.layers.get_output( 154 | [l_sentence_embedding, l_annotations], 155 | {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var}) 156 | premise_embedding, premise_annotation = lasagne.layers.get_output( 157 | [l_sentence_embedding, l_annotations], 158 | {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var}) 159 | 160 | 161 | ########### gated encoder and output MLP ########## 162 | l_hypo_embed = lasagne.layers.InputLayer( 163 | shape=(BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN), input_var=hypothesis_embedding) 164 | l_pre_embed = lasagne.layers.InputLayer( 165 | shape=(BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN), input_var=premise_embedding) 166 | 167 | # output dim: (BATCH_SIZE, 2*LSTM_HIDDEN, N_ROWS) 168 | l_factors = GatedEncoder3D([l_hypo_embed, l_pre_embed], num_hfactors=2*LSTM_HIDDEN) 169 | 170 | # Dropout: 171 | l_factors_noise = lasagne.layers.DropoutLayer(l_factors, p=GAEREG, rescale=True) 172 | 173 | # l_hids = DenseLayer3DWeight() 174 | 175 | l_outhid = lasagne.layers.DenseLayer( 176 | l_factors_noise, num_units=OUT_HIDDEN, nonlinearity=lasagne.nonlinearities.rectify) 177 | 178 | # Dropout: 179 | l_outhid_noise = lasagne.layers.DropoutLayer(l_outhid, p=GAEREG, rescale=True) 180 | 181 | l_output = lasagne.layers.DenseLayer( 182 | l_outhid_noise, num_units=3, nonlinearity=lasagne.nonlinearities.softmax) 183 | 184 | 185 | ########### target, cost, validation, etc. ########## 186 | target_values = T.ivector('target_output') 187 | target_values.tag.test_value = numpy.asarray([1,] * 50, dtype='int32') 188 | 189 | network_output = lasagne.layers.get_output(l_output) 190 | network_output_clean = lasagne.layers.get_output(l_output, deterministic=True) 191 | 192 | # penalty term and cost 193 | attention_penalty = T.mean((T.batched_dot( 194 | hypothesis_annotation, 195 | # pay attention to this line: 196 | # T.extra_ops.cpu_contiguous(hypothesis_annotation.dimshuffle(0, 2, 1)) 197 | hypothesis_annotation.dimshuffle(0, 2, 1) 198 | ) - T.eye(hypothesis_annotation.shape[1]).dimshuffle('x', 0, 1) 199 | )**2, axis=(0, 1, 2)) + T.mean((T.batched_dot( 200 | premise_annotation, 201 | # T.extra_ops.cpu_contiguous(premise_annotation.dimshuffle(0, 2, 1)) # ditto. 202 | premise_annotation.dimshuffle(0, 2, 1) # ditto. 203 | ) - T.eye(premise_annotation.shape[1]).dimshuffle('x', 0, 1))**2, axis=(0, 1, 2)) 204 | 205 | cost = T.mean(T.nnet.categorical_crossentropy(network_output, target_values) + \ 206 | ATTENTION_PENALTY * attention_penalty) 207 | cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, target_values) + \ 208 | ATTENTION_PENALTY * attention_penalty) 209 | 210 | # Retrieve all parameters from the network 211 | all_params = lasagne.layers.get_all_params(l_output) + \ 212 | lasagne.layers.get_all_params(l_sentence_embedding) 213 | numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]]) 214 | print("Number of params: {}".format(numparams)) 215 | 216 | # if exist param file then load params 217 | look_for = 'params' + os.sep + 'params_' + filename + '.pkl' 218 | if os.path.isfile(look_for): 219 | print("Resuming from file: " + look_for) 220 | all_param_values = cPickle.load(open(look_for, 'rb')) 221 | for p, v in zip(all_params, all_param_values): 222 | p.set_value(v) 223 | 224 | # withoutwe_params = all_params + [l_word_embed.W] 225 | 226 | # Compute updates for training 227 | print("Computing updates ...") 228 | updates = lasagne.updates.adagrad(cost, all_params, LEARNING_RATE) 229 | 230 | # Theano functions for training and computing cost 231 | print("Compiling functions ...") 232 | network_prediction = T.argmax(network_output, axis=1) 233 | error_rate = T.mean(T.neq(network_prediction, target_values)) 234 | network_prediction_clean = T.argmax(network_output_clean, axis=1) 235 | error_rate_clean = T.mean(T.neq(network_prediction_clean, target_values)) 236 | 237 | train = theano.function( 238 | [l_in_h.input_var, l_mask_h.input_var, 239 | l_in_p.input_var, l_mask_p.input_var, target_values], 240 | [cost, error_rate], updates=updates) 241 | compute_cost = theano.function( 242 | [l_in_h.input_var, l_mask_h.input_var, 243 | l_in_p.input_var, l_mask_p.input_var, target_values], 244 | [cost_clean, error_rate_clean]) 245 | 246 | def evaluate(mode): 247 | if mode == 'dev': 248 | data = dev_batches 249 | if mode == 'test': 250 | data = test_batches 251 | 252 | set_cost = 0. 253 | set_error_rate = 0. 254 | for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(data, 1): 255 | _cost, _error = compute_cost(hypo, hm, premise, pm, truth) 256 | set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \ 257 | 1.0 / batches_seen * _cost 258 | set_error_rate = (1.0 - 1.0 / batches_seen) * set_error_rate + \ 259 | 1.0 / batches_seen * _error 260 | 261 | return set_cost, set_error_rate 262 | 263 | dev_set_cost, dev_set_error = evaluate('dev') 264 | print("BEFORE TRAINING: dev cost %f, error %f" % (dev_set_cost, dev_set_error)) 265 | print("Training ...") 266 | try: 267 | for epoch in range(num_epochs): 268 | train_set_cost = 0. 269 | train_set_error = 0. 270 | start = time.time() 271 | 272 | for batches_seen, (hypo, hm, premise, pm, truth) in enumerate( 273 | train_batches, 1): 274 | _cost, _error = train(hypo, hm, premise, pm, truth) 275 | train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \ 276 | 1.0 / batches_seen * _cost 277 | train_set_error = (1.0 - 1.0 / batches_seen) * train_set_error + \ 278 | 1.0 / batches_seen * _error 279 | if batches_seen % 100 == 0: 280 | end = time.time() 281 | print("Sample %d %.2fs, lr %.4f, train cost %f, error %f" % ( 282 | batches_seen * BATCH_SIZE, 283 | LEARNING_RATE, 284 | end - start, 285 | train_set_cost, 286 | train_set_error)) 287 | start = end 288 | 289 | if batches_seen % 2000 == 0: 290 | dev_set_cost, dev_set_error = evaluate('dev') 291 | test_set_cost, test_set_error = evaluate('test') 292 | print("***dev cost %f, error %f" % (dev_set_cost, dev_set_error)) 293 | print("***test cost %f, error %f" % (test_set_cost, test_set_error)) 294 | 295 | # save parameters 296 | all_param_values = [p.get_value() for p in all_params] 297 | cPickle.dump(all_param_values, 298 | open('params' + os.sep + 'params_' + filename + '.pkl', 'wb')) 299 | 300 | # load params 301 | # all_param_values = cPickle.load(open('params' + os.sep + 'params_' + filename, 'rb')) 302 | # for p, v in zip(all_params, all_param_values): 303 | # p.set_value(v) 304 | 305 | dev_set_cost, dev_set_error = evaluate('dev') 306 | test_set_cost, test_set_error = evaluate('test') 307 | 308 | print("epoch %d, cost: train %f dev %f test %f;\n" 309 | " error train %f dev %f test %f" % ( 310 | epoch, 311 | train_set_cost, dev_set_cost, test_set_cost, 312 | train_set_error, dev_set_error, test_set_error)) 313 | except KeyboardInterrupt: 314 | pdb.set_trace() 315 | pass 316 | 317 | if __name__ == '__main__': 318 | main() 319 | -------------------------------------------------------------------------------- /segae_l2_dpout.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import print_function 5 | 6 | import time 7 | import os 8 | import sys 9 | import numpy 10 | import cPickle 11 | import theano 12 | import theano.tensor as T 13 | from sklearn.metrics import confusion_matrix 14 | import lasagne 15 | from lasagne.layers.recurrent import Gate 16 | from lasagne import init, nonlinearities 17 | 18 | from util_layers import DenseLayer3DInput, Softmax3D, ApplyAttention, GatedEncoder3D 19 | from dataset import SNLI 20 | 21 | import pdb 22 | theano.config.compute_test_value = 'warn' # 'off' # Use 'warn' to activate this feature 23 | 24 | 25 | LSTMHID = int(sys.argv[1]) # 150 Hidden unit numbers in LSTM 26 | ATTHID = int(sys.argv[2]) # 350 Hidden unit numbers in attention MLP 27 | OUTHID = int(sys.argv[3]) # 3000 Hidden unit numbers in output MLP 28 | NROW = int(sys.argv[4]) # 10 Number of rows in matrix representation 29 | LR = float(sys.argv[5]) # 0.01 30 | L2REG = float(sys.argv[6]) # 0.0001 L2 regularization 31 | DPOUT = float(sys.argv[7]) # 0.3 dropout rate 32 | ATTPENALTY = float(sys.argv[8]) # 1. 33 | WEDIM = int(sys.argv[9]) # 300 Dim of word embedding 34 | BSIZE = int(sys.argv[10]) # 50 Minibatch size 35 | GCLIP = float(sys.argv[11]) # 0.5 All gradients above this will be clipped 36 | NEPOCH = int(sys.argv[12]) # 12 Number of epochs to train the net 37 | STD = float(sys.argv[13]) # 0.1 Standard deviation of weights in initialization 38 | UPDATEWE = bool(int(sys.argv[14])) # 0 for False and 1 for True. Update word embedding in training 39 | filename = __file__.split('.')[0] + \ 40 | '_LSTMHID' + str(LSTMHID) + \ 41 | '_ATTHID' + str(ATTHID) + \ 42 | '_OUTHID' + str(OUTHID) + \ 43 | '_NROWS' + str(NROW) + \ 44 | '_LR' + str(LR) + \ 45 | '_L2REG' + str(L2REG) + \ 46 | '_DPOUT' + str(DPOUT) + \ 47 | '_ATTPENALTY' + str(ATTPENALTY) + \ 48 | '_WEDIM' + str(WEDIM) + \ 49 | '_BSIZE' + str(BSIZE) + \ 50 | '_GCLIP' + str(GCLIP) + \ 51 | '_NEPOCH' + str(NEPOCH) + \ 52 | '_STD' + str(STD) + \ 53 | '_UPDATEWE' + str(UPDATEWE) 54 | 55 | def main(num_epochs=NEPOCH): 56 | print("Loading data ...") 57 | snli = SNLI(batch_size=BSIZE) 58 | train_batches = list(snli.train_minibatch_generator()) 59 | dev_batches = list(snli.dev_minibatch_generator()) 60 | test_batches = list(snli.test_minibatch_generator()) 61 | W_word_embedding = snli.weight # W shape: (# vocab size, WE_DIM) 62 | del snli 63 | 64 | print("Building network ...") 65 | ########### sentence embedding encoder ########### 66 | # sentence vector, with each number standing for a word number 67 | input_var = T.TensorType('int32', [False, False])('sentence_vector') 68 | input_var.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 20), 'int32'), 69 | numpy.zeros((BSIZE, 5)).astype('int32'))) 70 | input_var.tag.test_value[1, 20:22] = (413, 45) 71 | l_in = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var) 72 | 73 | input_mask = T.TensorType('int32', [False, False])('sentence_mask') 74 | input_mask.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 20), dtype='int32'), 75 | numpy.zeros((BSIZE, 5), dtype='int32'))) 76 | input_mask.tag.test_value[1, 20:22] = 1 77 | l_mask = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask) 78 | 79 | # output shape (BSIZE, None, WEDIM) 80 | l_word_embed = lasagne.layers.EmbeddingLayer( 81 | l_in, 82 | input_size=W_word_embedding.shape[0], 83 | output_size=W_word_embedding.shape[1], 84 | W=W_word_embedding) 85 | 86 | # bidirectional LSTM 87 | l_forward = lasagne.layers.LSTMLayer( 88 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 89 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 90 | W_cell=init.Normal(STD)), 91 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 92 | W_cell=init.Normal(STD)), 93 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 94 | W_cell=None, nonlinearity=nonlinearities.tanh), 95 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 96 | W_cell=init.Normal(STD)), 97 | nonlinearity=lasagne.nonlinearities.tanh, 98 | peepholes = False, 99 | grad_clipping=GCLIP) 100 | 101 | l_backward = lasagne.layers.LSTMLayer( 102 | l_word_embed, mask_input=l_mask, num_units=LSTMHID, 103 | ingate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 104 | W_cell=init.Normal(STD)), 105 | forgetgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 106 | W_cell=init.Normal(STD)), 107 | cell=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 108 | W_cell=None, nonlinearity=nonlinearities.tanh), 109 | outgate=Gate(W_in=init.Normal(STD), W_hid=init.Normal(STD), 110 | W_cell=init.Normal(STD)), 111 | nonlinearity=lasagne.nonlinearities.tanh, 112 | peepholes = False, 113 | grad_clipping=GCLIP, backwards=True) 114 | 115 | # output dim: (BSIZE, None, 2*LSTMHID) 116 | l_concat = lasagne.layers.ConcatLayer([l_forward, l_backward], axis=2) 117 | l_concat_dpout = lasagne.layers.DropoutLayer(l_concat, p=DPOUT, rescale=True) # might not need this line 118 | 119 | # Attention mechanism to get sentence embedding 120 | # output dim: (BSIZE, None, ATTHID) 121 | l_ws1 = DenseLayer3DInput(l_concat_dpout, num_units=ATTHID) 122 | l_ws1_dpout = lasagne.layers.DropoutLayer(l_ws1, p=DPOUT, rescale=True) 123 | 124 | # output dim: (BSIZE, None, NROW) 125 | l_ws2 = DenseLayer3DInput(l_ws1_dpout, num_units=NROW, nonlinearity=None) 126 | l_annotations = Softmax3D(l_ws2, mask=l_mask) 127 | # output dim: (BSIZE, 2*LSTMHID, NROW) 128 | l_sentence_embedding = ApplyAttention([l_annotations, l_concat]) 129 | 130 | # beam search? Bi lstm in the sentence embedding layer? etc. 131 | 132 | 133 | ########### get embeddings for hypothesis and premise ########### 134 | # hypothesis 135 | input_var_h = T.TensorType('int32', [False, False])('hypothesis_vector') 136 | input_var_h.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 18), 'int32'), 137 | numpy.zeros((BSIZE, 6)).astype('int32'))) 138 | l_in_h = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var_h) 139 | 140 | input_mask_h = T.TensorType('int32', [False, False])('hypo_mask') 141 | input_mask_h.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 18), dtype='int32'), 142 | numpy.zeros((BSIZE, 6), dtype='int32'))) 143 | input_mask_h.tag.test_value[1, 18:22] = 1 144 | l_mask_h = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask_h) 145 | 146 | # premise 147 | input_var_p = T.TensorType('int32', [False, False])('premise_vector') 148 | input_var_p.tag.test_value = numpy.hstack((numpy.random.randint(1, 10000, (BSIZE, 16), 'int32'), 149 | numpy.zeros((BSIZE, 3)).astype('int32'))) 150 | l_in_p = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_var_p) 151 | 152 | input_mask_p = T.TensorType('int32', [False, False])('premise_mask') 153 | input_mask_p.tag.test_value = numpy.hstack((numpy.ones((BSIZE, 16), dtype='int32'), 154 | numpy.zeros((BSIZE, 3), dtype='int32'))) 155 | input_mask_p.tag.test_value[1, 16:18] = 1 156 | l_mask_p = lasagne.layers.InputLayer(shape=(BSIZE, None), input_var=input_mask_p) 157 | 158 | 159 | hypothesis_embedding, hypothesis_annotation = lasagne.layers.get_output( 160 | [l_sentence_embedding, l_annotations], 161 | {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var}) 162 | premise_embedding, premise_annotation = lasagne.layers.get_output( 163 | [l_sentence_embedding, l_annotations], 164 | {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var}) 165 | 166 | hypothesis_embedding_clean, hypothesis_annotation_clean = lasagne.layers.get_output( 167 | [l_sentence_embedding, l_annotations], 168 | {l_in: l_in_h.input_var, l_mask: l_mask_h.input_var}, deterministic=True) 169 | premise_embedding_clean, premise_annotation_clean = lasagne.layers.get_output( 170 | [l_sentence_embedding, l_annotations], 171 | {l_in: l_in_p.input_var, l_mask: l_mask_p.input_var}, deterministic=True) 172 | 173 | ########### gated encoder and output MLP ########## 174 | l_hypo_embed = lasagne.layers.InputLayer( 175 | shape=(BSIZE, NROW, 2*LSTMHID), input_var=hypothesis_embedding) 176 | l_hypo_embed_dpout = lasagne.layers.DropoutLayer(l_hypo_embed, p=DPOUT, rescale=True) 177 | l_pre_embed = lasagne.layers.InputLayer( 178 | shape=(BSIZE, NROW, 2*LSTMHID), input_var=premise_embedding) 179 | l_pre_embed_dpout = lasagne.layers.DropoutLayer(l_pre_embed, p=DPOUT, rescale=True) 180 | 181 | # output dim: (BSIZE, NROW, 2*LSTMHID) 182 | l_factors = GatedEncoder3D([l_hypo_embed_dpout, l_pre_embed_dpout], num_hfactors=2*LSTMHID) 183 | l_factors_dpout = lasagne.layers.DropoutLayer(l_factors, p=DPOUT, rescale=True) 184 | 185 | # l_hids = DenseLayer3DWeight() 186 | 187 | l_outhid = lasagne.layers.DenseLayer( 188 | l_factors_dpout, num_units=OUTHID, nonlinearity=lasagne.nonlinearities.rectify) 189 | l_outhid_dpout = lasagne.layers.DropoutLayer(l_outhid, p=DPOUT, rescale=True) 190 | 191 | l_output = lasagne.layers.DenseLayer( 192 | l_outhid_dpout, num_units=3, nonlinearity=lasagne.nonlinearities.softmax) 193 | 194 | 195 | ########### target, cost, validation, etc. ########## 196 | target_values = T.ivector('target_output') 197 | target_values.tag.test_value = numpy.asarray([1,] * BSIZE, dtype='int32') 198 | 199 | network_output = lasagne.layers.get_output(l_output) 200 | network_prediction = T.argmax(network_output, axis=1) 201 | accuracy = T.mean(T.eq(network_prediction, target_values)) 202 | 203 | network_output_clean = lasagne.layers.get_output( 204 | l_output, 205 | {l_hypo_embed: hypothesis_embedding_clean, 206 | l_pre_embed: premise_embedding_clean}, 207 | deterministic=True) 208 | network_prediction_clean = T.argmax(network_output_clean, axis=1) 209 | accuracy_clean = T.mean(T.eq(network_prediction_clean, target_values)) 210 | 211 | # penalty term and cost 212 | attention_penalty = T.mean((T.batched_dot( 213 | hypothesis_annotation, 214 | hypothesis_annotation.dimshuffle(0, 2, 1) 215 | ) - T.eye(hypothesis_annotation.shape[1]).dimshuffle('x', 0, 1) 216 | )**2, axis=(0, 1, 2)) + T.mean((T.batched_dot( 217 | premise_annotation, 218 | premise_annotation.dimshuffle(0, 2, 1) 219 | ) - T.eye(premise_annotation.shape[1]).dimshuffle('x', 0, 1))**2, axis=(0, 1, 2)) 220 | 221 | L2_lstm = ((l_forward.W_in_to_ingate ** 2).sum() + \ 222 | (l_forward.W_hid_to_ingate ** 2).sum() + \ 223 | (l_forward.W_in_to_forgetgate ** 2).sum() + \ 224 | (l_forward.W_hid_to_forgetgate ** 2).sum() + \ 225 | (l_forward.W_in_to_cell ** 2).sum() + \ 226 | (l_forward.W_hid_to_cell ** 2).sum() + \ 227 | (l_forward.W_in_to_outgate ** 2).sum() + \ 228 | (l_forward.W_hid_to_outgate ** 2).sum() + \ 229 | (l_backward.W_in_to_ingate ** 2).sum() + \ 230 | (l_backward.W_hid_to_ingate ** 2).sum() + \ 231 | (l_backward.W_in_to_forgetgate ** 2).sum() + \ 232 | (l_backward.W_hid_to_forgetgate ** 2).sum() + \ 233 | (l_backward.W_in_to_cell ** 2).sum() + \ 234 | (l_backward.W_hid_to_cell ** 2).sum() + \ 235 | (l_backward.W_in_to_outgate ** 2).sum() + \ 236 | (l_backward.W_hid_to_outgate ** 2).sum()) 237 | L2_attention = (l_ws1.W ** 2).sum() + (l_ws2.W ** 2).sum() 238 | L2_gae = (l_factors.Wxf ** 2).sum() + (l_factors.Wyf ** 2).sum() 239 | L2_outputhid = (l_outhid.W ** 2).sum() 240 | L2_softmax = (l_output.W ** 2).sum() 241 | L2 = L2_lstm + L2_attention + L2_gae + L2_outputhid + L2_softmax 242 | 243 | cost = T.mean(T.nnet.categorical_crossentropy(network_output, target_values)) + \ 244 | L2REG * L2 245 | cost_clean = T.mean(T.nnet.categorical_crossentropy(network_output_clean, target_values)) + \ 246 | L2REG * L2 247 | if ATTPENALTY != 0.: 248 | cost = cost + ATTPENALTY * attention_penalty 249 | cost_clean = cost_clean + ATTPENALTY * attention_penalty 250 | 251 | # Retrieve all parameters from the network 252 | all_params = lasagne.layers.get_all_params(l_output) + \ 253 | lasagne.layers.get_all_params(l_sentence_embedding) 254 | if not UPDATEWE: 255 | all_params.remove(l_word_embed.W) 256 | 257 | numparams = sum([numpy.prod(i) for i in [i.shape.eval() for i in all_params]]) 258 | print("Number of params: {}\nName\t\t\tShape\t\t\tSize".format(numparams)) 259 | print("-----------------------------------------------------------------") 260 | for item in all_params: 261 | print("{0:24}{1:24}{2}".format(item, item.shape.eval(), numpy.prod(item.shape.eval()))) 262 | 263 | # if exist param file then load params 264 | look_for = 'params' + os.sep + 'params_' + filename + '.pkl' 265 | if os.path.isfile(look_for): 266 | print("Resuming from file: " + look_for) 267 | all_param_values = cPickle.load(open(look_for, 'rb')) 268 | for p, v in zip(all_params, all_param_values): 269 | p.set_value(v) 270 | 271 | # Compute SGD updates for training 272 | print("Computing updates ...") 273 | updates = lasagne.updates.adagrad(cost, all_params, LR) 274 | 275 | # Theano functions for training and computing cost 276 | print("Compiling functions ...") 277 | train = theano.function( 278 | [l_in_h.input_var, l_mask_h.input_var, 279 | l_in_p.input_var, l_mask_p.input_var, target_values], 280 | [cost, accuracy], updates=updates) 281 | compute_cost = theano.function( 282 | [l_in_h.input_var, l_mask_h.input_var, 283 | l_in_p.input_var, l_mask_p.input_var, target_values], 284 | [cost_clean, accuracy_clean]) 285 | predict = theano.function( 286 | [l_in_h.input_var, l_mask_h.input_var, 287 | l_in_p.input_var, l_mask_p.input_var], 288 | network_prediction_clean) 289 | 290 | def evaluate(mode, verbose=False): 291 | if mode == 'dev': 292 | data = dev_batches 293 | if mode == 'test': 294 | data = test_batches 295 | 296 | set_cost = 0. 297 | set_accuracy = 0. 298 | for batches_seen, (hypo, hm, premise, pm, truth) in enumerate(data, 1): 299 | _cost, _accuracy = compute_cost(hypo, hm, premise, pm, truth) 300 | set_cost = (1.0 - 1.0 / batches_seen) * set_cost + \ 301 | 1.0 / batches_seen * _cost 302 | set_accuracy = (1.0 - 1.0 / batches_seen) * set_accuracy + \ 303 | 1.0 / batches_seen * _accuracy 304 | 305 | if verbose == True: 306 | predicted = [] 307 | truth = [] 308 | for batches_seen, (hypo, hm, premise, pm, th) in enumerate(data, 1): 309 | predicted.append(predict(hypo, hm, premise, pm)) 310 | truth.append(th) 311 | truth = numpy.concatenate(truth) 312 | predicted = numpy.concatenate(predicted) 313 | cm = confusion_matrix(truth, predicted) 314 | pr_a = cm.trace()*1.0 / truth.size 315 | pr_e = ((cm.sum(axis=0)*1.0/truth.size) * \ 316 | (cm.sum(axis=1)*1.0/truth.size)).sum() 317 | k = (pr_a - pr_e) / (1 - pr_e) 318 | print(mode + " set statistics:") 319 | print("kappa index of agreement: %f" % k) 320 | print("confusion matrix:") 321 | print(cm) 322 | 323 | return set_cost, set_accuracy 324 | 325 | print("Done. Evaluating scratch model ...") 326 | test_set_cost, test_set_accuracy = evaluate('test', verbose=True) 327 | print("BEFORE TRAINING: dev cost %f, accuracy %f" % (test_set_cost, test_set_accuracy)) 328 | print("Training ...") 329 | try: 330 | for epoch in range(num_epochs): 331 | train_set_cost = 0. 332 | train_set_accuracy = 0. 333 | start = time.time() 334 | 335 | for batches_seen, (hypo, hm, premise, pm, truth) in enumerate( 336 | train_batches, 1): 337 | _cost, _accuracy = train(hypo, hm, premise, pm, truth) 338 | train_set_cost = (1.0 - 1.0 / batches_seen) * train_set_cost + \ 339 | 1.0 / batches_seen * _cost 340 | train_set_accuracy = (1.0 - 1.0 / batches_seen) * train_set_accuracy + \ 341 | 1.0 / batches_seen * _accuracy 342 | if batches_seen % 100 == 0: 343 | end = time.time() 344 | print("Sample %d %.2fs, lr %.4f, train cost %f, accuracy %f" % ( 345 | batches_seen * BSIZE, 346 | end - start, 347 | LR, 348 | train_set_cost, 349 | train_set_accuracy)) 350 | start = end 351 | 352 | if batches_seen % 2000 == 0: 353 | dev_set_cost, dev_set_accuracy = evaluate('dev') 354 | print("***dev cost %f, accuracy %f" % (dev_set_cost, dev_set_accuracy)) 355 | 356 | # save parameters 357 | all_param_values = [p.get_value() for p in all_params] 358 | cPickle.dump(all_param_values, 359 | open('params' + os.sep + 'params_' + filename + '.pkl', 'wb')) 360 | 361 | dev_set_cost, dev_set_accuracy = evaluate('dev') 362 | test_set_cost, test_set_accuracy = evaluate('test', verbose=True) 363 | 364 | print("epoch %d, cost: train %f dev %f test %f;\n" 365 | " accu: train %f dev %f test %f" % ( 366 | epoch, 367 | train_set_cost, dev_set_cost, test_set_cost, 368 | train_set_accuracy, dev_set_accuracy, test_set_accuracy)) 369 | except KeyboardInterrupt: 370 | pdb.set_trace() 371 | pass 372 | 373 | if __name__ == '__main__': 374 | main() 375 | 376 | -------------------------------------------------------------------------------- /util_layers.py: -------------------------------------------------------------------------------- 1 | import numpy 2 | import theano 3 | import theano.tensor as T 4 | 5 | from lasagne import nonlinearities, init 6 | from lasagne.layers.base import Layer, MergeLayer 7 | 8 | import pdb 9 | 10 | 11 | class FlatConcat(MergeLayer): 12 | """ 13 | ConCatLayer but Flattened to 2 dims before concatenation. 14 | Accepts more than 2 input. But all inputs should have the same dimention in 15 | the first dimention. This layer flattens all input to a 2-D matrix and 16 | concatenates them in the second dimention. 17 | 18 | """ 19 | def get_output_shape_for(self, input_shapes): 20 | output_shapes = [] 21 | for shape in input_shapes: 22 | output_shapes.append((shape[0], numpy.prod(shape[1:]))) 23 | return (output_shapes[0][0], sum([i[-1] for i in output_shapes])) 24 | 25 | def get_output_for(self, inputs, **kwargs): 26 | inputs = [i.flatten(2) for i in inputs] 27 | return T.concatenate(inputs, axis=1) 28 | 29 | 30 | class DenseLayerTensorDot(Layer): 31 | """ 32 | multiply N 3D matrices along two dimensions of a 3D matrix, and produce a 33 | 3D output. In batch training case, these setting corresponds to: 34 | 35 | Input shape: (dim1, dim2, dim3, dim4) # (BATCH_SIZE, num_inputslices, N_ROWS, num_inputfeatures) 36 | weight shape: There are two type of weight dims: 37 | 'col': (num_slices, num_features, dim2, dim4) 38 | 'row': (num_slices, num_features, dim2, dim3) 39 | Output shape: There are two types of output shapes: 40 | 'col': (dim1, num_slices, dim3, num_features) 41 | # (BSIZE, num_slices, N_ROWS, num_features) 42 | 'row': (dim1, num_slices, num_features, num_inputfeatures) 43 | # (BSIZE, num_slices, num_features, num_inputfeatures) 44 | 45 | direction: 'row': you are modifying along the row direction, thus the num_inputfeatures keeps intact. 46 | or 'col': you are modifying along the col direction (the number of features), 47 | thus the N_ROWS will keep constant 48 | """ 49 | def __init__(self, incoming, num_slices, num_features, direction='col', 50 | W=init.GlorotUniform(gain='relu'), nonlinearity=nonlinearities.rectify, 51 | **kwargs): 52 | super(DenseLayerTensorDot, self).__init__(incoming, **kwargs) 53 | self.nonlinearity = (nonlinearities.identity if nonlinearity is None 54 | else nonlinearity) 55 | self.num_inputslices = self.input_shape[1] 56 | self.num_slices = num_slices 57 | self.num_inputfeatures = self.input_shape[3] 58 | self.num_features = num_features 59 | self.batch_size = self.input_shape[0] 60 | self.num_rows = self.input_shape[2] 61 | 62 | self.direction = direction 63 | if direction == 'col': 64 | self.W = self.add_param( 65 | W, 66 | (num_slices, num_features, self.num_inputslices, self.num_inputfeatures), 67 | name="W4D_TensorDot_col") 68 | self.axes = [[1, 3], [2, 3]] 69 | elif direction == 'row': 70 | self.W = self.add_param( 71 | W, 72 | (num_slices, num_features, self.num_inputslices, self.num_rows), 73 | name="W4D_TensorDot_row") 74 | self.axes = [[1, 2], [2, 3]] 75 | else: 76 | raise ValueError("`direction` has to be either `row` or `col`.") 77 | 78 | def get_output_shape_for(self, input_shape): 79 | num_inputfeatures = input_shape[3] 80 | batch_size = input_shape[0] 81 | num_rows = input_shape[2] 82 | 83 | # this may change according to the dims you choose to multiply 84 | if self.direction == 'col': 85 | return (batch_size, self.num_slices, num_rows, self.num_features) 86 | elif self.direction == 'row': 87 | return (batch_size, self.num_slices, self.num_features, num_inputfeatures) 88 | 89 | def get_output_for(self, input, **kwargs): 90 | x = input 91 | if self.direction == 'col': 92 | preactivation = T.tensordot(x, self.W, axes=self.axes).dimshuffle(0, 2, 1, 3) 93 | elif self.direction == 'row': 94 | preactivation = T.tensordot(x, self.W, axes=self.axes).dimshuffle(0, 2, 3, 1) 95 | return self.nonlinearity(preactivation) 96 | 97 | 98 | class DenseLayerTensorBatcheddot(Layer): 99 | """ 100 | """ 101 | def __init__(self): 102 | pass 103 | def get_output_shape_for(self): 104 | pass 105 | def get_output_for(self): 106 | pass 107 | 108 | 109 | class DenseLayer3DWeight(Layer): 110 | """ 111 | Apply a 3D matrix to a 3D input, basically it is just batched dot. 112 | 113 | Input: (BATCH_SIZE, inputs_per_row, N_ROWS) 114 | 115 | Weight: 116 | Depending on whether the weight is multiplied from left side of input, 117 | there are two shapes: 118 | right multiply case: (N_ROWS, inputs_per_row, units_per_row) 119 | left multiply case: (inputs_per_row, N_ROWS, units_per_row) 120 | 121 | Output: 122 | right multiply case: (BATCH_SIZE, units_per_row, N_ROWS) 123 | left multiply case: (BATCH_SIZE, inputs_per_row, units_per_row) 124 | 125 | Params: 126 | incoming, 127 | units_per_row, 128 | W 129 | b 130 | leftmul : True if the weight is left multiplied to the input. 131 | nonlinearity 132 | **kwargs 133 | """ 134 | def __init__(self, incoming, units_per_row, W=init.GlorotUniform(), 135 | b=init.Constant(0.), leftmul=False, nonlinearity=nonlinearities.tanh, 136 | **kwargs): 137 | super(DenseLayer3DWeight, self).__init__(incoming, **kwargs) 138 | self.nonlinearity = (nonlinearities.identity if nonlinearity is None 139 | else nonlinearity) 140 | 141 | self.units_per_row = units_per_row 142 | self.inputs_per_row = self.input_shape[1] 143 | self.num_rows = self.input_shape[2] 144 | self.leftmul = leftmul 145 | 146 | if leftmul: 147 | self.W = self.add_param( 148 | W, (self.inputs_per_row, self.num_rows, self.units_per_row), name='W3D') 149 | else: 150 | self.W = self.add_param( 151 | W, (self.num_rows, self.inputs_per_row, self.units_per_row), name='W3D') 152 | 153 | if b is None: 154 | self.b = None 155 | else: 156 | if self.leftmul: 157 | b = theano.shared( 158 | numpy.zeros((1, self.inputs_per_row, self.units_per_row), 159 | dtype=theano.config.floatX), 160 | broadcastable=(True, False, False), 161 | name="b3D") 162 | self.b = self.add_param(spec=b, 163 | shape=(1, self.inputs_per_row, self.units_per_row), 164 | regularizable=False) 165 | else: 166 | b = theano.shared( 167 | numpy.zeros((1, self.units_per_row, self.num_rows), 168 | dtype=theano.config.floatX), 169 | broadcastable=(True, False, False), 170 | name="b3D") 171 | self.b = self.add_param(spec=b, 172 | shape=(1, self.units_per_row, self.num_rows), 173 | regularizable=False) 174 | 175 | def get_output_shape_for(self, input_shape): 176 | if self.leftmul: 177 | return (input_shape[0], input_shape[1], self.units_per_row) 178 | else: 179 | return (input_shape[0], self.units_per_row, input_shape[2]) 180 | 181 | def get_output_for(self, input, **kwargs): 182 | if self.leftmul: 183 | preact = T.batched_dot(T.extra_ops.cpu_contiguous(input.dimshuffle(1, 0, 2)), 184 | self.W).dimshuffle(1, 0, 2) 185 | else: 186 | preact = T.batched_dot(T.extra_ops.cpu_contiguous(input.dimshuffle(2, 0, 1)), 187 | self.W).dimshuffle(1, 2, 0) 188 | if self.b is not None: 189 | preact = preact + self.b 190 | return self.nonlinearity(preact) 191 | 192 | 193 | class DenseLayer3DInput(Layer): 194 | """ 195 | Apply a 2D matrix to a 3D input, so its a batched dot with shared slices. 196 | 197 | Input: (BATCH_SIZE, inputdim1, inputdim2) 198 | 199 | Weight: 200 | Depending on whether the weight is multiplied from left side of input, 201 | there are two shapes: 202 | right multiply case: (inputdim2, num_units) 203 | 204 | Output: 205 | 206 | Params: 207 | incoming, 208 | units_per_row, 209 | W 210 | b 211 | leftmul : True if the weight is left multiplied to the input. 212 | nonlinearity 213 | **kwargs 214 | """ 215 | def __init__(self, incoming, num_units, W=init.GlorotUniform(), 216 | b=init.Constant(0.), nonlinearity=nonlinearities.tanh, 217 | **kwargs): 218 | super(DenseLayer3DInput, self).__init__(incoming, **kwargs) 219 | self.nonlinearity = (nonlinearities.identity if nonlinearity is None 220 | else nonlinearity) 221 | 222 | self.num_units = num_units 223 | 224 | num_inputs = self.input_shape[2] 225 | 226 | self.W = self.add_param(W, (num_inputs, num_units), name="W2D") 227 | if b is None: 228 | self.b = None 229 | else: 230 | self.b = self.add_param(b, (num_units,), name="b2D", 231 | regularizable=False) 232 | 233 | def get_output_shape_for(self, input_shape): 234 | return (input_shape[0], input_shape[1], self.num_units) 235 | 236 | def get_output_for(self, input, **kwargs): 237 | 238 | # pdb.set_trace() 239 | 240 | activation = T.dot(input, self.W) 241 | if self.b is not None: 242 | activation = activation + self.b.dimshuffle('x', 'x', 0) 243 | return self.nonlinearity(activation) 244 | 245 | 246 | class Softmax3D(MergeLayer): 247 | """Softmax is conducted on the middle dimension of a 3D tensor.""" 248 | def __init__(self, incoming, mask=None, **kwargs): 249 | """ 250 | mask: a lasagne layer. 251 | """ 252 | incomings = [incoming] 253 | self.have_mask = False 254 | if mask: 255 | incomings.append(mask) 256 | self.have_mask = True 257 | super(Softmax3D, self).__init__(incomings, **kwargs) 258 | 259 | def get_output_shape_for(self, input_shapes): 260 | return input_shapes[0] 261 | 262 | def get_output_for(self, inputs, **kwargs): 263 | preactivations = inputs[0] 264 | if self.have_mask: 265 | mask = inputs[1] 266 | preactivations = \ 267 | preactivations * mask.dimshuffle(0, 1, 'x').astype(theano.config.floatX) - \ 268 | numpy.asarray(1e36).astype(theano.config.floatX) * \ 269 | (1 - mask).dimshuffle(0, 1, 'x').astype(theano.config.floatX) 270 | 271 | annotation = T.nnet.softmax( 272 | preactivations.dimshuffle(0, 2, 1).reshape(( 273 | preactivations.shape[0] * preactivations.shape[2], 274 | preactivations.shape[1])) 275 | ).reshape(( 276 | preactivations.shape[0], 277 | preactivations.shape[2], 278 | preactivations.shape[1] 279 | )).dimshuffle(0, 2, 1) 280 | return annotation 281 | 282 | 283 | class ApplyAttention(MergeLayer): 284 | def get_output_shape_for(self, input_shapes): 285 | return (input_shapes[0][0], input_shapes[0][2], input_shapes[1][2]) 286 | 287 | def get_output_for(self, inputs, **kwargs): 288 | annotation, sentence = inputs[0], inputs[1] 289 | return T.batched_dot(sentence.dimshuffle(0, 2, 1), annotation).dimshuffle(0, 2, 1) 290 | 291 | 292 | class AugmentFeature(MergeLayer): 293 | """ 294 | Input: 295 | x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 296 | y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 297 | 298 | Output: (BATCH_SIZE, N_ROWS, 8*LSTM_HIDDEN) 299 | """ 300 | def get_output_shape_for(self, input_shapes): 301 | assert input_shapes[0] == input_shapes[1], ( 302 | "The two input to AugmentFeature layer should have the same shape.") 303 | batch_size = input_shapes[0][0] 304 | num_rows = input_shapes[0][1] 305 | num_dim = input_shapes[0][2] 306 | return (batch_size, num_rows, 4 * num_dim) 307 | 308 | def get_output_for(self, inputs, **kwargs): 309 | x, y = inputs[0], inputs[1] 310 | return T.concatenate([x, y, x - y, x * y], axis=2) 311 | 312 | 313 | class GatedEncoder3D(MergeLayer): 314 | """ 315 | An implementation of the encoder part of a 3D Gated Autoencoder. It has 316 | the encoder only. 317 | 318 | It just returns the factor of H, not H. To get the real H, add 319 | another dense layer on top of the output. 320 | 321 | See __paper__ for more info. 322 | 323 | Input: 324 | x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 325 | y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 326 | 327 | Output: 328 | hfactors = (BATCH_SIZE, N_ROWS, num_hfactors) 329 | 330 | """ 331 | def __init__(self, incomings, num_hfactors, 332 | Wxf=init.GlorotUniform(), 333 | Wyf=init.GlorotUniform(), 334 | **kwargs): 335 | super(GatedEncoder3D, self).__init__(incomings, **kwargs) 336 | self.num_xfactors = self.input_shapes[0][2] 337 | self.num_yfactors = self.input_shapes[1][2] 338 | self.num_rows = self.input_shapes[0][1] 339 | self.num_hfactors = num_hfactors 340 | self.Wxf = self.add_param( 341 | Wxf, (self.num_rows, self.num_xfactors, self.num_hfactors), name='Wxf') 342 | self.Wyf = self.add_param( 343 | Wyf, (self.num_rows, self.num_yfactors, self.num_hfactors), name='Wyf') 344 | 345 | def get_output_shape_for(self, input_shapes): 346 | batch_size = input_shapes[0][0] 347 | return (batch_size, self.num_rows, self.num_hfactors) 348 | 349 | def get_output_for(self, inputs, **kwargs): 350 | x, y = inputs[0], inputs[1] 351 | # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0) 352 | # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0) 353 | xfactor = T.batched_dot( 354 | T.extra_ops.cpu_contiguous(x.dimshuffle(1, 0, 2)), self.Wxf).dimshuffle(1, 0, 2) 355 | yfactor = T.batched_dot( 356 | T.extra_ops.cpu_contiguous(y.dimshuffle(1, 0, 2)), self.Wyf).dimshuffle(1, 0, 2) 357 | return xfactor * yfactor 358 | 359 | 360 | class StackedGatedEncoder3D(MergeLayer): 361 | """ 362 | An implementation of the encoder part of a 3D Gated Autoencoder. It has 363 | the encoder only. 364 | 365 | It just returns the factor of H, not H. To get the real H, add 366 | another dense layer on top of the output. 367 | 368 | See __paper__ for more info. 369 | 370 | Input: 371 | x: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 372 | y: (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 373 | 374 | Output: 375 | hfactors = (BATCH_SIZE, N_ROWS, num_hfactors) 376 | 377 | """ 378 | def __init__(self, incomings, 379 | Wxf1=init.GlorotUniform(), 380 | Wyf1=init.GlorotUniform(), 381 | Wxf2=init.GlorotUniform(), 382 | Wyf2=init.GlorotUniform(), 383 | **kwargs): 384 | super(StackedGatedEncoder3D, self).__init__(incomings, **kwargs) 385 | self.num_xfactors = self.input_shapes[0][2] 386 | self.num_yfactors = self.input_shapes[1][2] 387 | assert self.num_xfactors == self.num_yfactors 388 | self.num_rows = self.input_shapes[0][1] 389 | self.Wxf1 = self.add_param( 390 | Wxf1, (self.num_rows, self.num_xfactors, self.num_xfactors), name='Wxf1') 391 | self.Wyf1 = self.add_param( 392 | Wyf1, (self.num_rows, self.num_yfactors, self.num_yfactors), name='Wyf1') 393 | self.Wxf2 = self.add_param( 394 | Wxf2, (self.num_rows, self.num_xfactors, self.num_xfactors), name='Wxf2') 395 | self.Wyf2 = self.add_param( 396 | Wyf2, (self.num_rows, self.num_yfactors, self.num_yfactors), name='Wyf2') 397 | 398 | def get_output_shape_for(self, input_shapes): 399 | batch_size = input_shapes[0][0] 400 | return (batch_size, self.num_rows, self.num_xfactors) 401 | 402 | def get_output_for(self, inputs, **kwargs): 403 | x, y = inputs[0], inputs[1] 404 | # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0) 405 | # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0) 406 | xfactor1 = T.tanh(T.batched_dot( 407 | T.extra_ops.cpu_contiguous(x.dimshuffle(1, 0, 2)), self.Wxf1).dimshuffle(1, 0, 2)) 408 | yfactor1 = T.tanh(T.batched_dot( 409 | T.extra_ops.cpu_contiguous(y.dimshuffle(1, 0, 2)), self.Wyf1).dimshuffle(1, 0, 2)) 410 | xfactor2 = T.batched_dot( 411 | T.extra_ops.cpu_contiguous(xfactor1.dimshuffle(1, 0, 2)), self.Wxf2).dimshuffle(1, 0, 2) 412 | yfactor2 = T.batched_dot( 413 | T.extra_ops.cpu_contiguous(yfactor1.dimshuffle(1, 0, 2)), self.Wyf2).dimshuffle(1, 0, 2) 414 | return xfactor2 * yfactor2 415 | 416 | 417 | class GatedEncoder3DSharedW(MergeLayer): 418 | """ 419 | An implementation of the encoder part of a 3D Gated Autoencoder. 420 | 421 | It has the encoder only. 422 | 423 | It just returns the factor of H, not H. To get the real H, add 424 | another dense layer on top of the output. 425 | 426 | See __paper__ for more info. 427 | 428 | the two inputs, x and y, have to have the same shape. 429 | 430 | """ 431 | def __init__(self, incomings, num_hfactors, 432 | Wf=init.GlorotUniform(), 433 | **kwargs): 434 | super(GatedEncoder3DSharedW, self).__init__(incomings, **kwargs) 435 | self.num_factors = self.input_shapes[0][1] 436 | self.num_rows = self.input_shapes[0][2] 437 | self.num_hfactors = num_hfactors 438 | self.Wf = self.add_param( 439 | Wf, (self.num_rows, self.num_factors, self.num_hfactors), name='Wf') 440 | 441 | def get_output_shape_for(self, input_shapes): 442 | batch_size = input_shapes[0][0] 443 | return (batch_size, self.num_hfactors, self.num_rows) 444 | 445 | def get_output_for(self, inputs, **kwargs): 446 | x, y = inputs[0], inputs[1] 447 | # xfactor = T.batched_dot(x.dimshuffle(2, 0, 1), self.Wxf).dimshuffle(1, 2, 0) 448 | # yfactor = T.batched_dot(y.dimshuffle(2, 0, 1), self.Wyf).dimshuffle(1, 2, 0) 449 | xfactor = T.batched_dot(T.extra_ops.cpu_contiguous(x.dimshuffle(2, 0, 1)), self.Wf).dimshuffle(1, 2, 0) 450 | yfactor = T.batched_dot(T.extra_ops.cpu_contiguous(y.dimshuffle(2, 0, 1)), self.Wf).dimshuffle(1, 2, 0) 451 | return xfactor * yfactor 452 | 453 | 454 | class GatedEncoder4D(MergeLayer): 455 | """ 456 | An implementation of the encoder part of a 4D Gated Autoencoder. 457 | 458 | It has the encoder only. 459 | 460 | It just returns the factor of H, not H. To get the real H, add 461 | another dense layer on top of the output. 462 | 463 | the two inputs, x and y, have to have the same shape. 464 | 465 | Input shape: (dim1, dim2, num_factors) # (BATCH_SIZE, N_ROWS, 2*LSTM_HIDDEN) 466 | weight shape: (num_slices, num_factors, num_hfactors) # (N_SLICES, 2*LSTM_HIDDEN, num_hfactors) 467 | Output shape: (dim1, num_slices, dim2, num_hfactors) # (BATCH_SIZE, N_SLICES, N_ROWS, num_hfactors) 468 | 469 | """ 470 | def __init__(self, incomings, num_slices, num_hfactors, 471 | Wf=init.GlorotUniform(), 472 | **kwargs): 473 | super(GatedEncoder4D, self).__init__(incomings, **kwargs) 474 | self.num_slices = num_slices 475 | self.num_factors = self.input_shapes[0][2] 476 | self.num_rows = self.input_shapes[0][1] 477 | self.num_hfactors = num_hfactors 478 | self.Wf = self.add_param( 479 | Wf, (self.num_slices, self.num_factors, self.num_hfactors), name='Wf') 480 | 481 | def get_output_shape_for(self, input_shapes): 482 | batch_size = input_shapes[0][0] 483 | return (batch_size, self.num_slices, self.num_rows, self.num_hfactors) 484 | 485 | def get_output_for(self, inputs, **kwargs): 486 | x, y = inputs[0], inputs[1] 487 | xfactor = T.tensordot(x, self.Wf, axes=(2, 1)).dimshuffle(0, 2, 1, 3) 488 | yfactor = T.tensordot(y, self.Wf, axes=(2, 1)).dimshuffle(0, 2, 1, 3) 489 | return xfactor * yfactor 490 | 491 | 492 | class APAttentionBatch(MergeLayer): 493 | """ 494 | Attention Pooling mechanism. Compute a normalized weight over input sentences Q and A. 495 | 496 | input: Q & A: (BSIZE, dim1(dim2), DIM) 497 | Q & A mask (BSIZE, dim1(dim2)) 498 | U: (NROW, DIM, DIM) 499 | output: G: (BSIZE, NROW, dim1, dim2) 500 | """ 501 | def __init__(self, incomings, masks=None, num_row=10, init_noise=0.001, **kwargs): 502 | self.have_mask = False 503 | if masks: 504 | incomings = incomings + masks 505 | self.have_mask = True 506 | super(APAttentionBatch, self).__init__(incomings, **kwargs) 507 | self.num_row = num_row 508 | self.init_noise = init_noise 509 | self.num_dim = self.input_shapes[0][2] 510 | U = (numpy.identity(self.num_dim) + init.Normal(std=self.init_noise).sample( 511 | shape=(self.num_row, self.num_dim, self.num_dim)) 512 | ).astype(theano.config.floatX) 513 | self.U = self.add_param(U, U.shape, name='U') 514 | 515 | def get_output_shape_for(self, input_shapes): 516 | batch_size = input_shapes[0][0] 517 | num_wordQ = input_shapes[0][1] 518 | num_wordA = input_shapes[1][1] 519 | return (batch_size, self.num_row, num_wordQ, num_wordA) 520 | 521 | def get_output_for(self, inputs, **kwargs): 522 | Q = inputs[0] 523 | A = inputs[1] 524 | QU = T.tensordot(Q, self.U, axes=[2, 1]) # (BSIZE, dim1, NROW, DIM) 525 | QUA = T.batched_tensordot(QU, A, axes=[3, 2]).dimshuffle(0, 2, 1, 3) 526 | G = T.tanh(QUA) # (BSIZE, NROW, dim1, dim2) 527 | 528 | if self.have_mask: 529 | Qmask = inputs[2] 530 | Amask = inputs[3] 531 | Gmask = T.batched_dot(Qmask.dimshuffle(0, 1, 'x'), 532 | Amask.dimshuffle(0, 'x', 1)).dimshuffle(0, 'x', 1, 2) 533 | G = G * Gmask - (1 - Gmask) # pad -1 to trailing spaces. 534 | 535 | return G 536 | 537 | 538 | class ComputeEmbeddingPool(MergeLayer): 539 | """ 540 | Input : 541 | x: (BSIZE, NROW, DIM) 542 | y: (BSIZE, NROW, DIM) 543 | Output : 544 | (BSIZE, NROW, NROW) 545 | """ 546 | def __init__(self, incomings, **kwargs): 547 | super(ComputeEmbeddingPool, self).__init__(incomings, **kwargs) 548 | 549 | def get_output_shape_for(self, input_shapes): 550 | xshape = input_shapes[0] 551 | yshape = input_shapes[1] 552 | return (xshape[0], xshape[1], yshape[1]) 553 | 554 | def get_output_for(self, inputs, **kwargs): 555 | x = inputs[0] 556 | y = inputs[1] 557 | return T.batched_dot(x, y.dimshuffle(0, 2, 1)) 558 | 559 | 560 | class AttendOnEmbedding(MergeLayer): 561 | """ 562 | incomings=[x, embeddingpool], masks=[xmask, ymask], direction='col' 563 | or 564 | [y, embeddingpool], masks=[xmask, ymask], direction='row' 565 | 566 | Output : 567 | alpha; or beta 568 | """ 569 | def __init__(self, incomings, masks=None, direction='col', **kwargs): 570 | self.have_mask = False 571 | if masks: 572 | incomings = incomings + masks 573 | self.have_mask = True 574 | super(AttendOnEmbedding, self).__init__(incomings, **kwargs) 575 | self.direction = direction 576 | 577 | def get_output_shape_for(self, input_shapes): 578 | sent_shape = input_shapes[0] 579 | emat_shape = input_shapes[1] 580 | if self.direction == 'col': 581 | # x: (BSIZE, R_x, DIM) 582 | # emat: (BSIZE. R_x, R_y) 583 | # out: (BSIZE, R_y, DIM) 584 | return (sent_shape[0], emat_shape[2], sent_shape[2]) 585 | elif self.direction == 'row': 586 | # y: (BSIZE, R_y, DIM) 587 | # emat: (BSIZE. R_x, R_y) 588 | # out: (BSIZE, R_x, DIM) 589 | return (sent_shape[0], emat_shape[1], sent_shape[2]) 590 | 591 | def get_output_for(self, inputs, **kwargs): 592 | sentence = inputs[0] 593 | emat = inputs[1] 594 | if self.have_mask: 595 | xmask = inputs[2] 596 | ymask = inputs[3] 597 | xymask = T.batched_dot(xmask.dimshuffle(0, 1, 'x'), 598 | ymask.dimshuffle(0, 'x', 1)) 599 | emat = emat * xymask.astype(theano.config.floatX) - \ 600 | numpy.asarray(1e36).astype(theano.config.floatX) * \ 601 | (1 - xymask).astype(theano.config.floatX) 602 | 603 | if self.direction == 'col': # softmax on x's dim, and multiply by x 604 | annotation = T.nnet.softmax( 605 | emat.dimshuffle(0, 2, 1).reshape(( 606 | emat.shape[0] * emat.shape[2], emat.shape[1])) 607 | ).reshape(( 608 | emat.shape[0], emat.shape[2], emat.shape[1] 609 | )) # (BSIZE, R_y, R_x) 610 | if self.have_mask: 611 | annotation = annotation * ymask.dimshuffle( 612 | 0, 1, 'x').astype(theano.config.floatX) 613 | elif self.direction == 'row': # softmax on y's dim, and multiply by y 614 | annotation = T.nnet.softmax( 615 | emat.reshape(( 616 | emat.shape[0] * emat.shape[1], emat.shape[2])) 617 | ).reshape(( 618 | emat.shape[0], emat.shape[1], emat.shape[2] 619 | )) # (BSIZE, R_x, R_y) 620 | if self.have_mask: 621 | annotation = annotation * xmask.dimshuffle( 622 | 0, 1, 'x').astype(theano.config.floatX) 623 | return T.batched_dot(annotation, sentence) 624 | 625 | 626 | class MeanOverDim(MergeLayer): 627 | """ 628 | dim can be a number or a tuple of numbers to indicate which dim to compute mean. 629 | """ 630 | def __init__(self, incoming, mask=None, dim=1, **kwargs): 631 | incomings = [incoming] 632 | self.have_mask = False 633 | if mask: 634 | incomings.append(mask) 635 | self.have_mask = True 636 | super(MeanOverDim, self).__init__(incomings, **kwargs) 637 | self.dim = dim 638 | 639 | def get_output_shape_for(self, input_shapes): 640 | return tuple(x for i, x in enumerate(input_shapes[0]) if i != self.dim) 641 | 642 | def get_output_for(self, inputs, **kwargs): 643 | if self.have_mask: 644 | return T.sum(inputs[0], axis=self.dim) / \ 645 | inputs[1].sum(axis=1).dimshuffle(0, 'x') 646 | else: 647 | return T.mean(inputs[0], axis=self.dim) 648 | 649 | 650 | class MaxpoolingG(Layer): 651 | """ 652 | Input : G matrix, 653 | Input shape: (BSIZE, NROW, dim1, dim2) 654 | 655 | Output shape: 656 | 'row': (BSIZE, dim2, NROW) 657 | 'col': (BSIZE, dim1, NROW) 658 | """ 659 | def __init__(self, incoming, direction='col', **kwargs): 660 | super(MaxpoolingG, self).__init__(incoming, **kwargs) 661 | self.direction = direction 662 | 663 | def get_output_shape_for(self, input_shape): 664 | if self.direction == 'row': 665 | return (input_shape[0], input_shape[3], input_shape[1]) 666 | elif self.direction == 'col': 667 | return (input_shape[0], input_shape[2], input_shape[1]) 668 | 669 | def get_output_for(self, input, **kwargs): 670 | G = input 671 | if self.direction == 'row': 672 | return T.max(G, axis=2).dimshuffle(0, 2, 1) 673 | elif self.direction == 'col': 674 | return T.max(G, axis=3).dimshuffle(0, 2, 1) 675 | 676 | 677 | class Maxpooling(Layer): 678 | """ 679 | Input : N-D matrix, 680 | Input shape: (BSIZE, NROW, dim1, dim2) 681 | 682 | Output shape: 683 | """ 684 | def __init__(self, incoming, axis=1, **kwargs): 685 | super(Maxpooling, self).__init__(incoming, **kwargs) 686 | self.axis = axis 687 | 688 | def get_output_shape_for(self, input_shape): 689 | return input_shape[:self.axis] + input_shape[(self.axis+1):] 690 | 691 | def get_output_for(self, input, **kwargs): 692 | return T.max(input, axis=self.axis) 693 | --------------------------------------------------------------------------------