├── LICENSE
├── README.md
├── data
    └── stimuli_words.npy
├── evaluate_brain_predictions.py
├── extract_nlp_features.py
├── predict_brain_from_nlp.py
└── utils
    ├── bert_utils.py
    ├── elmo_utils.py
    ├── ridge_tools.py
    ├── use_utils.py
    ├── utils.py
    └── xl_utils.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Mariya Toneva
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)
 2 | 
 3 | This repository contains code for the paper [Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)](https://arxiv.org/pdf/1905.11833.pdf)
 4 | 
 5 | Bibtex: 
 6 | ```
 7 | @inproceedings{toneva2019interpreting,
 8 |   title={Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)},
 9 |   author={Toneva, Mariya and Wehbe, Leila},
10 |   booktitle={Advances in Neural Information Processing Systems},
11 |   pages={14954--14964},
12 |   year={2019}
13 | }
14 | ```
15 | ## fMRI Recordings of 8 Subjects Reading Harry Potter
16 | You can download the already [preprocessed data here](https://drive.google.com/drive/folders/1Q6zVCAJtKuLOh-zWpkS3lH8LBvHcEOE8?usp=sharing). This data contains fMRI recordings for 8 subjects reading one chapter of Harry Potter. The data been detrended, smoothed, and trimmed to remove the first 20TRs and the last 15TRs. For more information about the data, refer to the paper. We have also provided the precomputed voxel neighborhoods that we have used to compute the searchlight classification accuracies. 
17 | 
18 | The following code expects that these directories are positioned under the data folder in this repository (e.g. `./data/fMRI/` and `./data/voxel_neighborhoods`.
19 | 
20 | 
21 | ## Measuring Alignment Between Brain Recordings and NLP representations
22 | 
23 | Our approach consists of three main steps:
24 | 1. Derive representations of text from an NLP model
25 | 2. Build an encoding model that takes the derived NLP representations as input and predicts brain recordings of people reading the same text
26 | 3. Evaluates the predictions of the encoding model using a classification task
27 | 
28 | In our paper, we present alignment results from 4 different NLP models - ELMo, BERT, Transformer-XL, and USE. Below we provide an overview of how to run all three steps.
29 | 
30 | 
31 | ### Deriving representations of text from an NLP model
32 | 
33 | Needed dependencies for each model:
34 | - USE: Tensorflow < 1.8,  `pip install tensorflow_hub`
35 | - ELMo: `pip install allennlp`
36 | - BERT/Transformer-XL: `pip install pytorch_pretrained_bert`
37 | 
38 | 
39 | The following command can be used to derive the NLP features that we used to obtain the results in Figures 2 and 3:
40 | ```
41 | python extract_nlp_features.py
42 |     --nlp_model [bert/transformer_xl/elmo/use]   
43 |     --sequence_length s
44 |     --output_dir nlp_features
45 | ```
46 | where s ranges from to 1 to 40. This command derives the representation for all sequences of `s` consecutive words in the stimuli text in `/data/stimuli_words.npy` from the model specified in `--nlp_model` and saves one file for each layer in the model in the specified `--output_dir`. The names of the saved files contain the argument values that were used to generate them. The output files are numpy arrays of size `n_words x n_dimensions`, where `n_words` is the number of words in the stimulus text and `n_dimensions` is the number of dimensions in the embeddings of the specified model in `--nlp_model`. Each row of the output file contains the representation of the most recent `s` consecutive words in the stimulus text (i.e. row `i` of the output file is derived by passing words `i-s+1` to `i` through the pretrained NLP model).
47 | 
48 | 
49 | ### Building encoding model to predict fMRI recordings
50 | 
51 | Note: This code has been tested using python3.7
52 | 
53 | ```
54 | python predict_brain_from_nlp.py
55 |     --subject [F,H,I,J,K,L,M,N]
56 |     --nlp_feat_type [bert/elmo/transformer_xl/use]   
57 |     --nlp_feat_dir INPUT_FEAT_DIR
58 |     --layer l
59 |     --sequence_length s
60 |     --output_dir OUTPUT_DIR
61 | ```
62 | 
63 | This call builds encoding models to predict the fMRI recordings using representations of the text stimuli derived from NLP models in step 1 above (`INPUT_FEAT_DIR` is set to the same directory where the NLP features from step 1 were saved, `l` and `s` are the layer and sequence length to be used to load the extracted NLP representations). The encoding model is trained using ridge regression and 4-fold cross validation. The predictions of the encoding model for the heldout data in every fold are saved in an output file in the specified directory `OUTPUT_DIR`. The output filename is in the following format: `predict_{}_with_{}_layer_{}_len_{}.npy`, where the first field is specified by `--subject`, the second by `--nlp_feat_type`, and the rest by `--layer` and `--sequence_length`.
64 | 
65 | ### Evaluating the predictions of the encoding model using classification accuracy
66 | 
67 | Note: This code has been tested using python3.7
68 | 
69 | ```
70 | python evaluate_brain_predictions.py
71 |     --input_path INPUT_PATH
72 |     --output_path OUTPUT_PATH
73 |     --subject [F,H,I,J,K,L,M,N]
74 | ```
75 | 
76 | This call computes the mean 20v20 classification accuracy (over 1000 samplings of 20 words) for each encoding model (from each of the 4 CV folds). The output is a `pickle` file that contains a list with 4 elements -- one for each CV fold. Each of these 4 elements is another list, which contains the accuracies for all voxels. `INPUT_PATH` is the full path (including the file name) to the predictions saved in step 2 above. `OUTPUT_PATH` is the complete path (including file name) to where the accuracies should be saved. 
77 | 
78 | The following extracts the average accuracy across CV folds for a particular subject:
79 | ```
80 | import pickle as pk
81 | import numpy as np
82 | loaded = pk.load(open('{}_accs.pkl'.format(OUTPUT_PATH), 'rb'))
83 | mean_subj_acc_across_folds = loaded.mean(0)
84 | ```
85 | 


--------------------------------------------------------------------------------
/data/stimuli_words.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtoneva/brain_language_nlp/0f519a01e103e86554208ffae3167f4eba9eb788/data/stimuli_words.npy


--------------------------------------------------------------------------------
/evaluate_brain_predictions.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | import pickle as pk
 4 | import time as tm
 5 | 
 6 | from utils.utils import binary_classify_neighborhoods, CV_ind
 7 | 
 8 | 
 9 | if __name__ == '__main__':
10 |     parser = argparse.ArgumentParser()
11 |     parser.add_argument("--input_path", required=True)
12 |     parser.add_argument("--output_path", required=True)
13 |     parser.add_argument("--subject", default='')
14 |     args = parser.parse_args()
15 |     print(args)
16 | 
17 |     start_time = tm.time()
18 | 
19 |     loaded = np.load(args.input_path, allow_pickle=True)
20 |     preds_t_per_feat = loaded.item()['preds_t']
21 |     test_t_per_feat = loaded.item()['test_t']
22 |     print(test_t_per_feat.shape)
23 |     
24 |     n_class = 20   # how many predictions to classify at the same time
25 |     n_folds = 4
26 |     
27 |     neighborhoods = np.load('./data/voxel_neighborhoods/' + args.subject + '_ars_auto2.npy')
28 |     n_words, n_voxels = test_t_per_feat.shape
29 |     ind = CV_ind(n_words, n_folds=n_folds)
30 | 
31 |     accs = np.zeros([n_folds,n_voxels])
32 |     acc_std = np.zeros([n_folds,n_voxels])
33 | 
34 |     for ind_num in range(n_folds):
35 |         test_ind = ind==ind_num
36 |         accs[ind_num,:],_,_,_ = binary_classify_neighborhoods(preds_t_per_feat[test_ind,:], test_t_per_feat[test_ind,:], n_class=20, nSample = 1000,pair_samples = [],neighborhoods=neighborhoods)
37 | 
38 | 
39 |     fname = args.output_path
40 |     if n_class < 20:
41 |         fname = fname + '_{}v{}_'.format(n_class,n_class)
42 | 
43 |     with open(fname + '_accs.pkl','wb') as fout:
44 |         pk.dump(accs,fout)
45 | 
46 |     print('saved: {}'.format(fname + '_accs.pkl'))
47 | 
48 | 
49 |     
50 | 


--------------------------------------------------------------------------------
/extract_nlp_features.py:
--------------------------------------------------------------------------------
 1 | from utils.bert_utils import get_bert_layer_representations
 2 | from utils.xl_utils import get_xl_layer_representations
 3 | from utils.elmo_utils import get_elmo_layer_representations
 4 | from utils.use_utils import get_use_layer_representations
 5 | 
 6 | import time as tm
 7 | import numpy as np
 8 | import torch
 9 | import os
10 | import argparse
11 | 
12 |                 
13 | def save_layer_representations(model_layer_dict, model_name, seq_len, save_dir):             
14 |     for layer in model_layer_dict.keys():
15 |         np.save('{}/{}_length_{}_layer_{}.npy'.format(save_dir,model_name,seq_len,layer+1),np.vstack(model_layer_dict[layer]))  
16 |     print('Saved extracted features to {}'.format(save_dir))
17 |     return 1
18 | 
19 |                 
20 | model_options = ['bert','transformer_xl','elmo','use']        
21 | 
22 | if __name__ == '__main__':
23 |     parser = argparse.ArgumentParser()
24 |     parser.add_argument("--nlp_model", default='bert', choices=model_options)                
25 |     parser.add_argument("--sequence_length", type=int, default=1, help='length of context to provide to NLP model (default: 1)')
26 |     parser.add_argument("--output_dir", required=True, help='directory to save extracted representations to')
27 | 
28 |     args = parser.parse_args()
29 |     print(args)
30 |     
31 |     text_array = np.load(os.getcwd() + '/data/stimuli_words.npy')
32 |     remove_chars = [",","\"","@"]
33 |     
34 |     
35 |     if args.nlp_model == 'bert':
36 |         # the index of the word for which to extract the representations (in the input "[CLS] word_1 ... word_n [SEP]")
37 |         # for CLS, set to 0; for SEP set to -1; for last word set to -2
38 |         word_ind_to_extract = -2
39 |         nlp_features = get_bert_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract)
40 |     elif args.nlp_model == 'transformer_xl':
41 |         word_ind_to_extract = -1
42 |         nlp_features = get_xl_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract)
43 |     elif args.nlp_model == 'elmo':
44 |         word_ind_to_extract = -1
45 |         nlp_features = get_elmo_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract)
46 |     elif args.nlp_model == 'use':
47 |         nlp_features = get_use_layer_representations(args.sequence_length, text_array, remove_chars)
48 |     else:
49 |         print('Unrecognized model name {}'.format(args.nlp_model))
50 |         
51 |         
52 |     if not os.path.exists(args.output_dir):
53 |         os.makedirs(args.output_dir)          
54 |               
55 |     save_layer_representations(nlp_features, args.nlp_model, args.sequence_length, args.output_dir)
56 |         
57 |         
58 |         
59 |         
60 |     
61 |     
62 |     
63 | 
64 |     
65 | 


--------------------------------------------------------------------------------
/predict_brain_from_nlp.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | 
 4 | from utils.utils import run_class_time_CV_fmri_crossval_ridge
 5 |     
 6 | if __name__ == '__main__':
 7 |     parser = argparse.ArgumentParser()
 8 |     parser.add_argument("--subject", required=True)
 9 |     parser.add_argument("--nlp_feat_type", required=True)
10 |     parser.add_argument("--nlp_feat_dir", required=True)
11 |     parser.add_argument("--layer", type=int, required=False)
12 |     parser.add_argument("--sequence_length", type=int, required=False)
13 |     parser.add_argument("--output_dir", required=True)
14 |     
15 |     args = parser.parse_args()
16 |     print(args)
17 |         
18 |     predict_feat_dict = {'nlp_feat_type':args.nlp_feat_type,
19 |                          'nlp_feat_dir':args.nlp_feat_dir,
20 |                          'layer':args.layer,
21 |                          'seq_len':args.sequence_length}
22 | 
23 | 
24 |     # loading fMRI data
25 | 
26 |     data = np.load('./data/fMRI/data_subject_{}.npy'.format(args.subject))
27 |     corrs_t, _, _, preds_t, test_t = run_class_time_CV_fmri_crossval_ridge(data,
28 |                                                                 predict_feat_dict)
29 | 
30 |     fname = 'predict_{}_with_{}_layer_{}_len_{}'.format(args.subject, args.nlp_feat_type, args.layer, args.sequence_length)
31 |     print('saving: {}'.format(args.output_dir + fname))
32 | 
33 |     np.save(args.output_dir + fname + '.npy', {'corrs_t':corrs_t,'preds_t':preds_t,'test_t':test_t})
34 | 
35 |     
36 | 


--------------------------------------------------------------------------------
/utils/bert_utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from pytorch_pretrained_bert import BertTokenizer, BertModel
  4 | import time as tm
  5 | 
  6 | def get_bert_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract):
  7 | 
  8 |     model = BertModel.from_pretrained('bert-base-uncased')
  9 |     tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 10 |     model.eval()
 11 | 
 12 |     # get the token embeddings
 13 |     token_embeddings = []
 14 |     for word in text_array:
 15 |         current_token_embedding = get_bert_token_embeddings([word], tokenizer, model, remove_chars)
 16 |         token_embeddings.append(np.mean(current_token_embedding.detach().numpy(), 1))
 17 |     
 18 |     # where to store layer-wise bert embeddings of particular length
 19 |     BERT = {}
 20 |     for layer in range(12):
 21 |         BERT[layer] = []
 22 |     BERT[-1] = token_embeddings
 23 | 
 24 |     if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index
 25 |         from_start_word_ind_to_extract = seq_len + 2 + word_ind_to_extract  # add 2 for CLS + SEP tokens
 26 |     else:
 27 |         from_start_word_ind_to_extract = word_ind_to_extract
 28 | 
 29 |     start_time = tm.time()    
 30 |         
 31 |     # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times
 32 |     word_seq = text_array[:seq_len]
 33 |     for _ in range(seq_len):
 34 |         BERT = add_avrg_token_embedding_for_specific_word(word_seq,
 35 |                                                                      tokenizer,
 36 |                                                                      model,
 37 |                                                                      remove_chars,
 38 |                                                                      from_start_word_ind_to_extract,
 39 |                                                                      BERT)
 40 | 
 41 |     # then add the embedding of the last word in a sequence as the embedding for the sequence
 42 |     for end_curr_seq in range(seq_len, len(text_array)):
 43 |         word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1]
 44 |         BERT = add_avrg_token_embedding_for_specific_word(word_seq,
 45 |                                                           tokenizer,
 46 |                                                           model,
 47 |                                                           remove_chars,
 48 |                                                           from_start_word_ind_to_extract,
 49 |                                                           BERT)
 50 | 
 51 |         if end_curr_seq % 100 == 0:
 52 |             print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time))
 53 |             start_time = tm.time()
 54 | 
 55 |     print('Done extracting sequences of length {}'.format(seq_len))
 56 |     
 57 |     return BERT
 58 | 
 59 | # extracts layer representations for all words in words_in_array
 60 | # encoded_layers: list of tensors, length num layers. each tensor of dims num tokens by num dimensions in representation
 61 | # word_ind_to_token_ind: dict that maps from index in words_in_array to index in array of tokens when words_in_array is tokenized,
 62 | #                       with keys: index of word, and values: array of indices of corresponding tokens when word is tokenized
 63 | def predict_bert_embeddings(words_in_array, tokenizer, model, remove_chars):    
 64 |     
 65 |     for word in words_in_array:
 66 |         if word in remove_chars:
 67 |             print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.')
 68 |             return -1
 69 |     
 70 |     n_seq_tokens = 0
 71 |     seq_tokens = []
 72 |     
 73 |     word_ind_to_token_ind = {}             # dict that maps index of word in words_in_array to index of tokens in seq_tokens
 74 |     
 75 |     for i,word in enumerate(words_in_array):
 76 |         word_ind_to_token_ind[i] = []      # initialize token indices array for current word
 77 | 
 78 |         if word in ['[CLS]', '[SEP]']:     # [CLS] and [SEP] are already tokenized
 79 |             word_tokens = [word]
 80 |         else:    
 81 |             word_tokens = tokenizer.tokenize(word)
 82 |             
 83 |         for token in word_tokens:
 84 |             if token not in remove_chars:  # don't add any tokens that are in remove_chars
 85 |                 seq_tokens.append(token)
 86 |                 word_ind_to_token_ind[i].append(n_seq_tokens)
 87 |                 n_seq_tokens = n_seq_tokens + 1
 88 |     
 89 |     # convert token to vocabulary indices
 90 |     indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens)
 91 |     tokens_tensor = torch.tensor([indexed_tokens])
 92 |     
 93 |     encoded_layers, _ = model(tokens_tensor)
 94 |     pooled_output = np.squeeze(model.pooler(encoded_layers[-1]).detach().numpy())
 95 |     
 96 |     return encoded_layers, word_ind_to_token_ind, pooled_output
 97 |   
 98 | # add the embeddings for a specific word in the sequence
 99 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg
100 | def add_word_bert_embedding(bert_dict, embeddings_to_add, token_inds_to_avrg, specific_layer=-1):
101 |     if specific_layer >= 0:  # only add embeddings for one specified layer
102 |         layer_embedding = embeddings_to_add[specific_layer]
103 |         full_sequence_embedding = layer_embedding.detach().numpy()
104 |         bert_dict[specific_layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0))
105 |     else:
106 |         for layer, layer_embedding in enumerate(embeddings_to_add):
107 |             full_sequence_embedding = layer_embedding.detach().numpy()
108 |             bert_dict[layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word
109 |     return bert_dict
110 | 
111 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary
112 | #
113 | # word_seq: numpy array of words in input sequence
114 | # tokenizer: BERT tokenizer
115 | # model: BERT model
116 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized
117 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ
118 | # bert_dict: where to save the extracted embeddings
119 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,bert_dict):
120 |     
121 |     word_seq = ['[CLS]'] + list(word_seq) + ['[SEP]']   
122 |     all_sequence_embeddings, word_ind_to_token_ind, _ = predict_bert_embeddings(word_seq, tokenizer, model, remove_chars)
123 |     token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract]
124 |     bert_dict = add_word_bert_embedding(bert_dict, all_sequence_embeddings,token_inds_to_avrg)
125 |     
126 |     return bert_dict
127 | 
128 | 
129 | # get the BERT token embeddings
130 | def get_bert_token_embeddings(words_in_array, tokenizer, model, remove_chars):    
131 |     for word in words_in_array:
132 |         if word in remove_chars:
133 |             print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.')
134 |             return -1
135 |     
136 |     n_seq_tokens = 0
137 |     seq_tokens = []
138 |     
139 |     word_ind_to_token_ind = {}             # dict that maps index of word in words_in_array to index of tokens in seq_tokens
140 |     
141 |     for i,word in enumerate(words_in_array):
142 |         word_ind_to_token_ind[i] = []      # initialize token indices array for current word
143 | 
144 |         if word in ['[CLS]', '[SEP]']:     # [CLS] and [SEP] are already tokenized
145 |             word_tokens = [word]
146 |         else:    
147 |             word_tokens = tokenizer.tokenize(word)
148 |             
149 |         for token in word_tokens:
150 |             if token not in remove_chars:  # don't add any tokens that are in remove_chars
151 |                 seq_tokens.append(token)
152 |                 word_ind_to_token_ind[i].append(n_seq_tokens)
153 |                 n_seq_tokens = n_seq_tokens + 1
154 |     
155 |     # convert token to vocabulary indices
156 |     indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens)
157 |     tokens_tensor = torch.tensor([indexed_tokens])
158 |     
159 |     token_embeddings = model.embeddings.forward(tokens_tensor)
160 |     
161 |     return token_embeddings
162 |     
163 |     
164 | # add the embeddings for all individual words
165 | # specific_layer specifies only one layer to add embeddings from
166 | def add_all_bert_embeddings(bert_dict, embeddings_to_add, specific_layer=-1):
167 |     if specific_layer >= 0:
168 |         layer_embedding = embeddings_to_add[specific_layer]
169 |         seq_len = layer_embedding.shape[1]
170 |         full_sequence_embedding = layer_embedding.detach().numpy()
171 |         
172 |         for word in range(seq_len):
173 |             bert_dict[specific_layer].append(full_sequence_embedding[0,word,:])
174 |     else:  
175 |         for layer, layer_embedding in enumerate(embeddings_to_add):
176 |             seq_len = layer_embedding.shape[1]
177 |             full_sequence_embedding = layer_embedding.detach().numpy()
178 | 
179 |             for word in range(seq_len):
180 |                 bert_dict[layer].append(full_sequence_embedding[0,word,:])
181 |     return bert_dict
182 | 
183 | 
184 | # add the embeddings for only the last word in the sequence that is not [SEP] token
185 | def add_last_nonsep_bert_embedding(bert_dict, embeddings_to_add, specific_layer=-1):
186 |     if specific_layer >= 0:
187 |         layer_embedding = embeddings_to_add[specific_layer]
188 |         full_sequence_embedding = layer_embedding.detach().numpy()
189 |         
190 |         bert_dict[specific_layer].append(full_sequence_embedding[0,-2,:])
191 |     else:
192 |         for layer, layer_embedding in enumerate(embeddings_to_add):
193 |             full_sequence_embedding = layer_embedding.detach().numpy()
194 | 
195 |             bert_dict[layer].append(full_sequence_embedding[0,-2,:])
196 |     return bert_dict
197 | 
198 | # add the CLS token embeddings ([CLS] is the first token in each string)
199 | def add_cls_bert_embedding(bert_dict, embeddings_to_add, specific_layer=-1):
200 |     if specific_layer >= 0:
201 |         layer_embedding = embeddings_to_add[specific_layer]
202 |         full_sequence_embedding = layer_embedding.detach().numpy()
203 |         
204 |         bert_dict[specific_layer].append(full_sequence_embedding[0,0,:])
205 |     else:
206 |         for layer, layer_embedding in enumerate(embeddings_to_add):
207 |             full_sequence_embedding = layer_embedding.detach().numpy()
208 | 
209 |             bert_dict[layer].append(full_sequence_embedding[0,0,:])
210 |     return bert_dict
211 | 


--------------------------------------------------------------------------------
/utils/elmo_utils.py:
--------------------------------------------------------------------------------
 1 | from allennlp.commands.elmo import ElmoEmbedder
 2 | from allennlp.data.tokenizers.word_tokenizer import WordTokenizer
 3 | 
 4 | import numpy as np
 5 | import torch
 6 | import time as tm
 7 | 
 8 | def get_elmo_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract):
 9 | 
10 |     model = ElmoEmbedder()
11 |     tokenizer = WordTokenizer()
12 |     
13 |     # where to store layer-wise elmo embeddings of particular length
14 |     elmo = {}
15 |     for layer in range(-1,2):
16 |         elmo[layer] = []
17 | 
18 |     if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index
19 |         from_start_word_ind_to_extract = seq_len + word_ind_to_extract
20 |     else:
21 |         from_start_word_ind_to_extract = word_ind_to_extract
22 | 
23 |     start_time = tm.time()    
24 |         
25 |     # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times
26 |     word_seq = text_array[:seq_len]
27 |     for _ in range(seq_len):
28 |         elmo = add_avrg_token_embedding_for_specific_word(word_seq,
29 |                                                                      tokenizer,
30 |                                                                      model,
31 |                                                                      remove_chars,
32 |                                                                      from_start_word_ind_to_extract,
33 |                                                                      elmo)
34 | 
35 |     # then add the embedding of the last word in a sequence as the embedding for the sequence
36 |     for end_curr_seq in range(seq_len, len(text_array)):
37 |         word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1]
38 |         elmo = add_avrg_token_embedding_for_specific_word(word_seq,
39 |                                                           tokenizer,
40 |                                                           model,
41 |                                                           remove_chars,
42 |                                                           from_start_word_ind_to_extract,
43 |                                                           elmo)
44 | 
45 |         if end_curr_seq % 100 == 0:
46 |             print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time))
47 |             start_time = tm.time()
48 | 
49 |     print('Done extracting sequences of length {}'.format(seq_len))
50 |     
51 |     return elmo 
52 | 
53 | 
54 | def predict_elmo_embeddings(words_in_array, tokenizer, model, remove_chars):    
55 |     
56 |     n_seq_tokens = 0
57 |     seq_tokens = []
58 |     
59 |     word_ind_to_token_ind = {}             # dict that maps index of word in words_in_array to index of tokens in seq_tokens
60 |     
61 |     for i,word in enumerate(words_in_array):
62 |         word_ind_to_token_ind[i] = []      # initialize token indices array for current word
63 | 
64 |         word_tokens = tokenizer.tokenize(str(word))
65 |             
66 |         for token in word_tokens:
67 |             if token not in remove_chars:  # don't add any tokens that are in remove_chars
68 |                 seq_tokens.append(str(token))
69 |                 word_ind_to_token_ind[i].append(n_seq_tokens)
70 |                 n_seq_tokens = n_seq_tokens + 1
71 |     
72 |     
73 |     encoded_layers = model.embed_sentence(seq_tokens)
74 |     
75 |     return encoded_layers, word_ind_to_token_ind
76 | 
77 | 
78 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary
79 | #
80 | # word_seq: numpy array of words in input sequence
81 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized
82 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ
83 | # model_dict: where to save the extracted embeddings
84 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,model_dict):
85 |     
86 |     word_seq = list(word_seq)  
87 |     all_sequence_embeddings, word_ind_to_token_ind = predict_elmo_embeddings(word_seq, tokenizer, model, remove_chars)
88 |     token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract]
89 |     model_dict = add_word_elmo_embedding(model_dict, all_sequence_embeddings,token_inds_to_avrg)
90 |     
91 |     return model_dict
92 | 
93 | # add the embeddings for a specific word in the sequence
94 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg
95 | def add_word_elmo_embedding(elmo_dict, embeddings_to_add, token_inds_to_avrg):
96 |     for layer in elmo_dict.keys():
97 |             elmo_dict[layer].append(np.mean(embeddings_to_add[layer+1,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word
98 |     return elmo_dict


--------------------------------------------------------------------------------
/utils/ridge_tools.py:
--------------------------------------------------------------------------------
  1 | from numpy.linalg import inv, svd
  2 | import numpy as np
  3 | from sklearn.model_selection import KFold
  4 | from sklearn.linear_model import Ridge, RidgeCV
  5 | import time
  6 | from scipy.stats import zscore
  7 | 
  8 | def corr(X,Y):
  9 |     return np.mean(zscore(X)*zscore(Y),0)
 10 | 
 11 | def R2(Pred,Real):
 12 |     SSres = np.mean((Real-Pred)**2,0)
 13 |     SStot = np.var(Real,0)
 14 |     return np.nan_to_num(1-SSres/SStot)
 15 | 
 16 | def R2r(Pred,Real):
 17 |     R2rs = R2(Pred,Real)
 18 |     ind_neg = R2rs<0
 19 |     R2rs = np.abs(R2rs)
 20 |     R2rs = np.sqrt(R2rs)
 21 |     R2rs[ind_neg] *= - 1
 22 |     return R2rs
 23 | 
 24 | def ridge(X,Y,lmbda):
 25 |     return np.dot(inv(X.T.dot(X)+lmbda*np.eye(X.shape[1])),X.T.dot(Y))
 26 | 
 27 | def ridge_by_lambda(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])):
 28 |     error = np.zeros((lambdas.shape[0],Y.shape[1]))
 29 |     for idx,lmbda in enumerate(lambdas):
 30 |         weights = ridge(X,Y,lmbda)
 31 |         error[idx] = 1 - R2(np.dot(Xval,weights),Yval)
 32 |     return error
 33 | 
 34 | def ridge_sk(X,Y,lmbda):
 35 |     rd = Ridge(alpha = lmbda)
 36 |     rd.fit(X,Y)
 37 |     return rd.coef_.T
 38 | 
 39 | def ridgeCV_sk(X,Y,lmbdas):
 40 |     rd = RidgeCV(alphas = lmbdas)
 41 |     rd.fit(X,Y)
 42 |     return rd.coef_.T
 43 | 
 44 | def ridge_by_lambda_sk(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])):
 45 |     error = np.zeros((lambdas.shape[0],Y.shape[1]))
 46 |     for idx,lmbda in enumerate(lambdas):
 47 |         weights = ridge_sk(X,Y,lmbda)
 48 |         error[idx] = 1 -  R2(np.dot(Xval,weights),Yval)
 49 |     return error
 50 | 
 51 | def ridge_svd(X,Y,lmbda):
 52 |     U, s, Vt = svd(X, full_matrices=False)
 53 |     d = s / (s** 2 + lmbda)
 54 |     return np.dot(Vt,np.diag(d).dot(U.T.dot(Y)))
 55 | 
 56 | def ridge_by_lambda_svd(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])):
 57 |     error = np.zeros((lambdas.shape[0],Y.shape[1]))
 58 |     U, s, Vt = svd(X, full_matrices=False)
 59 |     for idx,lmbda in enumerate(lambdas):
 60 |         d = s / (s** 2 + lmbda)
 61 |         weights = np.dot(Vt,np.diag(d).dot(U.T.dot(Y)))
 62 |         error[idx] = 1 - R2(np.dot(Xval,weights),Yval)
 63 |     return error
 64 | 
 65 | def kernel_ridge(X,Y,lmbda):
 66 |     return np.dot(X.T.dot(inv(X.dot(X.T)+lmbda*np.eye(X.shape[0]))),Y)
 67 | 
 68 | def kernel_ridge_by_lambda(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])):
 69 |     error = np.zeros((lambdas.shape[0],Y.shape[1]))
 70 |     for idx,lmbda in enumerate(lambdas):
 71 |         weights = kernel_ridge(X,Y,lmbda)
 72 |         error[idx] = 1 - R2(np.dot(Xval,weights),Yval)
 73 |     return error
 74 | 
 75 | def kernel_ridge_svd(X,Y,lmbda):
 76 |     U, s, Vt = svd(X.T, full_matrices=False)
 77 |     d = s / (s** 2 + lmbda)
 78 |     return np.dot(np.dot(U,np.diag(d).dot(Vt)),Y)
 79 | 
 80 | def kernel_ridge_by_lambda_svd(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])):
 81 |     error = np.zeros((lambdas.shape[0],Y.shape[1]))
 82 |     U, s, Vt = svd(X.T, full_matrices=False)
 83 |     for idx,lmbda in enumerate(lambdas):
 84 |         d = s / (s** 2 + lmbda)
 85 |         weights = np.dot(np.dot(U,np.diag(d).dot(Vt)),Y)
 86 |         error[idx] = 1 - R2(np.dot(Xval,weights),Yval)
 87 |     return error
 88 | 
 89 | def cross_val_ridge(train_features,train_data, n_splits = 10,
 90 |                     lambdas = np.array([10**i for i in range(-6,10)]),
 91 |                     method = 'plain',
 92 |                     do_plot = False):
 93 | 
 94 |     ridge_1 = dict(plain = ridge_by_lambda,
 95 |                    svd = ridge_by_lambda_svd,
 96 |                    kernel_ridge = kernel_ridge_by_lambda,
 97 |                    kernel_ridge_svd = kernel_ridge_by_lambda_svd,
 98 |                    ridge_sk = ridge_by_lambda_sk)[method]
 99 |     ridge_2 = dict(plain = ridge,
100 |                    svd = ridge_svd,
101 |                    kernel_ridge = kernel_ridge,
102 |                    kernel_ridge_svd = kernel_ridge_svd,
103 |                    ridge_sk = ridge_sk)[method]
104 | 
105 |     n_voxels = train_data.shape[1]
106 |     nL = lambdas.shape[0]
107 |     r_cv = np.zeros((nL, train_data.shape[1]))
108 | 
109 |     kf = KFold(n_splits=n_splits)
110 |     start_t = time.time()
111 |     for icv, (trn, val) in enumerate(kf.split(train_data)):
112 |         #print('ntrain = {}'.format(train_features[trn].shape[0]))
113 |         cost = ridge_1(train_features[trn],train_data[trn],
114 |                                train_features[val],train_data[val],
115 |                                lambdas=lambdas)
116 |         if do_plot:
117 |             import matplotlib.pyplot as plt
118 |             plt.figure()
119 |             plt.imshow(cost,aspect = 'auto')
120 |         r_cv += cost
121 |         #if icv%3 ==0:
122 |         #    print(icv)
123 |         #print('average iteration length {}'.format((time.time()-start_t)/(icv+1)))
124 |     if do_plot:
125 |         plt.figure()
126 |         plt.imshow(r_cv,aspect='auto',cmap = 'RdBu_r');
127 | 
128 |     argmin_lambda = np.argmin(r_cv,axis = 0)
129 |     weights = np.zeros((train_features.shape[1],train_data.shape[1]))
130 |     for idx_lambda in range(lambdas.shape[0]): # this is much faster than iterating over voxels!
131 |         idx_vox = argmin_lambda == idx_lambda
132 |         weights[:,idx_vox] = ridge_2(train_features, train_data[:,idx_vox],lambdas[idx_lambda])
133 |     if do_plot:
134 |         plt.figure()
135 |         plt.imshow(weights,aspect='auto',cmap = 'RdBu_r',vmin = -0.5,vmax = 0.5);
136 | 
137 |     return weights, np.array([lambdas[i] for i in argmin_lambda])
138 | 


--------------------------------------------------------------------------------
/utils/use_utils.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import tensorflow_hub as hub
 3 | import numpy as np
 4 | 
 5 | 
 6 | def clean_word(word, remove_chars):
 7 |     word2 = word[:]
 8 |     while len(word2)>0 and word2[0] in remove_chars:
 9 |         word2 = word2[1:]
10 |     while len(word2)>0 and word2[-1] in remove_chars:
11 |         word2 = word2[:-1]    
12 |     return word2
13 | 
14 | def get_use_layer_representations(seq_len, text_array, remove_chars):
15 |     
16 |     module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]
17 | 
18 |     # Import the Universal Sentence Encoder's TF Hub module
19 |     embed = hub.Module(module_url)
20 |     
21 |     # Reduce logging output.
22 |     tf.logging.set_verbosity(tf.logging.ERROR)
23 |     
24 |     clean_text_array = [clean_word(w,remove_chars) for w in text_array]
25 |     n_labels = len(clean_text_array)
26 | 
27 |     seq_strings = [" ".join(clean_text_array[i-seq_len:i]) for i in range(20,n_labels)]
28 | 
29 |     with tf.Session() as session:
30 |         session.run([tf.global_variables_initializer(), tf.tables_initializer()])
31 | 
32 |         embeddings = session.run(embed(seq_strings))
33 |         sequence = np.array(embeddings)
34 | 
35 |     USE = {}
36 |     USE[-1] = [np.zeros((20,sequence.shape[1])),sequence]
37 |     
38 |     return USE
39 | 


--------------------------------------------------------------------------------
/utils/utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from sklearn.decomposition import PCA
  3 | from scipy.stats import zscore
  4 | import time
  5 | import csv
  6 | import os
  7 | import nibabel
  8 | from sklearn.metrics.pairwise import euclidean_distances
  9 | from scipy.ndimage.filters import gaussian_filter
 10 | 
 11 | from utils.ridge_tools import cross_val_ridge, corr
 12 | import time as tm
 13 | 
 14 |     
 15 | def load_transpose_zscore(file): 
 16 |     dat = nibabel.load(file).get_data()
 17 |     dat = dat.T
 18 |     return zscore(dat,axis = 0)
 19 | 
 20 | def smooth_run_not_masked(data,smooth_factor):
 21 |     smoothed_data = np.zeros_like(data)
 22 |     for i,d in enumerate(data):
 23 |         smoothed_data[i] = gaussian_filter(data[i], sigma=smooth_factor, order=0, output=None,
 24 |                  mode='reflect', cval=0.0, truncate=4.0)
 25 |     return smoothed_data
 26 | 
 27 | def delay_one(mat, d):
 28 |         # delays a matrix by a delay d. Positive d ==> row t has row t-d
 29 |     new_mat = np.zeros_like(mat)
 30 |     if d>0:
 31 |         new_mat[d:] = mat[:-d]
 32 |     elif d<0:
 33 |         new_mat[:d] = mat[-d:]
 34 |     else:
 35 |         new_mat = mat
 36 |     return new_mat
 37 | 
 38 | def delay_mat(mat, delays):
 39 |         # delays a matrix by a set of delays d.
 40 |         # a row t in the returned matrix has the concatenated:
 41 |         # row(t-delays[0],t-delays[1]...t-delays[last] )
 42 |     new_mat = np.concatenate([delay_one(mat, d) for d in delays],axis = -1)
 43 |     return new_mat
 44 | 
 45 | # train/test is the full NLP feature
 46 | # train/test_pca is the NLP feature reduced to 10 dimensions via PCA that has been fit on the training data
 47 | # feat_dir is the directory where the NLP features are stored
 48 | # train_indicator is an array of 0s and 1s indicating whether the word at this index is in the training set
 49 | def get_nlp_features_fixed_length(layer, seq_len, feat_type, feat_dir, train_indicator, SKIP_WORDS=20, END_WORDS=5176):
 50 |     
 51 |     loaded = np.load(feat_dir + feat_type + '_length_'+str(seq_len)+ '_layer_' + str(layer) + '.npy')
 52 |     if feat_type == 'elmo':
 53 |         train = loaded[SKIP_WORDS:END_WORDS,:][:,:512][train_indicator]   # only forward LSTM
 54 |         test = loaded[SKIP_WORDS:END_WORDS,:][:,:512][~train_indicator]   # only forward LSTM
 55 |     elif feat_type == 'bert' or feat_type == 'transformer_xl' or feat_type == 'use':
 56 |         train = loaded[SKIP_WORDS:END_WORDS,:][train_indicator]
 57 |         test = loaded[SKIP_WORDS:END_WORDS,:][~train_indicator]
 58 |     else:
 59 |         print('Unrecognized NLP feature type {}. Available options elmo, bert, transformer_xl, use'.format(feat_type))
 60 |     
 61 |     pca = PCA(n_components=10, svd_solver='full')
 62 |     pca.fit(train)
 63 |     train_pca = pca.transform(train)
 64 |     test_pca = pca.transform(test)
 65 | 
 66 |     return train, test, train_pca, test_pca 
 67 | 
 68 | def CV_ind(n, n_folds):
 69 |     ind = np.zeros((n))
 70 |     n_items = int(np.floor(n/n_folds))
 71 |     for i in range(0,n_folds -1):
 72 |         ind[i*n_items:(i+1)*n_items] = i
 73 |     ind[(n_folds-1)*n_items:] = (n_folds-1)
 74 |     return ind
 75 | 
 76 | def TR_to_word_CV_ind(TR_train_indicator,SKIP_WORDS=20,END_WORDS=5176):
 77 |     time = np.load('./data/fMRI/time_fmri.npy')
 78 |     runs = np.load('./data/fMRI/runs_fmri.npy') 
 79 |     time_words = np.load('./data/fMRI/time_words_fmri.npy')
 80 |     time_words = time_words[SKIP_WORDS:END_WORDS]
 81 |         
 82 |     word_train_indicator = np.zeros([len(time_words)], dtype=bool)    
 83 |     words_id = np.zeros([len(time_words)],dtype=int)
 84 |     # w=find what TR each word belongs to
 85 |     for i in range(len(time_words)):                
 86 |         words_id[i] = np.where(time_words[i]> time)[0][-1]
 87 |         
 88 |         if words_id[i] <= len(runs) - 15:
 89 |             offset = runs[int(words_id[i])]*20 + (runs[int(words_id[i])]-1)*15
 90 |             if TR_train_indicator[int(words_id[i])-offset-1] == 1:
 91 |                 word_train_indicator[i] = True
 92 |     return word_train_indicator        
 93 | 
 94 | 
 95 | def prepare_fmri_features(train_features, test_features, word_train_indicator, TR_train_indicator, SKIP_WORDS=20, END_WORDS=5176):
 96 |         
 97 |     time = np.load('./data/fMRI/time_fmri.npy')
 98 |     runs = np.load('./data/fMRI/runs_fmri.npy') 
 99 |     time_words = np.load('./data/fMRI/time_words_fmri.npy')
100 |     time_words = time_words[SKIP_WORDS:END_WORDS]
101 |         
102 |     words_id = np.zeros([len(time_words)])
103 |     # w=find what TR each word belongs to
104 |     for i in range(len(time_words)):
105 |         words_id[i] = np.where(time_words[i]> time)[0][-1]
106 |         
107 |     all_features = np.zeros([time_words.shape[0], train_features.shape[1]])
108 |     all_features[word_train_indicator] = train_features
109 |     all_features[~word_train_indicator] = test_features
110 |         
111 |     p = all_features.shape[1]
112 |     tmp = np.zeros([time.shape[0], p])
113 |     for i in range(time.shape[0]):
114 |         tmp[i] = np.mean(all_features[(words_id<=i)*(words_id>i-1)],0)
115 |     tmp = delay_mat(tmp, np.arange(1,5))
116 | 
117 |     # remove the edges of each run
118 |     tmp = np.vstack([zscore(tmp[runs==i][20:-15]) for i in range(1,5)])
119 |     tmp = np.nan_to_num(tmp)
120 |         
121 |     return tmp[TR_train_indicator], tmp[~TR_train_indicator]
122 | 
123 |   
124 | 
125 | def run_class_time_CV_fmri_crossval_ridge(data, predict_feat_dict,
126 |                                           regress_feat_names_list = [],method = 'kernel_ridge', 
127 |                                           lambdas = np.array([0.1,1,10,100,1000]),
128 |                                           detrend = False, n_folds = 4, skip=5):
129 |     
130 |     nlp_feat_type = predict_feat_dict['nlp_feat_type']
131 |     feat_dir = predict_feat_dict['nlp_feat_dir']
132 |     layer = predict_feat_dict['layer']
133 |     seq_len = predict_feat_dict['seq_len']
134 |         
135 |         
136 |     n_words = data.shape[0]
137 |     n_voxels = data.shape[1]
138 | 
139 |     ind = CV_ind(n_words, n_folds=n_folds)
140 | 
141 |     corrs = np.zeros((n_folds, n_voxels))
142 |     acc = np.zeros((n_folds, n_voxels))
143 |     acc_std = np.zeros((n_folds, n_voxels))
144 | 
145 |     all_test_data = []
146 |     all_preds = []
147 |     
148 |     
149 |     for ind_num in range(n_folds):
150 |         train_ind = ind!=ind_num
151 |         test_ind = ind==ind_num
152 |         
153 |         word_CV_ind = TR_to_word_CV_ind(train_ind)
154 |         
155 |         _,_,tmp_train_features,tmp_test_features = get_nlp_features_fixed_length(layer, seq_len, nlp_feat_type, feat_dir, word_CV_ind)
156 |         train_features,test_features = prepare_fmri_features(tmp_train_features, tmp_test_features, word_CV_ind, train_ind)
157 |         
158 |         # split data
159 |         train_data = data[train_ind]
160 |         test_data = data[test_ind]
161 | 
162 |         # skip TRs between train and test data
163 |         if ind_num == 0: # just remove from front end
164 |             train_data = train_data[skip:,:]
165 |             train_features = train_features[skip:,:]
166 |         elif ind_num == n_folds-1: # just remove from back end
167 |             train_data = train_data[:-skip,:]
168 |             train_features = train_features[:-skip,:]
169 |         else:
170 |             test_data = test_data[skip:-skip,:]
171 |             test_features = test_features[skip:-skip,:]
172 | 
173 |         # normalize data
174 |         train_data = np.nan_to_num(zscore(np.nan_to_num(train_data)))
175 |         test_data = np.nan_to_num(zscore(np.nan_to_num(test_data)))
176 |         all_test_data.append(test_data)
177 |         
178 |         train_features = np.nan_to_num(zscore(train_features))
179 |         test_features = np.nan_to_num(zscore(test_features)) 
180 |         
181 |         start_time = tm.time()
182 |         weights, chosen_lambdas = cross_val_ridge(train_features,train_data, n_splits = 10, lambdas = np.array([10**i for i in range(-6,10)]), method = 'plain',do_plot = False)
183 | 
184 |         preds = np.dot(test_features, weights)
185 |         corrs[ind_num,:] = corr(preds,test_data)
186 |         all_preds.append(preds)
187 |             
188 |         print('fold {} completed, took {} seconds'.format(ind_num, tm.time()-start_time))
189 |         del weights
190 | 
191 |     return corrs, acc, acc_std, np.vstack(all_preds), np.vstack(all_test_data)
192 | 
193 | def binary_classify_neighborhoods(Ypred, Y, n_class=20, nSample = 1000,pair_samples = [],neighborhoods=[]):
194 |     # n_class = how many words to classify at once
195 |     # nSample = how many words to classify
196 | 
197 |     voxels = Y.shape[-1]
198 |     neighborhoods = np.asarray(neighborhoods, dtype=int)
199 | 
200 |     import time as tm
201 | 
202 |     acc = np.full([nSample, Y.shape[-1]], np.nan)
203 |     acc2 = np.full([nSample, Y.shape[-1]], np.nan)
204 |     test_word_inds = []
205 | 
206 |     if len(pair_samples)>0:
207 |         Ypred2 = Ypred[pair_samples>=0]
208 |         Y2 = Y[pair_samples>=0]
209 |         pair_samples2 = pair_samples[pair_samples>=0]
210 |     else:
211 |         Ypred2 = Ypred
212 |         Y2 = Y
213 |         pair_samples2 = pair_samples
214 |     n = Y2.shape[0]
215 |     start_time = tm.time()
216 |     for idx in range(nSample):
217 |         
218 |         idx_real = np.random.choice(n, n_class)
219 | 
220 |         sample_real = Y2[idx_real]
221 |         sample_pred_correct = Ypred2[idx_real]
222 | 
223 |         if len(pair_samples2) == 0:
224 |             idx_wrong = np.random.choice(n, n_class)
225 |         else:
226 |             idx_wrong = sample_same_but_different(idx_real,pair_samples2)
227 |         sample_pred_incorrect = Ypred2[idx_wrong]
228 | 
229 |         #print(sample_pred_incorrect.shape)
230 | 
231 |         # compute distances within neighborhood
232 |         dist_correct = np.sum((sample_real - sample_pred_correct)**2,0)
233 |         dist_incorrect = np.sum((sample_real - sample_pred_incorrect)**2,0)
234 | 
235 |         neighborhood_dist_correct = np.array([np.sum(dist_correct[neighborhoods[v,neighborhoods[v,:]>-1]]) for v in range(voxels)])
236 |         neighborhood_dist_incorrect = np.array([np.sum(dist_incorrect[neighborhoods[v,neighborhoods[v,:]>-1]]) for v in range(voxels)])
237 | 
238 | 
239 |         acc[idx,:] = (neighborhood_dist_correct < neighborhood_dist_incorrect)*1.0 + (neighborhood_dist_correct == neighborhood_dist_incorrect)*0.5
240 | 
241 |         test_word_inds.append(idx_real)
242 |     print('Classification for fold done. Took {} seconds'.format(tm.time()-start_time))
243 |     return np.nanmean(acc,0), np.nanstd(acc,0), acc, np.array(test_word_inds)
244 | 


--------------------------------------------------------------------------------
/utils/xl_utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch
  3 | import time as tm
  4 | 
  5 | from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel
  6 | 
  7 | 
  8 | def get_xl_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract):
  9 | 
 10 |     model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
 11 |     tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
 12 |     model.eval()
 13 |     
 14 | 
 15 |     # get the token embeddings
 16 |     token_embeddings = []
 17 |     for word in text_array:
 18 |         current_token_embedding = get_xl_token_embeddings([word], tokenizer, model, remove_chars)
 19 |         token_embeddings.append(np.mean(current_token_embedding.detach().numpy(), 1))
 20 |     
 21 |     # where to store layer-wise xl embeddings of particular length
 22 |     XL = {}
 23 |     for layer in range(19):
 24 |         XL[layer] = []
 25 |     XL[-1] = token_embeddings
 26 | 
 27 |     if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index
 28 |         from_start_word_ind_to_extract = seq_len + word_ind_to_extract  
 29 |     else:
 30 |         from_start_word_ind_to_extract = word_ind_to_extract
 31 | 
 32 |     start_time = tm.time()    
 33 |         
 34 |     # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times
 35 |     word_seq = text_array[:seq_len]
 36 |     for _ in range(seq_len):
 37 |         XL = add_avrg_token_embedding_for_specific_word(word_seq,
 38 |                                                                      tokenizer,
 39 |                                                                      model,
 40 |                                                                      remove_chars,
 41 |                                                                      from_start_word_ind_to_extract,
 42 |                                                                      XL)
 43 | 
 44 |     # then add the embedding of the last word in a sequence as the embedding for the sequence
 45 |     for end_curr_seq in range(seq_len, len(text_array)):
 46 |         word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1]
 47 |         XL = add_avrg_token_embedding_for_specific_word(word_seq,
 48 |                                                           tokenizer,
 49 |                                                           model,
 50 |                                                           remove_chars,
 51 |                                                           from_start_word_ind_to_extract,
 52 |                                                           XL)
 53 | 
 54 |         if end_curr_seq % 100 == 0:
 55 |             print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time))
 56 |             start_time = tm.time()
 57 | 
 58 |     print('Done extracting sequences of length {}'.format(seq_len))
 59 |     
 60 |     return XL
 61 | 
 62 | def predict_xl_embeddings(words_in_array, tokenizer, model, remove_chars):        
 63 |     for word in words_in_array:
 64 |         if word in remove_chars:
 65 |             print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.')
 66 |             return -1
 67 |     
 68 |     n_seq_tokens = 0
 69 |     seq_tokens = []
 70 |     
 71 |     word_ind_to_token_ind = {}             # dict that maps index of word in words_in_array to index of tokens in seq_tokens
 72 |     
 73 |     for i,word in enumerate(words_in_array):
 74 |         word_ind_to_token_ind[i] = []      # initialize token indices array for current word
 75 |         word_tokens = tokenizer.tokenize(word)  
 76 |         for token in word_tokens:
 77 |             if token not in remove_chars:  # don't add any tokens that are in remove_chars
 78 |                 seq_tokens.append(token)
 79 |                 word_ind_to_token_ind[i].append(n_seq_tokens)
 80 |                 n_seq_tokens = n_seq_tokens + 1
 81 |     
 82 |     # convert token to vocabulary indices
 83 |     indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens)
 84 |     tokens_tensor = torch.tensor([indexed_tokens])
 85 |     
 86 |     hidden_states, mems = model(tokens_tensor)
 87 |     seq_length = hidden_states.size(1)
 88 |     lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems)
 89 |     all_hidden_states = lower_hidden_states + [hidden_states]
 90 |     
 91 |     return all_hidden_states, word_ind_to_token_ind
 92 | 
 93 | # get the XL token embeddings
 94 | def get_xl_token_embeddings(words_in_array, tokenizer, model, remove_chars):    
 95 |     for word in words_in_array:
 96 |         if word in remove_chars:
 97 |             print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.')
 98 |             return -1
 99 |     
100 |     n_seq_tokens = 0
101 |     seq_tokens = []
102 |     
103 |     word_ind_to_token_ind = {}             # dict that maps index of word in words_in_array to index of tokens in seq_tokens
104 |     
105 |     for i,word in enumerate(words_in_array):
106 |         word_ind_to_token_ind[i] = []      # initialize token indices array for current word
107 |         word_tokens = tokenizer.tokenize(word)  
108 |         for token in word_tokens:
109 |             if token not in remove_chars:  # don't add any tokens that are in remove_chars
110 |                 seq_tokens.append(token)
111 |                 word_ind_to_token_ind[i].append(n_seq_tokens)
112 |                 n_seq_tokens = n_seq_tokens + 1
113 |             
114 |     # Convert token to vocabulary indices
115 |     indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens)
116 |     
117 |     # Convert inputs to PyTorch tensors
118 |     tokens_tensor = torch.tensor([indexed_tokens])
119 |     
120 |     token_embeddings = model.word_emb.forward(tokens_tensor)
121 |     
122 |     return token_embeddings
123 |     
124 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary
125 | #
126 | # word_seq: numpy array of words in input sequence
127 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized
128 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ
129 | # model_dict: where to save the extracted embeddings
130 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,model_dict):
131 |     
132 |     word_seq = list(word_seq)  
133 |     all_sequence_embeddings, word_ind_to_token_ind = predict_xl_embeddings(word_seq, tokenizer, model, remove_chars)
134 |     token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract]
135 |     model_dict = add_word_xl_embedding(model_dict, all_sequence_embeddings,token_inds_to_avrg)
136 |     
137 |     return model_dict
138 | 
139 | # add the embeddings for a specific word in the sequence
140 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg
141 | def add_word_xl_embedding(model_dict, embeddings_to_add, token_inds_to_avrg, specific_layer=-1):
142 |     if specific_layer >= 0:  # only add embeddings for one specified layer
143 |         layer_embedding = embeddings_to_add[specific_layer]
144 |         full_sequence_embedding = layer_embedding.detach().numpy()
145 |         model_dict[specific_layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0))
146 |     else:
147 |         for layer, layer_embedding in enumerate(embeddings_to_add):
148 |             full_sequence_embedding = layer_embedding.detach().numpy()
149 |             model_dict[layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word
150 |     return model_dict
151 | 


--------------------------------------------------------------------------------