├── LICENSE ├── README.md ├── data └── stimuli_words.npy ├── evaluate_brain_predictions.py ├── extract_nlp_features.py ├── predict_brain_from_nlp.py └── utils ├── bert_utils.py ├── elmo_utils.py ├── ridge_tools.py ├── use_utils.py ├── utils.py └── xl_utils.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Mariya Toneva 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) 2 | 3 | This repository contains code for the paper [Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)](https://arxiv.org/pdf/1905.11833.pdf) 4 | 5 | Bibtex: 6 | ``` 7 | @inproceedings{toneva2019interpreting, 8 | title={Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)}, 9 | author={Toneva, Mariya and Wehbe, Leila}, 10 | booktitle={Advances in Neural Information Processing Systems}, 11 | pages={14954--14964}, 12 | year={2019} 13 | } 14 | ``` 15 | ## fMRI Recordings of 8 Subjects Reading Harry Potter 16 | You can download the already [preprocessed data here](https://drive.google.com/drive/folders/1Q6zVCAJtKuLOh-zWpkS3lH8LBvHcEOE8?usp=sharing). This data contains fMRI recordings for 8 subjects reading one chapter of Harry Potter. The data been detrended, smoothed, and trimmed to remove the first 20TRs and the last 15TRs. For more information about the data, refer to the paper. We have also provided the precomputed voxel neighborhoods that we have used to compute the searchlight classification accuracies. 17 | 18 | The following code expects that these directories are positioned under the data folder in this repository (e.g. `./data/fMRI/` and `./data/voxel_neighborhoods`. 19 | 20 | 21 | ## Measuring Alignment Between Brain Recordings and NLP representations 22 | 23 | Our approach consists of three main steps: 24 | 1. Derive representations of text from an NLP model 25 | 2. Build an encoding model that takes the derived NLP representations as input and predicts brain recordings of people reading the same text 26 | 3. Evaluates the predictions of the encoding model using a classification task 27 | 28 | In our paper, we present alignment results from 4 different NLP models - ELMo, BERT, Transformer-XL, and USE. Below we provide an overview of how to run all three steps. 29 | 30 | 31 | ### Deriving representations of text from an NLP model 32 | 33 | Needed dependencies for each model: 34 | - USE: Tensorflow < 1.8, `pip install tensorflow_hub` 35 | - ELMo: `pip install allennlp` 36 | - BERT/Transformer-XL: `pip install pytorch_pretrained_bert` 37 | 38 | 39 | The following command can be used to derive the NLP features that we used to obtain the results in Figures 2 and 3: 40 | ``` 41 | python extract_nlp_features.py 42 | --nlp_model [bert/transformer_xl/elmo/use] 43 | --sequence_length s 44 | --output_dir nlp_features 45 | ``` 46 | where s ranges from to 1 to 40. This command derives the representation for all sequences of `s` consecutive words in the stimuli text in `/data/stimuli_words.npy` from the model specified in `--nlp_model` and saves one file for each layer in the model in the specified `--output_dir`. The names of the saved files contain the argument values that were used to generate them. The output files are numpy arrays of size `n_words x n_dimensions`, where `n_words` is the number of words in the stimulus text and `n_dimensions` is the number of dimensions in the embeddings of the specified model in `--nlp_model`. Each row of the output file contains the representation of the most recent `s` consecutive words in the stimulus text (i.e. row `i` of the output file is derived by passing words `i-s+1` to `i` through the pretrained NLP model). 47 | 48 | 49 | ### Building encoding model to predict fMRI recordings 50 | 51 | Note: This code has been tested using python3.7 52 | 53 | ``` 54 | python predict_brain_from_nlp.py 55 | --subject [F,H,I,J,K,L,M,N] 56 | --nlp_feat_type [bert/elmo/transformer_xl/use] 57 | --nlp_feat_dir INPUT_FEAT_DIR 58 | --layer l 59 | --sequence_length s 60 | --output_dir OUTPUT_DIR 61 | ``` 62 | 63 | This call builds encoding models to predict the fMRI recordings using representations of the text stimuli derived from NLP models in step 1 above (`INPUT_FEAT_DIR` is set to the same directory where the NLP features from step 1 were saved, `l` and `s` are the layer and sequence length to be used to load the extracted NLP representations). The encoding model is trained using ridge regression and 4-fold cross validation. The predictions of the encoding model for the heldout data in every fold are saved in an output file in the specified directory `OUTPUT_DIR`. The output filename is in the following format: `predict_{}_with_{}_layer_{}_len_{}.npy`, where the first field is specified by `--subject`, the second by `--nlp_feat_type`, and the rest by `--layer` and `--sequence_length`. 64 | 65 | ### Evaluating the predictions of the encoding model using classification accuracy 66 | 67 | Note: This code has been tested using python3.7 68 | 69 | ``` 70 | python evaluate_brain_predictions.py 71 | --input_path INPUT_PATH 72 | --output_path OUTPUT_PATH 73 | --subject [F,H,I,J,K,L,M,N] 74 | ``` 75 | 76 | This call computes the mean 20v20 classification accuracy (over 1000 samplings of 20 words) for each encoding model (from each of the 4 CV folds). The output is a `pickle` file that contains a list with 4 elements -- one for each CV fold. Each of these 4 elements is another list, which contains the accuracies for all voxels. `INPUT_PATH` is the full path (including the file name) to the predictions saved in step 2 above. `OUTPUT_PATH` is the complete path (including file name) to where the accuracies should be saved. 77 | 78 | The following extracts the average accuracy across CV folds for a particular subject: 79 | ``` 80 | import pickle as pk 81 | import numpy as np 82 | loaded = pk.load(open('{}_accs.pkl'.format(OUTPUT_PATH), 'rb')) 83 | mean_subj_acc_across_folds = loaded.mean(0) 84 | ``` 85 | -------------------------------------------------------------------------------- /data/stimuli_words.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mtoneva/brain_language_nlp/0f519a01e103e86554208ffae3167f4eba9eb788/data/stimuli_words.npy -------------------------------------------------------------------------------- /evaluate_brain_predictions.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import pickle as pk 4 | import time as tm 5 | 6 | from utils.utils import binary_classify_neighborhoods, CV_ind 7 | 8 | 9 | if __name__ == '__main__': 10 | parser = argparse.ArgumentParser() 11 | parser.add_argument("--input_path", required=True) 12 | parser.add_argument("--output_path", required=True) 13 | parser.add_argument("--subject", default='') 14 | args = parser.parse_args() 15 | print(args) 16 | 17 | start_time = tm.time() 18 | 19 | loaded = np.load(args.input_path, allow_pickle=True) 20 | preds_t_per_feat = loaded.item()['preds_t'] 21 | test_t_per_feat = loaded.item()['test_t'] 22 | print(test_t_per_feat.shape) 23 | 24 | n_class = 20 # how many predictions to classify at the same time 25 | n_folds = 4 26 | 27 | neighborhoods = np.load('./data/voxel_neighborhoods/' + args.subject + '_ars_auto2.npy') 28 | n_words, n_voxels = test_t_per_feat.shape 29 | ind = CV_ind(n_words, n_folds=n_folds) 30 | 31 | accs = np.zeros([n_folds,n_voxels]) 32 | acc_std = np.zeros([n_folds,n_voxels]) 33 | 34 | for ind_num in range(n_folds): 35 | test_ind = ind==ind_num 36 | accs[ind_num,:],_,_,_ = binary_classify_neighborhoods(preds_t_per_feat[test_ind,:], test_t_per_feat[test_ind,:], n_class=20, nSample = 1000,pair_samples = [],neighborhoods=neighborhoods) 37 | 38 | 39 | fname = args.output_path 40 | if n_class < 20: 41 | fname = fname + '_{}v{}_'.format(n_class,n_class) 42 | 43 | with open(fname + '_accs.pkl','wb') as fout: 44 | pk.dump(accs,fout) 45 | 46 | print('saved: {}'.format(fname + '_accs.pkl')) 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /extract_nlp_features.py: -------------------------------------------------------------------------------- 1 | from utils.bert_utils import get_bert_layer_representations 2 | from utils.xl_utils import get_xl_layer_representations 3 | from utils.elmo_utils import get_elmo_layer_representations 4 | from utils.use_utils import get_use_layer_representations 5 | 6 | import time as tm 7 | import numpy as np 8 | import torch 9 | import os 10 | import argparse 11 | 12 | 13 | def save_layer_representations(model_layer_dict, model_name, seq_len, save_dir): 14 | for layer in model_layer_dict.keys(): 15 | np.save('{}/{}_length_{}_layer_{}.npy'.format(save_dir,model_name,seq_len,layer+1),np.vstack(model_layer_dict[layer])) 16 | print('Saved extracted features to {}'.format(save_dir)) 17 | return 1 18 | 19 | 20 | model_options = ['bert','transformer_xl','elmo','use'] 21 | 22 | if __name__ == '__main__': 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument("--nlp_model", default='bert', choices=model_options) 25 | parser.add_argument("--sequence_length", type=int, default=1, help='length of context to provide to NLP model (default: 1)') 26 | parser.add_argument("--output_dir", required=True, help='directory to save extracted representations to') 27 | 28 | args = parser.parse_args() 29 | print(args) 30 | 31 | text_array = np.load(os.getcwd() + '/data/stimuli_words.npy') 32 | remove_chars = [",","\"","@"] 33 | 34 | 35 | if args.nlp_model == 'bert': 36 | # the index of the word for which to extract the representations (in the input "[CLS] word_1 ... word_n [SEP]") 37 | # for CLS, set to 0; for SEP set to -1; for last word set to -2 38 | word_ind_to_extract = -2 39 | nlp_features = get_bert_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract) 40 | elif args.nlp_model == 'transformer_xl': 41 | word_ind_to_extract = -1 42 | nlp_features = get_xl_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract) 43 | elif args.nlp_model == 'elmo': 44 | word_ind_to_extract = -1 45 | nlp_features = get_elmo_layer_representations(args.sequence_length, text_array, remove_chars, word_ind_to_extract) 46 | elif args.nlp_model == 'use': 47 | nlp_features = get_use_layer_representations(args.sequence_length, text_array, remove_chars) 48 | else: 49 | print('Unrecognized model name {}'.format(args.nlp_model)) 50 | 51 | 52 | if not os.path.exists(args.output_dir): 53 | os.makedirs(args.output_dir) 54 | 55 | save_layer_representations(nlp_features, args.nlp_model, args.sequence_length, args.output_dir) 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /predict_brain_from_nlp.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | 4 | from utils.utils import run_class_time_CV_fmri_crossval_ridge 5 | 6 | if __name__ == '__main__': 7 | parser = argparse.ArgumentParser() 8 | parser.add_argument("--subject", required=True) 9 | parser.add_argument("--nlp_feat_type", required=True) 10 | parser.add_argument("--nlp_feat_dir", required=True) 11 | parser.add_argument("--layer", type=int, required=False) 12 | parser.add_argument("--sequence_length", type=int, required=False) 13 | parser.add_argument("--output_dir", required=True) 14 | 15 | args = parser.parse_args() 16 | print(args) 17 | 18 | predict_feat_dict = {'nlp_feat_type':args.nlp_feat_type, 19 | 'nlp_feat_dir':args.nlp_feat_dir, 20 | 'layer':args.layer, 21 | 'seq_len':args.sequence_length} 22 | 23 | 24 | # loading fMRI data 25 | 26 | data = np.load('./data/fMRI/data_subject_{}.npy'.format(args.subject)) 27 | corrs_t, _, _, preds_t, test_t = run_class_time_CV_fmri_crossval_ridge(data, 28 | predict_feat_dict) 29 | 30 | fname = 'predict_{}_with_{}_layer_{}_len_{}'.format(args.subject, args.nlp_feat_type, args.layer, args.sequence_length) 31 | print('saving: {}'.format(args.output_dir + fname)) 32 | 33 | np.save(args.output_dir + fname + '.npy', {'corrs_t':corrs_t,'preds_t':preds_t,'test_t':test_t}) 34 | 35 | 36 | -------------------------------------------------------------------------------- /utils/bert_utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from pytorch_pretrained_bert import BertTokenizer, BertModel 4 | import time as tm 5 | 6 | def get_bert_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract): 7 | 8 | model = BertModel.from_pretrained('bert-base-uncased') 9 | tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 10 | model.eval() 11 | 12 | # get the token embeddings 13 | token_embeddings = [] 14 | for word in text_array: 15 | current_token_embedding = get_bert_token_embeddings([word], tokenizer, model, remove_chars) 16 | token_embeddings.append(np.mean(current_token_embedding.detach().numpy(), 1)) 17 | 18 | # where to store layer-wise bert embeddings of particular length 19 | BERT = {} 20 | for layer in range(12): 21 | BERT[layer] = [] 22 | BERT[-1] = token_embeddings 23 | 24 | if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index 25 | from_start_word_ind_to_extract = seq_len + 2 + word_ind_to_extract # add 2 for CLS + SEP tokens 26 | else: 27 | from_start_word_ind_to_extract = word_ind_to_extract 28 | 29 | start_time = tm.time() 30 | 31 | # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times 32 | word_seq = text_array[:seq_len] 33 | for _ in range(seq_len): 34 | BERT = add_avrg_token_embedding_for_specific_word(word_seq, 35 | tokenizer, 36 | model, 37 | remove_chars, 38 | from_start_word_ind_to_extract, 39 | BERT) 40 | 41 | # then add the embedding of the last word in a sequence as the embedding for the sequence 42 | for end_curr_seq in range(seq_len, len(text_array)): 43 | word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1] 44 | BERT = add_avrg_token_embedding_for_specific_word(word_seq, 45 | tokenizer, 46 | model, 47 | remove_chars, 48 | from_start_word_ind_to_extract, 49 | BERT) 50 | 51 | if end_curr_seq % 100 == 0: 52 | print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time)) 53 | start_time = tm.time() 54 | 55 | print('Done extracting sequences of length {}'.format(seq_len)) 56 | 57 | return BERT 58 | 59 | # extracts layer representations for all words in words_in_array 60 | # encoded_layers: list of tensors, length num layers. each tensor of dims num tokens by num dimensions in representation 61 | # word_ind_to_token_ind: dict that maps from index in words_in_array to index in array of tokens when words_in_array is tokenized, 62 | # with keys: index of word, and values: array of indices of corresponding tokens when word is tokenized 63 | def predict_bert_embeddings(words_in_array, tokenizer, model, remove_chars): 64 | 65 | for word in words_in_array: 66 | if word in remove_chars: 67 | print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.') 68 | return -1 69 | 70 | n_seq_tokens = 0 71 | seq_tokens = [] 72 | 73 | word_ind_to_token_ind = {} # dict that maps index of word in words_in_array to index of tokens in seq_tokens 74 | 75 | for i,word in enumerate(words_in_array): 76 | word_ind_to_token_ind[i] = [] # initialize token indices array for current word 77 | 78 | if word in ['[CLS]', '[SEP]']: # [CLS] and [SEP] are already tokenized 79 | word_tokens = [word] 80 | else: 81 | word_tokens = tokenizer.tokenize(word) 82 | 83 | for token in word_tokens: 84 | if token not in remove_chars: # don't add any tokens that are in remove_chars 85 | seq_tokens.append(token) 86 | word_ind_to_token_ind[i].append(n_seq_tokens) 87 | n_seq_tokens = n_seq_tokens + 1 88 | 89 | # convert token to vocabulary indices 90 | indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens) 91 | tokens_tensor = torch.tensor([indexed_tokens]) 92 | 93 | encoded_layers, _ = model(tokens_tensor) 94 | pooled_output = np.squeeze(model.pooler(encoded_layers[-1]).detach().numpy()) 95 | 96 | return encoded_layers, word_ind_to_token_ind, pooled_output 97 | 98 | # add the embeddings for a specific word in the sequence 99 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg 100 | def add_word_bert_embedding(bert_dict, embeddings_to_add, token_inds_to_avrg, specific_layer=-1): 101 | if specific_layer >= 0: # only add embeddings for one specified layer 102 | layer_embedding = embeddings_to_add[specific_layer] 103 | full_sequence_embedding = layer_embedding.detach().numpy() 104 | bert_dict[specific_layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) 105 | else: 106 | for layer, layer_embedding in enumerate(embeddings_to_add): 107 | full_sequence_embedding = layer_embedding.detach().numpy() 108 | bert_dict[layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word 109 | return bert_dict 110 | 111 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary 112 | # 113 | # word_seq: numpy array of words in input sequence 114 | # tokenizer: BERT tokenizer 115 | # model: BERT model 116 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized 117 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ 118 | # bert_dict: where to save the extracted embeddings 119 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,bert_dict): 120 | 121 | word_seq = ['[CLS]'] + list(word_seq) + ['[SEP]'] 122 | all_sequence_embeddings, word_ind_to_token_ind, _ = predict_bert_embeddings(word_seq, tokenizer, model, remove_chars) 123 | token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract] 124 | bert_dict = add_word_bert_embedding(bert_dict, all_sequence_embeddings,token_inds_to_avrg) 125 | 126 | return bert_dict 127 | 128 | 129 | # get the BERT token embeddings 130 | def get_bert_token_embeddings(words_in_array, tokenizer, model, remove_chars): 131 | for word in words_in_array: 132 | if word in remove_chars: 133 | print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.') 134 | return -1 135 | 136 | n_seq_tokens = 0 137 | seq_tokens = [] 138 | 139 | word_ind_to_token_ind = {} # dict that maps index of word in words_in_array to index of tokens in seq_tokens 140 | 141 | for i,word in enumerate(words_in_array): 142 | word_ind_to_token_ind[i] = [] # initialize token indices array for current word 143 | 144 | if word in ['[CLS]', '[SEP]']: # [CLS] and [SEP] are already tokenized 145 | word_tokens = [word] 146 | else: 147 | word_tokens = tokenizer.tokenize(word) 148 | 149 | for token in word_tokens: 150 | if token not in remove_chars: # don't add any tokens that are in remove_chars 151 | seq_tokens.append(token) 152 | word_ind_to_token_ind[i].append(n_seq_tokens) 153 | n_seq_tokens = n_seq_tokens + 1 154 | 155 | # convert token to vocabulary indices 156 | indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens) 157 | tokens_tensor = torch.tensor([indexed_tokens]) 158 | 159 | token_embeddings = model.embeddings.forward(tokens_tensor) 160 | 161 | return token_embeddings 162 | 163 | 164 | # add the embeddings for all individual words 165 | # specific_layer specifies only one layer to add embeddings from 166 | def add_all_bert_embeddings(bert_dict, embeddings_to_add, specific_layer=-1): 167 | if specific_layer >= 0: 168 | layer_embedding = embeddings_to_add[specific_layer] 169 | seq_len = layer_embedding.shape[1] 170 | full_sequence_embedding = layer_embedding.detach().numpy() 171 | 172 | for word in range(seq_len): 173 | bert_dict[specific_layer].append(full_sequence_embedding[0,word,:]) 174 | else: 175 | for layer, layer_embedding in enumerate(embeddings_to_add): 176 | seq_len = layer_embedding.shape[1] 177 | full_sequence_embedding = layer_embedding.detach().numpy() 178 | 179 | for word in range(seq_len): 180 | bert_dict[layer].append(full_sequence_embedding[0,word,:]) 181 | return bert_dict 182 | 183 | 184 | # add the embeddings for only the last word in the sequence that is not [SEP] token 185 | def add_last_nonsep_bert_embedding(bert_dict, embeddings_to_add, specific_layer=-1): 186 | if specific_layer >= 0: 187 | layer_embedding = embeddings_to_add[specific_layer] 188 | full_sequence_embedding = layer_embedding.detach().numpy() 189 | 190 | bert_dict[specific_layer].append(full_sequence_embedding[0,-2,:]) 191 | else: 192 | for layer, layer_embedding in enumerate(embeddings_to_add): 193 | full_sequence_embedding = layer_embedding.detach().numpy() 194 | 195 | bert_dict[layer].append(full_sequence_embedding[0,-2,:]) 196 | return bert_dict 197 | 198 | # add the CLS token embeddings ([CLS] is the first token in each string) 199 | def add_cls_bert_embedding(bert_dict, embeddings_to_add, specific_layer=-1): 200 | if specific_layer >= 0: 201 | layer_embedding = embeddings_to_add[specific_layer] 202 | full_sequence_embedding = layer_embedding.detach().numpy() 203 | 204 | bert_dict[specific_layer].append(full_sequence_embedding[0,0,:]) 205 | else: 206 | for layer, layer_embedding in enumerate(embeddings_to_add): 207 | full_sequence_embedding = layer_embedding.detach().numpy() 208 | 209 | bert_dict[layer].append(full_sequence_embedding[0,0,:]) 210 | return bert_dict 211 | -------------------------------------------------------------------------------- /utils/elmo_utils.py: -------------------------------------------------------------------------------- 1 | from allennlp.commands.elmo import ElmoEmbedder 2 | from allennlp.data.tokenizers.word_tokenizer import WordTokenizer 3 | 4 | import numpy as np 5 | import torch 6 | import time as tm 7 | 8 | def get_elmo_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract): 9 | 10 | model = ElmoEmbedder() 11 | tokenizer = WordTokenizer() 12 | 13 | # where to store layer-wise elmo embeddings of particular length 14 | elmo = {} 15 | for layer in range(-1,2): 16 | elmo[layer] = [] 17 | 18 | if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index 19 | from_start_word_ind_to_extract = seq_len + word_ind_to_extract 20 | else: 21 | from_start_word_ind_to_extract = word_ind_to_extract 22 | 23 | start_time = tm.time() 24 | 25 | # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times 26 | word_seq = text_array[:seq_len] 27 | for _ in range(seq_len): 28 | elmo = add_avrg_token_embedding_for_specific_word(word_seq, 29 | tokenizer, 30 | model, 31 | remove_chars, 32 | from_start_word_ind_to_extract, 33 | elmo) 34 | 35 | # then add the embedding of the last word in a sequence as the embedding for the sequence 36 | for end_curr_seq in range(seq_len, len(text_array)): 37 | word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1] 38 | elmo = add_avrg_token_embedding_for_specific_word(word_seq, 39 | tokenizer, 40 | model, 41 | remove_chars, 42 | from_start_word_ind_to_extract, 43 | elmo) 44 | 45 | if end_curr_seq % 100 == 0: 46 | print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time)) 47 | start_time = tm.time() 48 | 49 | print('Done extracting sequences of length {}'.format(seq_len)) 50 | 51 | return elmo 52 | 53 | 54 | def predict_elmo_embeddings(words_in_array, tokenizer, model, remove_chars): 55 | 56 | n_seq_tokens = 0 57 | seq_tokens = [] 58 | 59 | word_ind_to_token_ind = {} # dict that maps index of word in words_in_array to index of tokens in seq_tokens 60 | 61 | for i,word in enumerate(words_in_array): 62 | word_ind_to_token_ind[i] = [] # initialize token indices array for current word 63 | 64 | word_tokens = tokenizer.tokenize(str(word)) 65 | 66 | for token in word_tokens: 67 | if token not in remove_chars: # don't add any tokens that are in remove_chars 68 | seq_tokens.append(str(token)) 69 | word_ind_to_token_ind[i].append(n_seq_tokens) 70 | n_seq_tokens = n_seq_tokens + 1 71 | 72 | 73 | encoded_layers = model.embed_sentence(seq_tokens) 74 | 75 | return encoded_layers, word_ind_to_token_ind 76 | 77 | 78 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary 79 | # 80 | # word_seq: numpy array of words in input sequence 81 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized 82 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ 83 | # model_dict: where to save the extracted embeddings 84 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,model_dict): 85 | 86 | word_seq = list(word_seq) 87 | all_sequence_embeddings, word_ind_to_token_ind = predict_elmo_embeddings(word_seq, tokenizer, model, remove_chars) 88 | token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract] 89 | model_dict = add_word_elmo_embedding(model_dict, all_sequence_embeddings,token_inds_to_avrg) 90 | 91 | return model_dict 92 | 93 | # add the embeddings for a specific word in the sequence 94 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg 95 | def add_word_elmo_embedding(elmo_dict, embeddings_to_add, token_inds_to_avrg): 96 | for layer in elmo_dict.keys(): 97 | elmo_dict[layer].append(np.mean(embeddings_to_add[layer+1,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word 98 | return elmo_dict -------------------------------------------------------------------------------- /utils/ridge_tools.py: -------------------------------------------------------------------------------- 1 | from numpy.linalg import inv, svd 2 | import numpy as np 3 | from sklearn.model_selection import KFold 4 | from sklearn.linear_model import Ridge, RidgeCV 5 | import time 6 | from scipy.stats import zscore 7 | 8 | def corr(X,Y): 9 | return np.mean(zscore(X)*zscore(Y),0) 10 | 11 | def R2(Pred,Real): 12 | SSres = np.mean((Real-Pred)**2,0) 13 | SStot = np.var(Real,0) 14 | return np.nan_to_num(1-SSres/SStot) 15 | 16 | def R2r(Pred,Real): 17 | R2rs = R2(Pred,Real) 18 | ind_neg = R2rs<0 19 | R2rs = np.abs(R2rs) 20 | R2rs = np.sqrt(R2rs) 21 | R2rs[ind_neg] *= - 1 22 | return R2rs 23 | 24 | def ridge(X,Y,lmbda): 25 | return np.dot(inv(X.T.dot(X)+lmbda*np.eye(X.shape[1])),X.T.dot(Y)) 26 | 27 | def ridge_by_lambda(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])): 28 | error = np.zeros((lambdas.shape[0],Y.shape[1])) 29 | for idx,lmbda in enumerate(lambdas): 30 | weights = ridge(X,Y,lmbda) 31 | error[idx] = 1 - R2(np.dot(Xval,weights),Yval) 32 | return error 33 | 34 | def ridge_sk(X,Y,lmbda): 35 | rd = Ridge(alpha = lmbda) 36 | rd.fit(X,Y) 37 | return rd.coef_.T 38 | 39 | def ridgeCV_sk(X,Y,lmbdas): 40 | rd = RidgeCV(alphas = lmbdas) 41 | rd.fit(X,Y) 42 | return rd.coef_.T 43 | 44 | def ridge_by_lambda_sk(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])): 45 | error = np.zeros((lambdas.shape[0],Y.shape[1])) 46 | for idx,lmbda in enumerate(lambdas): 47 | weights = ridge_sk(X,Y,lmbda) 48 | error[idx] = 1 - R2(np.dot(Xval,weights),Yval) 49 | return error 50 | 51 | def ridge_svd(X,Y,lmbda): 52 | U, s, Vt = svd(X, full_matrices=False) 53 | d = s / (s** 2 + lmbda) 54 | return np.dot(Vt,np.diag(d).dot(U.T.dot(Y))) 55 | 56 | def ridge_by_lambda_svd(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])): 57 | error = np.zeros((lambdas.shape[0],Y.shape[1])) 58 | U, s, Vt = svd(X, full_matrices=False) 59 | for idx,lmbda in enumerate(lambdas): 60 | d = s / (s** 2 + lmbda) 61 | weights = np.dot(Vt,np.diag(d).dot(U.T.dot(Y))) 62 | error[idx] = 1 - R2(np.dot(Xval,weights),Yval) 63 | return error 64 | 65 | def kernel_ridge(X,Y,lmbda): 66 | return np.dot(X.T.dot(inv(X.dot(X.T)+lmbda*np.eye(X.shape[0]))),Y) 67 | 68 | def kernel_ridge_by_lambda(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])): 69 | error = np.zeros((lambdas.shape[0],Y.shape[1])) 70 | for idx,lmbda in enumerate(lambdas): 71 | weights = kernel_ridge(X,Y,lmbda) 72 | error[idx] = 1 - R2(np.dot(Xval,weights),Yval) 73 | return error 74 | 75 | def kernel_ridge_svd(X,Y,lmbda): 76 | U, s, Vt = svd(X.T, full_matrices=False) 77 | d = s / (s** 2 + lmbda) 78 | return np.dot(np.dot(U,np.diag(d).dot(Vt)),Y) 79 | 80 | def kernel_ridge_by_lambda_svd(X, Y, Xval, Yval, lambdas=np.array([0.1,1,10,100,1000])): 81 | error = np.zeros((lambdas.shape[0],Y.shape[1])) 82 | U, s, Vt = svd(X.T, full_matrices=False) 83 | for idx,lmbda in enumerate(lambdas): 84 | d = s / (s** 2 + lmbda) 85 | weights = np.dot(np.dot(U,np.diag(d).dot(Vt)),Y) 86 | error[idx] = 1 - R2(np.dot(Xval,weights),Yval) 87 | return error 88 | 89 | def cross_val_ridge(train_features,train_data, n_splits = 10, 90 | lambdas = np.array([10**i for i in range(-6,10)]), 91 | method = 'plain', 92 | do_plot = False): 93 | 94 | ridge_1 = dict(plain = ridge_by_lambda, 95 | svd = ridge_by_lambda_svd, 96 | kernel_ridge = kernel_ridge_by_lambda, 97 | kernel_ridge_svd = kernel_ridge_by_lambda_svd, 98 | ridge_sk = ridge_by_lambda_sk)[method] 99 | ridge_2 = dict(plain = ridge, 100 | svd = ridge_svd, 101 | kernel_ridge = kernel_ridge, 102 | kernel_ridge_svd = kernel_ridge_svd, 103 | ridge_sk = ridge_sk)[method] 104 | 105 | n_voxels = train_data.shape[1] 106 | nL = lambdas.shape[0] 107 | r_cv = np.zeros((nL, train_data.shape[1])) 108 | 109 | kf = KFold(n_splits=n_splits) 110 | start_t = time.time() 111 | for icv, (trn, val) in enumerate(kf.split(train_data)): 112 | #print('ntrain = {}'.format(train_features[trn].shape[0])) 113 | cost = ridge_1(train_features[trn],train_data[trn], 114 | train_features[val],train_data[val], 115 | lambdas=lambdas) 116 | if do_plot: 117 | import matplotlib.pyplot as plt 118 | plt.figure() 119 | plt.imshow(cost,aspect = 'auto') 120 | r_cv += cost 121 | #if icv%3 ==0: 122 | # print(icv) 123 | #print('average iteration length {}'.format((time.time()-start_t)/(icv+1))) 124 | if do_plot: 125 | plt.figure() 126 | plt.imshow(r_cv,aspect='auto',cmap = 'RdBu_r'); 127 | 128 | argmin_lambda = np.argmin(r_cv,axis = 0) 129 | weights = np.zeros((train_features.shape[1],train_data.shape[1])) 130 | for idx_lambda in range(lambdas.shape[0]): # this is much faster than iterating over voxels! 131 | idx_vox = argmin_lambda == idx_lambda 132 | weights[:,idx_vox] = ridge_2(train_features, train_data[:,idx_vox],lambdas[idx_lambda]) 133 | if do_plot: 134 | plt.figure() 135 | plt.imshow(weights,aspect='auto',cmap = 'RdBu_r',vmin = -0.5,vmax = 0.5); 136 | 137 | return weights, np.array([lambdas[i] for i in argmin_lambda]) 138 | -------------------------------------------------------------------------------- /utils/use_utils.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import tensorflow_hub as hub 3 | import numpy as np 4 | 5 | 6 | def clean_word(word, remove_chars): 7 | word2 = word[:] 8 | while len(word2)>0 and word2[0] in remove_chars: 9 | word2 = word2[1:] 10 | while len(word2)>0 and word2[-1] in remove_chars: 11 | word2 = word2[:-1] 12 | return word2 13 | 14 | def get_use_layer_representations(seq_len, text_array, remove_chars): 15 | 16 | module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"] 17 | 18 | # Import the Universal Sentence Encoder's TF Hub module 19 | embed = hub.Module(module_url) 20 | 21 | # Reduce logging output. 22 | tf.logging.set_verbosity(tf.logging.ERROR) 23 | 24 | clean_text_array = [clean_word(w,remove_chars) for w in text_array] 25 | n_labels = len(clean_text_array) 26 | 27 | seq_strings = [" ".join(clean_text_array[i-seq_len:i]) for i in range(20,n_labels)] 28 | 29 | with tf.Session() as session: 30 | session.run([tf.global_variables_initializer(), tf.tables_initializer()]) 31 | 32 | embeddings = session.run(embed(seq_strings)) 33 | sequence = np.array(embeddings) 34 | 35 | USE = {} 36 | USE[-1] = [np.zeros((20,sequence.shape[1])),sequence] 37 | 38 | return USE 39 | -------------------------------------------------------------------------------- /utils/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.decomposition import PCA 3 | from scipy.stats import zscore 4 | import time 5 | import csv 6 | import os 7 | import nibabel 8 | from sklearn.metrics.pairwise import euclidean_distances 9 | from scipy.ndimage.filters import gaussian_filter 10 | 11 | from utils.ridge_tools import cross_val_ridge, corr 12 | import time as tm 13 | 14 | 15 | def load_transpose_zscore(file): 16 | dat = nibabel.load(file).get_data() 17 | dat = dat.T 18 | return zscore(dat,axis = 0) 19 | 20 | def smooth_run_not_masked(data,smooth_factor): 21 | smoothed_data = np.zeros_like(data) 22 | for i,d in enumerate(data): 23 | smoothed_data[i] = gaussian_filter(data[i], sigma=smooth_factor, order=0, output=None, 24 | mode='reflect', cval=0.0, truncate=4.0) 25 | return smoothed_data 26 | 27 | def delay_one(mat, d): 28 | # delays a matrix by a delay d. Positive d ==> row t has row t-d 29 | new_mat = np.zeros_like(mat) 30 | if d>0: 31 | new_mat[d:] = mat[:-d] 32 | elif d<0: 33 | new_mat[:d] = mat[-d:] 34 | else: 35 | new_mat = mat 36 | return new_mat 37 | 38 | def delay_mat(mat, delays): 39 | # delays a matrix by a set of delays d. 40 | # a row t in the returned matrix has the concatenated: 41 | # row(t-delays[0],t-delays[1]...t-delays[last] ) 42 | new_mat = np.concatenate([delay_one(mat, d) for d in delays],axis = -1) 43 | return new_mat 44 | 45 | # train/test is the full NLP feature 46 | # train/test_pca is the NLP feature reduced to 10 dimensions via PCA that has been fit on the training data 47 | # feat_dir is the directory where the NLP features are stored 48 | # train_indicator is an array of 0s and 1s indicating whether the word at this index is in the training set 49 | def get_nlp_features_fixed_length(layer, seq_len, feat_type, feat_dir, train_indicator, SKIP_WORDS=20, END_WORDS=5176): 50 | 51 | loaded = np.load(feat_dir + feat_type + '_length_'+str(seq_len)+ '_layer_' + str(layer) + '.npy') 52 | if feat_type == 'elmo': 53 | train = loaded[SKIP_WORDS:END_WORDS,:][:,:512][train_indicator] # only forward LSTM 54 | test = loaded[SKIP_WORDS:END_WORDS,:][:,:512][~train_indicator] # only forward LSTM 55 | elif feat_type == 'bert' or feat_type == 'transformer_xl' or feat_type == 'use': 56 | train = loaded[SKIP_WORDS:END_WORDS,:][train_indicator] 57 | test = loaded[SKIP_WORDS:END_WORDS,:][~train_indicator] 58 | else: 59 | print('Unrecognized NLP feature type {}. Available options elmo, bert, transformer_xl, use'.format(feat_type)) 60 | 61 | pca = PCA(n_components=10, svd_solver='full') 62 | pca.fit(train) 63 | train_pca = pca.transform(train) 64 | test_pca = pca.transform(test) 65 | 66 | return train, test, train_pca, test_pca 67 | 68 | def CV_ind(n, n_folds): 69 | ind = np.zeros((n)) 70 | n_items = int(np.floor(n/n_folds)) 71 | for i in range(0,n_folds -1): 72 | ind[i*n_items:(i+1)*n_items] = i 73 | ind[(n_folds-1)*n_items:] = (n_folds-1) 74 | return ind 75 | 76 | def TR_to_word_CV_ind(TR_train_indicator,SKIP_WORDS=20,END_WORDS=5176): 77 | time = np.load('./data/fMRI/time_fmri.npy') 78 | runs = np.load('./data/fMRI/runs_fmri.npy') 79 | time_words = np.load('./data/fMRI/time_words_fmri.npy') 80 | time_words = time_words[SKIP_WORDS:END_WORDS] 81 | 82 | word_train_indicator = np.zeros([len(time_words)], dtype=bool) 83 | words_id = np.zeros([len(time_words)],dtype=int) 84 | # w=find what TR each word belongs to 85 | for i in range(len(time_words)): 86 | words_id[i] = np.where(time_words[i]> time)[0][-1] 87 | 88 | if words_id[i] <= len(runs) - 15: 89 | offset = runs[int(words_id[i])]*20 + (runs[int(words_id[i])]-1)*15 90 | if TR_train_indicator[int(words_id[i])-offset-1] == 1: 91 | word_train_indicator[i] = True 92 | return word_train_indicator 93 | 94 | 95 | def prepare_fmri_features(train_features, test_features, word_train_indicator, TR_train_indicator, SKIP_WORDS=20, END_WORDS=5176): 96 | 97 | time = np.load('./data/fMRI/time_fmri.npy') 98 | runs = np.load('./data/fMRI/runs_fmri.npy') 99 | time_words = np.load('./data/fMRI/time_words_fmri.npy') 100 | time_words = time_words[SKIP_WORDS:END_WORDS] 101 | 102 | words_id = np.zeros([len(time_words)]) 103 | # w=find what TR each word belongs to 104 | for i in range(len(time_words)): 105 | words_id[i] = np.where(time_words[i]> time)[0][-1] 106 | 107 | all_features = np.zeros([time_words.shape[0], train_features.shape[1]]) 108 | all_features[word_train_indicator] = train_features 109 | all_features[~word_train_indicator] = test_features 110 | 111 | p = all_features.shape[1] 112 | tmp = np.zeros([time.shape[0], p]) 113 | for i in range(time.shape[0]): 114 | tmp[i] = np.mean(all_features[(words_id<=i)*(words_id>i-1)],0) 115 | tmp = delay_mat(tmp, np.arange(1,5)) 116 | 117 | # remove the edges of each run 118 | tmp = np.vstack([zscore(tmp[runs==i][20:-15]) for i in range(1,5)]) 119 | tmp = np.nan_to_num(tmp) 120 | 121 | return tmp[TR_train_indicator], tmp[~TR_train_indicator] 122 | 123 | 124 | 125 | def run_class_time_CV_fmri_crossval_ridge(data, predict_feat_dict, 126 | regress_feat_names_list = [],method = 'kernel_ridge', 127 | lambdas = np.array([0.1,1,10,100,1000]), 128 | detrend = False, n_folds = 4, skip=5): 129 | 130 | nlp_feat_type = predict_feat_dict['nlp_feat_type'] 131 | feat_dir = predict_feat_dict['nlp_feat_dir'] 132 | layer = predict_feat_dict['layer'] 133 | seq_len = predict_feat_dict['seq_len'] 134 | 135 | 136 | n_words = data.shape[0] 137 | n_voxels = data.shape[1] 138 | 139 | ind = CV_ind(n_words, n_folds=n_folds) 140 | 141 | corrs = np.zeros((n_folds, n_voxels)) 142 | acc = np.zeros((n_folds, n_voxels)) 143 | acc_std = np.zeros((n_folds, n_voxels)) 144 | 145 | all_test_data = [] 146 | all_preds = [] 147 | 148 | 149 | for ind_num in range(n_folds): 150 | train_ind = ind!=ind_num 151 | test_ind = ind==ind_num 152 | 153 | word_CV_ind = TR_to_word_CV_ind(train_ind) 154 | 155 | _,_,tmp_train_features,tmp_test_features = get_nlp_features_fixed_length(layer, seq_len, nlp_feat_type, feat_dir, word_CV_ind) 156 | train_features,test_features = prepare_fmri_features(tmp_train_features, tmp_test_features, word_CV_ind, train_ind) 157 | 158 | # split data 159 | train_data = data[train_ind] 160 | test_data = data[test_ind] 161 | 162 | # skip TRs between train and test data 163 | if ind_num == 0: # just remove from front end 164 | train_data = train_data[skip:,:] 165 | train_features = train_features[skip:,:] 166 | elif ind_num == n_folds-1: # just remove from back end 167 | train_data = train_data[:-skip,:] 168 | train_features = train_features[:-skip,:] 169 | else: 170 | test_data = test_data[skip:-skip,:] 171 | test_features = test_features[skip:-skip,:] 172 | 173 | # normalize data 174 | train_data = np.nan_to_num(zscore(np.nan_to_num(train_data))) 175 | test_data = np.nan_to_num(zscore(np.nan_to_num(test_data))) 176 | all_test_data.append(test_data) 177 | 178 | train_features = np.nan_to_num(zscore(train_features)) 179 | test_features = np.nan_to_num(zscore(test_features)) 180 | 181 | start_time = tm.time() 182 | weights, chosen_lambdas = cross_val_ridge(train_features,train_data, n_splits = 10, lambdas = np.array([10**i for i in range(-6,10)]), method = 'plain',do_plot = False) 183 | 184 | preds = np.dot(test_features, weights) 185 | corrs[ind_num,:] = corr(preds,test_data) 186 | all_preds.append(preds) 187 | 188 | print('fold {} completed, took {} seconds'.format(ind_num, tm.time()-start_time)) 189 | del weights 190 | 191 | return corrs, acc, acc_std, np.vstack(all_preds), np.vstack(all_test_data) 192 | 193 | def binary_classify_neighborhoods(Ypred, Y, n_class=20, nSample = 1000,pair_samples = [],neighborhoods=[]): 194 | # n_class = how many words to classify at once 195 | # nSample = how many words to classify 196 | 197 | voxels = Y.shape[-1] 198 | neighborhoods = np.asarray(neighborhoods, dtype=int) 199 | 200 | import time as tm 201 | 202 | acc = np.full([nSample, Y.shape[-1]], np.nan) 203 | acc2 = np.full([nSample, Y.shape[-1]], np.nan) 204 | test_word_inds = [] 205 | 206 | if len(pair_samples)>0: 207 | Ypred2 = Ypred[pair_samples>=0] 208 | Y2 = Y[pair_samples>=0] 209 | pair_samples2 = pair_samples[pair_samples>=0] 210 | else: 211 | Ypred2 = Ypred 212 | Y2 = Y 213 | pair_samples2 = pair_samples 214 | n = Y2.shape[0] 215 | start_time = tm.time() 216 | for idx in range(nSample): 217 | 218 | idx_real = np.random.choice(n, n_class) 219 | 220 | sample_real = Y2[idx_real] 221 | sample_pred_correct = Ypred2[idx_real] 222 | 223 | if len(pair_samples2) == 0: 224 | idx_wrong = np.random.choice(n, n_class) 225 | else: 226 | idx_wrong = sample_same_but_different(idx_real,pair_samples2) 227 | sample_pred_incorrect = Ypred2[idx_wrong] 228 | 229 | #print(sample_pred_incorrect.shape) 230 | 231 | # compute distances within neighborhood 232 | dist_correct = np.sum((sample_real - sample_pred_correct)**2,0) 233 | dist_incorrect = np.sum((sample_real - sample_pred_incorrect)**2,0) 234 | 235 | neighborhood_dist_correct = np.array([np.sum(dist_correct[neighborhoods[v,neighborhoods[v,:]>-1]]) for v in range(voxels)]) 236 | neighborhood_dist_incorrect = np.array([np.sum(dist_incorrect[neighborhoods[v,neighborhoods[v,:]>-1]]) for v in range(voxels)]) 237 | 238 | 239 | acc[idx,:] = (neighborhood_dist_correct < neighborhood_dist_incorrect)*1.0 + (neighborhood_dist_correct == neighborhood_dist_incorrect)*0.5 240 | 241 | test_word_inds.append(idx_real) 242 | print('Classification for fold done. Took {} seconds'.format(tm.time()-start_time)) 243 | return np.nanmean(acc,0), np.nanstd(acc,0), acc, np.array(test_word_inds) 244 | -------------------------------------------------------------------------------- /utils/xl_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import time as tm 4 | 5 | from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel 6 | 7 | 8 | def get_xl_layer_representations(seq_len, text_array, remove_chars, word_ind_to_extract): 9 | 10 | model = TransfoXLModel.from_pretrained('transfo-xl-wt103') 11 | tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103') 12 | model.eval() 13 | 14 | 15 | # get the token embeddings 16 | token_embeddings = [] 17 | for word in text_array: 18 | current_token_embedding = get_xl_token_embeddings([word], tokenizer, model, remove_chars) 19 | token_embeddings.append(np.mean(current_token_embedding.detach().numpy(), 1)) 20 | 21 | # where to store layer-wise xl embeddings of particular length 22 | XL = {} 23 | for layer in range(19): 24 | XL[layer] = [] 25 | XL[-1] = token_embeddings 26 | 27 | if word_ind_to_extract < 0: # the index is specified from the end of the array, so invert the index 28 | from_start_word_ind_to_extract = seq_len + word_ind_to_extract 29 | else: 30 | from_start_word_ind_to_extract = word_ind_to_extract 31 | 32 | start_time = tm.time() 33 | 34 | # before we've seen enough words to make up the sequence length, add the representation for the last word 'seq_len' times 35 | word_seq = text_array[:seq_len] 36 | for _ in range(seq_len): 37 | XL = add_avrg_token_embedding_for_specific_word(word_seq, 38 | tokenizer, 39 | model, 40 | remove_chars, 41 | from_start_word_ind_to_extract, 42 | XL) 43 | 44 | # then add the embedding of the last word in a sequence as the embedding for the sequence 45 | for end_curr_seq in range(seq_len, len(text_array)): 46 | word_seq = text_array[end_curr_seq-seq_len+1:end_curr_seq+1] 47 | XL = add_avrg_token_embedding_for_specific_word(word_seq, 48 | tokenizer, 49 | model, 50 | remove_chars, 51 | from_start_word_ind_to_extract, 52 | XL) 53 | 54 | if end_curr_seq % 100 == 0: 55 | print('Completed {} out of {}: {}'.format(end_curr_seq, len(text_array), tm.time()-start_time)) 56 | start_time = tm.time() 57 | 58 | print('Done extracting sequences of length {}'.format(seq_len)) 59 | 60 | return XL 61 | 62 | def predict_xl_embeddings(words_in_array, tokenizer, model, remove_chars): 63 | for word in words_in_array: 64 | if word in remove_chars: 65 | print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.') 66 | return -1 67 | 68 | n_seq_tokens = 0 69 | seq_tokens = [] 70 | 71 | word_ind_to_token_ind = {} # dict that maps index of word in words_in_array to index of tokens in seq_tokens 72 | 73 | for i,word in enumerate(words_in_array): 74 | word_ind_to_token_ind[i] = [] # initialize token indices array for current word 75 | word_tokens = tokenizer.tokenize(word) 76 | for token in word_tokens: 77 | if token not in remove_chars: # don't add any tokens that are in remove_chars 78 | seq_tokens.append(token) 79 | word_ind_to_token_ind[i].append(n_seq_tokens) 80 | n_seq_tokens = n_seq_tokens + 1 81 | 82 | # convert token to vocabulary indices 83 | indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens) 84 | tokens_tensor = torch.tensor([indexed_tokens]) 85 | 86 | hidden_states, mems = model(tokens_tensor) 87 | seq_length = hidden_states.size(1) 88 | lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems) 89 | all_hidden_states = lower_hidden_states + [hidden_states] 90 | 91 | return all_hidden_states, word_ind_to_token_ind 92 | 93 | # get the XL token embeddings 94 | def get_xl_token_embeddings(words_in_array, tokenizer, model, remove_chars): 95 | for word in words_in_array: 96 | if word in remove_chars: 97 | print('An input word is also in remove_chars. This word will be removed and may lead to misalignment. Proceed with caution.') 98 | return -1 99 | 100 | n_seq_tokens = 0 101 | seq_tokens = [] 102 | 103 | word_ind_to_token_ind = {} # dict that maps index of word in words_in_array to index of tokens in seq_tokens 104 | 105 | for i,word in enumerate(words_in_array): 106 | word_ind_to_token_ind[i] = [] # initialize token indices array for current word 107 | word_tokens = tokenizer.tokenize(word) 108 | for token in word_tokens: 109 | if token not in remove_chars: # don't add any tokens that are in remove_chars 110 | seq_tokens.append(token) 111 | word_ind_to_token_ind[i].append(n_seq_tokens) 112 | n_seq_tokens = n_seq_tokens + 1 113 | 114 | # Convert token to vocabulary indices 115 | indexed_tokens = tokenizer.convert_tokens_to_ids(seq_tokens) 116 | 117 | # Convert inputs to PyTorch tensors 118 | tokens_tensor = torch.tensor([indexed_tokens]) 119 | 120 | token_embeddings = model.word_emb.forward(tokens_tensor) 121 | 122 | return token_embeddings 123 | 124 | # predicts representations for specific word in input word sequence, and adds to existing layer-wise dictionary 125 | # 126 | # word_seq: numpy array of words in input sequence 127 | # remove_chars: characters that should not be included in the represention when word_seq is tokenized 128 | # from_start_word_ind_to_extract: the index of the word whose features to extract, INDEXED FROM START OF WORD_SEQ 129 | # model_dict: where to save the extracted embeddings 130 | def add_avrg_token_embedding_for_specific_word(word_seq,tokenizer,model,remove_chars,from_start_word_ind_to_extract,model_dict): 131 | 132 | word_seq = list(word_seq) 133 | all_sequence_embeddings, word_ind_to_token_ind = predict_xl_embeddings(word_seq, tokenizer, model, remove_chars) 134 | token_inds_to_avrg = word_ind_to_token_ind[from_start_word_ind_to_extract] 135 | model_dict = add_word_xl_embedding(model_dict, all_sequence_embeddings,token_inds_to_avrg) 136 | 137 | return model_dict 138 | 139 | # add the embeddings for a specific word in the sequence 140 | # token_inds_to_avrg: indices of tokens in embeddings output to avrg 141 | def add_word_xl_embedding(model_dict, embeddings_to_add, token_inds_to_avrg, specific_layer=-1): 142 | if specific_layer >= 0: # only add embeddings for one specified layer 143 | layer_embedding = embeddings_to_add[specific_layer] 144 | full_sequence_embedding = layer_embedding.detach().numpy() 145 | model_dict[specific_layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) 146 | else: 147 | for layer, layer_embedding in enumerate(embeddings_to_add): 148 | full_sequence_embedding = layer_embedding.detach().numpy() 149 | model_dict[layer].append(np.mean(full_sequence_embedding[0,token_inds_to_avrg,:],0)) # avrg over all tokens for specified word 150 | return model_dict 151 | --------------------------------------------------------------------------------