├── .gitignore ├── LICENSE ├── README.md ├── data ├── gigaword │ ├── README.md │ ├── input.txt │ └── task1_ref0.txt └── sentence_compression │ ├── README.md │ ├── eval_src_1000_unk.txt │ └── eval_tgt_1000_unk.txt ├── figure1.png ├── lm_lstm ├── dataload.py ├── main.py ├── model.py ├── testppl.py ├── train.py └── utils.py ├── results_elmo_giga ├── README.md ├── smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt └── smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_best_f1.txt ├── results_elmo_sc ├── README.md ├── smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_b1.0_single.txt └── smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_best_f1.txt └── uss ├── beam_search.py ├── elmo_lstm_forward.py ├── elmo_sequential_embedder.py ├── gpt2_sequential_embedder.py ├── lm_subvocab.py ├── pre_closetables.py ├── pre_word_list.py ├── sim_embed_score.py ├── sim_token_match.py ├── summary_search_elmo.py ├── summary_search_gpt2.py ├── summary_select_eval.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | *.idea 3 | 4 | lm_lstm_models/* 5 | 6 | voctbls/* 7 | results_gpt2/* 8 | results_elmo_giga/* 9 | results_elmo_sc/* 10 | *.sh 11 | *.tar.gz 12 | 13 | !data/gigaword/README.md 14 | !data/gigaword/input.txt 15 | !data/gigaword/input_unk.txt 16 | !data/gigaword/task1_ref0.txt 17 | !data/sentence_compression/eval_src_1000_unk.txt 18 | !data/sentence_compression/eval_tgt_1000_unk.txt 19 | 20 | !results_elmo_giga/README.md 21 | !results_elmo_giga/smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt 22 | !results_elmo_sc/README.md 23 | !results_elmo_sc/smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_b1.0_single.txt 24 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 jzhou316 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Unsupervised Sentence Summarization 2 | 3 | [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) 4 | 5 | Unsupervised sentence summarization by contextual matching. 6 | 7 | This is the code for the paper: \ 8 | [Simple Unsupervised Summarization by Contextual Matching](https://arxiv.org/pdf/1907.13337.pdf) (ACL 2019) \ 9 | Jiawei Zhou, Alexander Rush 10 | 11 | 12 | 13 | 14 | 15 | ## Overview 16 | 17 | Using contextual word embeddings (e.g. pre-trained ELMo model) along with a language model trained on summary style sentences, we are able to find sentence level summarizations in an unsupervised way without being exposed to any paired data. 18 | 19 | The summary generation process is through beam search to maximize a product-of-expert score which is a combination of a contextual matching model (relying on pre-trained left-contextual embeddings) and a language fluency model (based on the summary domain specific language model). This works for both abstractive and extractive sentence level summarization. 20 | 21 | **Note:** as our generation process is left-to-right, we used only the forward model in ELMo. We also tested on the small version of GPT-2 after its release, but found it didn't not perform as well as ELMo in our setup. 22 | 23 | 24 | ## Dependencies 25 | 26 | The code was based and tested on the following libraries: 27 | - python 3.6 28 | - PyTorch 0.4.1 29 | - allennlp 0.5.1 30 | 31 | For Rouge evaluation, we used [files2rouge](https://github.com/pltrdy/files2rouge). 32 | 33 | 34 | ## Datasets & Summary Results & Pre-trained Language Models 35 | 36 | | Data & Task | Test Set &
Unsupervised Model Output | Summary LM & Vocabulary | Full Dataset | 37 | |:---:|:---:|:---:|:---:| 38 | | English Gigaword
(abstractive summarization) | [test data](./data/gigaword)
[model output](./results_elmo_giga) | [language model](https://drive.google.com/file/d/1iF0tLvoo74-o22-1jUjMTrLwK948sMKp/view?usp=sharing) | [full data](https://github.com/harvardnlp/sent-summary) | 39 | | Google Sentence Compression
(extractive summarization) | [test data](./data/sentence_compression)
[model output](./results_elmo_sc)| [language model](https://drive.google.com/file/d/1KVh7J6Mpj6W5YFV0DPAb81OwJSo26C7g/view?usp=sharing) | [full data](https://github.com/google-research-datasets/sentence-compression) | 40 | 41 | ## Unsupervised Summary Generation 42 | 43 | To generate summaries for a given corpus of source sentences, make sure the following two components are prepared: 44 | - The ELMo model contained in the [allennlp](https://github.com/allenai/allennlp) library package 45 | - A pre-trained LSTM based language model on the summary style short sentences (we have included our language modeling and training scripts in [lm_lstm](./lm_lstm), as well as our pre-trained models [above](#Datasets-&-Summary-Results-&-Pre-trained-Language-Models)) 46 | 47 | --- 48 | 49 | Suppose the file structure is as follows: 50 | ``` 51 | ├── ./ 52 | ├── data/ 53 | ├── gigaword/ 54 | ├── sentence_compression/ 55 | ├── lm_lstm/ 56 | ├── lm_lstm_models/ 57 | ├── gigaword/ 58 | ├── sentence_compression/ 59 | ├── uss/ 60 | ├── ... 61 | ├── ... 62 | ``` 63 | 64 | where we use two datasets as examples, English Gigaword for abstractive sentence summarization and Google sentence compression dataset for extractive sentence summarization, as were used in the paper. Suppose these data are stored in the `./data/` directory. 65 | 66 | For the following commands we take the [English Gigaword dataset](https://github.com/harvardnlp/sent-summary) as an example. 67 | 68 | --- 69 | 70 | **To train a summary domain specific language model:** 71 | 72 | ``` 73 | python lm_lstm/main.py --data_src user --userdata_path ./data/gigaword --userdata_train train.title.txt --userdata_val valid.title.filter.txt --userdata_test task1_ref0_unk.txt --bptt 32 --bsz 256 --embedsz 1024 --hiddensz 1024 --tieweights 0 --optim SGD --lr 0.1 --gradclip 15 --epochs 50 --vocabsave ./lm_lstm_models/gigaword/vocabTle.pkl --save ./lm_lstm_models/gigaword/Tle_LSTM_untied.pth --devid 0 74 | ``` 75 | Remember to change and check the data file paths and names, and save the vocabulary and model to proper places. For a full list of hyperparameters and their meanings use `python lm_lstm/main.py --help` or check the python script. 76 | 77 | **A minor detail**: in the processed [English Gigaword dataset](https://github.com/harvardnlp/sent-summary) we used, the training and validation sets contain unknown words as ``, whereas in the test set they are represented as `UNK`. When training the language model we replace the `UNK` tokens in the test summary set by ``. And after generating the test summaries, to compare with the original reference, we map `` back to `UNK` for Rouge evaluation to be consistent with the literature. 78 | 79 | In our experiments, the above command should produce a language model which achieves train perplexity ~62, validation perplexity ~72, and test perplexity ~201 (as the test set if quite different from the train and validation sets). And training takes about a day on a DGX V100 GPU. 80 | 81 | --- 82 | 83 | After obtaining the summary domain specific language model, we can do the **summary generation using the following command:** 84 | 85 | ``` 86 | python uss/summary_search_elmo.py --src ./data/gigaword/input_unk.txt --modelclass ./lm_lstm --model ./lm_lstm_models/gigaword/Tle_LSTM_untied.pth --vocab ./lm_lstm_models/gigaword/vocabTle.pkl --n 6 --ns 10 --nf 300 --elmo_layer cat --alpha 0.1 --beta 0 --beam_width 10 --devid 0 --save_dir ./results_elmo_giga/ 87 | ``` 88 | 89 | where: 90 | - `--src` specifies the sentence corpus to be summarized 91 | - `--modelclass` is the directory in which the language model source script is saved 92 | - `--model` is the path of the language model to be used 93 | - `--vocab` is the vocabulary file associated with the language model 94 | 95 | and finally the results will be saved in the directory `--save_dir` with a system generated file name based on user specified hyperparameters. For the full list of hyperparameters for the summary generation process and their meanings (alghough most of them were used for experimental purposes and need not to be changed), use `python uss/summary_search_elmo.py --help` or check the python script. 96 | 97 | With the above command exactly, a file named "smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt" containing all the generated sentence summarizations will be saved in the directory "./results_elmo_giga/". In this file, for each source sentence, all of the finished hypotheses from beam search are saved as candidate summary sentences, along with their alignments to the original source sentence, as well as the combined scores, contextual matching scores, and language modeling scores. Note that this searching process is relatively slow, as we need to calculate the contextual embeddings for every sentence prefix and every candidate next word in the procedure, even with our optimization of caching and batching. 98 | 99 | 100 | ## Evaluation 101 | 102 | To be consistent with the literature for evaluation, we need to select one summary sentence from the list to compare with the reference summary and calculate some metric statistics such as Rouge scores. Since our generation is unsupervised it could be difficult to select the best summary from a list of candidate summaries, and it is often the case that there is a better one than our selected one. Nevertheless we use a simple length penalized beam search score for our selection criterion. 103 | 104 | **For summary selection and evaluation, run the following command:** 105 | 106 | ``` 107 | python uss/summary_select_eval.py --src ./data/gigaword/input_unk.txt --ref ./data/gigaword/task1_ref0.txt --gen ./results_elmo_giga/smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt --save_dir ./results_elmo_giga --lp 0.1 108 | ``` 109 | 110 | where: 111 | - `--src`: source sentences file path 112 | - `--ref`: reference sentences file path 113 | - `--gen`: generated summary sentences file path 114 | - `--save_dir`: directory to save selected summaries 115 | - `--lp`: additive length penalty (usually between -0.1 and 0.1) 116 | 117 | This will generate a file named "smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt" in "./results_elmo_giga/", containing a single summary selected for each of the source sentences. And Rouge scores will be computed and printed, along with other statistics including copy rate, compression rate, and average summary length. 118 | 119 | Note that the Rouge evaluation is based on [files2rouge](https://github.com/pltrdy/files2rouge). 120 | 121 | ## Data and Sample Output 122 | 123 | We have included the test sets of English Gigaword dataset and Google sentence compression evaluation set in the "./data" folder. 124 | 125 | We also include the summary outputs from our unsupervised method for these two test sets in "./results_elmo_giga" and "./results_elmo_sc" respectively. 126 | 127 | ## Citing 128 | 129 | If you find the resources in this repository useful, please consider citing: 130 | 131 | ``` 132 | @inproceedings{zhou2019simple, 133 | title={Simple Unsupervised Summarization by Contextual Matching}, 134 | author={Zhou, Jiawei and Rush, Alexander M}, 135 | booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, 136 | pages={5101--5106}, 137 | year={2019} 138 | } 139 | ``` 140 | -------------------------------------------------------------------------------- /data/gigaword/README.md: -------------------------------------------------------------------------------- 1 | ## English Gigaword Summarization Test Set 2 | 3 | - **input.txt**: source sentences 4 | - **input_unk.txt**: source sentences after replacing `UNK` with `` 5 | - **task1_ref0.txt**: reference summary sentences 6 | -------------------------------------------------------------------------------- /data/sentence_compression/README.md: -------------------------------------------------------------------------------- 1 | ## Google Sentence Compression Test Set 2 | 3 | For extractive sentence summarization. The data is cleaned and pre-processed from the original dataset in a way similar to that of Gigaword dataset. 4 | -------------------------------------------------------------------------------- /figure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jzhou316/Unsupervised-Sentence-Summarization/29f3e23d608143b8d09a98fac3968b6f4f97302e/figure1.png -------------------------------------------------------------------------------- /lm_lstm/dataload.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Jun 20 23:25:26 2018 4 | 5 | @author: zjw 6 | """ 7 | 8 | import torchtext 9 | 10 | 11 | def loadPTB(root='E:/NLP/LM/data', batch_size=64, bptt_len=32, device=None, **kwargs): 12 | """ 13 | Load the Penn Treebank dataset. Download if not existing. 14 | """ 15 | TEXT = torchtext.data.Field(lower=True) 16 | train, val, test = torchtext.datasets.PennTreebank.splits(root=root, text_field=TEXT) 17 | TEXT.build_vocab(train, **kwargs) # could include: max_size, min_freq, vectors 18 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test), 19 | batch_size=batch_size, 20 | bptt_len=bptt_len, 21 | device=device, 22 | repeat=False) 23 | 24 | return TEXT, train_iter, val_iter, test_iter 25 | 26 | 27 | def loadWiki2(root='E:/NLP/LM/data', batch_size=64, bptt_len=32, device=None, **kwargs): 28 | """ 29 | Load the WikiText2 dataset. Download if not existing. 30 | """ 31 | TEXT = torchtext.data.Field(lower=True) 32 | train, val, test = torchtext.datasets.WikiText2.splits(root=root, text_field=TEXT) 33 | TEXT.build_vocab(train, **kwargs) 34 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test), 35 | batch_size=batch_size, 36 | bptt_len=bptt_len, 37 | device=device, 38 | repeat=False) 39 | 40 | return TEXT, train_iter, val_iter, test_iter 41 | 42 | 43 | def loadLMdata(path='E:/NLP/LM/data/penn-tree-bank-small', 44 | train='ptb.train.5k.txt', 45 | val='ptb.valid.txt', 46 | test='ptb.test.txt', 47 | batch_size=64, 48 | bptt_len=32, 49 | device=None, **kwargs): 50 | """ 51 | Load a dataset for LM training. The dataset should exist already. 52 | """ 53 | TEXT = torchtext.data.Field(lower=True) 54 | train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path=path, 55 | train=train, 56 | validation=val, 57 | test=test, 58 | text_field=TEXT) 59 | TEXT.build_vocab(train, val, test, **kwargs) 60 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test), 61 | batch_size=batch_size, 62 | bptt_len=bptt_len, 63 | device=device, 64 | repeat=False) 65 | 66 | return TEXT, train_iter, val_iter, test_iter -------------------------------------------------------------------------------- /lm_lstm/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Oct 26 2018 4 | 5 | @author: zjw 6 | """ 7 | import torch 8 | import torch.optim as optim 9 | from dataload import loadPTB, loadWiki2, loadLMdata 10 | from model import RNNModel 11 | from train import training, validating 12 | # from train_sharding import training, validating 13 | from utils import logging 14 | 15 | import os 16 | import sys 17 | # import random 18 | import argparse 19 | import time 20 | import importlib 21 | import pickle 22 | 23 | 24 | ########## set up parameters 25 | # data 26 | data_src = 'ptb' 27 | # on MicroSoft Azure 28 | # data_root = '/media/work/LM/data' 29 | # userdata_path = '/media/work/LM/data/Giga-sum' # .../penn-treebank-small 30 | # on Harvard Odyssey Cluster 31 | data_root = '/n/rush_lab/users/jzhou/LM/data' 32 | userdata_path = '/n/rush_lab/users/jzhou/LM/data/Giga-sum' # .../penn-treebank-small 33 | userdata_train = 'train.title.txt' 34 | userdata_val = 'valid.title.filter.txt' 35 | userdata_test = 'task1_ref0_unk.txt' 36 | batch_size = 128 37 | bptt_len = 32 38 | # model 39 | embed_size = 512 # 1024 40 | hidden_size = 512 # 1024 41 | num_layers = 2 42 | dropout = 0.5 43 | tieweights = 0 # 0 for False, 1 for True 44 | # optimization 45 | learning_rate = 0.01 46 | momentum = 0.9 47 | weight_decay = 1e-4 48 | grad_max_norm = 120 # 1024, 0.01 ---> 120 49 | shard_size = 64 50 | ##subvocab_size = 0 51 | #learning_rate = 0.001 52 | #grad_max_norm = None 53 | num_epochs = 50 54 | 55 | vocabsavepath = './models/vocabTle.pkl' 56 | savepath = './models/Tle_LSTM.pth' 57 | #savepath = '/media/work/LM/LMModel.pth' 58 | #savepath = '/n/rush_lab/users/jzhou/LM/LMModel.pth' 59 | 60 | 61 | def parse_args(): 62 | parser = argparse.ArgumentParser(description='Training an LSTM language model.') 63 | group = parser.add_mutually_exclusive_group() 64 | group.add_argument('--devid', type=int, default=-1, help='single device id; -1 for CPU') 65 | group.add_argument('--devids', type=str, default='off', help='multiple device ids for data parallel; use comma to separate, e.g. 0, 1, 2') 66 | # parser.add_argument('--devid', type=int, default=-1, help='device id; -1 for CPU') 67 | ## parser.add_argument('--modelfile', type=str, default='model', help='file name of the model, without .py') 68 | parser.add_argument('--seed', type=int, default=0, help='random seed') 69 | parser.add_argument('--logmode', type=str, default='w', help='logging file mode') 70 | # data loading 71 | parser.add_argument('--data_src', type=str, default=data_src, choices=['ptb', 'wiki2', 'user'], help='data source') 72 | parser.add_argument('--data_root', type=str, default=data_root, help='root path for PTB/Wiki2 dataset path') 73 | parser.add_argument('--userdata_path', type=str, default=userdata_path, help='user data path') 74 | parser.add_argument('--userdata_train', type=str, default=userdata_train, help='user data training set file name') 75 | parser.add_argument('--userdata_val', type=str, default=userdata_val, help='user data validating set file name') 76 | parser.add_argument('--userdata_test', type=str, default=userdata_test, help='user data testing set file name') 77 | parser.add_argument('--bptt', type=int, default=bptt_len, help='bptt length') 78 | parser.add_argument('--bsz', type=int, default=batch_size, help='batch size') 79 | parser.add_argument('--vocabsave', type=str, default=vocabsavepath, help='file path to save the vocabulary object') 80 | # model 81 | parser.add_argument('--embedsz', type=int, default=embed_size, help='word embedding size') 82 | parser.add_argument('--hiddensz', type=int, default=hidden_size, help='hidden state size') 83 | parser.add_argument('--numlayers', type=int, default=num_layers, help='number of layers') 84 | parser.add_argument('--dropout', type=float, default=dropout, help='dropout probability') 85 | # parser.add_argument('--tieweights', help='whether to tie input and output embedding weights', action='store_true') 86 | parser.add_argument('--tieweights', type=int, default=tieweights, help='whether to tie input and output embedding weights') 87 | parser.add_argument('--start_model', type=str, default='off', help='a trained model to start with') 88 | # optimization 89 | parser.add_argument('--optim', type=str, default='SGD', choices=['SGD', 'Adam'], help='optimization algorithm') 90 | parser.add_argument('--lr', type=float, default=learning_rate, help='learning rate') 91 | parser.add_argument('--momentum', type=float, default=momentum, help='momentum for SGD') 92 | parser.add_argument('--wd', type=float, default=weight_decay, help='weight decay (L2 penalty)') 93 | parser.add_argument('--gradclip', type=float, default=grad_max_norm, help='gradient norm clip') 94 | ## parser.add_argument('--shardsz', type=int, default=shard_size, help='shard size for mixture of softmax output layer') 95 | ## parser.add_argument('--subvocabsz', type=int, default=subvocab_size, help='sub-vocabulary size for training on large corpus') 96 | parser.add_argument('--epochs', type=int, default=num_epochs, help='number of training epochs') 97 | parser.add_argument('--save', type=str, default=savepath, help='file path to save the best model') 98 | args = parser.parse_args() 99 | return args 100 | 101 | args = parse_args() 102 | 103 | 104 | ## RNNModel = importlib.import_module(args.modelfile).RNNModel 105 | 106 | cuda_device = 'cpu' if args.devid == -1 else f'cuda:{args.devid}' 107 | if args.devids is not 'off': 108 | device_ids = list(map(int, args.devids.split(','))) 109 | output_device = device_ids[0] 110 | cuda_device = f'cuda:{output_device}' 111 | 112 | # if os.name == 'nt': 113 | # # run on my personal windows computer with cpu 114 | # cuda_device = None 115 | # root = 'E:/NLP/LM/data' 116 | # path = 'E:/NLP/LM/data/penn-treebank-small' 117 | # elif os.name == 'posix': 118 | # # run on Harvard Odyssey cluster 119 | # cuda_device = 'cuda:0' 120 | # # root = '/n/rush_lab/users/jzhou/LM/data' 121 | # # path = '/n/rush_lab/users/jzhou/LM/data/penn-treebank-small' 122 | # root = '/media/work/LM/data' 123 | # path = '/media/work/LM/data/penn-treebank' 124 | # # path = '/media/work/LM/data/Giga-sum' 125 | # # train = 'train.title.txt' 126 | # # val = 'valid.title.filter.txt' 127 | # # test = 'task1_ref0.txt' 128 | 129 | log_file = os.path.splitext(args.save)[0] + '.log' 130 | f_log = open(log_file, args.logmode) 131 | 132 | logging('python ' + ' '.join(sys.argv), f_log=f_log) 133 | 134 | logging('-' * 30, f_log=f_log) 135 | logging(time.ctime(), f_log=f_log) 136 | 137 | 138 | # random.seed(args.seed) # this has no impact on the current model training 139 | torch.manual_seed(args.seed) 140 | # torch.backends.cudnn.deterministic = True 141 | # torch.backends.cudnn.benchmark = False 142 | # torch.backends.cudnn.enabled = False 143 | 144 | # print('-' * 30) 145 | # print(time.ctime()) 146 | 147 | ########## load the dataset 148 | logging('-' * 30, f_log=f_log) 149 | logging('Loading data ...', f_log=f_log) 150 | 151 | # print('-' * 30) 152 | # print('Loading data ...') 153 | 154 | if args.data_src == 'ptb': 155 | TEXT, train_iter, val_iter, test_iter = loadPTB(root=args.data_root, 156 | batch_size=args.bsz, 157 | bptt_len=args.bptt, 158 | device=cuda_device) 159 | elif args.data_src == 'wiki2': 160 | TEXT, train_iter, val_iter, test_iter = loadWiki2(root=args.data_root, 161 | batch_size=args.bsz, 162 | bptt_len=args.bptt, 163 | device=cuda_device) 164 | elif args.data_src == 'user': 165 | TEXT, train_iter, val_iter, test_iter = loadLMdata(path=args.userdata_path, 166 | train=args.userdata_train, 167 | val=args.userdata_val, 168 | test=args.userdata_test, 169 | batch_size=args.bsz, 170 | bptt_len=args.bptt, 171 | device=cuda_device, 172 | min_freq=5) 173 | 174 | padid = TEXT.vocab.stoi[''] 175 | vocab_size = len(TEXT.vocab) 176 | 177 | logging(f'Vocab size: {vocab_size}', f_log=f_log) 178 | if not os.path.exists(args.vocabsave): 179 | pickle.dump(TEXT.vocab, open(args.vocabsave, 'wb')) 180 | logging(f'Vocabulary object saved to: {args.vocabsave}', f_log=f_log) 181 | else: 182 | logging(f'Vocabulary object at: {args.vocabsave}', f_log=f_log) 183 | logging('Complete!', f_log=f_log) 184 | logging('-' * 30, f_log=f_log) 185 | 186 | ##if args.subvocabsz >= vocab_size or args.subvocabsz == 0: 187 | ## args.subvocabsz = None 188 | 189 | # print('Complete!') 190 | # print('-' * 30) 191 | 192 | ########## define the model and optimizer 193 | 194 | if args.start_model is 'off': 195 | LMModel = RNNModel(vocab_size=vocab_size, 196 | embed_size=args.embedsz, 197 | hidden_size=args.hiddensz, 198 | num_layers=args.numlayers, 199 | dropout=args.dropout, 200 | padid=padid, 201 | tieweights=args.tieweights) 202 | else: 203 | LMModel_start = torch.load(args.start_model).cpu() 204 | # Note: watch out if the model class has different methods from the loaded one to start with !!! 205 | LMModel = RNNModel(vocab_size=vocab_size, 206 | embed_size=args.embedsz, 207 | hidden_size=args.hiddensz, 208 | num_layers=args.numlayers, 209 | dropout=args.dropout, 210 | padid=padid, 211 | tieweights=args.tieweights) 212 | LMModel.load_state_dict(LMModel_start.state_dict()) 213 | 214 | 215 | # LMModel = torch.load(args.save).cpu() 216 | 217 | model_size = sum(p.nelement() for p in LMModel.parameters()) 218 | logging('-' * 30, f_log=f_log) 219 | logging(f'Model tatal parameters: {model_size}', f_log=f_log) 220 | logging('-' * 30, f_log=f_log) 221 | 222 | # print('-' * 30) 223 | # print(f'Model tatal parameters: {model_size}') 224 | # print('-' * 30) 225 | 226 | if torch.cuda.is_available() and cuda_device is not 'cpu': 227 | LMModel = LMModel.cuda(cuda_device) 228 | 229 | LMModel_parallel = None 230 | if torch.cuda.is_available() and args.devids is not 'off': 231 | LMModel_parallel = torch.nn.DataParallel(LMModel, device_ids=device_ids, output_device=output_device, dim=1) 232 | # .cuda() is necessary if LMModel was not on any GPU device 233 | # LMModel_parallel._modules['module'].lstm.flatten_parameters() 234 | 235 | if args.optim == 'SGD': 236 | optimizer = optim.SGD(LMModel.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.wd) 237 | elif args.optim == 'Adam': 238 | optimizer = optim.Adam(LMModel.parameters(), lr=args.lr, weight_decay=args.wd) 239 | 240 | scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.25, patience=1) 241 | 242 | if args.start_model is not 'off': 243 | start_model_optstate_path = os.path.splitext(args.start_model)[0] + '_optstate.pth' 244 | start_model_schstate_path = os.path.splitext(args.start_model)[0] + '_schstate.pth' 245 | if os.path.exists(start_model_optstate_path): 246 | optimizer.load_state_dict(torch.load(start_model_optstate_path)) 247 | logging('-' * 30, f_log=f_log) 248 | logging('Loading saved optimizer states.', f_log=f_log) 249 | logging('-' * 30, f_log=f_log) 250 | 251 | if os.path.exists(start_model_schstate_path): 252 | scheduler.load_state_dict(torch.load(start_model_schstate_path)) 253 | logging('-' * 30, f_log=f_log) 254 | logging('Loading saved scheduler states.', f_log=f_log) 255 | logging('-' * 30, f_log=f_log) 256 | 257 | # print('-' * 30) 258 | # print('Loading saved optimizer states.') 259 | # print('-' * 30) 260 | 261 | ########## traing the model 262 | if args.start_model is not 'off': 263 | start_model_rngstate_path = os.path.splitext(args.start_model)[0] + '_rngstate.pth' 264 | if os.path.exists(start_model_rngstate_path): 265 | torch.set_rng_state(torch.load(start_model_rngstate_path)['torch_rng_state']) 266 | torch.cuda.set_rng_state_all(torch.load(start_model_rngstate_path)['cuda_rng_state']) 267 | logging('-' * 30, f_log=f_log) 268 | logging('Loading saved rng states.', f_log=f_log) 269 | logging('-' * 30, f_log=f_log) 270 | 271 | train_ppl, val_ppl = training(train_iter, val_iter, args.epochs, 272 | LMModel, 273 | optimizer, 274 | scheduler, 275 | args.gradclip, 276 | args.save, 277 | ## shard_size=args.shardsz, 278 | LMModel_parallel=LMModel_parallel, 279 | f_log=f_log) 280 | ## subvocab_size=args.subvocabsz) 281 | 282 | ######### test the trained model 283 | ##test_ppl = validating(test_iter, LMModel, shard_size=args.shardsz, LMModel_parallel=LMModel_parallel, f_log=f_log) 284 | test_ppl = validating(test_iter, LMModel, LMModel_parallel=LMModel_parallel, f_log=f_log) 285 | logging('-' * 30, f_log=f_log) 286 | logging('Test ppl: %f' % test_ppl, f_log=f_log) 287 | logging('-' * 30, f_log=f_log) 288 | 289 | f_log.close() 290 | 291 | # print('-' * 30) 292 | # print('Test ppl: %f' % test_ppl) 293 | # print('-' * 30) 294 | -------------------------------------------------------------------------------- /lm_lstm/model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Nov 29 2018 4 | 5 | @author: zjw 6 | """ 7 | import torch 8 | import torch.nn as nn 9 | 10 | 11 | class RNNModel(nn.Module): 12 | def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout, padid=1, max_norm=None, tieweights=False): 13 | super(RNNModel, self).__init__() 14 | 15 | self.vocab_size = vocab_size 16 | self.embed_size = embed_size 17 | self.hidden_size = hidden_size 18 | self.num_layers = num_layers 19 | self.padid = padid 20 | 21 | self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=self.padid, max_norm=None) 22 | self.lstm = nn.LSTM(input_size=embed_size, hidden_size=hidden_size, 23 | num_layers=num_layers, dropout=dropout) 24 | self.drop = nn.Dropout(p=dropout) 25 | self.proj = nn.Linear(hidden_size, vocab_size, bias=True) 26 | 27 | self.init_weight(0.5) 28 | 29 | # tie weights 30 | if tieweights: 31 | self.proj.weight = self.embedding.weight 32 | 33 | def init_weight(self, initrange=0.1): 34 | nn.init.uniform_(self.embedding.weight, -initrange, initrange) 35 | # nn.init.uniform_(self.proj.weight, -initrange, initrange) 36 | nn.init.orthogonal_(self.proj.weight) 37 | nn.init.constant_(self.proj.bias, 0) 38 | 39 | def forward(self, batch_text, hn, subvocab=None, return_prob=False): 40 | embed = self.embedding(batch_text) # size: (seq_len, batch_size, embed_size) 41 | output, hn = self.lstm(embed, hn) # output size: (seq_len, batch_size, hidden_size) 42 | output = self.drop(output) # hn = (hn, cn), each with size: (num_layers, batch, hidden_size) 43 | if isinstance(subvocab, list): 44 | subvocab = torch.LongTensor(subvocab, device=output.device) 45 | output = self.proj(output) if subvocab is None else nn.functional.linear(output, self.proj.weight[subvocab, :], self.proj.bias[subvocab]) 46 | if return_prob: 47 | output = nn.functional.softmax(output, dim=-1) 48 | # detach last hidden and cell states to truncate the computational graph for BPTT. 49 | hn = tuple(map(lambda x: x.detach(), hn)) 50 | return output, hn 51 | 52 | def score_textseq(self, text, vocab, hn=None, size_average=True): 53 | """ 54 | Output the log-likelihood of a text sequence. 55 | """ 56 | if isinstance(text, str): 57 | text = text.split() 58 | textid = next(self.parameters()).new_tensor([vocab.stoi[w] for w in text], dtype=torch.long) 59 | with torch.no_grad(): 60 | self.eval() 61 | model_output, hn = self(textid.unsqueeze(1), hn) 62 | self.train() 63 | ll = nn.functional.cross_entropy(model_output[:-1, 0, :], textid[1:], ignore_index=self.padid, 64 | reduction='elementwise_mean' if size_average else 'sum') 65 | ll = -ll.item() 66 | return ll 67 | 68 | def score_nexttoken(self, text, vocab, hn=None): 69 | """ 70 | Output the predictive probabilities of the next token given a text sequence. 71 | """ 72 | if isinstance(text, str): 73 | text = text.split() 74 | textid = next(self.parameters()).new_tensor([vocab.stoi[w] for w in text], dtype=torch.long) 75 | with torch.no_grad(): 76 | self.eval() 77 | model_output, hn = self(textid.unsqueeze(1), hn, return_prob=True) 78 | self.train() 79 | 80 | return model_output[-1, 0, :] 81 | 82 | -------------------------------------------------------------------------------- /lm_lstm/testppl.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Jan 31 2019 4 | 5 | @author: zjw 6 | """ 7 | import torch 8 | from dataload import loadPTB, loadWiki2, loadLMdata 9 | from model import RNNModel 10 | from train import validating 11 | # from train_sharding import training, validating 12 | 13 | import os 14 | import sys 15 | # import random 16 | import argparse 17 | import time 18 | import importlib 19 | import pickle 20 | 21 | 22 | ########## set up parameters 23 | # data 24 | data_src = 'ptb' 25 | # on MicroSoft Azure 26 | # data_root = '/media/work/LM/data' 27 | # userdata_path = '/media/work/LM/data/Giga-sum' # .../penn-treebank-small 28 | # on Harvard Odyssey Cluster 29 | data_root = '/n/rush_lab/users/jzhou/LM/data' 30 | userdata_path = '/n/rush_lab/users/jzhou/LM/data/Giga-sum' # .../penn-treebank-small 31 | userdata_train = 'train.title.txt' 32 | userdata_val = 'valid.title.filter.txt' 33 | userdata_test = 'task1_ref0_unk.txt' 34 | batch_size = 128 35 | bptt_len = 32 36 | 37 | vocabsavepath = './models/vocabTle.pkl' 38 | 39 | def parse_args(): 40 | parser = argparse.ArgumentParser(description='Training an LSTM language model.') 41 | group = parser.add_mutually_exclusive_group() 42 | group.add_argument('--devid', type=int, default=-1, help='single device id; -1 for CPU') 43 | group.add_argument('--devids', type=str, default='off', help='multiple device ids for data parallel; use comma to separate, e.g. 0, 1, 2') 44 | # parser.add_argument('--devid', type=int, default=-1, help='device id; -1 for CPU') 45 | ## parser.add_argument('--modelfile', type=str, default='model', help='file name of the model, without .py') 46 | parser.add_argument('--seed', type=int, default=0, help='random seed') 47 | # data loading 48 | parser.add_argument('--data_src', type=str, default=data_src, choices=['ptb', 'wiki2', 'user'], help='data source') 49 | parser.add_argument('--data_root', type=str, default=data_root, help='root path for PTB/Wiki2 dataset path') 50 | parser.add_argument('--userdata_path', type=str, default=userdata_path, help='user data path') 51 | parser.add_argument('--userdata_train', type=str, default=userdata_train, help='user data training set file name') 52 | parser.add_argument('--userdata_val', type=str, default=userdata_val, help='user data validating set file name') 53 | parser.add_argument('--userdata_test', type=str, default=userdata_test, help='user data testing set file name') 54 | parser.add_argument('--bptt', type=int, default=bptt_len, help='bptt length') 55 | parser.add_argument('--bsz', type=int, default=batch_size, help='batch size') 56 | parser.add_argument('--vocabsave', type=str, default=vocabsavepath, help='file path to save the vocabulary object') 57 | # model 58 | parser.add_argument('--model', type=str, default='off', help='a trained model to start with') 59 | args = parser.parse_args() 60 | return args 61 | 62 | args = parse_args() 63 | 64 | 65 | ## RNNModel = importlib.import_module(args.modelfile).RNNModel 66 | 67 | cuda_device = 'cpu' if args.devid == -1 else f'cuda:{args.devid}' 68 | if args.devids is not 'off': 69 | device_ids = list(map(int, args.devids.split(','))) 70 | output_device = device_ids[0] 71 | cuda_device = f'cuda:{output_device}' 72 | 73 | # if os.name == 'nt': 74 | # # run on my personal windows computer with cpu 75 | # cuda_device = None 76 | # root = 'E:/NLP/LM/data' 77 | # path = 'E:/NLP/LM/data/penn-treebank-small' 78 | # elif os.name == 'posix': 79 | # # run on Harvard Odyssey cluster 80 | # cuda_device = 'cuda:0' 81 | # # root = '/n/rush_lab/users/jzhou/LM/data' 82 | # # path = '/n/rush_lab/users/jzhou/LM/data/penn-treebank-small' 83 | # root = '/media/work/LM/data' 84 | # path = '/media/work/LM/data/penn-treebank' 85 | # # path = '/media/work/LM/data/Giga-sum' 86 | # # train = 'train.title.txt' 87 | # # val = 'valid.title.filter.txt' 88 | # # test = 'task1_ref0.txt' 89 | 90 | 91 | # random.seed(args.seed) # this has no impact on the current model training 92 | torch.manual_seed(args.seed) 93 | # torch.backends.cudnn.deterministic = True 94 | # torch.backends.cudnn.benchmark = False 95 | # torch.backends.cudnn.enabled = False 96 | 97 | # print('-' * 30) 98 | # print(time.ctime()) 99 | 100 | ########## load the dataset 101 | print('-' * 30) 102 | print('Loading data ...') 103 | 104 | if args.data_src == 'ptb': 105 | TEXT, train_iter, val_iter, test_iter = loadPTB(root=args.data_root, 106 | batch_size=args.bsz, 107 | bptt_len=args.bptt, 108 | device=cuda_device) 109 | elif args.data_src == 'wiki2': 110 | TEXT, train_iter, val_iter, test_iter = loadWiki2(root=args.data_root, 111 | batch_size=args.bsz, 112 | bptt_len=args.bptt, 113 | device=cuda_device) 114 | elif args.data_src == 'user': 115 | TEXT, train_iter, val_iter, test_iter = loadLMdata(path=args.userdata_path, 116 | train=args.userdata_train, 117 | val=args.userdata_val, 118 | test=args.userdata_test, 119 | batch_size=args.bsz, 120 | bptt_len=args.bptt, 121 | device=cuda_device, 122 | min_freq=5) 123 | print(f'Vocabulary size: {len(TEXT.vocab)}') 124 | print('Complete!') 125 | print('-' * 30) 126 | 127 | ########## define the model 128 | LMModel = torch.load(args.model).cpu() 129 | ''' 130 | LMModel_start = torch.load(args.start_model).cpu() 131 | # Note: watch out if the model class has different methods from the loaded one to start with !!! 132 | LMModel = RNNModel(vocab_size=vocab_size, 133 | embed_size=LMModel_start.embedsz, 134 | hidden_size=LMModel_start.hiddensz, 135 | num_layers=LMModel_start.numlayers, 136 | dropout=LMModel_start.dropout, 137 | padid=LMModel_start.padid, 138 | tieweights=LMModel_start.tieweights) 139 | LMModel.load_state_dict(LMModel_start.state_dict()) 140 | ''' 141 | 142 | # LMModel = torch.load(args.save).cpu() 143 | 144 | model_size = sum(p.nelement() for p in LMModel.parameters()) 145 | 146 | print('-' * 30) 147 | print(f'Model tatal parameters: {model_size}') 148 | print('-' * 30) 149 | 150 | if torch.cuda.is_available() and cuda_device is not 'cpu': 151 | LMModel = LMModel.cuda(cuda_device) 152 | 153 | LMModel_parallel = None 154 | if torch.cuda.is_available() and args.devids is not 'off': 155 | LMModel_parallel = torch.nn.DataParallel(LMModel, device_ids=device_ids, output_device=output_device, dim=1) 156 | # .cuda() is necessary if LMModel was not on any GPU device 157 | # LMModel_parallel._modules['module'].lstm.flatten_parameters() 158 | ''' 159 | if args.start_model is not 'off': 160 | start_model_optstate_path = os.path.splitext(args.start_model)[0] + '_optstate.pth' 161 | start_model_schstate_path = os.path.splitext(args.start_model)[0] + '_schstate.pth' 162 | if os.path.exists(start_model_optstate_path): 163 | optimizer.load_state_dict(torch.load(start_model_optstate_path)) 164 | logging('-' * 30, f_log=f_log) 165 | logging('Loading saved optimizer states.', f_log=f_log) 166 | logging('-' * 30, f_log=f_log) 167 | 168 | if os.path.exists(start_model_schstate_path): 169 | scheduler.load_state_dict(torch.load(start_model_schstate_path)) 170 | logging('-' * 30, f_log=f_log) 171 | logging('Loading saved scheduler states.', f_log=f_log) 172 | logging('-' * 30, f_log=f_log) 173 | ''' 174 | # print('-' * 30) 175 | # print('Loading saved optimizer states.') 176 | # print('-' * 30) 177 | 178 | ######### test the trained model 179 | test_ppl = validating(test_iter, LMModel, LMModel_parallel=LMModel_parallel) 180 | 181 | print('-' * 30) 182 | print('Test ppl: %f' % test_ppl) 183 | print('-' * 30) 184 | 185 | -------------------------------------------------------------------------------- /lm_lstm/train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Oct 25 2018 4 | 5 | @author: zjw 6 | """ 7 | import torch 8 | import torch.nn as nn 9 | import math 10 | import time 11 | import os 12 | from utils import logging, timeSince, rand_subvocab 13 | 14 | 15 | def training(train_iter, val_iter, num_epoch, LMModel, optimizer, scheduler, grad_max_norm=None, savepath='./LMModel.pth', LMModel_parallel=None, f_log=None, subvocab_size=None): 16 | criterion = nn.CrossEntropyLoss(ignore_index=LMModel.padid, reduction='sum') 17 | best_val_ppl = None 18 | last_epoch = scheduler.last_epoch 19 | LMModel.train() 20 | start = time.time() 21 | for epoch in range(last_epoch + 1, last_epoch + 1 + num_epoch): 22 | train_iter.init_epoch() 23 | loss_total = 0 24 | num_token_passed = 0 25 | hn = None 26 | for batch in train_iter: 27 | # calculate sub-vocabulary 28 | ## subvocab = rand_subvocab(batch, LMModel.vocab_size, subvocab_size) 29 | subvocab = None 30 | # update parameters 31 | optimizer.zero_grad() 32 | if LMModel_parallel is None: 33 | output, hn = LMModel(batch.text, hn if hn is not None else None, subvocab=subvocab) 34 | else: 35 | output, hn = LMModel_parallel(batch.text, hn if hn is not None else None, subvocab=subvocab.numpy().tolist() if subvocab is not None else None) 36 | if subvocab is not None: 37 | target_subids = batch.target.new_tensor([(subvocab == x).nonzero().item() for x in batch.target.view(-1).cpu()], dtype=torch.long) 38 | loss = criterion(output.view(-1, output.size(2)), batch.target.view(-1)) if subvocab is None else \ 39 | criterion(output.view(-1, output.size(2)), target_subids) 40 | loss.backward() 41 | 42 | if grad_max_norm: 43 | nn.utils.clip_grad_norm_(LMModel.parameters(), grad_max_norm) 44 | 45 | # calculate perplexity 46 | loss_total += float(loss) # do not accumulate history accross training loop 47 | num_token_passed += (torch.numel(batch.target) - 48 | torch.sum(batch.target == LMModel.padid)).item() 49 | # do not count the '', which could only exist 50 | # at the end of the last batch 51 | loss_avg = loss_total / num_token_passed 52 | ppl = math.exp(loss_avg) 53 | 54 | optimizer.step() 55 | 56 | # print information 57 | if train_iter.iterations % 50 == 0 or train_iter.iterations == len(train_iter): 58 | logging('Epoch %d / %d, iteration %d / %d, ppl: %f (time elasped %s)' 59 | %(epoch + 1, last_epoch + 1 + num_epoch, train_iter.iterations, len(train_iter), ppl, timeSince(start)), f_log=f_log) 60 | 61 | # calculation ppl on validation set 62 | val_ppl = validating(val_iter, LMModel, LMModel_parallel=LMModel_parallel, f_log=f_log) 63 | LMModel.train() 64 | logging('-' * 30, f_log=f_log) 65 | logging('Validating ppl: %f' % val_ppl, f_log=f_log) 66 | logging('-' * 30, f_log=f_log) 67 | 68 | scheduler.step(val_ppl) 69 | 70 | # save the model if the validation ppl is the best so far 71 | if not best_val_ppl or val_ppl < best_val_ppl: 72 | best_val_ppl = val_ppl 73 | torch.save(LMModel, savepath) 74 | torch.save(optimizer.state_dict(), os.path.splitext(savepath)[0] + '_optstate.pth') 75 | torch.save(scheduler.state_dict(), os.path.splitext(savepath)[0] + '_schstate.pth') 76 | torch.save({'torch_rng_state': torch.get_rng_state(), 'cuda_rng_state': torch.cuda.get_rng_state_all()}, os.path.splitext(savepath)[0] + '_rngstate.pth') 77 | logging(f'Current model (after epoch {epoch+1}) saved to {savepath} (along with optimizer state dictionary & scheduler state dictionary & rng states)', f_log=f_log) 78 | logging('-' * 30, f_log=f_log) 79 | 80 | return ppl, val_ppl 81 | 82 | 83 | def validating(val_iter, LMModel, LMModel_parallel=None, f_log=None): 84 | criterion = nn.CrossEntropyLoss(ignore_index=LMModel.padid, reduction='sum') 85 | LMModel.eval() 86 | with torch.no_grad(): 87 | val_iter.init_epoch() 88 | loss_total = 0 89 | num_token_passed = 0 90 | hn = None 91 | for batch in val_iter: 92 | if LMModel_parallel is None: 93 | output, hn = LMModel(batch.text, hn if hn is not None else None) 94 | else: 95 | output, hn = LMModel_parallel(batch.text, hn if hn is not None else None) 96 | loss = criterion(output.view(-1, output.size(2)), batch.target.view(-1)) 97 | loss_total += float(loss) 98 | num_token_passed += torch.sum(batch.target.ne(LMModel.padid)).item() 99 | 100 | ppl = math.exp(loss_total / num_token_passed) 101 | return ppl 102 | -------------------------------------------------------------------------------- /lm_lstm/utils.py: -------------------------------------------------------------------------------- 1 | import time 2 | import math 3 | import torch 4 | 5 | 6 | def logging(s, f_log=None, print_=True, log_=True): 7 | if print_: 8 | print(s) 9 | if log_ and f_log is not None: 10 | f_log.write(s + '\n') 11 | 12 | 13 | def timeSince(start): 14 | now = time.time() 15 | s = now - start 16 | m = math.floor(s / 60) 17 | s -= m * 60 18 | h = math.floor(m / 60) 19 | m -= h * 60 20 | if h == 0: 21 | return '%dm %ds' % (m, s) 22 | else: 23 | return '%dh %dm %ds' % (h, m, s) 24 | 25 | def rand_subvocab(batch, vocab_size, subvocab_size=None): 26 | if subvocab_size is None or subvocab_size >= vocab_size: 27 | return None 28 | batch_ids = torch.cat([batch.text.view(-1), batch.target.view(-1)]).cpu().unique() 29 | subvocab = torch.cat([torch.randperm(vocab_size)[:subvocab_size], batch_ids]).unique(sorted=True) 30 | return subvocab 31 | 32 | -------------------------------------------------------------------------------- /results_elmo_giga/README.md: -------------------------------------------------------------------------------- 1 | **Summaries generated from our unsupervised method for Gigaword test set** 2 | 3 | Including: 4 | - Summaries selected from finished beams with a consistent length penalty 5 | - Oracle results (select the best summary from the finished beams compared with the reference) 6 | -------------------------------------------------------------------------------- /results_elmo_sc/README.md: -------------------------------------------------------------------------------- 1 | **Summaries generated from our unsupervised method for Google sentence compression test set** 2 | 3 | Including: 4 | - Summaries selected from finished beams with a consistent length penalty 5 | - Oracle results (select the best summary from the finished beams compared with the reference) 6 | -------------------------------------------------------------------------------- /uss/beam_search.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | from sim_embed_score import simScoreNext, simScoreNext_GPT2 5 | # from lmsubvocab import prob_next 6 | from lm_subvocab import prob_next_1step 7 | import nltk 8 | 9 | lemma = nltk.wordnet.WordNetLemmatizer() 10 | 11 | 12 | class BeamUnit: 13 | def __init__(self, word_id, pre_loc, cur_loc, score, seq_len, vocab, **kwargs): 14 | self.score = score 15 | self.word_id = word_id 16 | self.pre_loc = pre_loc 17 | self.cur_loc = cur_loc 18 | self.seq_len = seq_len 19 | self.vocab = vocab 20 | for k, v in kwargs.items(): 21 | setattr(self, k, v) 22 | 23 | 24 | class Beam: 25 | def __init__(self, init_K, vocab, init_ids, device=None, **kwargs): 26 | assert 1 <= init_K <= len(vocab), 'Initial beam size should be in [1, len(vocab)]!' 27 | assert init_K == len(init_ids), 'Initial beam size should equal to the length of initial ids!' 28 | self.K = [init_K] # dynamic beam size 29 | self.vocab = vocab 30 | self.step = 0 31 | self.device = device 32 | self.endbus = [] # ending BeamUnits 33 | self.endall = False # if all beams reach the termination 34 | if init_ids == [None]: # A special initial id 35 | seq_len = 0 36 | else: 37 | seq_len = 1 38 | self.beamseq = [[BeamUnit(word_id, pre_loc=None, cur_loc=i, score=0, seq_len=seq_len, vocab=vocab, **kwargs) 39 | for (i, word_id) in enumerate(init_ids)]] 40 | # Note: for the reason of unifying different language models, all the beams start from one single unit: 41 | # for similarity LM, init_ids should always be [None]; 42 | # for normal LM, init_ids should be of your own pick (since the current LM was not trained with 43 | # a special BOS token, it must start with some given token). 44 | 45 | def beamstep(self, K, score_funcK, **kwargs): 46 | """ 47 | K: beam size next step 48 | score_func: a function that takes in a list of BeamUnit and returns the next top K BeamUnit based on some scores 49 | """ 50 | if self.endall: 51 | raise ValueError('Beam.endall flag is already raised. No need to do beamstep.') 52 | 53 | nexttopKK, endbus = score_funcK(self.beamseq[-1], K, **kwargs) 54 | self.endbus += endbus 55 | 56 | if nexttopKK == []: 57 | print('All beams reach EOS. Beamstep stops.') 58 | self.endall = True 59 | else: 60 | self.beamseq.append(nexttopKK) # TO DO: add termination condition 61 | self.K.append(len(nexttopKK)) 62 | self.step += 1 63 | 64 | def beamcut(self, K, score_func=None, **kwargs): 65 | """ 66 | Cut the current beam width down to K (top K). 67 | """ 68 | assert K > 0, 'Beam width K should be positive!' 69 | if K >= self.K[-1]: 70 | print('No need to cut.') 71 | else: 72 | if score_func is None: 73 | self.beamseq[-1] = self.beamseq[-1][0:K] 74 | self.K[-1] = K 75 | else: 76 | ll = [score_func(text=self.retrieve(k + 1)[0], **kwargs) for k in range(self.K[-1])] 77 | ll_sorted = sorted(list(zip(range(len(ll)), ll)), key=lambda x: x[1], reverse=True)[0:K] 78 | ll_idx, _ = zip(*ll_sorted) 79 | self.beamseq[-1] = [self.beamseq[-1][i] for i in range(self.K[-1]) if i in ll_idx] 80 | assert len(self.beamseq[-1]) == K 81 | for k, bu in enumerate(self.beamseq[-1]): 82 | bu.cur_loc = k 83 | self.K[-1] = K 84 | return ll, ll_sorted 85 | 86 | def beamselect(self, indices=[0]): 87 | """ 88 | Select the beams (at last step) according to indices. 89 | Default: select the first beam, which is equivalent to self.beamcut(1). 90 | """ 91 | indices = sorted(list(set(indices))) # indices: no repeated numbers, and should be sorted 92 | assert indices[-1] < self.K[-1], 'Index out of range (beamwidth).' 93 | self.beamseq[-1] = [self.beamseq[-1][i] for i in indices] 94 | for k, bu in enumerate(self.beamseq[-1]): 95 | bu.cur_loc = k 96 | self.K[-1] = len(self.beamseq[-1]) 97 | 98 | def beamback(self, seq_len): 99 | """ 100 | Trace back the beam at seq_len. 101 | """ 102 | assert seq_len <= len(self.beamseq), 'seq_len larger than maximum.' 103 | if self.beamseq[0][0].word_id is None: 104 | self.beamseq = self.beamseq[0:(seq_len + 1)] 105 | self.K = self.K[0:(seq_len + 1)] 106 | self.step = seq_len 107 | else: 108 | self.beamseq = self.beamseq[0:seq_len] 109 | self.K = self.K[0:seq_len] 110 | self.step = seq_len - 1 111 | self.endall = False 112 | 113 | def retrieve(self, k, seq_len=-1): 114 | """ 115 | Retrieve the k-th ranked generated sentence. 116 | """ 117 | 118 | if self.beamseq[0][0].word_id is not None and seq_len > 0: 119 | # for a normal LM 120 | seq_len -= 1 121 | 122 | assert 1 <= k <= self.K[seq_len], 'k must be in [1, the total number of beams at seq_len]!' 123 | 124 | rebeam = [self.beamseq[seq_len][k - 1]] 125 | n = seq_len 126 | while rebeam[0].pre_loc is not None: 127 | n -= 1 128 | rebeam = [self.beamseq[n][rebeam[0].pre_loc]] + rebeam 129 | sent = [self.vocab.itos[bu.word_id] for bu in rebeam if bu.word_id is not None] 130 | 131 | return sent, rebeam 132 | 133 | def retrieve_align(self, rebeam): 134 | """ 135 | Should be run after calling Beam.retrieve(...). 136 | """ 137 | align_locs = [bu.align_loc.item() for bu in rebeam if bu.word_id is not None and bu.align_loc is not None] 138 | return align_locs 139 | 140 | def retrieve_endbus(self): 141 | """ 142 | Retrieve the complete sentences acquired by beam steps. 143 | """ 144 | sents = [] 145 | aligns = [] 146 | score_avgs = [] 147 | for ks in self.endbus: 148 | sent, rebeam = self.retrieve(ks[0] + 1, ks[1]) 149 | score_avg = ks[2] / ks[1] 150 | 151 | sents.append(sent) 152 | aligns.append(self.retrieve_align(rebeam)) 153 | score_avgs.append(score_avg) 154 | 155 | return sents, aligns, score_avgs 156 | 157 | def simscore(self, bu, K, template_vec, ee, word_list=None, mono=False, 158 | batch_size=1024, normalized=True, elmo_layer='avg'): 159 | """ 160 | Score function based on sentence similarities. 161 | """ 162 | if word_list is None: 163 | word_list = self.vocab.itos 164 | scores, indices, states = simScoreNext(template_vec, word_list, ee, 165 | prevs_state=bu.elmo_state, batch_size=batch_size, 166 | prevs_align=bu.align_loc if mono else None, 167 | normalized=normalized, elmo_layer=elmo_layer) 168 | scores_prob = torch.nn.functional.log_softmax(scores, dim=0) 169 | 170 | sorted_scores, sorting_indices = torch.sort(scores) 171 | 172 | nexttopK = [BeamUnit(self.vocab.stoi[word_list[i]], bu.cur_loc, None, scores_prob[i].item() + bu.score, 173 | bu.seq_len + 1, self.vocab, elmo_state=states[i], align_loc=indices[i].item()) 174 | for i in sorting_indices[0:(K + 5)] 175 | # do not allow repeated words consecutively 176 | if lemma.lemmatize(self.vocab.itos[bu.word_id]) != lemma.lemmatize(word_list[i])] 177 | nexttopK = nexttopK[0:K] 178 | 179 | return nexttopK 180 | 181 | def lmscore(self, bulist, K, LMModel, word_list=None, subvocab=None, clustermask=None, renorm=False, temperature=1): 182 | """ 183 | Score function based on a pretrained RNN language model. 184 | """ 185 | # note that LMModel should have the same vocab as that in Beam() 186 | 187 | ## when no candidate word list is provided, use the full vocabulary 188 | if word_list is None: 189 | word_list = self.vocab.itos 190 | subvocab = None 191 | clustermask = None 192 | 193 | if self.device is not None: 194 | LMModel = LMModel.cuda(device=self.device) 195 | LMModel.eval() 196 | with torch.no_grad(): 197 | onbeam_ids = list(range(len(bulist))) 198 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids], 199 | dtype=torch.long).unsqueeze(0) 200 | if bulist[onbeam_ids[0]].lm_state is None: 201 | # 'lm_state' for the current beam is either all 'None' or all not 'None'. 202 | batch_hn = None 203 | else: 204 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1), 205 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1)) 206 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn, 207 | subvocab=subvocab, clustermask=clustermask, onscore=False, 208 | renorm=renorm, 209 | temperature=temperature) 210 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence 211 | hn = list( 212 | zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1), torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1))) 213 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log( 214 | subprobs) 215 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities 216 | 217 | ## rank and update 218 | if K > len(lm_cum_logprob): 219 | scores_sorted, ids_sorted = lm_cum_logprob.sort(descending=True) 220 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 221 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 222 | m, 223 | scores_sorted[m].item(), 224 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 225 | self.vocab, 226 | lm_score=lm_cum_logprob[i].item(), 227 | lm_state=hn[i // len(word_list)]) 228 | for (m, i) in enumerate(ids_sorted)] 229 | else: 230 | scores_topK, ids_topK = lm_cum_logprob.topk(K) 231 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 232 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 233 | m, 234 | scores_topK[m].item(), 235 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 236 | self.vocab, 237 | lm_score=lm_cum_logprob[i].item(), 238 | lm_state=hn[i // len(word_list)]) 239 | for (m, i) in enumerate(ids_topK)] 240 | 241 | endbus = [] 242 | 243 | return nexttopKK, endbus 244 | 245 | def combscoreK(self, bulist, K, template_vec, ee, LMModel, 246 | word_list=None, subvocab=None, clustermask=None, 247 | mono=True, batch_size=1024, normalized=True, renorm=False, temperature=1, 248 | elmo_layer='avg', alpha=0.01, stopbyLMeos=False, ifadditive=False): 249 | """ 250 | Given a list of 'BeamUnit', score the next tokens from the candidate word list based on the combination of 251 | sentence similarities and a pretrained language model. Output the top K scored new 'BeamUnit', in a list. 252 | 253 | Input: 254 | stopbyLMeos: whether to use the LM '' to solely decide end of sentence, i.e. when '' gets the 255 | highest probability from the LM, remove the generated sentence out of beam. Default: False. 256 | 257 | Note: 258 | 'word_list', 'subvocab', and 'clustermask' should be coupled, sorted based on the full vocabulary. 259 | """ 260 | 261 | ## when no candidate word list is provided, use the full vocabulary 262 | if word_list is None: 263 | word_list = self.vocab.itos 264 | subvocab = None 265 | clustermask = None 266 | 267 | ## calculate the similarity scores 268 | endbus = [] # finished sequences 269 | onbeam_ids = list(range( 270 | len(bulist))) # keep track of sequences on beam that have not aligned to the end of the source sequence 271 | sim_cum_allbeam = None 272 | indices_allbeam = None 273 | states_allbeam = [] 274 | for (i, bu) in enumerate(bulist): 275 | try: 276 | scores, indices, states = simScoreNext(template_vec, word_list, ee, 277 | prevs_state=bu.elmo_state, batch_size=batch_size, 278 | prevs_align=bu.align_loc if mono else None, 279 | normalized=normalized, elmo_layer=elmo_layer) 280 | scores_logprob = F.log_softmax(scores, dim=0) 281 | 282 | sim_cum_logprob = scores_logprob + torch.tensor(bu.sim_score, dtype=torch.float, device=self.device) 283 | 284 | sim_cum_allbeam = sim_cum_logprob if sim_cum_allbeam is None else torch.cat( 285 | [sim_cum_allbeam, sim_cum_logprob]) 286 | indices_allbeam = indices if indices_allbeam is None else torch.cat([indices_allbeam, indices]) 287 | states_allbeam = states_allbeam + states 288 | 289 | # current sequence already aligned to the end: move out of beam 290 | except AssertionError as e: 291 | print('AssertionError:', e) 292 | endbus.append((i, bu.seq_len, bu.score, bu.sim_score, bu.lm_score)) 293 | onbeam_ids.remove(i) 294 | 295 | ## calculate the RNN LM scores 296 | ## note that LMModel should have the same vocab as that in Beam() 297 | if len(bulist) == 1 and bulist[0].word_id is None: 298 | # first beam step after initialization, only relying on similarity scores and no LM calculation is needed 299 | scores_comb = sim_cum_allbeam 300 | lm_cum_logprob = torch.zeros_like(sim_cum_allbeam) 301 | hn = [None] * len(onbeam_ids) # at the initial step, 'onbeam_ids' wouldn't be empty anyway 302 | else: 303 | ## all sequences have aligned to the end of source sentence 304 | if onbeam_ids == []: 305 | return [], endbus 306 | ## do the RNN LM forward calculation 307 | if bulist[onbeam_ids[0]].lm_state is None: 308 | # 'lm_state' for the current beam is either all 'None' or all not 'None'. 309 | batch_hn = None 310 | else: 311 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1), 312 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1)) 313 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids], 314 | dtype=torch.long).unsqueeze(0) 315 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn, 316 | subvocab=subvocab, clustermask=clustermask, onscore=False, 317 | renorm=renorm, 318 | temperature=temperature) 319 | 320 | ### LM predictes '' with the highest probability: move out of beam 321 | if stopbyLMeos: 322 | subprobs_max, subprobs_maxids = torch.max(subprobs, dim=1) 323 | eospos = (subprobs_maxids == word_list.index('')).nonzero() 324 | if eospos.size(0) > 0: # number of ended sentences 325 | # Note: have to delete backwards! Otherwise the indices will change. 326 | oob_ids = [onbeam_ids.pop(ep.item()) for ep in eospos.squeeze(1).sort(descending=True)[0]] 327 | oob_ids = sorted(oob_ids) 328 | print('-' * 5 + ' predicted most likely by LM at location:', *oob_ids) 329 | for i in oob_ids: 330 | endbus.append((i, bulist[i].seq_len, bulist[i].score, bulist[i].sim_score, bulist[i].lm_score)) 331 | # all sequences have been predicted with '' having highest probabilities 332 | if onbeam_ids == []: 333 | return [], endbus 334 | else: 335 | remainpos = [i for i in range(len(subprobs)) if i not in eospos] 336 | subprobs = subprobs[remainpos, :] 337 | probs = probs[remainpos, :] 338 | hn = (hn[0][:, remainpos, :], hn[1][:, remainpos, :]) 339 | remainpos_simallbeam = [] 340 | for rp in remainpos: 341 | remainpos_simallbeam += list(range(len(word_list) * rp, len(word_list) * (rp + 1))) 342 | sim_cum_allbeam = sim_cum_allbeam[remainpos_simallbeam] 343 | indices_allbeam = indices_allbeam[remainpos_simallbeam] 344 | states_allbeam = [s for (i, s) in enumerate(states_allbeam) if i in remainpos_simallbeam] 345 | 346 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence 347 | hn = list(zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1), 348 | torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1))) 349 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log( 350 | subprobs) 351 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities 352 | 353 | if ifadditive: 354 | scores_comb = torch.log((1 - alpha) * torch.exp(sim_cum_allbeam) + alpha * torch.exp(lm_cum_logprob)) 355 | else: 356 | scores_comb = (1 - alpha) * sim_cum_allbeam + alpha * lm_cum_logprob 357 | 358 | ## rank and update 359 | if K > len(scores_comb): 360 | scores_comb_sorted, ids_sorted = scores_comb.sort(descending=True) 361 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 362 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 363 | m, 364 | scores_comb_sorted[m].item(), 365 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 366 | self.vocab, 367 | sim_score=sim_cum_allbeam[i].item(), 368 | lm_score=lm_cum_logprob[i].item(), 369 | lm_state=hn[i // len(word_list)], 370 | elmo_state=states_allbeam[i], 371 | align_loc=indices_allbeam[i]) 372 | for (m, i) in enumerate(ids_sorted)] 373 | else: 374 | scores_comb_topK, ids_topK = scores_comb.topk(K) 375 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 376 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 377 | m, 378 | scores_comb_topK[m].item(), 379 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 380 | self.vocab, 381 | sim_score=sim_cum_allbeam[i].item(), 382 | lm_score=lm_cum_logprob[i].item(), 383 | lm_state=hn[i // len(word_list)], 384 | elmo_state=states_allbeam[i], 385 | align_loc=indices_allbeam[i]) 386 | for (m, i) in enumerate(ids_topK)] 387 | 388 | return nexttopKK, endbus 389 | 390 | def combscoreK_GPT2(self, bulist, K, template_vec, ge, LMModel, 391 | word_list=None, subvocab=None, clustermask=None, 392 | mono=True, normalized=True, renorm=False, temperature=1, 393 | bpe2word='last', alpha=0.01, stopbyLMeos=False, ifadditive=False): 394 | """ 395 | Given a list of 'BeamUnit', score the next tokens from the candidate word list based on the combination of 396 | sentence similarities and a pretrained language model. Output the top K scored new 'BeamUnit', in a list. 397 | 398 | Input: 399 | stopbyLMeos: whether to use the LM '' to solely decide end of sentence, i.e. when '' gets the 400 | highest probability from the LM, remove the generated sentence out of beam. Default: False. 401 | 402 | Note: 403 | 'word_list', 'subvocab', and 'clustermask' should be coupled, sorted based on the full vocabulary. 404 | """ 405 | 406 | ## when no candidate word list is provided, use the full vocabulary 407 | if word_list is None: 408 | word_list = self.vocab.itos 409 | subvocab = None 410 | clustermask = None 411 | 412 | ## calculate the similarity scores 413 | endbus = [] # finished sequences 414 | onbeam_ids = list(range( 415 | len(bulist))) # keep track of sequences on beam that have not aligned to the end of the source sequence 416 | sim_cum_allbeam = None 417 | indices_allbeam = None 418 | states_allbeam = [] 419 | for (i, bu) in enumerate(bulist): 420 | try: 421 | scores, indices, states = simScoreNext_GPT2(template_vec, word_list, ge, 422 | prevs_state=bu.gpt2_state, 423 | prevs_align=bu.align_loc if mono else None, 424 | normalized=normalized, bpe2word=bpe2word) 425 | scores_logprob = F.log_softmax(scores, dim=0) 426 | 427 | sim_cum_logprob = scores_logprob + torch.tensor(bu.sim_score, dtype=torch.float, device=self.device) 428 | 429 | sim_cum_allbeam = sim_cum_logprob if sim_cum_allbeam is None else torch.cat( 430 | [sim_cum_allbeam, sim_cum_logprob]) 431 | indices_allbeam = indices if indices_allbeam is None else torch.cat([indices_allbeam, indices]) 432 | states_allbeam = states_allbeam + states 433 | 434 | # current sequence already aligned to the end: move out of beam 435 | except AssertionError as e: 436 | print('AssertionError:', e) 437 | endbus.append((i, bu.seq_len, bu.score, bu.sim_score, bu.lm_score)) 438 | onbeam_ids.remove(i) 439 | 440 | ## calculate the RNN LM scores 441 | ## note that LMModel should have the same vocab as that in Beam() 442 | if len(bulist) == 1 and bulist[0].word_id is None: 443 | # first beam step after initialization, only relying on similarity scores and no LM calculation is needed 444 | scores_comb = sim_cum_allbeam 445 | lm_cum_logprob = torch.zeros_like(sim_cum_allbeam) 446 | hn = [None] * len(onbeam_ids) # at the initial step, 'onbeam_ids' wouldn't be empty anyway 447 | else: 448 | ## all sequences have aligned to the end of source sentence 449 | if onbeam_ids == []: 450 | return [], endbus 451 | ## do the RNN LM forward calculation 452 | if bulist[onbeam_ids[0]].lm_state is None: 453 | # 'lm_state' for the current beam is either all 'None' or all not 'None'. 454 | batch_hn = None 455 | else: 456 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1), 457 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1)) 458 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids], 459 | dtype=torch.long).unsqueeze(0) 460 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn, 461 | subvocab=subvocab, clustermask=clustermask, onscore=False, 462 | renorm=renorm, 463 | temperature=temperature) 464 | 465 | ### LM predictes '' with the highest probability: move out of beam 466 | if stopbyLMeos: 467 | subprobs_max, subprobs_maxids = torch.max(subprobs, dim=1) 468 | eospos = (subprobs_maxids == word_list.index('')).nonzero() 469 | if eospos.size(0) > 0: # number of ended sentences 470 | # Note: have to delete backwards! Otherwise the indices will change. 471 | oob_ids = [onbeam_ids.pop(ep.item()) for ep in eospos.squeeze(1).sort(descending=True)[0]] 472 | oob_ids = sorted(oob_ids) 473 | print('-' * 5 + ' predicted most likely by LM at location:', *oob_ids) 474 | for i in oob_ids: 475 | endbus.append((i, bulist[i].seq_len, bulist[i].score, bulist[i].sim_score, bulist[i].lm_score)) 476 | # all sequences have been predicted with '' having highest probabilities 477 | if onbeam_ids == []: 478 | return [], endbus 479 | else: 480 | remainpos = [i for i in range(len(subprobs)) if i not in eospos] 481 | subprobs = subprobs[remainpos, :] 482 | probs = probs[remainpos, :] 483 | hn = (hn[0][:, remainpos, :], hn[1][:, remainpos, :]) 484 | remainpos_simallbeam = [] 485 | for rp in remainpos: 486 | remainpos_simallbeam += list(range(len(word_list) * rp, len(word_list) * (rp + 1))) 487 | sim_cum_allbeam = sim_cum_allbeam[remainpos_simallbeam] 488 | indices_allbeam = indices_allbeam[remainpos_simallbeam] 489 | states_allbeam = [s for (i, s) in enumerate(states_allbeam) if i in remainpos_simallbeam] 490 | 491 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence 492 | hn = list(zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1), 493 | torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1))) 494 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log( 495 | subprobs) 496 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities 497 | 498 | if ifadditive: 499 | scores_comb = torch.log((1 - alpha) * torch.exp(sim_cum_allbeam) + alpha * torch.exp(lm_cum_logprob)) 500 | else: 501 | scores_comb = (1 - alpha) * sim_cum_allbeam + alpha * lm_cum_logprob 502 | 503 | ## rank and update 504 | if K > len(scores_comb): 505 | scores_comb_sorted, ids_sorted = scores_comb.sort(descending=True) 506 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 507 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 508 | m, 509 | scores_comb_sorted[m].item(), 510 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 511 | self.vocab, 512 | sim_score=sim_cum_allbeam[i].item(), 513 | lm_score=lm_cum_logprob[i].item(), 514 | lm_state=hn[i // len(word_list)], 515 | gpt2_state=states_allbeam[i], 516 | align_loc=indices_allbeam[i]) 517 | for (m, i) in enumerate(ids_sorted)] 518 | else: 519 | scores_comb_topK, ids_topK = scores_comb.topk(K) 520 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]], 521 | bulist[onbeam_ids[i // len(word_list)]].cur_loc, 522 | m, 523 | scores_comb_topK[m].item(), 524 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1, 525 | self.vocab, 526 | sim_score=sim_cum_allbeam[i].item(), 527 | lm_score=lm_cum_logprob[i].item(), 528 | lm_state=hn[i // len(word_list)], 529 | gpt2_state=states_allbeam[i], 530 | align_loc=indices_allbeam[i]) 531 | for (m, i) in enumerate(ids_topK)] 532 | 533 | return nexttopKK, endbus 534 | -------------------------------------------------------------------------------- /uss/elmo_lstm_forward.py: -------------------------------------------------------------------------------- 1 | """ 2 | A stacked forward only LSTM with skip connections between layers. 3 | 4 | Modified from allennlp/modules/elmo_lstm.py. 5 | """ 6 | from typing import Optional, Tuple, List, Union 7 | import warnings 8 | 9 | import torch 10 | from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence 11 | with warnings.catch_warnings(): 12 | warnings.filterwarnings("ignore", category=FutureWarning) 13 | import h5py 14 | import numpy 15 | 16 | from allennlp.modules.lstm_cell_with_projection import LstmCellWithProjection 17 | from allennlp.common.checks import ConfigurationError 18 | from allennlp.modules.encoder_base import _EncoderBase 19 | from allennlp.common.file_utils import cached_path 20 | 21 | RnnState = Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] # pylint: disable=invalid-name 22 | 23 | class ElmoLstmForward(_EncoderBase): 24 | """ 25 | A stacked, forward only LSTM which uses 26 | :class:`~allennlp.modules.lstm_cell_with_projection.LstmCellWithProjection`'s 27 | with highway layers between the inputs to layers. 28 | The inputs to the forward and backward directions are independent - forward and backward 29 | states are not concatenated between layers. 30 | Additionally, this LSTM maintains its `own` state, which is updated every time 31 | ``forward`` is called. It is dynamically resized for different batch sizes and is 32 | designed for use with non-continuous inputs (i.e inputs which aren't formatted as a stream, 33 | such as text used for a language modelling task, which is how stateful RNNs are typically used). 34 | This is non-standard, but can be thought of as having an "end of sentence" state, which is 35 | carried across different sentences. 36 | Parameters 37 | ---------- 38 | input_size : ``int``, required 39 | The dimension of the inputs to the LSTM. 40 | hidden_size : ``int``, required 41 | The dimension of the outputs of the LSTM. 42 | cell_size : ``int``, required. 43 | The dimension of the memory cell of the 44 | :class:`~allennlp.modules.lstm_cell_with_projection.LstmCellWithProjection`. 45 | num_layers : ``int``, required 46 | The number of bidirectional LSTMs to use. 47 | requires_grad: ``bool``, optional 48 | If True, compute gradient of ELMo parameters for fine tuning. 49 | recurrent_dropout_probability: ``float``, optional (default = 0.0) 50 | The dropout probability to be used in a dropout scheme as stated in 51 | `A Theoretically Grounded Application of Dropout in Recurrent Neural Networks 52 | `_ . 53 | state_projection_clip_value: ``float``, optional, (default = None) 54 | The magnitude with which to clip the hidden_state after projecting it. 55 | memory_cell_clip_value: ``float``, optional, (default = None) 56 | The magnitude with which to clip the memory cell. 57 | """ 58 | def __init__(self, 59 | input_size: int, 60 | hidden_size: int, 61 | cell_size: int, 62 | num_layers: int, 63 | requires_grad: bool = False, 64 | recurrent_dropout_probability: float = 0.0, 65 | memory_cell_clip_value: Optional[float] = None, 66 | state_projection_clip_value: Optional[float] = None) -> None: 67 | super(ElmoLstmForward, self).__init__(stateful=False) # change 'stateful' flag to be False 68 | # so that hidden_state can be externally provided 69 | 70 | # Required to be wrapped with a :class:`PytorchSeq2SeqWrapper`. 71 | self.input_size = input_size 72 | self.hidden_size = hidden_size 73 | self.num_layers = num_layers 74 | self.cell_size = cell_size 75 | self.requires_grad = requires_grad 76 | 77 | forward_layers = [] 78 | 79 | lstm_input_size = input_size 80 | go_forward = True 81 | for layer_index in range(num_layers): 82 | forward_layer = LstmCellWithProjection(lstm_input_size, 83 | hidden_size, 84 | cell_size, 85 | go_forward, 86 | recurrent_dropout_probability, 87 | memory_cell_clip_value, 88 | state_projection_clip_value) 89 | 90 | lstm_input_size = hidden_size 91 | 92 | self.add_module('forward_layer_{}'.format(layer_index), forward_layer) 93 | 94 | forward_layers.append(forward_layer) 95 | 96 | self.forward_layers = forward_layers 97 | 98 | 99 | def forward(self, # pylint: disable=arguments-differ 100 | inputs: torch.Tensor, 101 | mask: torch.LongTensor, 102 | hidden_state: Optional[RnnState] = None) -> torch.Tensor: 103 | """ 104 | Parameters 105 | ---------- 106 | inputs : ``torch.Tensor``, required. 107 | A Tensor of shape ``(batch_size, sequence_length, embedding_size)``. 108 | mask : ``torch.LongTensor``, required. 109 | A binary mask of shape ``(batch_size, sequence_length)`` representing the 110 | non-padded elements in each sequence in the batch. 111 | hidden_state : ``Optional[RnnState]``, (default = None). 112 | A single tensor of shape (num_layers, batch_size, hidden_size) representing the 113 | state of an RNN with or a tuple of 114 | tensors of shapes (num_layers, batch_size, hidden_size) and 115 | (num_layers, batch_size, memory_size), representing the hidden state and memory 116 | state of an LSTM-like RNN. 117 | Returns 118 | ------- 119 | A ``torch.Tensor`` of shape (num_layers, batch_size, sequence_length, hidden_size), 120 | where the num_layers dimension represents the LSTM output from that layer. 121 | A tuple of 122 | tensors of shapes (num_layers, batch_size, hidden_size) and 123 | (num_layers, batch_size, memory_size), representing the final hidden state and memory 124 | state of an LSTM-like RNN. 125 | """ 126 | batch_size, total_sequence_length = mask.size() 127 | stacked_sequence_output, final_states, restoration_indices = \ 128 | self.sort_and_run_forward(self._lstm_forward, inputs, mask, hidden_state) # add 'hidden_state' here 129 | 130 | num_layers, num_valid, returned_timesteps, encoder_dim = stacked_sequence_output.size() 131 | # Add back invalid rows which were removed in the call to sort_and_run_forward. 132 | if num_valid < batch_size: 133 | zeros = stacked_sequence_output.new_zeros(num_layers, 134 | batch_size - num_valid, 135 | returned_timesteps, 136 | encoder_dim) 137 | stacked_sequence_output = torch.cat([stacked_sequence_output, zeros], 1) 138 | 139 | # The states also need to have invalid rows added back. 140 | new_states = [] 141 | for state in final_states: 142 | state_dim = state.size(-1) 143 | zeros = state.new_zeros(num_layers, batch_size - num_valid, state_dim) 144 | new_states.append(torch.cat([state, zeros], 1)) 145 | final_states = new_states 146 | 147 | # It's possible to need to pass sequences which are padded to longer than the 148 | # max length of the sequence to a Seq2StackEncoder. However, packing and unpacking 149 | # the sequences mean that the returned tensor won't include these dimensions, because 150 | # the RNN did not need to process them. We add them back on in the form of zeros here. 151 | sequence_length_difference = total_sequence_length - returned_timesteps 152 | if sequence_length_difference > 0: 153 | zeros = stacked_sequence_output.new_zeros(num_layers, 154 | batch_size, 155 | sequence_length_difference, 156 | stacked_sequence_output[0].size(-1)) 157 | stacked_sequence_output = torch.cat([stacked_sequence_output, zeros], 2) 158 | 159 | # self._update_states(final_states, restoration_indices) 160 | 161 | # Restore the original indices and return the sequence. 162 | # Has shape (num_layers, batch_size, sequence_length, hidden_size) 163 | return stacked_sequence_output.index_select(1, restoration_indices), \ 164 | tuple([state.index_select(1, restoration_indices).detach() for state in final_states]) # detach final states 165 | 166 | def _lstm_forward(self, 167 | inputs: PackedSequence, 168 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \ 169 | Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]: 170 | """ 171 | Parameters 172 | ---------- 173 | inputs : ``PackedSequence``, required. 174 | A batch first ``PackedSequence`` to run the stacked LSTM over. 175 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None) 176 | A tuple (state, memory) representing the initial hidden state and memory 177 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and 178 | (num_layers, batch_size, 1 * cell_size) respectively. 179 | Returns 180 | ------- 181 | output_sequence : ``torch.FloatTensor`` 182 | The encoded sequence of shape (num_layers, batch_size, sequence_length, hidden_size) 183 | final_states: ``Tuple[torch.FloatTensor, torch.FloatTensor]`` 184 | The per-layer final (state, memory) states of the LSTM, with shape 185 | (num_layers, batch_size, 1 * hidden_size) and (num_layers, batch_size, 1 * cell_size) 186 | respectively. The last dimension is NOT duplicated because it contains the state/memory 187 | for ONLY the forward layers. 188 | """ 189 | if initial_state is None: 190 | hidden_states: List[Optional[Tuple[torch.Tensor, 191 | torch.Tensor]]] = [None] * len(self.forward_layers) 192 | elif initial_state[0].size()[0] != len(self.forward_layers): 193 | raise ConfigurationError("Initial states were passed to forward() but the number of " 194 | "initial states does not match the number of layers.") 195 | else: 196 | hidden_states = list(zip(initial_state[0].split(1, 0), initial_state[1].split(1, 0))) 197 | # list of tuples, each one is a (hidden, memory) tuple for that layer 198 | 199 | inputs, batch_lengths = pad_packed_sequence(inputs, batch_first=True) 200 | forward_output_sequence = inputs 201 | 202 | final_states = [] 203 | sequence_outputs = [] 204 | for layer_index, state in enumerate(hidden_states): 205 | forward_layer = getattr(self, 'forward_layer_{}'.format(layer_index)) 206 | 207 | forward_cache = forward_output_sequence 208 | 209 | if state is not None: 210 | forward_hidden_state = state[0] 211 | forward_memory_state = state[1] 212 | forward_state = (forward_hidden_state, forward_memory_state) 213 | else: 214 | forward_state = None 215 | 216 | forward_output_sequence, forward_state = forward_layer(forward_output_sequence, 217 | batch_lengths, 218 | forward_state) 219 | 220 | # Skip connections, just adding the input to the output. 221 | if layer_index != 0: 222 | forward_output_sequence += forward_cache 223 | 224 | sequence_outputs.append(forward_output_sequence) 225 | # Append the state tuples in a list, so that we can return 226 | # the final states for all the layers. 227 | final_states.append((forward_state[0], 228 | forward_state[1])) 229 | 230 | stacked_sequence_outputs: torch.FloatTensor = torch.stack(sequence_outputs) 231 | # Stack the hidden state and memory for each layer into 2 tensors of shape 232 | # (num_layers, batch_size, hidden_size) and (num_layers, batch_size, cell_size) 233 | # respectively. 234 | final_hidden_states, final_memory_states = zip(*final_states) 235 | final_state_tuple: Tuple[torch.FloatTensor, 236 | torch.FloatTensor] = (torch.cat(final_hidden_states, 0), 237 | torch.cat(final_memory_states, 0)) 238 | return stacked_sequence_outputs, final_state_tuple 239 | 240 | def load_weights(self, weight_file: str) -> None: 241 | """ 242 | Load the pre-trained weights from the file. 243 | """ 244 | requires_grad = self.requires_grad 245 | 246 | with h5py.File(cached_path(weight_file), 'r') as fin: 247 | for i_layer, lstms in enumerate( 248 | zip(self.forward_layers) 249 | ): 250 | for j_direction, lstm in enumerate(lstms): 251 | # lstm is an instance of LSTMCellWithProjection 252 | cell_size = lstm.cell_size 253 | 254 | dataset = fin['RNN_%s' % j_direction]['RNN']['MultiRNNCell']['Cell%s' % i_layer 255 | ]['LSTMCell'] 256 | 257 | # tensorflow packs together both W and U matrices into one matrix, 258 | # but pytorch maintains individual matrices. In addition, tensorflow 259 | # packs the gates as input, memory, forget, output but pytorch 260 | # uses input, forget, memory, output. So we need to modify the weights. 261 | tf_weights = numpy.transpose(dataset['W_0'][...]) 262 | torch_weights = tf_weights.copy() 263 | 264 | # split the W from U matrices 265 | input_size = lstm.input_size 266 | input_weights = torch_weights[:, :input_size] 267 | recurrent_weights = torch_weights[:, input_size:] 268 | tf_input_weights = tf_weights[:, :input_size] 269 | tf_recurrent_weights = tf_weights[:, input_size:] 270 | 271 | # handle the different gate order convention 272 | for torch_w, tf_w in [[input_weights, tf_input_weights], 273 | [recurrent_weights, tf_recurrent_weights]]: 274 | torch_w[(1 * cell_size):(2 * cell_size), :] = tf_w[(2 * cell_size):(3 * cell_size), :] 275 | torch_w[(2 * cell_size):(3 * cell_size), :] = tf_w[(1 * cell_size):(2 * cell_size), :] 276 | 277 | lstm.input_linearity.weight.data.copy_(torch.FloatTensor(input_weights)) 278 | lstm.state_linearity.weight.data.copy_(torch.FloatTensor(recurrent_weights)) 279 | lstm.input_linearity.weight.requires_grad = requires_grad 280 | lstm.state_linearity.weight.requires_grad = requires_grad 281 | 282 | # the bias weights 283 | tf_bias = dataset['B'][...] 284 | # tensorflow adds 1.0 to forget gate bias instead of modifying the 285 | # parameters... 286 | tf_bias[(2 * cell_size):(3 * cell_size)] += 1 287 | torch_bias = tf_bias.copy() 288 | torch_bias[(1 * cell_size):(2 * cell_size) 289 | ] = tf_bias[(2 * cell_size):(3 * cell_size)] 290 | torch_bias[(2 * cell_size):(3 * cell_size) 291 | ] = tf_bias[(1 * cell_size):(2 * cell_size)] 292 | lstm.state_linearity.bias.data.copy_(torch.FloatTensor(torch_bias)) 293 | lstm.state_linearity.bias.requires_grad = requires_grad 294 | 295 | # the projection weights 296 | proj_weights = numpy.transpose(dataset['W_P_0'][...]) 297 | lstm.state_projection.weight.data.copy_(torch.FloatTensor(proj_weights)) 298 | lstm.state_projection.weight.requires_grad = requires_grad 299 | -------------------------------------------------------------------------------- /uss/elmo_sequential_embedder.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sequentially embed tokens into ELMo vectors, using only forward computation, with externally updated hidden states. 3 | 4 | Based on allennlp.commands.elmo.ElmoEmbedder and allennlp.modules.elmo._ElmoBiLm. 5 | """ 6 | 7 | import json 8 | import logging 9 | from typing import List, Iterable, Tuple, Any, Optional, Dict 10 | import warnings 11 | 12 | # with warnings.catch_warnings(): 13 | # warnings.filterwarnings("ignore", category=FutureWarning) 14 | # import h5py 15 | warnings.filterwarnings('ignore', message='numpy.dtype size changed') 16 | warnings.filterwarnings('ignore', message='numpy.ufunc size changed') 17 | 18 | import numpy 19 | import torch 20 | from overrides import overrides 21 | 22 | from allennlp.common.file_utils import cached_path 23 | from allennlp.common.tqdm import Tqdm 24 | from allennlp.common.util import lazy_groups_of 25 | from allennlp.common.checks import ConfigurationError 26 | from allennlp.data.token_indexers.elmo_indexer import ELMoCharacterMapper 27 | from allennlp.modules.elmo import batch_to_ids, _ElmoCharacterEncoder 28 | 29 | from elmo_lstm_forward import ElmoLstmForward 30 | 31 | logger = logging.getLogger(__name__) # pylint: disable=invalid-name 32 | 33 | DEFAULT_OPTIONS_FILE = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json" # pylint: disable=line-too-long 34 | DEFAULT_WEIGHT_FILE = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5" # pylint: disable=line-too-long 35 | DEFAULT_BATCH_SIZE = 64 36 | 37 | 38 | class ElmoEmbedderForward(torch.nn.Module): 39 | def __init__(self, 40 | options_file: str = DEFAULT_OPTIONS_FILE, 41 | weight_file: str = DEFAULT_WEIGHT_FILE, 42 | requires_grad: bool = False, 43 | vocab_to_cache: List[str] = None, 44 | cuda_device: int = -1) -> None: 45 | super(ElmoEmbedderForward, self).__init__() 46 | 47 | self._token_embedder = _ElmoCharacterEncoder2(options_file, weight_file, requires_grad=requires_grad) 48 | 49 | self._requires_grad = requires_grad 50 | if requires_grad and vocab_to_cache: 51 | logging.warning("You are fine tuning ELMo and caching char CNN word vectors. " 52 | "This behaviour is not guaranteed to be well defined, particularly. " 53 | "if not all of your inputs will occur in the vocabulary cache.") 54 | # This is an embedding, used to look up cached 55 | # word vectors built from character level cnn embeddings. 56 | self._word_embedding = None 57 | self._bos_embedding: torch.Tensor = None 58 | self._eos_embedding: torch.Tensor = None 59 | if vocab_to_cache: 60 | logging.info("Caching character cnn layers for words in vocabulary.") 61 | # This sets 3 attributes, _word_embedding, _bos_embedding and _eos_embedding. 62 | # They are set in the method so they can be accessed from outside the 63 | # constructor. 64 | self.create_cached_cnn_embeddings(vocab_to_cache) 65 | self.vocab = vocab_to_cache # the first token should be the padding token, with id = 0 66 | 67 | with open(cached_path(options_file), 'r') as fin: 68 | options = json.load(fin) 69 | if not options['lstm'].get('use_skip_connections'): 70 | raise ConfigurationError('We only support pretrained biLMs with residual connections') 71 | 72 | logger.info("Initializing ELMo Forward.") 73 | self._elmo_lstm_forward = ElmoLstmForward(input_size=options['lstm']['projection_dim'], 74 | hidden_size=options['lstm']['projection_dim'], 75 | cell_size=options['lstm']['dim'], 76 | num_layers=options['lstm']['n_layers'], 77 | memory_cell_clip_value=options['lstm']['cell_clip'], 78 | state_projection_clip_value=options['lstm']['proj_clip'], 79 | requires_grad=requires_grad) 80 | self._elmo_lstm_forward.load_weights(weight_file) 81 | if cuda_device >= 0: 82 | self._elmo_lstm_forward = self._elmo_lstm_forward.cuda(device=cuda_device) 83 | self._token_embedder = self._token_embedder.cuda(device=cuda_device) 84 | # self.cuda(device=cuda_device) # this happens in-place 85 | self.cuda_device = cuda_device if cuda_device >= 0 else 'cpu' 86 | # Number of representation layers including context independent layer 87 | self.num_layers = options['lstm']['n_layers'] + 1 88 | 89 | def batch_to_embeddings(self, 90 | batch: List[List[str]], 91 | add_bos: bool = False, 92 | add_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: 93 | """ 94 | Compute sentence insensitive token representations for a batch of tokenized sentences, 95 | using pretrained character level CNN. This is the first layer of ELMo representation. 96 | 97 | Parameters 98 | ---------- 99 | batch : ``List[List[str]]``, required 100 | A list of tokenized sentences. 101 | add_bos: ``bool`` 102 | Whether to add begin of sentence token. 103 | add_eos: ``bool`` 104 | Whether to add end of sentence token. 105 | Returns 106 | ------- 107 | type_representation: ``torch.Tensor`` 108 | Shape ``(batch_size, sequence_length + 0/1/2, embedding_dim)`` tensor with context 109 | insensitive token representations. 110 | mask: ``torch.Tensor`` 111 | Shape ``(batch_size, sequence_length + 0/1/2)`` long tensor with sequence mask. 112 | """ 113 | 114 | if self._word_embedding is not None: # vocab_to_cache was passed in the constructor of this class 115 | try: 116 | word_inputs = [[self.vocab.index(w) for w in b] for b in batch] 117 | max_timesteps = max([len(b) for b in word_inputs]) 118 | word_inputs = [b + [0] * (max_timesteps - len(b)) if len(b) < max_timesteps else b 119 | for b in word_inputs] # 0 is the padding id 120 | word_inputs = torch.tensor(word_inputs, dtype=torch.long, device=self.cuda_device) 121 | # word ids in the cached vocabulary 122 | # LongTensor of shape (batch_size, max_timesteps) 123 | 124 | mask_without_bos_eos = (word_inputs > 0).long() 125 | # The character cnn part is cached - just look it up. 126 | embedded_inputs = self._word_embedding(word_inputs) # type: ignore 127 | # shape (batch_size, timesteps + 0/1/2, embedding_dim) 128 | type_representation, mask = add_sentence_boundaries( 129 | embedded_inputs, 130 | mask_without_bos_eos, 131 | self._bos_embedding, 132 | self._eos_embedding, 133 | add_bos, 134 | add_eos 135 | ) 136 | except RuntimeError: 137 | character_ids = batch_to_ids(batch) # size (batch_size, max_timesteps, 50) 138 | if self.cuda_device >= 0: 139 | character_ids = character_ids.cuda(device=self.cuda_device) 140 | # Back off to running the character convolutions, 141 | # as we might not have the words in the cache. 142 | token_embedding = self._token_embedder(character_ids, add_bos, add_eos) 143 | mask = token_embedding['mask'] 144 | type_representation = token_embedding['token_embedding'] 145 | else: 146 | character_ids = batch_to_ids(batch) # size (batch_size, max_timesteps, 50) 147 | if self.cuda_device >= 0: 148 | character_ids = character_ids.cuda(device=self.cuda_device) 149 | token_embedding = self._token_embedder(character_ids, add_bos, add_eos) 150 | mask = token_embedding['mask'] 151 | type_representation = token_embedding['token_embedding'] 152 | 153 | return type_representation, mask 154 | 155 | def forward(self, 156 | batch: List[List[str]], 157 | add_bos: bool = False, 158 | add_eos: bool = False, 159 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \ 160 | Tuple[List[numpy.ndarray], Tuple[torch.Tensor, torch.Tensor]]: 161 | """ 162 | Parameters 163 | ---------- 164 | batch : ``List[List[str]]``, required 165 | A list of tokenized sentences. 166 | add_bos: ``bool`` 167 | Whether to add begin of sentence token. 168 | add_eos: ``bool`` 169 | Whether to add end of sentence token. 170 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None) 171 | A tuple (state, memory) representing the initial hidden state and memory 172 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and 173 | (num_layers, batch_size, 1 * cell_size) respectively. 174 | 175 | Or, with shape (num_layers, 1 * hidden_size) and 176 | (num_layers, 1 * cell_size) respectively, if all the batch share the same initial_state. 177 | 178 | Returns 179 | ------- 180 | lstm_outputs : ``torch.FloatTensor`` 181 | The encoded sequence of shape (num_layers, batch_size, sequence_length, hidden_size) 182 | final_states : ``Tuple[torch.FloatTensor, torch.FloatTensor]`` 183 | The per-layer final (state, memory) states of the LSTM, with shape 184 | (num_layers, batch_size, 1 * hidden_size) and (num_layers, batch_size, 1 * cell_size) 185 | respectively. The last dimension is NOT duplicated because it contains the state/memory 186 | for ONLY the forward layers. 187 | 188 | elmo_embeddings: ``list[numpy.ndarray]`` 189 | A list of tensors, each representing the ELMo vectors for the input sentence at the same index. 190 | """ 191 | batch_size = len(batch) 192 | if initial_state is not None: # TO DO: need to deal with changing batch size 193 | initial_state_shape = list(initial_state[0].size()) 194 | if len(initial_state_shape) == 2: 195 | initial_state = (initial_state[0].expand(batch_size, -1, -1).transpose(0, 1), 196 | initial_state[1].expand(batch_size, -1, -1).transpose(0, 1)) 197 | elif len(initial_state_shape) == 3: 198 | pass 199 | else: 200 | raise ValueError("initial_state only accepts tuple of 2D or 3D input") 201 | 202 | token_embedding, mask = self.batch_to_embeddings(batch, add_bos, add_eos) 203 | lstm_outputs, final_states = self._elmo_lstm_forward(token_embedding, mask, initial_state) 204 | 205 | # Prepare the output. The first layer is duplicated. 206 | # Because of minor differences in how masking is applied depending 207 | # on whether the char cnn layers are cached, we'll be defensive and 208 | # multiply by the mask here. It's not strictly necessary, as the 209 | # mask passed on is correct, but the values in the padded areas 210 | # of the char cnn representations can change. 211 | 212 | output_tensors = [token_embedding * mask.float().unsqueeze(-1)] 213 | for layer_activations in torch.chunk(lstm_outputs, lstm_outputs.size(0), dim=0): 214 | output_tensors.append(layer_activations.squeeze(0)) 215 | 216 | # without_bos_eos is a 3 element list of tuples of (batch_size, num_timesteps, dim) and 217 | # (batch_size, num_timesteps) tensors, each element representing a layer. 218 | without_bos_eos = [remove_sentence_boundaries(layer, mask, add_bos, add_eos) 219 | for layer in output_tensors] 220 | # Split the list of tuples into two tuples, each of length 3 221 | activations_without_bos_eos, mask_without_bos_eos = zip(*without_bos_eos) 222 | 223 | # Convert the activations_without_bos_eos into a single batch first tensor, 224 | # of size (batch_size, num_layers, num_timesteps, dim) 225 | activations = torch.cat([ele.unsqueeze(1) for ele in activations_without_bos_eos], dim=1) 226 | # The mask is the same for each ELMo layer, so just take the first. 227 | mask_without_bos_eos = mask_without_bos_eos[0] 228 | 229 | # organize the Elmo embeddings into a list corresponding to the batch of sentences 230 | elmo_embeddings = [] 231 | for i in range(batch_size): 232 | length = int(mask_without_bos_eos[i, :].sum()) 233 | if length == 0: 234 | raise ConfigurationError('There exists totally masked out sequence in the batch.') 235 | else: 236 | # elmo_embeddings.append(activations[i, :, :length, :].detach().cpu().numpy()) 237 | elmo_embeddings.append(activations[i, :, :length, :].detach()) 238 | 239 | return elmo_embeddings, final_states 240 | 241 | def embed_sentence(self, 242 | sentence: List[str], 243 | add_bos: bool = False, 244 | add_eos: bool = False, 245 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \ 246 | Tuple[numpy.ndarray, Tuple[torch.Tensor, torch.Tensor]]: 247 | """ 248 | Computes the forward only ELMo embeddings for a single tokenized sentence. 249 | See the comment under the class definition. 250 | Parameters 251 | ---------- 252 | sentence : ``List[str]``, required 253 | A tokenized sentence. 254 | add_bos: ``bool`` 255 | Whether to add begin of sentence token. 256 | add_eos: ``bool`` 257 | Whether to add end of sentence token. 258 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None) 259 | A tuple (state, memory) representing the initial hidden state and memory 260 | of the LSTM, with shape (num_layers, 1, 1 * hidden_size) and 261 | (num_layers, 1, 1 * cell_size) respectively. 262 | 263 | Or, with shape (num_layers, 1 * hidden_size) and 264 | (num_layers, 1 * cell_size) respectively. 265 | Returns 266 | ------- 267 | A tensor containing the ELMo vectors, and 268 | final states, tuple of size (num_layers, hidden_size) and (num_layers, memory_size). 269 | """ 270 | elmo_embeddings, final_states = self.forward([sentence], add_bos, add_eos, initial_state) 271 | 272 | return elmo_embeddings[0], tuple([ele.squeeze(1) for ele in final_states]) 273 | 274 | def embed_sentences(self, 275 | sentences: Iterable[List[str]], 276 | add_bos: bool = False, 277 | add_eos: bool = False, 278 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, 279 | batch_size: int = DEFAULT_BATCH_SIZE) -> \ 280 | List[Tuple[numpy.ndarray, Tuple[torch.Tensor, torch.Tensor]]]: 281 | """ 282 | Computes the forward only ELMo embeddings for a iterable of sentences. 283 | See the comment under the class definition. 284 | Parameters 285 | ---------- 286 | sentences : ``Iterable[List[str]]``, required 287 | An iterable of tokenized sentences. 288 | add_bos: ``bool`` 289 | Whether to add begin of sentence token. 290 | add_eos: ``bool`` 291 | Whether to add end of sentence token. 292 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None) 293 | A tuple (state, memory) representing the initial hidden state and memory 294 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and 295 | (num_layers, batch_size, 1 * cell_size) respectively. 296 | 297 | Or, with shape (num_layers, 1 * hidden_size) and 298 | (num_layers, 1 * cell_size) respectively, if all the batch share the same initial_state. 299 | batch_size : ``int``, required 300 | The number of sentences ELMo should process at once. 301 | Returns 302 | ------- 303 | A list of tuple of (numpy.ndarray/torch.Tensor, (torch.Tensor, torch.Tensor)), 304 | each representing the ELMo vectors for the input sentence 305 | at the same index, and the final states after running that sentence, with shape (num_layers, hidden_size) and 306 | (num_layers, cell_size) respectively. 307 | (The return type could also be a generator. Can convert to a list using list().) 308 | """ 309 | embeddings_and_states = [] 310 | print('Embedding sentences into forward ELMo vectors ---') 311 | # for batch in Tqdm.tqdm(lazy_groups_of(iter(sentences), batch_size)): 312 | for batch in lazy_groups_of(iter(sentences), batch_size): 313 | elmo_embeddings, final_states = self.forward(batch, add_bos, add_eos, initial_state) 314 | # Remember: final_states is a tuple of tensors 315 | final_states_chunked = [] 316 | for i in range(2): 317 | final_states_chunked.append(list(map(lambda x: torch.squeeze(x, dim=1), 318 | final_states[i].chunk(final_states[i].size(1), dim=1)))) 319 | final_states_chunked = list(zip(*final_states_chunked)) 320 | assert len(elmo_embeddings) == len(final_states_chunked), 'length of embeddings and final states mismatch' 321 | # yield from zip(elmo_embeddings, final_states_chunked) 322 | embeddings_and_states += list(zip(elmo_embeddings, final_states_chunked)) 323 | return embeddings_and_states 324 | 325 | def create_cached_cnn_embeddings(self, tokens: List[str]) -> None: 326 | """ 327 | Given a list of tokens, this method precomputes word representations 328 | by running just the character convolutions and highway layers of elmo, 329 | essentially creating uncontextual word vectors. On subsequent forward passes, 330 | the word ids are looked up from an embedding, rather than being computed on 331 | the fly via the CNN encoder. 332 | This function sets 3 attributes: 333 | _word_embedding : ``torch.Tensor`` 334 | The word embedding for each word in the tokens passed to this method. 335 | _bos_embedding : ``torch.Tensor`` 336 | The embedding for the BOS token. 337 | _eos_embedding : ``torch.Tensor`` 338 | The embedding for the EOS token. 339 | Parameters 340 | ---------- 341 | tokens : ``List[str]``, required. 342 | A list of tokens to precompute character convolutions for. 343 | """ 344 | tokens = [ELMoCharacterMapper.bos_token, ELMoCharacterMapper.eos_token] + tokens 345 | timesteps = 32 346 | batch_size = 32 347 | chunked_tokens = lazy_groups_of(iter(tokens), timesteps) 348 | 349 | all_embeddings = [] 350 | device = get_device_of(next(self.parameters())) 351 | for batch in lazy_groups_of(chunked_tokens, batch_size): 352 | # Shape (batch_size, timesteps, 50) 353 | batched_tensor = batch_to_ids(batch) 354 | # NOTE: This device check is for when a user calls this method having 355 | # already placed the model on a device. If this is called in the 356 | # constructor, it will probably happen on the CPU. This isn't too bad, 357 | # because it's only a few convolutions and will likely be very fast. 358 | if device >= 0: 359 | batched_tensor = batched_tensor.cuda(device) 360 | output = self._token_embedder(batched_tensor, add_bos=False, add_eos=False) 361 | token_embedding = output["token_embedding"] 362 | mask = output["mask"] 363 | token_embedding, _ = remove_sentence_boundaries(token_embedding, mask, rmv_bos=False, rmv_eos=False) 364 | all_embeddings.append(token_embedding.view(-1, token_embedding.size(-1))) 365 | full_embedding = torch.cat(all_embeddings, 0) 366 | 367 | # We might have some trailing embeddings from padding in the batch, so 368 | # we clip the embedding and lookup to the right size. 369 | full_embedding = full_embedding[:len(tokens), :] 370 | embedding = full_embedding[2:len(tokens), :] 371 | vocab_size, embedding_dim = list(embedding.size()) 372 | 373 | from allennlp.modules.token_embedders import Embedding # type: ignore 374 | self._bos_embedding = full_embedding[0, :] 375 | self._eos_embedding = full_embedding[1, :] 376 | self._word_embedding = Embedding(vocab_size, # type: ignore 377 | embedding_dim, 378 | weight=embedding.data, 379 | trainable=self._requires_grad, 380 | padding_index=0) 381 | 382 | 383 | class _ElmoCharacterEncoder2(_ElmoCharacterEncoder): 384 | @overrides 385 | def forward(self, 386 | inputs: torch.Tensor, 387 | add_bos: bool = False, 388 | add_eos: bool = False) -> Dict[str, torch.Tensor]: # pylint: disable=arguments-differ 389 | """ 390 | Compute context insensitive token embeddings for ELMo representations. 391 | Parameters 392 | ---------- 393 | inputs: ``torch.Tensor`` 394 | Shape ``(batch_size, sequence_length, 50)`` of character ids representing the 395 | current batch. 396 | add_bos: ``bool`` 397 | Whether to add begin of sentence symbol 398 | add_eos: ``bool`` 399 | Whether to add end of sentence symbol 400 | Returns 401 | ------- 402 | Dict with keys: 403 | ``'token_embedding'``: ``torch.Tensor`` 404 | Shape ``(batch_size, sequence_length + 0/1/2, embedding_dim)`` tensor with context 405 | insensitive token representations. 406 | ``'mask'``: ``torch.Tensor`` 407 | Shape ``(batch_size, sequence_length + 0/1/2)`` long tensor with sequence mask. 408 | """ 409 | # Add BOS/EOS (this is the only difference from the original _ElmoCharacterEncoder class) 410 | mask = ((inputs > 0).long().sum(dim=-1) > 0).long() 411 | character_ids_with_bos_eos, mask_with_bos_eos = add_sentence_boundaries( 412 | inputs, 413 | mask, 414 | self._beginning_of_sentence_characters, 415 | self._end_of_sentence_characters, 416 | add_bos, 417 | add_eos 418 | ) 419 | 420 | # the character id embedding 421 | max_chars_per_token = self._options['char_cnn']['max_characters_per_token'] 422 | # (batch_size * sequence_length, max_chars_per_token, embed_dim) 423 | character_embedding = torch.nn.functional.embedding( 424 | character_ids_with_bos_eos.view(-1, max_chars_per_token), 425 | self._char_embedding_weights 426 | ) 427 | 428 | # run convolutions 429 | cnn_options = self._options['char_cnn'] 430 | if cnn_options['activation'] == 'tanh': 431 | activation = torch.nn.functional.tanh 432 | elif cnn_options['activation'] == 'relu': 433 | activation = torch.nn.functional.relu 434 | else: 435 | raise ConfigurationError("Unknown activation") 436 | 437 | # (batch_size * sequence_length, embed_dim, max_chars_per_token) 438 | character_embedding = torch.transpose(character_embedding, 1, 2) 439 | convs = [] 440 | for i in range(len(self._convolutions)): 441 | conv = getattr(self, 'char_conv_{}'.format(i)) 442 | convolved = conv(character_embedding) 443 | # (batch_size * sequence_length, n_filters for this width) 444 | convolved, _ = torch.max(convolved, dim=-1) 445 | convolved = activation(convolved) 446 | convs.append(convolved) 447 | 448 | # (batch_size * sequence_length, n_filters) 449 | token_embedding = torch.cat(convs, dim=-1) 450 | 451 | # apply the highway layers (batch_size * sequence_length, n_filters) 452 | token_embedding = self._highways(token_embedding) 453 | 454 | # final projection (batch_size * sequence_length, embedding_dim) 455 | token_embedding = self._projection(token_embedding) 456 | 457 | # reshape to (batch_size, sequence_length, embedding_dim) 458 | batch_size, sequence_length, _ = character_ids_with_bos_eos.size() 459 | 460 | return { 461 | 'mask': mask_with_bos_eos, 462 | 'token_embedding': token_embedding.view(batch_size, sequence_length, -1) 463 | } 464 | 465 | 466 | def add_sentence_boundaries(tensor: torch.Tensor, 467 | mask: torch.Tensor, 468 | sentence_begin_token: Any, 469 | sentence_end_token: Any, 470 | add_bos: bool = False, 471 | add_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: 472 | """ 473 | Add begin/end of sentence tokens to the batch of sentences. 474 | Given a batch of sentences with size ``(batch_size, timesteps)`` or 475 | ``(batch_size, timesteps, dim)`` this returns a tensor of shape 476 | ``(batch_size, timesteps + 0/1/2)`` or ``(batch_size, timesteps + 0/1/2, dim)`` respectively. 477 | Returns both the new tensor and updated mask. 478 | Parameters 479 | ---------- 480 | tensor : ``torch.Tensor`` 481 | A tensor of shape ``(batch_size, timesteps)`` or ``(batch_size, timesteps, dim)`` 482 | mask : ``torch.Tensor`` 483 | A tensor of shape ``(batch_size, timesteps)`` (assuming padding id is always 0) 484 | sentence_begin_token: Any (anything that can be broadcast in torch for assignment) 485 | For 2D input, a scalar with the id. For 3D input, a tensor with length dim. 486 | sentence_end_token: Any (anything that can be broadcast in torch for assignment) 487 | For 2D input, a scalar with the id. For 3D input, a tensor with length dim. 488 | add_bos: bool 489 | Whether to add begin of sentence token. 490 | add_eos: bool 491 | Whether to add end of sentence token. 492 | Returns 493 | ------- 494 | tensor_with_boundary_tokens : ``torch.Tensor`` 495 | The tensor with the appended and prepended boundary tokens. If the input was 2D, 496 | it has shape (batch_size, timesteps + 0/1/2) and if the input was 3D, it has shape 497 | (batch_size, timesteps + 0/1/2, dim). 498 | new_mask : ``torch.Tensor`` 499 | The new mask for the tensor, taking into account the appended tokens 500 | marking the beginning and end of the sentence. 501 | """ 502 | # TODO: matthewp, profile this transfer 503 | sequence_lengths = mask.sum(dim=1).detach().cpu().numpy() 504 | tensor_shape = list(tensor.data.shape) 505 | new_shape = list(tensor_shape) 506 | if add_bos: 507 | new_shape[1] = new_shape[1] + 1 508 | if add_eos: 509 | new_shape[1] = new_shape[1] + 1 510 | tensor_with_boundary_tokens = tensor.new_zeros(*new_shape) 511 | if len(tensor_shape) == 2: 512 | if add_bos: 513 | tensor_with_boundary_tokens[:, 1:(1 + tensor_shape[1])] = tensor 514 | tensor_with_boundary_tokens[:, 0] = sentence_begin_token 515 | else: 516 | tensor_with_boundary_tokens[:, 0:tensor_shape[1]] = tensor 517 | if add_eos: 518 | for i, j in enumerate(sequence_lengths): 519 | tensor_with_boundary_tokens[i, j + 1 if add_bos else j] = sentence_end_token 520 | new_mask = (tensor_with_boundary_tokens != 0).long() 521 | elif len(tensor_shape) == 3: 522 | if add_bos: 523 | tensor_with_boundary_tokens[:, 1:(1 + tensor_shape[1]), :] = tensor 524 | else: 525 | tensor_with_boundary_tokens[:, 0:tensor_shape[1], :] = tensor 526 | for i, j in enumerate(sequence_lengths): 527 | if add_bos: 528 | tensor_with_boundary_tokens[i, 0, :] = sentence_begin_token 529 | if add_eos: 530 | tensor_with_boundary_tokens[i, j + 1 if add_bos else j, :] = sentence_end_token 531 | new_mask = ((tensor_with_boundary_tokens > 0).long().sum(dim=-1) > 0).long() 532 | else: 533 | raise ValueError("add_sentence_boundary_token_ids only accepts 2D and 3D input") 534 | 535 | return tensor_with_boundary_tokens, new_mask 536 | 537 | def remove_sentence_boundaries(tensor: torch.Tensor, 538 | mask: torch.Tensor, 539 | rmv_bos: bool = False, 540 | rmv_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: 541 | """ 542 | Remove begin/end of sentence embeddings from the batch of sentences. 543 | Given a batch of sentences with size ``(batch_size, timesteps)`` or 544 | ``(batch_size, timesteps, dim)`` this returns a tensor of shape ``(batch_size, timesteps - 0/1/2)`` or 545 | ``(batch_size, timesteps - 0/1/2, dim)`` after removing 546 | the beginning and end sentence markers. The sentences are assumed to be padded on the right, 547 | with the beginning of each sentence assumed to occur at index 0 (i.e., ``mask[:, 0]`` is assumed 548 | to be 1). 549 | Returns both the new tensor and updated mask. 550 | This function is the inverse of ``add_sentence_boundaries``. 551 | Parameters 552 | ---------- 553 | tensor : ``torch.Tensor`` 554 | A tensor of shape ``(batch_size, timesteps)`` or ``(batch_size, timesteps, dim)`` 555 | mask : ``torch.Tensor`` 556 | A tensor of shape ``(batch_size, timesteps)`` 557 | rmv_bos: bool 558 | Whether to remove begin of sentence token 559 | rmv_eos: bool 560 | Whether to remove end of sentence token 561 | Returns 562 | ------- 563 | tensor_without_boundary_tokens : ``torch.Tensor`` 564 | The tensor after removing the boundary tokens of shape ``(batch_size, timesteps - 0/1/2)`` 565 | or ``(batch_size, timesteps - 0/1/2, dim)`` 566 | new_mask : ``torch.Tensor`` 567 | The new mask for the tensor of shape ``(batch_size, timesteps - 0/1/2)``. 568 | """ 569 | # TODO: matthewp, profile this transfer 570 | if not rmv_bos and not rmv_eos: 571 | return tensor, mask 572 | 573 | sequence_lengths = mask.sum(dim=1).detach().cpu().numpy() 574 | tensor_shape = list(tensor.data.shape) 575 | new_shape = list(tensor_shape) 576 | if rmv_bos: 577 | new_shape[1] = new_shape[1] - 1 578 | if rmv_eos: 579 | new_shape[1] = new_shape[1] - 1 580 | tensor_without_boundary_tokens = tensor.new_zeros(*new_shape) 581 | new_mask = tensor.new_zeros((new_shape[0], new_shape[1]), dtype=torch.long) 582 | for i, j in enumerate(sequence_lengths): 583 | if rmv_bos and rmv_eos and j > 2: 584 | if len(tensor_shape) == 3: 585 | tensor_without_boundary_tokens[i, :(j - 2), :] = tensor[i, 1:(j - 1), :] 586 | elif len(tensor_shape) == 2: 587 | tensor_without_boundary_tokens[i, :(j - 2)] = tensor[i, 1:(j - 1)] 588 | else: 589 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input") 590 | new_mask[i, :(j - 2)] = 1 591 | if rmv_bos and not rmv_eos and j > 1: 592 | if len(tensor_shape) == 3: 593 | tensor_without_boundary_tokens[i, :(j - 1), :] = tensor[i, 1:j, :] 594 | elif len(tensor_shape) == 2: 595 | tensor_without_boundary_tokens[i, :(j - 1)] = tensor[i, 1:j] 596 | else: 597 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input") 598 | new_mask[i, :(j - 1)] = 1 599 | if not rmv_bos and rmv_eos and j > 1: 600 | if len(tensor_shape) == 3: 601 | tensor_without_boundary_tokens[i, :(j - 1), :] = tensor[i, :(j - 1), :] 602 | elif len(tensor_shape) == 2: 603 | tensor_without_boundary_tokens[i, :(j - 1)] = tensor[i, :(j - 1)] 604 | else: 605 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input") 606 | new_mask[i, :(j - 1)] = 1 607 | 608 | return tensor_without_boundary_tokens, new_mask 609 | 610 | -------------------------------------------------------------------------------- /uss/gpt2_sequential_embedder.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Sequentially embed word tokens into GPT-2 (last-layer) hidden state vectors, with model internal states from the past saved. 3 | Note that GPT-2 uses BPE encodings for its vocabulary, so each word type will have multiple BPE units of variable length. 4 | 5 | Based on the library pytorch_pretrained_bert. 6 | ''' 7 | 8 | import torch 9 | import torch.nn as nn 10 | from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model 11 | 12 | import logging 13 | logging.basicConfig(level=logging.INFO) 14 | 15 | 16 | class GPT2Embedder(nn.Module): 17 | def __init__(self, cuda_device=-1): 18 | super(GPT2Embedder, self).__init__() 19 | 20 | self.cuda_device = 'cpu' if cuda_device == -1 else f'cuda:{cuda_device}' 21 | 22 | # Load pre-trained model tokenizer (vocabulary) 23 | self.enc = GPT2Tokenizer.from_pretrained('gpt2') 24 | # Load pre-trained model (weights) 25 | self.model = GPT2Model.from_pretrained('gpt2') 26 | 27 | self.model.to(self.cuda_device) 28 | self.model.eval() # we only use the evaluation mode of the pretrained model 29 | 30 | self._bos_id = self.enc.encoder['<|endoftext|>'] 31 | self._bos_past = None 32 | 33 | @property 34 | def bos_past(self): 35 | if self._bos_past is not None: 36 | return self._bos_past 37 | else: 38 | with torch.no_grad(): 39 | _, self._bos_past = self.model(torch.tensor([[self._bos_id]], device=self.cuda_device), past=None) 40 | return self._bos_past 41 | 42 | def embed_sentence(self, sentence, add_bos=False, add_eos=False, bpe2word='last', initial_state=None): 43 | ''' 44 | Compute the GPT-2 embeddings for a single tokenized sentence. 45 | 46 | Input: 47 | sentence (List[str]): tokenized sentence 48 | add_bos (bool): whether to add begin of sentence token '<|endoftext|>' 49 | add_eos (bool): whetehr to add end of sentenc token '<|endoftext|>' (currently not used) 50 | bpe2word (str): how to turn the BPE vectors into word vectors; 51 | 'last': last hidden state; 'avg': average hidden state. 52 | initial_state (List[torch.Tensor]): GPT-2 internal states for the past 53 | 54 | Output: 55 | embeddings (torch.Tensor): GPT-2 vectors for the sentence, size (len(sentence), 768) 56 | states (List[torch.Tensor]): GPT-2 internal states for the past, a list of length 12 (for 12 layers) 57 | ''' 58 | assert isinstance(sentence, list), 'input "sentence" should be a list of word types.' 59 | assert bpe2word in ['last', 'avg'] 60 | 61 | if add_bos: 62 | # initial_state is not used when 'add_bos' is True 63 | past = self.bos_past 64 | else: 65 | past = initial_state 66 | 67 | if past is None: 68 | bos_sp = '' # begin of sentence: whether there is a space or not 69 | else: 70 | bos_sp = ' ' 71 | 72 | for i, w in enumerate(sentence): 73 | if i == 0: 74 | bpe_units = torch.tensor([self.enc.encode(bos_sp + w)], device=self.cuda_device) 75 | with torch.no_grad(): 76 | vec, past = self.model(bpe_units, past=past) 77 | else: 78 | bpe_units = torch.tensor([self.enc.encode(' ' + w)], device=self.cuda_device) 79 | with torch.no_grad(): 80 | vec, past = self.model(bpe_units, past=past) 81 | 82 | if bpe2word == 'last': 83 | vec = vec[:, -1, :] 84 | elif bpe2word == 'avg': 85 | vec = vec.mean(dim=1) 86 | else: 87 | raise ValueError 88 | 89 | embeddings = vec if i == 0 else torch.cat([embeddings, vec], dim=0) 90 | 91 | return embeddings, past 92 | 93 | def embed_words(self, words, add_bos=False, add_eos=False, bpe2word='last', initial_state=None): 94 | ''' 95 | Compute the GPT-2 embeddings for a list of words. 96 | The challenge is that these words might have BPE encodings of different lengths, so we need to pad for a batch and then 97 | correctly index out the embeddings and internal states at right positions. 98 | 99 | Input: 100 | words (List[str]): a list of words 101 | add_bos (bool): whether to add begin of sentence token '<|endoftext|>' 102 | add_eos (bool): whetehr to add end of sentenc token '<|endoftext|>' (currently not used) 103 | bpe2word (str): how to turn the BPE vectors into word vectors; 104 | 'last': last hidden state; 'avg': average hidden state. 105 | initial_state (List[torch.Tensor]): GPT-2 internal states for the past 106 | 107 | Output: 108 | embeddings (torch.Tensor): GPT-2 vectors for the words, size (len(words), 768) 109 | states (List[List[torch.Tensor]]): GPT-2 internal states for the past, a list of length len(words) 110 | ''' 111 | assert isinstance(words, list), 'input "words" should be a list of candidate word types for the next step.' 112 | assert bpe2word in ['last', 'avg'] 113 | 114 | if add_bos: 115 | # initial_state is not used when 'add_bos' is True 116 | past = self.bos_past 117 | else: 118 | past = initial_state 119 | 120 | if past is None: 121 | bos_sp = '' # begin of sentence: whether there is a space or not 122 | else: 123 | bos_sp = ' ' 124 | 125 | n = len(words) 126 | bpe_list = [self.enc.encode(bos_sp + w) for w in words] 127 | bpe_lens = [len(b) for b in bpe_list] 128 | 129 | ## padding to for a batch 130 | padding = 0 131 | max_seqlen = max(bpe_lens) 132 | bpe_list = [b + [padding] * (max_seqlen - l) for b, l in zip(bpe_list, bpe_lens)] 133 | bpe_padded = torch.tensor(bpe_list, device=self.cuda_device) # size (n, max_seqlen) 134 | if past is not None: 135 | past_seqlen = past[0].size(3) 136 | past = [p.expand(-1, n, -1, -1, -1) for p in past] # same past internal states for every word in the batch 137 | else: 138 | past_seqlen = 0 139 | 140 | ## run GPT-2 model 141 | with torch.no_grad(): 142 | hid, mid = self.model(bpe_padded, past=past) 143 | 144 | ## extract the hidden states of words through indexing 145 | if bpe2word == 'last': 146 | # method 1: torch.gather 147 | index = torch.tensor(bpe_lens, device=self.cuda_device).reshape(n, 1, 1).expand(-1, -1, hid.size(2)) - 1 148 | embeddings = torch.gather(hid, 1, index).squeeze(1) 149 | # method 2: for loop 150 | # embeddings = hid.new_zeros(n, hid.size(2)) 151 | # for i in range(n): 152 | # embeddings[i] = hid[i, bpe_lens[i] - 1] 153 | elif bpe2word == 'avg': 154 | a = torch.arange(max_seqlen, device=self.cuda_device).view(1, -1).expand(n, -1) 155 | b = torch.tensor(bpe_lens, device=self.cuda_device).view(-1, 1) 156 | mask = a >= b # size (n, max_seqlen) 157 | hid[mask] = 0 # mask out the padded position embeddings 158 | embeddings = hid.sum(dim=1) / b.float() 159 | else: 160 | raise ValueError 161 | 162 | ## index out the internal states 163 | states = torch.cat(mid, dim=0) # size (2 * 12, n, 12, past_seqlen + max_seqlen, 64) 164 | states = torch.split(states, 1, dim=1) # list of length n 165 | states = [torch.chunk(s.index_select(3, torch.arange(past_seqlen + l, device=self.cuda_device)), 12, dim=0) 166 | for s, l in zip(states, bpe_lens)] 167 | 168 | return embeddings, states 169 | 170 | -------------------------------------------------------------------------------- /uss/lm_subvocab.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def prob_next(LMModel, vocab, text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False): 5 | """ 6 | Output the probability distribution for the next word based on a pretrained LM, given the previous text. 7 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary. 8 | 9 | Input: 10 | LMModel: pretrained RNN LM model. 11 | vocab: full vocabulary. 'torchtext.vocab.Vocab'. 12 | text: previous words in the sentence. 13 | hn: initial hidden states to the LM. 14 | subvocab: sub-vocabulary. 'torch.LongTensor'. 15 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). 16 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities. 17 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False. 18 | 19 | Output: 20 | (if subvocab is not None) subprobs: probability distribution over the sub-vocabulary. 21 | probs: probability distribution over the full vocabulary. 22 | hn: hidden states. 23 | """ 24 | 25 | if clustermask is not None: 26 | assert subvocab is not None, 'clustermask provided but No subvocab provided.' 27 | 28 | if isinstance(text, str): 29 | text = text.split() 30 | 31 | textid = next(LMModel.parameters()).new_tensor([vocab.stoi[w] for w in text], 32 | dtype=torch.long) 33 | with torch.no_grad(): 34 | LMModel.eval() 35 | batch_text = textid.unsqueeze(1) 36 | embed = LMModel.embedding(batch_text) 37 | output, hn = LMModel.lstm(embed, hn) 38 | output = LMModel.proj(output) # size: (seq_len, batch_size=1, vocab_size) 39 | 40 | probs = torch.nn.functional.softmax(output[-1].squeeze(), dim=0) 41 | 42 | if subvocab is None: 43 | # if no subvocab is provided, return the full probability distribution and hidden states 44 | return probs, hn 45 | 46 | ## cluster on the raw scores (rather than the probabilities) before passing to the softmax layer 47 | if onscore: 48 | scores = output[-1].squeeze() 49 | subscores = scores[subvocab] 50 | if clustermask is None: 51 | subprobs = torch.nn.functional.softmax(subscores, dim=0) 52 | return subprobs, probs, hn 53 | for i in range(len(subvocab)): 54 | subscores[i] = scores[clustermask[i]].sum() 55 | subprobs = torch.nn.functional.softmax(subscores, dim=0) 56 | return subprobs, probs, hn 57 | 58 | ## cluster on the probabilities 59 | subprobs = probs[subvocab] 60 | if clustermask is None: 61 | if renorm: 62 | subprobs = subprobs / subprobs.sum() 63 | # subprobs = torch.nn.functional.softmax(subprobs, dim=0) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller 64 | return subprobs, probs, hn 65 | 66 | for i in range(len(subvocab)): 67 | subprobs[i] = probs[clustermask[i]].sum() 68 | if renorm: 69 | subprobs = subprobs / subprobs.sum() 70 | # subprobs = torch.nn.functional.softmax(subprobs, dim=0) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller 71 | return subprobs, probs, hn 72 | 73 | 74 | def prob_next_1step(LMModel, batch_text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False, temperature=1): 75 | """ 76 | Output the probability distribution for the next word based on a pretrained LM, carried in only one step of the forward pass. 77 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary. 78 | This function is specifically used in the beam search. 79 | 80 | Input: 81 | LMModel: pretrained RNN LM model. 82 | batch_text: text id input to the language model, of size (seq_len=1, batch_size=onbeam_size). 83 | hn: hidden states to the LM, a tuple and each of size (num_layers * num_directions, batch_size=onbeam_size, hidden_size). 84 | subvocab: sub-vocabulary. 'torch.LongTensor'. 85 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). 86 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities. 87 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False. 88 | 89 | Output: 90 | subprobs: probability distribution over the sub-vocabulary. Size: (batch_size=onbeam_size, subvocab_size) 91 | probs: probability distribution over the full vocabulary. Size: (batch_size=onbeam_size, vocab_size) 92 | hn: hidden states. Tuple, each of size (num_layers * num_directions, batch_size=onbeam_size, hidden_size). 93 | """ 94 | 95 | if clustermask is not None: 96 | assert subvocab is not None, 'clustermask provided but No subvocab provided.' 97 | 98 | with torch.no_grad(): 99 | LMModel.eval() 100 | embed = LMModel.embedding(batch_text) 101 | output, hn = LMModel.lstm(embed, hn) 102 | output = LMModel.proj(output) # size: (seq_len=1, batch_size=onbeam_size, vocab_size) 103 | 104 | output = output / temperature 105 | probs = torch.nn.functional.softmax(output.squeeze(0), dim=1) # size: (batch_size=onbeam_size, vocab_size) 106 | 107 | if subvocab is None: 108 | # if no subvocab is provided, return the full probability distribution and hidden states 109 | return probs, probs, hn 110 | 111 | # ## cluster on the raw scores (rather than the probabilities) before passing to the softmax layer 112 | # if onscore: 113 | 114 | # return 115 | 116 | ## cluster on the probabilities 117 | subprobs = probs[:, subvocab] # size: (batch_size=onbeam_size, subvocab_size) 118 | if clustermask is None: 119 | if renorm: 120 | subprobs = subprobs / torch.sum(subprobs, dim=1, keepdim=True) 121 | # subprobs = torch.nn.functional.softmax(subprobs, dim=1) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller 122 | return subprobs, probs, hn 123 | 124 | for i in range(len(subvocab)): 125 | subprobs[:, i] = probs[:, clustermask[i]].sum(dim=1) 126 | 127 | if renorm: 128 | subprobs = subprobs / torch.sum(subprobs, dim=1, keepdim=True) 129 | # subprobs = torch.nn.functional.softmax(subprobs, dim=1) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller 130 | return subprobs, probs, hn 131 | 132 | 133 | def prob_sent(LMModel, vocab, text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False, size_average=False): 134 | """ 135 | Output the log-likelihood of a sentence based on a pretrained LM. 136 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary. 137 | 138 | Input: 139 | LMModel: pretrained RNN LM model. 140 | vocab: full vocabulary. 'torchtext.vocab.Vocab'. 141 | text: previous words in the sentence. 142 | hn: initial hidden states to the LM. 143 | subvocab: sub-vocabulary. 'torch.LongTensor'. 144 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). 145 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities. 146 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False. 147 | size_average: whether to average the log-likelihood according to the sequence length. 148 | 149 | Output: 150 | ll: log-likelihood of the given sentence evaluated by the pretrained LM. 151 | hn: hidden states. 152 | """ 153 | 154 | if clustermask is not None: 155 | assert subvocab is not None, 'clustermask provided but No subvocab provided.' 156 | 157 | if isinstance(text, str): 158 | text = text.split() 159 | 160 | ## no subvocab is provided, operating on the full vocabulary 161 | textid = next(LMModel.parameters()).new_tensor([vocab.stoi[w] for w in text], 162 | dtype=torch.long) 163 | if subvocab is None: 164 | with torch.no_grad(): 165 | LMModel.eval() 166 | batch_text = textid.unsqueeze(1) 167 | embed = LMModel.embedding(batch_text) 168 | output, hn = LMModel.lstm(embed, hn) 169 | output = LMModel.proj(output) # size: (seq_len, batch_size=1, vocab_size) 170 | ll = torch.nn.functional.cross_entropy(output.squeeze()[:-1, :], textid[1:], size_average=size_average, ignore_index=LMModel.padid) 171 | ll = -ll.item() 172 | return ll, hn 173 | 174 | ## subvocab is provided 175 | textid_sub = next(LMModel.parameters()).new_tensor([subvocab.numpy().tolist().index(vocab.stoi[w]) for w in text], 176 | dtype=torch.long) 177 | subprobs_sent = torch.zeros(len(text) - 1, len(subvocab), device=next(LMModel.parameters()).device) 178 | for i in range(len(text) - 1): 179 | subprobs, probs, hn = prob_next(LMModel, vocab, text[i], hn, subvocab, clustermask, onscore, renorm) 180 | subprobs_sent[i] = subprobs 181 | ll = torch.nn.functional.nll_loss(torch.log(subprobs_sent), textid_sub[1:], size_average=size_average, ignore_index=LMModel.padid) 182 | ll = -ll.item() 183 | return ll, hn 184 | 185 | 186 | def clmk_nn(embedmatrix, subvocab, normalized=True): 187 | """ 188 | Generate 'clustermask', based on nearest neighbors, i.e. each word outside of the sub-vocabulary is assigned 189 | to the group of its closest one in the sub-vocabulary. 190 | 191 | Input: 192 | embedmatrix: word embedding matrix. Default should be the output embedding from the RNN language model. 193 | subvocab: sub-vocabulary. 'torch.LongTensor'. 194 | normalized: whether to use the normalized dot product as the distance measure, i.e. cosine similarity. 195 | 196 | Output: 197 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). 198 | """ 199 | 200 | submatrix = embedmatrix[subvocab] 201 | sim_table = torch.mm(submatrix, embedmatrix.t()) 202 | if normalized: 203 | sim_table = sim_table / torch.ger(submatrix.norm(2, 1), embedmatrix.norm(2, 1)) 204 | maxsim, maxsim_ind = torch.max(sim_table, dim=0) 205 | 206 | groups = [] 207 | vocab_ind = torch.arange(len(embedmatrix), device=embedmatrix.device) 208 | clustermask = torch.zeros_like(sim_table, dtype=torch.uint8, device='cpu') 209 | for i in range(len(subvocab)): 210 | groups.append(vocab_ind[maxsim_ind == i].long()) 211 | clustermask[i][groups[i]] = 1 212 | 213 | return clustermask 214 | 215 | 216 | def clmk_cn(embedmatrix, subvocab, simthre=0.6, normalized=True): 217 | """ 218 | Generate 'clustermask', based on the cone method, i.e. each word in the sub-vocabulary is joined by the closest words in a cone, 219 | specified by a cosine similarity threshold. 220 | 221 | Input: 222 | embedmatrix: word embedding matrix. Default should be the output embedding from the RNN language model. 223 | subvocab: sub-vocabulary. 'torch.LongTensor'. 224 | simthre: cosine similarity threshold. 225 | normalized: whether to use the normalized dot product as the distance measure, i.e. cosine similarity. 226 | 227 | Output: 228 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). 229 | """ 230 | 231 | submatrix = embedmatrix[subvocab] 232 | sim_table = torch.mm(submatrix, embedmatrix.t()) 233 | if normalized: 234 | sim_table = sim_table / torch.ger(submatrix.norm(2, 1), embedmatrix.norm(2, 1)) 235 | 236 | clustermask = (sim_table > simthre).to('cpu') 237 | ## remove the indices that are already in the sub-vocabulary 238 | subvocabmask = torch.zeros_like(clustermask, dtype=torch.uint8) 239 | subvocabmask[:, subvocab] = 1 240 | clustermask = (clustermask ^ subvocabmask) & clustermask # set difference 241 | for i in range(len(subvocab)): 242 | clustermask[i][subvocab[i]] = 1 # add back the current word in the sub-vocabulary 243 | 244 | return clustermask 245 | -------------------------------------------------------------------------------- /uss/pre_closetables.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import pickle 3 | import math 4 | from tqdm import tqdm 5 | 6 | import sys 7 | 8 | from elmo_sequential_embedder import ElmoEmbedderForward 9 | from sim_embed_score import pickElmoForwardLayer 10 | 11 | 12 | def ELMoBotEmbedding(itos, device=-1): 13 | """ 14 | itos: List[str]. A list of words consisting of the vocabulary. 15 | device: int. -1 for cpu. 16 | """ 17 | ee = ElmoEmbedderForward(cuda_device=device) 18 | vocab_vecs, _ = zip(*ee.embed_sentences([[w] for w in itos], add_bos=True, batch_size=1024)) 19 | vocab_vecs = [pickElmoForwardLayer(vec, 'bot') for vec in vocab_vecs] 20 | embedmatrix = torch.cat(vocab_vecs, dim=0) # size: (vocab_size, embed_size) 21 | 22 | return embedmatrix 23 | 24 | 25 | def findclosewords_vocab(vocab, embedmatrix, numwords=500, normalized=True, device='cpu'): 26 | """ 27 | Find closest words for every word in the vocabulary. 28 | """ 29 | v = len(vocab) 30 | assert v == len(embedmatrix) 31 | 32 | embedmatrix = embedmatrix.to(device) 33 | 34 | chunk_size = 1000 # to solve the problem of out of memory 35 | 36 | if v > chunk_size: 37 | n = math.ceil(v / chunk_size) 38 | else: 39 | n = 1 40 | values = None 41 | indices = None 42 | start = 0 43 | for i in tqdm(range(n)): 44 | embedmatrix_chunk = embedmatrix[start:(start + chunk_size), :] 45 | start = start + chunk_size 46 | 47 | sim_table = torch.mm(embedmatrix_chunk, embedmatrix.t()) 48 | if normalized: 49 | sim_table = sim_table / torch.ger(embedmatrix_chunk.norm(2, 1), embedmatrix.norm(2, 1)) 50 | 51 | values_chunk, indices_chunk = sim_table.topk(numwords, dim=1) 52 | values = values_chunk if values is None else torch.cat([values, values_chunk], dim=0) 53 | indices = indices_chunk if indices is None else torch.cat([indices, indices_chunk], dim=0) 54 | 55 | return values.to('cpu'), indices.to('cpu') # values and indices have size (vocab_len, numwords) 56 | 57 | 58 | if __name__ == '__main__': 59 | 60 | vocab_path = '../4.0_cluster/vocabTle.pkl' # vocabulary for the pretrained language model 61 | closewordsim_path = '../4.0_cluster/vocabTleCloseWordSims.pkl' 62 | closewordind_path = '../4.0_cluster/vocabTleCloseWordIndices.pkl' # character level word embeddings 63 | closewordsim_outembed_path = 'vocabTleCloseWordSims_outembed_MoS.pkl' 64 | closewordind_outembed_path = 'vocabTleCloseWordIndices_outembed_MoS.pkl' 65 | modelclass_path = '../LSTM_MoS' 66 | model_path = '../LSTM_MoS/models/LMModelMoSTle2.pth' 67 | 68 | # vocab_path = '../LSTM_LUC/vocabTle50k.pkl' # vocabulary for the pretrained language model 69 | # closewordsim_path = 'vocabTle50kCloseWordSims.pkl' 70 | # closewordind_path = 'vocabTle50kCloseWordIndices.pkl' # character level word embeddings 71 | # closewordsim_outembed_path = 'vocabTle50kCloseWordSims_outembed_wtI.pkl' 72 | # closewordind_outembed_path = 'vocabTle50kCloseWordIndices_outembed_wtI.pkl' 73 | # modelclass_path = '../LSTM_LUC' 74 | # model_path = '../LSTM_LUC/models/TleLUC_wtI_0-0.0001-1Penalty.pth' 75 | 76 | # vocabulary 77 | vocab = pickle.load(open(vocab_path, 'rb')) 78 | 79 | # # character embeddings of the vocabulary 80 | # embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=0) 81 | # values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500) 82 | 83 | # # save results 84 | # pickle.dump(values_cnn, open(closewordsim_path, 'wb')) 85 | # pickle.dump(indices_cnn, open(closewordind_path, 'wb')) 86 | 87 | # output embeddings of the vocabulary 88 | modelclass_path = modelclass_path 89 | if modelclass_path not in sys.path: 90 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model; the model class file must be included in the search path 91 | LMModel = torch.load(model_path, map_location=torch.device('cpu')) 92 | embedmatrix = LMModel.proj_vocab.weight 93 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500) 94 | 95 | # save results 96 | pickle.dump(values, open(closewordsim_outembed_path, 'wb')) 97 | pickle.dump(indices, open(closewordind_outembed_path, 'wb')) 98 | 99 | -------------------------------------------------------------------------------- /uss/pre_word_list.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def findwordlist(template, closewordind, vocab, numwords=10, addeos=False): 5 | """ 6 | Based on a template sentence, find the candidate word list. 7 | 8 | Input: 9 | template: source sentence. 10 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor. 11 | vocab: full vocabulary. 12 | numwords: number of closest words per word in the template. 13 | addeos: whether to include '' in the candidate word list. 14 | """ 15 | if isinstance(template, str): 16 | template = template.split() 17 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template]) 18 | # subvocab = closewordind[templateind, :numwords].flatten().cpu() # torch.flatten() only exists from PyTorch 0.4.1 19 | subvocab = closewordind[templateind, :numwords].view(-1).cpu() 20 | if addeos: 21 | subvocab = torch.cat([subvocab, torch.LongTensor([vocab.stoi['']])]) 22 | subvocab = subvocab.unique(sorted=True) 23 | word_list = [vocab.itos[i] for i in subvocab] 24 | 25 | return word_list, subvocab 26 | 27 | 28 | def findwordlist_screened(template, closewordind, closewordind_outembed, vocab, numwords=10, addeos=False): 29 | """ 30 | Based on a template sentence, find the candidate word list, according to the character level RNN embeddings but 31 | screened by the output embeddings. 32 | 33 | Input: 34 | template: source sentence. 35 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor. 36 | closewordind_embed: same as 'closewordind', but using output embeddings. 37 | vocab: full vocabulary. 38 | numwords: number of closest words per word in the template. 39 | addeos: whether to include '' in the candidate word list. 40 | """ 41 | if isinstance(template, str): 42 | template = template.split() 43 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template]) 44 | 45 | subvocab = closewordind[templateind, :numwords].view(-1).cpu() 46 | subvocab_embed = closewordind_outembed[templateind, 1:numwords].view(-1).cpu() 47 | subvocab_intemplate = closewordind[templateind, 0].view(-1).cpu() 48 | 49 | subvocab_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device) 50 | subvocab_mask[subvocab] = 1 51 | subvocab_embed_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device) 52 | subvocab_embed_mask[subvocab_embed] = 1 53 | 54 | subvocab_screened_mask = (subvocab_mask ^ subvocab_embed_mask) & subvocab_mask 55 | subvocab_screened_mask[subvocab_intemplate] = 1 # add back the words in the template sentence 56 | if addeos: 57 | subvocab_screened_mask[vocab.stoi['']] = 1 58 | 59 | subvocab_screened = torch.arange(len(vocab), dtype=torch.long, device=subvocab.device) 60 | subvocab_screened = subvocab_screened[subvocab_screened_mask] 61 | 62 | word_list = [vocab.itos[i] for i in subvocab_screened] 63 | 64 | return word_list, subvocab_screened 65 | 66 | 67 | def findwordlist_screened2(template, closewordind, closewordind_outembed, vocab, numwords=10, 68 | numwords_outembed=None, numwords_freq=500, addeos=False): 69 | """ 70 | Based on a template sentence, find the candidate word list, according to the character level RNN embeddings but 71 | screened by the output embeddings, and keep the words that are in the top 'numwords_freq' list in the vocabulary. 72 | 73 | Input: 74 | template: source sentence. 75 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor. 76 | closewordind_embed: same as 'closewordind', but using output embeddings. 77 | vocab: full vocabulary. 78 | numwords: number of closest words per word in the template. 79 | numwords_outembed: number of closest words per word in the output embedding to be screened out. 80 | numwords_freq: number of the most frequent words in the vocabulary to remain. 81 | addeos: whether to include '' in the candidate word list. 82 | """ 83 | if numwords_outembed is None: 84 | numwords_outembed = numwords 85 | 86 | if numwords_outembed <= 1: 87 | return findwordlist(template, closewordind, vocab, numwords=numwords, addeos=addeos) 88 | 89 | if isinstance(template, str): 90 | template = template.split() 91 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template]) 92 | 93 | subvocab = closewordind[templateind, :numwords].view(-1).cpu() 94 | subvocab_embed = closewordind_outembed[templateind, 1:numwords_outembed].view(-1).cpu() 95 | subvocab_intemplate = closewordind[templateind, 0].view(-1).cpu() 96 | 97 | subvocab_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device) 98 | subvocab_mask[subvocab] = 1 99 | subvocab_embed_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device) 100 | subvocab_embed_mask[subvocab_embed[subvocab_embed >= numwords_freq]] = 1 # never remove the most frequent words 101 | 102 | subvocab_screened_mask = (subvocab_mask ^ subvocab_embed_mask) & subvocab_mask 103 | subvocab_screened_mask[subvocab_intemplate] = 1 # add back the words in the template sentence 104 | if addeos: 105 | subvocab_screened_mask[vocab.stoi['']] = 1 106 | 107 | subvocab_screened = torch.arange(len(vocab), dtype=torch.long, device=subvocab.device) 108 | subvocab_screened = subvocab_screened[subvocab_screened_mask] 109 | 110 | word_list = [vocab.itos[i] for i in subvocab_screened] 111 | 112 | return word_list, subvocab_screened 113 | -------------------------------------------------------------------------------- /uss/sim_embed_score.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import time 4 | from utils import timeSince 5 | # from tqdm import tqdm 6 | 7 | from sim_token_match import OneTokenMatch 8 | 9 | 10 | def pickElmoForwardLayer(embedding, elmo_layer='avg'): 11 | """ 12 | Given a forward only ELMo embedding vector of size (3, #words, 512), pick up the layer 13 | """ 14 | assert elmo_layer in ['top', 'mid', 'bot', 'avg', 'cat'] 15 | 16 | if elmo_layer == 'top': 17 | embedding = embedding[2] 18 | elif elmo_layer == 'mid': 19 | embedding = embedding[1] 20 | elif elmo_layer == 'bot': 21 | embedding = embedding[0] 22 | elif elmo_layer == 'avg': 23 | if isinstance(embedding, np.ndarray): 24 | embedding = np.average(embedding, axis=0) 25 | elif isinstance(embedding, torch.Tensor): 26 | embedding = torch.mean(embedding, dim=0) 27 | elif elmo_layer == 'cat': 28 | if isinstance(embedding, np.ndarray): 29 | embedding = np.reshape(embedding.transpose(1, 0, 2), 30 | (-1, embedding.shape[0] * embedding.shape[2])) # concat 3 layers, bottom first 31 | elif isinstance(embedding, torch.Tensor): 32 | embedding = embedding.transpose(0, 1).reshape(-1, embedding.size(0) * embedding.size(2)) 33 | 34 | return embedding 35 | 36 | 37 | def simScoreNext(template_vec, 38 | word_list, 39 | ee, 40 | batch_size=1024, 41 | prevs_state=None, 42 | prevs_align=None, 43 | normalized=True, 44 | elmo_layer='avg'): 45 | """ 46 | Score the next tokens based on sentence level similarity, with previous alignment fixed. 47 | 48 | Input: 49 | template_vec: template sentence ELMo vectors. 50 | word_list: a list of next candidate words. 51 | ee: a ``ElmoEmbedderForward`` class. 52 | batch_size: for ee to use. 53 | prevs_state: previous hidden states. 54 | prevs_align: aligning location for the last word in the sequence. 55 | If provided, monotonicity is required. 56 | normalized: whether to use normalized dot product (cosine similarity) for token similarity calculation. 57 | elmo_layer: ELMo layer to use. 58 | Output: 59 | scores: unsorted one-token similarity scores, torch.Tensor. 60 | indices: matched indices in template_vec for each token, torch.LongTensor. 61 | states: corresponding ELMo forward lstm hidden states, List. 62 | """ 63 | sentences = [[w] for w in word_list] 64 | src_vec = pickElmoForwardLayer(template_vec, elmo_layer) 65 | if prevs_state is None: 66 | assert prevs_align is None, 'Nothing should be passed in when no history.' 67 | # beginning of sentence, the first token 68 | embeddings_and_states = ee.embed_sentences(sentences, add_bos=True, batch_size=batch_size) 69 | else: 70 | # in the middle of sentence, sequential update 71 | # start = time.time() 72 | embeddings_and_states = ee.embed_sentences(sentences, initial_state=prevs_state, batch_size=batch_size) 73 | # print('ELMo embedding: ' + timeSince(start)) 74 | 75 | embeddings, states = zip(*embeddings_and_states) # this returns two tuples 76 | 77 | scores = [] 78 | indices = [] 79 | print('Calculating similarities ---') 80 | # start = time.time() 81 | embeddings = [pickElmoForwardLayer(vec, elmo_layer) for vec in embeddings] 82 | scores, indices = OneTokenMatch(src_vec, embeddings, normalized=normalized, starting_loc=prevs_align) 83 | # print('Similarities: ' + timeSince(start)) 84 | 85 | return scores, indices, list(states) 86 | 87 | 88 | def simScoreNext_GPT2(template_vec, 89 | word_list, 90 | ge, 91 | bpe2word='last', 92 | prevs_state=None, 93 | prevs_align=None, 94 | normalized=True): 95 | """ 96 | Score the next tokens based on sentence level similarity, with previous alignment fixed. 97 | In particular, this function uses GPT-2 to embed the sentences/candidate words: 98 | - Calculate the embeddings for each candidate word using pre-trained GPT-2 model, given the previous hidden states 99 | - Calculate best alignment positions and similarity scores for each word 100 | 101 | Note: 102 | - GPT-2 uses BPE tokenizer, so each word may be split into several different units 103 | 104 | Input: 105 | template_vec (torch.Tensor): template sentence GPT-2 embedding vectors 106 | word_list (list): a list of next candidate words 107 | ge (:class:`GPT2Embedder`): a `GPT2Embedder` object for embedding words using GPT-2 108 | bpe2word (str): how to turn the BPE vectors into word vectors. 109 | 'last': last hidden state; 'avg': average hidden state. 110 | prevs_state (list[torch.Tensor]): previous hidden states for the GPT-2 model 111 | prevs_align (int): aligning location for the last word in the sequence. 112 | If provided, monotonicity is required. 113 | normalized (bool): whether to use normalized dot product (cosine similarity) for token similarity calculation 114 | 115 | Output: 116 | scores (torch.Tensor): unsorted one-token similarity scores 117 | indices (torch.LongTensor): matched indices in template_vec for each token 118 | states (list): corresponding GPT-2 past internal hidden states 119 | """ 120 | assert bpe2word in ['last', 'avg'] 121 | 122 | if prevs_state is None: 123 | # beginning of sentence, the first token 124 | assert prevs_align is None, 'Nothing should be passed in when no history.' 125 | add_bos = True 126 | else: 127 | # in the middle of a sentence, sequential update 128 | add_bos = False 129 | 130 | embeddings, states = ge.embed_words(word_list, add_bos=add_bos, bpe2word=bpe2word, initial_state=prevs_state) 131 | 132 | scores = [] 133 | indices = [] 134 | print('Calculating similarities ---') 135 | # start = time.time() 136 | scores, indices = OneTokenMatch(template_vec, embeddings, normalized=normalized, starting_loc=prevs_align) 137 | # print('Similarities: ' + timeSince(start)) 138 | 139 | return scores, indices, states 140 | 141 | 142 | """ 143 | def simScoreNext_GPT2(template_vec, 144 | bpe_encoding_grouped, 145 | model, 146 | bpe2word='last', 147 | prevs_state=None, prevs_align=None, normalized=True): 148 | ''' 149 | Score the next tokens based on sentence level similarity, with previous alignment fixed. 150 | In particular, this function uses GPT-2 to embed the sentences/candidate words: 151 | - Calculate the embeddings for each candidate word using pretrained GPT-2 model, given the previous hidden states 152 | - Calculate best alignment positions and similarity scores for each word 153 | 154 | Note: 155 | - GPT-2 uses BPE tokenizer, so each word may be splitted into several different units 156 | 157 | Input: 158 | template_vec (torch.Tensor): template sentence GPT-2 embedding vectors 159 | word_list (list): a list of next candidate words 160 | prevs_state (list[torch.Tensor]): previous hidden states for the GPT-2 model 161 | tokenizer (pytorch_pretrained_bert.tokenization_gpt2.GPT2Tokenizer): GPT-2 tokenizer 162 | model (pytorch_pretrained_bert.modeling_gpt2.GPT2Model): GPT-2 Model 163 | bpe2word (str): how to turn the BPE vectors into word vectors. 164 | 'last': last hidden state; 'avg': average hidden state. 165 | prevs_align (int): aligning location for the last word in the sequence. 166 | If provided, monotonicity is required. 167 | normalized (bool): whether to use normalized dot product (cosine similarity) for token similarity calculation 168 | 169 | Output: 170 | scores (torch.Tensor): unsorted one-token similarity scores 171 | indices (torch.LongTensor): matched indices in template_vec for each token 172 | states (list): corresponding GPT-2 hidden states 173 | ''' 174 | assert bpe2word in ['last', 'avg'] 175 | 176 | device = next(model.parameters()).device 177 | model.eval() 178 | 179 | if prevs_state is None: 180 | # beginning of sentence, the first token 181 | assert prevs_align is None, 'Nothing should be passed in when no history.' 182 | else: 183 | # in the middle of a sentence, sequential update 184 | assert prevs_state is not None, 'There should be history.' 185 | 186 | embeddings = [] # word embeddings 187 | states = [] # hidden states saved for sequential calculations 188 | with torch.no_grad(): 189 | for bpe_encoding in bpe_encoding_grouped: 190 | # bpe_encoding is a tensor of bpe unit ids 191 | vec, past = model(bpe_encoding, past=prevs_state) 192 | # vec: size (n, len(bpe_encoding), 768) 193 | # past: a list of length 12, each of size (2, n, 12, len(bpe_encoding), 64) 194 | # which records keys, values for 12 heads in each of the 12 layers 195 | # where n is the number of words of the same len(bpe_encoding) in the word list 196 | 197 | if bpe2word == 'last': 198 | embeddings.append(vec[:, -1, :]) # size (n, 768) 199 | elif bpe2word == 'avg': 200 | embeddings.append(vec.mean(dim=1)) # size (n, 768) 201 | else: # impossible 202 | raise ValueError 203 | 204 | past = torch.cat(past, dim=0) # size (2 * 12, n, 12, len(bpe_encoding), 64) 205 | past = torch.split(past, 1, dim=1) # list of length n, each of size (2 * 12, 1, 12, len(bpe_encoding), 64) 206 | states += past 207 | 208 | embeddings = torch.cat(embeddings, dim=0) # size (#word_list, 768) 209 | states = [torch.chunk(s, 12, dim=0) for s in states] 210 | 211 | scores = [] 212 | indices = [] 213 | print('Calculating similarities ---') 214 | # start = time.time() 215 | scores, indices = OneTokenMatch(template_vec, embeddings, normalized=normalized, starting_loc=prevs_align) 216 | # print('Similarities: ' + timeSince(start)) 217 | 218 | return scores, indices, states 219 | """ 220 | -------------------------------------------------------------------------------- /uss/sim_token_match.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sequential calculation of word similarities in an embedding space. 3 | Previous alignments are fixed once found. 4 | 5 | Each time, do the best word vector matching between a template sentence and a list of single tokens, based on 6 | cosine similarities or dot products. 7 | In the simplest case, do not require monotonicity. 8 | """ 9 | import torch 10 | 11 | 12 | def OneTokenMatch(src, token_list, normalized=False, starting_loc=None): 13 | """ 14 | Input: 15 | src: source sequence, such as a long sentence vector to be summarized. 16 | token_list: a list of word vectors to be matched with 'src'. 17 | starting_loc: aligning location for the last word in the sequence. 18 | If provided, monotonicity is required. 19 | Output: 20 | similarities: the best similarity scores for each token in 'token_list'. 21 | indices: the matched indices in 'src' for the best scores for each token. 22 | """ 23 | if isinstance(token_list, list): 24 | assert isinstance(token_list[0], torch.Tensor) and isinstance(src, torch.Tensor), \ 25 | 'source/template sequence must be torch.Tensor.' 26 | assert len(token_list[0].size()) == len(src.size()) == 2, 'input sequences must be 2D series.' 27 | elif isinstance(token_list, torch.Tensor): 28 | assert isinstance(src, torch.Tensor), 'source/template sequence must be torch.Tensor.' 29 | assert len(token_list.size()) == len(src.size()) == 2, 'input sequences must be 2D series.' 30 | else: 31 | raise TypeError 32 | 33 | if starting_loc is not None: 34 | # require monotonicity, by only looking at 'src' from or after 'starting_loc' 35 | # strict monotonicity 36 | assert starting_loc < len(src) - 1, 'last word already matched to the last token in template, ' \ 37 | 'when requiring strict monotonicity.' 38 | src = src[(starting_loc + 1):] 39 | # weak monotonicity 40 | # assert starting_loc < len(src) 41 | # src = src[starting_loc:] 42 | 43 | if isinstance(token_list, list): 44 | token_matrix = torch.cat(token_list, dim=0) 45 | elif isinstance(token_list, torch.Tensor): 46 | token_matrix = token_list 47 | else: 48 | raise TypeError 49 | sim_table = torch.mm(src, token_matrix.t()) # size: (src_len, token_list_len) or (truncated_src_len, token_list_len) 50 | 51 | if normalized: 52 | sim_table = sim_table / torch.ger(src.norm(2, 1), token_matrix.norm(2, 1)) 53 | 54 | similarities, indices = torch.max(sim_table, dim=0) 55 | 56 | if starting_loc is not None: 57 | indices += starting_loc + 1 # strict monotonicity 58 | # indices += starting_loc # weak monotonicity 59 | 60 | return similarities, indices 61 | 62 | 63 | def TokenMatch(src, tgt, mono=True, weakmono=False, normalized=True): 64 | """ 65 | Calculate the similarity between two sentences by word embedding match and single token alignment. 66 | 67 | Input: 68 | src: source sequence word embeddings. 'torch.Tensor' of size (src_seq_len, embed_dim). 69 | tgt: short target sequence word embeddings to be matched to 'src'. 'torch.Tensor' of size 70 | (tgt_seq_len, embed_dim). 71 | mono: whether to constrain the alignments to be monotonic. Default: True. 72 | weakmono: whether to relax the alignment monotonicity to be weak (non-strict). Only effective when 'mono' 73 | is True. Default: False. 74 | normalized: whether to normalize the dot product in calculating word similarities, i.e. whether to use 75 | cosine similarity or just dot product. Default: True. 76 | 77 | Output: 78 | similarity: sequence similarity, by summing the max similarities of the best alignment. 79 | indices: locations in the 'src' sequence that each 'tgt' token is aligned to. 80 | """ 81 | 82 | assert isinstance(src, torch.Tensor) and isinstance(tgt, torch.Tensor), 'input sequences must be torch.Tensor.' 83 | assert len(src.size()) == len(tgt.size()) == 2, 'input sequences must be 2D series.' 84 | 85 | sim_table = torch.mm(src, tgt.t()) 86 | if normalized: 87 | sim_table = sim_table / torch.ger(src.norm(2, 1), tgt.norm(2, 1)) 88 | 89 | if mono: 90 | src_len, tgt_len = sim_table.size() 91 | max_sim = [] 92 | if weakmono: 93 | indices = [0] 94 | for i in range(1, tgt_len + 1): 95 | mi, ii = torch.max(sim_table[indices[i - 1]:, i - 1].unsqueeze(1), dim=0) 96 | max_sim.append(mi) 97 | indices.append(ii + indices[i - 1]) 98 | else: 99 | indices = [-1] 100 | for i in range(1, tgt_len + 1): 101 | if indices[i - 1] == src_len - 1: 102 | max_sim.append(sim_table[-1, i - 1].unsqueeze(0)) 103 | indices.append(indices[i - 1]) 104 | else: 105 | mi, ii = torch.max(sim_table[(indices[i - 1] + 1):, i - 1].unsqueeze(1), dim=0) 106 | max_sim.append(mi) 107 | indices.append(ii + indices[i - 1] + 1) 108 | max_sim = torch.cat(max_sim) 109 | indices = torch.cat(indices[1:]) 110 | else: 111 | max_sim, indices = torch.max(sim_table, dim=0) 112 | 113 | similarity = torch.sum(max_sim) 114 | 115 | return similarity, indices 116 | -------------------------------------------------------------------------------- /uss/summary_search_elmo.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import time 3 | import sys 4 | import os 5 | import argparse 6 | 7 | import torch 8 | from tqdm import tqdm 9 | 10 | from elmo_sequential_embedder import ElmoEmbedderForward 11 | from pre_closetables import ELMoBotEmbedding, findclosewords_vocab 12 | # from pre_word_list import findwordlist, findwordlist_screened 13 | from pre_word_list import findwordlist_screened2 14 | from lm_subvocab import clmk_nn 15 | from beam_search import Beam 16 | from utils import timeSince 17 | 18 | 19 | def gensummary_elmo(template_vec, 20 | ee, 21 | vocab, 22 | LMModel, 23 | word_list, 24 | subvocab, 25 | clustermask=None, 26 | mono=True, 27 | renorm=True, 28 | temperature=1, 29 | elmo_layer='avg', 30 | max_step=20, 31 | beam_width=10, 32 | beam_width_start=10, 33 | alpha=0.1, 34 | alpha_start=0.1, 35 | begineos=True, 36 | stopbyLMeos=False, 37 | devid=0, 38 | **kwargs): 39 | """ 40 | Unsupervised sentence summary generation using beam search, by contextual matching and a summary style language model. 41 | The contextual matching here is on top of pretrained ELMo embeddings. 42 | 43 | Input: 44 | - template_vec (torch.Tensor): forward only ELMo embeddings of the source sentence. 45 | 'torch.Tensor' of size (3, seq_len, 512). 46 | - ee (elmo_sequential_embedder.ElmoEmbedderForward): 'elmo_sequential_embedder.ElmoEmbedderForward' object. 47 | - vocab (torchtext.vocab.Vocab): 'torchtext.vocab.Vocab' object. Should be the same as is used for the 48 | pretrained language model. 49 | - LMModel (user defined torch.nn.Module): a pretrained language model on the summary sentences. 50 | - word_list (list): a list of words in the vocabulary to work with. 'List'. 51 | - subvocab (torch.LongTensor): 'torch.LongTensor' consisting of the indices of the words corresponding 52 | to `word_list`. 53 | - clustermask (torch.ByteTensor): a binary mask for each of the sub-vocabulary word. 54 | 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). Default:None. 55 | - mono (bool): whether to keep monotonicity contraint. Default: True. 56 | - renorm (bool): whether to renormalize the probabilities over the sub-vocabulary. Default: True. 57 | - temperature (float): temperature applied to the softmax in the language model. Default: 1. 58 | - elmo_layer (str): which ELMo layer to use as the word type representation. 59 | Choose from ['avg', 'cat', 'bot', 'mid', 'top']. Default: 'avg'. 60 | - max_step (int): maximum number of beam steps. 61 | - beam_width (int): beam width. 62 | - beam_width_start (int): beam width of the first step. 63 | - alpha (float): the amount of language model part used for scoring. The score is: 64 | (1 - \alpha) * similarity_logscore + \alpha * LM_logscore. 65 | - alpha_start (float): the amount of language model part used for scoring, only for the first step. 66 | - begineos (bool): whether to begin with the special '' token as is trained in the language model. 67 | Note that ELMo has its own special beginning token. Default: True. 68 | - stopbyLMeos (bool): whether to stop a sentence solely by the language model predicting '' as the 69 | top possibility. Default: False. 70 | - devid (int): device id to run the algorithm and LSTM language models. 'int', default: 0. -1 for cpu. 71 | **kwargs: other arguments input to function . 72 | E.g. - normalized (bool): whether to normalize the dot product when calculating the similarity, 73 | which makes it cosine similarity. Default: True. 74 | - ifadditive (bool): whether to use an additive model on mixing the probability scores. Default: False. 75 | 76 | Output: 77 | - beam (beam_search.Beam): 'Beam' object, recording all the generated sequences. 78 | 79 | """ 80 | device = 'cpu' if devid == -1 else f'cuda:{devid}' 81 | 82 | # Beam Search: initialization 83 | if begineos: 84 | beam = Beam(1, vocab, init_ids=[vocab.stoi['']], device=device, 85 | sim_score=0, lm_score=0, lm_state=None, elmo_state=None, align_loc=None) 86 | else: 87 | beam = Beam(1, vocab, init_ids=[None], device=device, 88 | sim_score=0, lm_score=0, lm_state=None, elmo_state=None, align_loc=None) 89 | 90 | # first step: start with 'beam_width_start' best matched words 91 | beam.beamstep(beam_width_start, 92 | beam.combscoreK, 93 | template_vec=template_vec, 94 | ee=ee, 95 | LMModel=LMModel, 96 | word_list=word_list, 97 | subvocab=subvocab, 98 | clustermask=clustermask, 99 | alpha=alpha_start, 100 | renorm=renorm, 101 | temperature=temperature, 102 | elmo_layer=elmo_layer, 103 | # normalized=True, 104 | # ifadditive=False, 105 | **kwargs) 106 | 107 | # run beam search, until all sentences hit or max_step reached 108 | for s in range(max_step): 109 | print(f'beam step {s + 1} ' + '-' * 50 + '\n') 110 | beam.beamstep(beam_width, 111 | beam.combscoreK, 112 | template_vec=template_vec, 113 | ee=ee, 114 | LMModel=LMModel, 115 | word_list=word_list, 116 | subvocab=subvocab, 117 | clustermask=clustermask, 118 | mono=mono, 119 | alpha=alpha, 120 | renorm=renorm, 121 | temperature=temperature, 122 | stopbyLMeos=stopbyLMeos, 123 | elmo_layer=elmo_layer, 124 | # normalized=True, 125 | # ifadditive=False, 126 | **kwargs) 127 | # all beams reach termination 128 | if beam.endall: 129 | break 130 | 131 | return beam 132 | 133 | 134 | def sortsummary(beam, beta=0): 135 | """ 136 | Sort the generated summaries by beam search, with length penalty considered. 137 | 138 | Input: 139 | - beam (beam_search.Beam): 'Beam' object finished with beam search. 140 | - beta (float): length penalty when sorting. Default: 0 (no length penalty). 141 | 142 | Output: 143 | - ssa (list[tuple]): 'List[Tuple]' of (score_avg, sentence, alignment, sim_score, lm_score). 144 | """ 145 | sents = [] 146 | aligns = [] 147 | score_avgs = [] 148 | sim_scores = [] 149 | lm_scores = [] 150 | 151 | for ks in beam.endbus: 152 | sent, rebeam = beam.retrieve(ks[0] + 1, ks[1]) 153 | score_avg = ks[2] / (ks[1] ** beta) 154 | 155 | sents.append(sent) 156 | aligns.append(beam.retrieve_align(rebeam)) 157 | score_avgs.append(score_avg) 158 | sim_scores.append(ks[3]) 159 | lm_scores.append(ks[4]) 160 | 161 | ssa = sorted([(score_avgs[i], sents[i], aligns[i], sim_scores[i], lm_scores[i]) for i in range(len(sents))], 162 | reverse=True) 163 | 164 | return ssa 165 | 166 | 167 | def fixlensummary(beam, length=-1): 168 | """ 169 | Pull out fixed length summaries from the beam search. 170 | 171 | Input: 172 | - beam (beam_search.Beam): 'Beam' object finished with beam search. 173 | - length (int): wanted length of the summary. 174 | 175 | Output: 176 | - ssa (list[tuple]): 'List[Tuple]' of sorted (score, sentence, alignments, sim_score, lm_score). 177 | """ 178 | assert length >= 1 and length <= beam.step, 'invalid sentence length.' 179 | 180 | ssa = [] 181 | for i in range(beam.K[length]): 182 | sent, rebeam = beam.retrieve(i + 1, l) 183 | ssa.append((beam.beamseq[length][i].score, 184 | sent, 185 | beam.retrieve_align(rebeam), 186 | beam.beamseq[length][i].sim_score, 187 | beam.beamseq[length][i].lm_score)) 188 | 189 | return ssa 190 | 191 | 192 | ############################################################################### 193 | ########## some default parameters ########## 194 | ############################################################################### 195 | devid = 0 196 | 197 | ##### for English giga words 198 | arttxtpath = './data/Giga-sum/input_unk_250.txt' 199 | # arttxtpath = './data/Giga-sum/input_unk_251-500.txt' 200 | # arttxtpath = './data/Giga-sum/input_unk_501-750.txt' 201 | # arttxtpath = './data/Giga-sum/input_unk_751-1000.txt' 202 | # arttxtpath = './data/Giga-sum/input_unk_1001-1250.txt' 203 | # arttxtpath = './data/Giga-sum/input_unk_1251-1500.txt' 204 | # arttxtpath = './data/Giga-sum/input_unk_1501-1750.txt' 205 | # arttxtpath = './data/Giga-sum/input_unk_1751-1951.txt' 206 | 207 | # arttxtpath = './data/Giga-sum/input_unk.txt' 208 | 209 | ''' 210 | vocab_path = './lm_lstm_models/gigaword/vocabTle.pkl' 211 | modelclass_path = './lm_lstm' 212 | model_path = './lm_lstm_models/gigaword/Tle_LSTM_untied.pth' 213 | closeword = './voctbls/vocabTleCloseWord' 214 | closeword_lmemb = './voctbls/vocabTleCloseWord' 215 | savedir = './results_elmo_giga/' 216 | ''' 217 | 218 | ##### for Google sentence compression dataset 219 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt' 220 | 221 | vocab_path = './lm_lstm_models/sentence_compression/vocabsctgt.pkl' 222 | modelclass_path = './lm_lstm' 223 | model_path = './lm_lstm_models/sentence_compression/sctgt_LSTM_1024_untied.pth' 224 | closeword = './voctbls/vocabsctgtCloseWord' 225 | closeword_lmemb = './voctbls/vocabsctgtCloseWord' 226 | savedir = './results_elmo_sc/' 227 | 228 | ''' 229 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt' 230 | 231 | vocab_path = './lm_lstm_models/sentence_compression/vocabsctgt.pkl' 232 | modelclass_path = './lm_lstm' 233 | model_path = './lm_lstm_models/sentence_compression/sctgt_LSTM_untied.pth' 234 | closeword = './voctbls/vocabsctgtCloseWord' 235 | closeword_lmemb = './voctbls/vocabsctgtCloseWord' 236 | savedir = './results_elmo_sc_512/' 237 | ''' 238 | 239 | ##### beam search parameters 240 | begineos = True 241 | appendsenteos = True 242 | eosavgemb = False 243 | max_step = 20 244 | beam_width = 10 245 | beam_width_start = 10 246 | # mono = True 247 | renorm = False 248 | cluster = True 249 | temperature = 1 250 | elmo_layer = 'avg' 251 | alpha = 0.1 252 | alpha_start = alpha 253 | stopbyLMeos = False 254 | # ifadditive = False 255 | beta = 0.0 256 | 257 | # find word list 258 | numwords = 6 259 | numwords_outembed = -1 260 | numwords_freq = 500 261 | 262 | # if fix generation length 263 | fixedlen = False 264 | genlen = '9' # '9, 10, 11' for example for multiple lengths; including the starting '' token, and can include 265 | # the ending '' token as well (if not 'stobbyLMeos') 266 | 267 | ############################################################################### 268 | 269 | 270 | def parse_args(): 271 | parser = argparse.ArgumentParser(description='Unsupervised generation of summaries from source file.') 272 | # source file 273 | parser.add_argument('--src', type=str, default=arttxtpath, help='source sentences file') 274 | parser.add_argument('--devid', type=int, default=devid, help='device id; -1 for cpu') 275 | # preparations 276 | parser.add_argument('--vocab', type=str, default=vocab_path, help='vocabulary file') 277 | parser.add_argument('--modelclass', type=str, default=modelclass_path, 278 | help='location of the model class definition file') 279 | parser.add_argument('--model', type=str, default=model_path, help='pre-trained language model') 280 | parser.add_argument('--closeword', type=str, default=closeword, help='character embedding close word tables') 281 | parser.add_argument('--closeword_lmemb', type=str, default=closeword_lmemb, 282 | help='LM output embedding close word tables') 283 | parser.add_argument('--savedir', type=str, default=savedir, help='directory to save results') 284 | # beam search parameters 285 | parser.add_argument('--begineos', type=int, default=int(begineos), help='whether to start with ') 286 | parser.add_argument('--appendsenteos', type=int, default=int(appendsenteos), 287 | help='whether to append at the end of source sentence') 288 | parser.add_argument('--eosavgemb', type=int, default=int(eosavgemb), 289 | help='whether to encode using average hidden states') 290 | parser.add_argument('--max_step', type=int, default=max_step, help='maximum beam step') 291 | parser.add_argument('--beam_width', type=int, default=beam_width, help='beam width') 292 | parser.add_argument('--beam_width_start', type=int, default=beam_width_start, help='beam width at first step') 293 | parser.add_argument('--renorm', type=int, default=int(renorm), 294 | help='whether to renormalize the probabilities over the sub-vocabulary') 295 | parser.add_argument('--cluster', type=int, default=int(cluster), 296 | help='whether to do clustering for the sub-vocabulary probabilities') 297 | parser.add_argument('--temp', type=float, default=temperature, 298 | help='temperature used to smooth the output of the softmax layer') 299 | parser.add_argument('--elmo_layer', type=str, default=elmo_layer, choices=['bot', 'mid', 'top', 'avg', 'cat'], 300 | help='elmo layer to use') 301 | parser.add_argument('--alpha', type=float, default=alpha, help='mixture coefficient for LM') 302 | parser.add_argument('--alpha_start', type=float, default=alpha_start, 303 | help='mixture coefficient for LM for the first step') 304 | parser.add_argument('--stopbyLMeos', type=int, default=int(stopbyLMeos), 305 | help='whether to stop the sentence solely by LM prediction') 306 | parser.add_argument('--beta', type=int, default=beta, help='length penalty') 307 | parser.add_argument('--n', type=int, default=numwords, 308 | help='number of closest words for each token to form the candidate list') 309 | parser.add_argument('--ns', type=int, default=numwords_outembed, 310 | help='number of closest words for each token in the output embedding for each token ' 311 | 'to screen the candidate list') 312 | parser.add_argument('--nf', type=int, default=numwords_freq, 313 | help='number of the most frequent words in the vocabulary to keep in the candidate list') 314 | parser.add_argument('--fixedlen', type=int, default=int(fixedlen), 315 | help='whether to generate fixed length summaries') 316 | parser.add_argument('--genlen', type=str, default=genlen, 317 | help='lengths of summaries to be generated; should be comma separated') 318 | 319 | args = parser.parse_args() 320 | return args 321 | 322 | 323 | if __name__ == '__main__': 324 | args = parse_args() 325 | 326 | ##### input arguments 327 | arttxtpath = args.src 328 | 329 | devid = args.devid 330 | 331 | vocab_path = args.vocab # vocabulary for the pre-trained language model 332 | modelclass_path = args.modelclass 333 | model_path = args.model 334 | 335 | closewordsim_path = args.closeword + 'Sims.pkl' 336 | closewordind_path = args.closeword + 'Indices.pkl' # character level word embeddings 337 | closewordsim_outembed_path = args.closeword_lmemb + 'Sims_outembed_' + \ 338 | os.path.splitext(os.path.basename(model_path))[0] + '.pkl' 339 | closewordind_outembed_path = args.closeword_lmemb + 'Indices_outembed_' + \ 340 | os.path.splitext(os.path.basename(model_path))[0] + '.pkl' 341 | 342 | device = 'cpu' if devid == -1 else f'cuda:{devid}' 343 | 344 | ##### beam search parameters 345 | begineos = args.begineos 346 | appendsenteos = args.appendsenteos 347 | eosavgemb = args.eosavgemb if appendsenteos else False 348 | max_step = args.max_step 349 | beam_width = args.beam_width 350 | beam_width_start = args.beam_width_start 351 | mono = True 352 | renorm = args.renorm 353 | cluster = args.cluster 354 | temp = args.temp 355 | elmo_layer = args.elmo_layer 356 | alpha = args.alpha 357 | alpha_start = args.alpha_start 358 | stopbyLMeos = args.stopbyLMeos 359 | ifadditive = False 360 | beta = args.beta 361 | numwords = args.n 362 | numwords_outembed = args.ns if args.ns != -1 else numwords 363 | numwords_freq = args.nf 364 | fixedlen = args.fixedlen 365 | genlen = list(map(int, args.genlen.split(','))) # including the starting '' token 366 | # and can include the ending '' token as well (if not 'stobbyLMeos') 367 | 368 | ##### read in the article/source sentences to be summarized 369 | g = open(arttxtpath, 'r') 370 | sents = [line.strip() for line in g if line.strip()] 371 | g.close() 372 | nsents = len(sents) 373 | 374 | ##### load the ELMo forward embedder class 375 | ee = ElmoEmbedderForward(cuda_device=devid) 376 | 377 | ##### load vocabulary and the pre-trained language model 378 | vocab = pickle.load(open(vocab_path, 'rb')) 379 | 380 | if modelclass_path not in sys.path: 381 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model 382 | # the model class file must be included in the search path 383 | LMModel = torch.load(model_path, map_location=torch.device(device)) 384 | embedmatrix = LMModel.proj.weight 385 | 386 | ##### check if the close_tables exist already; if not, generate 387 | if not os.path.exists(closewordind_path): 388 | # character embeddings of the vocabulary 389 | embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=devid) 390 | values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500) 391 | # save results 392 | os.makedirs(os.path.dirname(closewordind_path), exist_ok=True) 393 | pickle.dump(values_cnn, open(closewordsim_path, 'wb')) 394 | pickle.dump(indices_cnn, open(closewordind_path, 'wb')) 395 | 396 | if not os.path.exists(closewordind_outembed_path): 397 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500) 398 | # save results 399 | os.makedirs(os.path.dirname(closewordind_outembed_path), exist_ok=True) 400 | pickle.dump(values, open(closewordsim_outembed_path, 'wb')) 401 | pickle.dump(indices, open(closewordind_outembed_path, 'wb')) 402 | 403 | closewordind = pickle.load(open(closewordind_path, 'rb')) 404 | closewordind_outembed = pickle.load(open(closewordind_outembed_path, 'rb')) 405 | 406 | ##### generate save file name 407 | basename = os.path.basename(arttxtpath) 408 | basename = os.path.splitext(basename)[0] 409 | 410 | savedir = args.savedir 411 | 412 | smrypath = os.path.join(savedir, 'smry_') + basename + f'_Ks{beam_width_start}' + f'_clust{int(cluster)}' 413 | 414 | if renorm: 415 | smrypath += f'_renorm{int(renorm)}' 416 | if temp != 1: 417 | smrypath += f'_temper{temp}' 418 | if elmo_layer != 'avg': 419 | smrypath += f'_EL{elmo_layer}' 420 | 421 | smrypath += f'_eosavg{int(eosavgemb)}' + f'_n{numwords}' 422 | 423 | if numwords_outembed != numwords: 424 | smrypath += f'_ns{numwords_outembed}' 425 | if numwords_freq != 500: 426 | smrypath += f'_nf{numwords_freq}' 427 | if beam_width != 10: 428 | smrypath += f'_K{beam_width}' 429 | if stopbyLMeos: 430 | smrypath += f'_soleLMeos' 431 | 432 | if alpha_start != alpha: 433 | smrypath += f'_as{alpha_start}' 434 | if fixedlen: 435 | genlen = sorted(genlen) 436 | smrypath_list = [smrypath + f'_length{l - 1}' + f'_a{alpha}' + '_all.txt' for l in genlen] 437 | else: 438 | smrypath += f'_a{alpha}' + f'_b{beta}' + '_all.txt' 439 | 440 | ##### run summary generation and write to file 441 | if fixedlen: 442 | os.makedirs(os.path.dirname(smrypath), exist_ok=True) 443 | g_list = [open(fname, 'w') for fname in smrypath_list] 444 | else: 445 | os.makedirs(os.path.dirname(smrypath), exist_ok=True) 446 | g = open(smrypath, 'w') 447 | 448 | start = time.time() 449 | for ind in tqdm(range(nsents)): 450 | template = sents[ind].strip('.').strip() # remove '.' at the end 451 | if appendsenteos: 452 | template += ' ' 453 | 454 | ### Find the close words to those in the template sentence 455 | # word_list, subvocab = findwordlist(template, closewordind, vocab, numwords=1, addeos=True) 456 | # word_list, subvocab = findwordlist_screened(template, closewordind, closewordind_outembed, 457 | # vocab, numwords=6, addeos=True) 458 | word_list, subvocab = findwordlist_screened2(template, closewordind, closewordind_outembed, vocab, 459 | numwords=numwords, numwords_outembed=numwords_outembed, 460 | numwords_freq=numwords_freq, addeos=True) 461 | if cluster: 462 | clustermask = clmk_nn(embedmatrix, subvocab) 463 | 464 | ### ELMo embedding of the template sentence 465 | if eosavgemb is False: 466 | template_vec, _ = ee.embed_sentence(template.split(), add_bos=True) 467 | else: 468 | tt = template.split()[:-1] 469 | hiddens = [] 470 | template_vec = None 471 | current_hidden = None 472 | for i in range(len(tt)): 473 | current_embed, current_hidden = ee.embed_sentence([tt[i]], add_bos=True if i == 0 else False, 474 | initial_state=current_hidden) 475 | hiddens.append(current_hidden) 476 | template_vec = current_embed if template_vec is None else torch.cat([template_vec, current_embed], 477 | dim=1) 478 | hiddens_h, hiddens_c = zip(*hiddens) 479 | hiddens_avg = (sum(hiddens_h) / len(hiddens_h), sum(hiddens_c) / len(hiddens_c)) 480 | eosavg, _ = ee.embed_sentence([''], initial_state=hiddens_avg) 481 | template_vec = torch.cat([template_vec, eosavg], dim=1) 482 | 483 | ### beam search 484 | max_step_temp = min([len(template.split()), max_step]) 485 | beam = gensummary_elmo(template_vec, 486 | ee, 487 | vocab, 488 | LMModel, 489 | word_list, 490 | subvocab, 491 | clustermask=clustermask if cluster else None, 492 | renorm=renorm, 493 | temperature=temp, 494 | elmo_layer=elmo_layer, 495 | max_step=max_step_temp, 496 | beam_width=beam_width, 497 | beam_width_start=beam_width_start, 498 | mono=mono, 499 | alpha=alpha, 500 | alpha_start=alpha_start, 501 | begineos=begineos, 502 | stopbyLMeos=stopbyLMeos, 503 | ifadditive=ifadditive, 504 | devid=devid) 505 | 506 | ### sort and write to file 507 | if fixedlen: 508 | for j in range(len(genlen) - 1, -1, -1): 509 | g_list[j].write('-' * 5 + f'<{ind + 1}>' + '-' * 5 + '\n') 510 | g_list[j].write('\n') 511 | if genlen[j] <= beam.step: 512 | ssa = fixlensummary(beam, length=genlen[j]) 513 | if ssa == []: 514 | g_list[j].write('\n') 515 | else: 516 | for m in range(len(ssa)): 517 | g_list[j].write(' '.join(ssa[m][1][1:]) + '\n') 518 | g_list[j].write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) 519 | + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n') 520 | g_list[j].writelines(['%d, ' % loc for loc in ssa[m][2]]) 521 | g_list[j].write('\n') 522 | g_list[j].write('\n') 523 | else: 524 | g_list[j].write('\n') 525 | 526 | if (ind + 1) % 10 == 0: 527 | g_list[j].flush() 528 | os.fsync(g_list[j].fileno()) 529 | else: 530 | ssa = sortsummary(beam, beta=beta) 531 | g.write('-' * 5 + f'<{ind + 1}>' + '-' * 5 + '\n') 532 | g.write('\n') 533 | if ssa == []: 534 | g.write('\n') 535 | else: 536 | for m in range(len(ssa)): 537 | g.write(' '.join(ssa[m][1][1:]) + '\n') 538 | g.write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) + ' ' + '{:.3f}'.format( 539 | ssa[m][4]) + '\n') 540 | g.writelines(['%d, ' % loc for loc in ssa[m][2]]) 541 | g.write('\n') 542 | g.write('\n') 543 | 544 | if (ind + 1) % 10 == 0: 545 | g.flush() 546 | os.fsync(g.fileno()) 547 | 548 | print('time elapsed %s' % timeSince(start)) 549 | if fixedlen: 550 | for gg in g_list: 551 | gg.close() 552 | print('results saved to: %s' % (("\n" + " " * 18).join(smrypath_list))) 553 | else: 554 | g.close() 555 | print(f'results saved to: {smrypath}') 556 | -------------------------------------------------------------------------------- /uss/summary_search_gpt2.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import pickle 3 | import time 4 | import math 5 | import sys 6 | import os 7 | from tqdm import tqdm 8 | import argparse 9 | 10 | from pre_closetables import ELMoBotEmbedding, findclosewords_vocab 11 | 12 | from gpt2_sequential_embedder import GPT2Embedder 13 | # from pre_word_list import findwordlist, findwordlist_screened 14 | from pre_word_list import findwordlist_screened2 15 | from lm_subvocab import clmk_nn 16 | from beam_search import Beam 17 | from utils import timeSince 18 | 19 | 20 | def gensummary_gpt2(template_vec, 21 | ge, 22 | vocab, 23 | LMModel, 24 | word_list, 25 | subvocab, 26 | clustermask=None, 27 | mono=True, 28 | renorm=True, 29 | temperature=1, 30 | bpe2word='last', 31 | max_step = 20, 32 | beam_width = 10, 33 | beam_width_start = 10, 34 | alpha=0.1, 35 | alpha_start=0.1, 36 | begineos=True, 37 | stopbyLMeos=False, 38 | devid=0, 39 | **kwargs): 40 | """ 41 | Unsupervised sentence summary generation using beam search, by contextual matching and a summary style language model. 42 | The contextual matching here is on top of pretrained ELMo embeddings. 43 | 44 | Input: 45 | template_vec: forward only ELMo embeddings of the source sentence. 'torch.Tensor' of size (3, seq_len, 512). 46 | ge: 'gpt2_sequential_embedder.GPT2Embedder' object. 47 | vocab: 'torchtext.vocab.Vocab' object. Should be the same as is used for the pretrained language model. 48 | LMModel: a pretrained language model on the summary sentences. 49 | word_list: a list of words in the vocabulary to work with. 'List'. 50 | subvocab: 'torch.LongTensor' consisting of the indices of the words corresponding to 'word_list'. 51 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). Default:None. 52 | mono: whether to keep monotonicity contraint. Default: True. 53 | renorm: whether to renormalize the probabilities over the sub-vocabulary. Default: True. 54 | Temperature: temperature applied to the softmax in the language model. Default: 1. 55 | bpe2word: how to turn the BPE vectors into word vectors. Choose from ['last', 'avg']. Default: 'last'. 56 | max_step: maximum number of beam steps. 57 | beam_width: beam width. 58 | beam_width_start: beam width of the first step. 59 | alpha: the amount of language model part used for scoring. The score is: (1 - \alpha) * similarity_logscore + \alpha * LM_logscore. 60 | begineos: whether to begin with the special '' token as is trained in the language model. Note that ELMo has its own special beginning token. Default: True. 61 | stopbyLMeos: whether to stop a sentence solely by the language model predicting '' as the top possibility. Default: False. 62 | devid: device id to run the algorithm and LSTM language models. 'int', default: 0. -1 for cpu. 63 | **kwargs: other arguments input to function . 64 | E.g. normalized: whether to normalize the dot product when calculating the similarity, which makes it cosine similarity. Default: True. 65 | ifadditive: whether to use an additive model on mixing the probability scores. Default: False. 66 | 67 | Output: 68 | beam: 'Beam' object, recording all the generated sequences. 69 | 70 | """ 71 | device = 'cpu' if devid == -1 else f'cuda:{devid}' 72 | 73 | # Beam Search: initialization 74 | if begineos: 75 | beam = Beam(1, vocab, init_ids=[vocab.stoi['']], device=device, 76 | sim_score=0, lm_score=0, lm_state=None, gpt2_state=None, align_loc=None) 77 | else: 78 | beam = Beam(1, vocab, init_ids=[None], device=device, 79 | sim_score=0, lm_score=0, lm_state=None, gpt2_state=None, align_loc=None) 80 | 81 | # first step: start with 'beam_width_start' best matched words 82 | beam.beamstep(beam_width_start, 83 | beam.combscoreK_GPT2, 84 | template_vec=template_vec, 85 | ge=ge, 86 | LMModel=LMModel, 87 | word_list=word_list, 88 | subvocab=subvocab, 89 | clustermask=clustermask, 90 | alpha=alpha_start, 91 | renorm=renorm, 92 | temperature=temperature, 93 | bpe2word=bpe2word, 94 | normalized=True, 95 | ifadditive=False, 96 | **kwargs) 97 | 98 | # run beam search, until all sentences hit or max_step reached 99 | for s in range(max_step): 100 | print(f'beam step {s+1} ' + '-' * 50 + '\n') 101 | beam.beamstep(beam_width, 102 | beam.combscoreK_GPT2, 103 | template_vec=template_vec, 104 | ge=ge, 105 | LMModel=LMModel, 106 | word_list=word_list, 107 | subvocab=subvocab, 108 | clustermask=clustermask, 109 | mono=mono, 110 | alpha=alpha, 111 | renorm=renorm, 112 | temperature=temperature, 113 | stopbyLMeos=stopbyLMeos, 114 | bpe2word=bpe2word, 115 | normalized=True, 116 | ifadditive=False, 117 | **kwargs) 118 | # all beams reach termination 119 | if beam.endall: 120 | break 121 | 122 | return beam 123 | 124 | 125 | def sortsummary(beam, beta=0): 126 | """ 127 | Sort the generated summaries by beam search, with length penalty considered. 128 | 129 | Input: 130 | beam: 'Beam' object finished with beam search. 131 | beta: length penalty when sorting. Default: 0 (no length penalty). 132 | Output: 133 | ssa: 'List[Tuple]' of (score_avg, sentence, alignment, sim_score, lm_score). 134 | """ 135 | sents = [] 136 | aligns = [] 137 | score_avgs = [] 138 | sim_scores = [] 139 | lm_scores = [] 140 | 141 | for ks in beam.endbus: 142 | sent, rebeam = beam.retrieve(ks[0] + 1, ks[1]) 143 | score_avg = ks[2] / (ks[1] ** beta) 144 | 145 | sents.append(sent) 146 | aligns.append(beam.retrieve_align(rebeam)) 147 | score_avgs.append(score_avg) 148 | sim_scores.append(ks[3]) 149 | lm_scores.append(ks[4]) 150 | 151 | ssa = sorted([(score_avgs[i], sents[i], aligns[i], sim_scores[i], lm_scores[i]) for i in range(len(sents))], reverse=True) 152 | 153 | return ssa 154 | 155 | 156 | def fixlensummary(beam, length=-1): 157 | """ 158 | Pull out fixed length summaries from the beam search. 159 | 160 | Input: 161 | beam: 'Beam' object finished with beam search. 162 | length: wanted length of the summary. 163 | Output: 164 | ssa: 'List[Tuple]' of sorted (score, sentence, alignments, sim_score, lm_score). 165 | """ 166 | assert length >=1 and length <= beam.step, 'invalid sentence length.' 167 | 168 | ssa = [] 169 | for i in range(beam.K[length]): 170 | sent, rebeam = beam.retrieve(i + 1, l) 171 | ssa.append((beam.beamseq[length][i].score, sent, beam.retrieve_align(rebeam), beam.beamseq[length][i].sim_score, beam.beamseq[length][i].lm_score)) 172 | 173 | return ssa 174 | 175 | 176 | ##### input arguments 177 | #arttxtpath = '../LM/data/Giga-sum/input_unk_250.txt' 178 | #arttxtpath = '../LM/data/Giga-sum/input_unk_251-500.txt' 179 | #arttxtpath = '../LM/data/Giga-sum/input_unk_501-750.txt' 180 | #arttxtpath = '../LM/data/Giga-sum/input_unk_751-1000.txt' 181 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1001-1250.txt' 182 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1251-1500.txt' 183 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1501-1750.txt' 184 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1751-1951.txt' 185 | 186 | arttxtpath = '../LM/data/Giga-sum/input_unk.txt' 187 | 188 | devid = 0 189 | 190 | ''' 191 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt' 192 | 193 | vocab_path = '../LM/LSTM/models_sc/vocabsctgt.pkl' 194 | modelclass_path = '../LM/LSTM' 195 | model_path = '../LM/LSTM/models_sc/sctgt_LSTM_1024_untied.pth' 196 | closeword = './voctbls/vocabsctgtCloseWord' 197 | closeword_lmemb = './voctbls/vocabsctgtCloseWord_1024_untied_' 198 | savedir = './results_sc_1024_untied_gpt2/' 199 | ''' 200 | 201 | ''' 202 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt' 203 | 204 | vocab_path = '../LM/LSTM/models_sc/vocabsctgt.pkl' 205 | modelclass_path = '../LM/LSTM' 206 | model_path = '../LM/LSTM/models_sc/sctgt_LSTM_untied.pth' 207 | closeword = './voctbls/vocabsctgtCloseWord' 208 | closeword_lmemb = './voctbls/vocabsctgtCloseWord_untied_' 209 | savedir = './results_sc_untied/' 210 | ''' 211 | 212 | 213 | vocab_path = '../LM/LSTM/models/vocabTle.pkl' 214 | modelclass_path = '../LM/LSTM' 215 | model_path = '../LM/LSTM/models/Tle_LSTM_untied.pth' 216 | closeword = './voctbls/vocabTleCloseWord' 217 | closeword_lmemb = './voctbls/vocabTleCloseWord_untied_' 218 | savedir = './results_gpt2/' 219 | 220 | 221 | # vocab_path = '../LM/LSTM/models/vocabTle.pkl' 222 | # modelclass_path = '../LM/LSTM' 223 | # model_path = '../LM/LSTM/models/Tle_LSTM.pth' 224 | # closeword = 'vocabTleCloseWord' 225 | # closeword_lmemb = 'vocabTleCloseWord' 226 | # savedir = './results/' 227 | 228 | 229 | # vocab_path = '../LM/LSTM_LUC/models/vocabTle50k.pkl' 230 | # modelclass_path = '../LM/LSTM_LUC' 231 | # model_path = '../LM/LSTM_LUC/models/TleLUC_wtI_noB_0-0.0001-1Penalty.pth' 232 | # closeword = 'vocabTle50kCloseWord' 233 | # closeword_lmemb = 'vocabTle50kCloseWord' 234 | 235 | # vocab_path = '../4.0_cluster/vocabTle.pkl' # vocabulary for the pretrained language model 236 | # modelclass_path = '../LM/LSTM_MoS' 237 | # model_path = '../LM/LSTM_MoS/models/LMModelMoSTle2.pth' 238 | # closeword = '../4.0_cluster/vocabTleCloseWord' # character level word embeddings 239 | # closeword_lmemb = 'vocabTleCloseWord' 240 | 241 | 242 | ##### beam search parameters 243 | begineos = True 244 | appendsenteos = True 245 | eosavgemb = False 246 | max_step = 20 247 | beam_width = 10 248 | beam_width_start = 10 249 | # mono = True 250 | renorm = False 251 | cluster = True 252 | temperature = 1 253 | bpe2word = 'last' 254 | alpha = 0.1 255 | alpha_start = alpha 256 | stopbyLMeos = False 257 | # ifadditive = False 258 | beta = 0.0 259 | 260 | # find word list 261 | numwords = 6 262 | numwords_outembed = -1 263 | numwords_freq = 500 264 | 265 | # if fix generation length 266 | fixedlen = False 267 | genlen = '9' 268 | # genlen = [9] # including the starting '' token, and can include the ending '' token as well (if not 'stobbyLMeos') 269 | 270 | 271 | def parse_args(): 272 | parser = argparse.ArgumentParser(description='Unsupervisely generate summaries from source file.') 273 | # source file 274 | parser.add_argument('--src', type=str, default=arttxtpath, help='source sentences file') 275 | parser.add_argument('--devid', type=int, default=devid, help='device id; -1 for cpu') 276 | # preparations 277 | parser.add_argument('--vocab', type=str, default=vocab_path, help='vocabulary file') 278 | parser.add_argument('--modelclass', type=str, default=modelclass_path, help='location of the model class definition file') 279 | parser.add_argument('--model', type=str, default=model_path, help='pre-trained language model') 280 | parser.add_argument('--closeword', type=str, default=closeword, help='character embedding close word tables') 281 | parser.add_argument('--closeword_lmemb', type=str, default=closeword_lmemb, help='LM output embedding close word tables') 282 | parser.add_argument('--savedir', type=str, default=savedir, help='directory to save results') 283 | # beam search parameters 284 | parser.add_argument('--begineos', type=int, default=int(begineos), help='whether to start with ') 285 | parser.add_argument('--appendsenteos', type=int, default=int(appendsenteos), help='whether to append at the end of source sentence') 286 | parser.add_argument('--eosavgemb', type=int, default=int(eosavgemb), help='whether to encode using average hidden states (deprecated)') 287 | parser.add_argument('--max_step', type=int, default=max_step, help='maximum beam step') 288 | parser.add_argument('--beam_width', type=int, default=beam_width, help='beam width') 289 | parser.add_argument('--beam_width_start', type=int, default=beam_width_start, help='beam width at first step') 290 | parser.add_argument('--renorm', type=int, default=int(renorm), help='whether to renormalize the probabilities over the sub-vocabulary') 291 | parser.add_argument('--cluster', type=int, default=int(cluster), help='whether to do clustering for the sub-vocabulary probabilities') 292 | parser.add_argument('--temp', type=float, default=temperature, help='temperature used to smooth the output of the softmax layer') 293 | parser.add_argument('--bpe2word', type=str, default=bpe2word, choices=['last', 'avg'], help='how to use BPE hidden states to represent a word') 294 | parser.add_argument('--alpha', type=float, default=alpha, help='mixture coefficient for LM') 295 | parser.add_argument('--alpha_start', type=float, default=alpha_start, help='mixture coefficient for LM for the first step') 296 | parser.add_argument('--stopbyLMeos', type=int, default=int(stopbyLMeos), help='whether to stop the sentence solely by LM prediction') 297 | parser.add_argument('--beta', type=int, default=beta, help='length penalty') 298 | parser.add_argument('--n', type=int, default=numwords, help='number of closest words for each token to form the candidate list') 299 | parser.add_argument('--ns', type=int, default=numwords_outembed, help='number of closest words for each token in the output embedding for each token to screen the candidate list') 300 | parser.add_argument('--nf', type=int, default=numwords_freq, help='number of the most frequent words in the vocabulary to keep in the candidate list') 301 | parser.add_argument('--fixedlen', type=int, default=int(fixedlen), help='whether to generate fixed length summaries') 302 | parser.add_argument('--genlen', type=str, default=genlen, help='lengths of summaries to be generated; should be comma separated') 303 | 304 | args = parser.parse_args() 305 | return args 306 | 307 | 308 | if __name__ == '__main__': 309 | args = parse_args() 310 | 311 | ##### input arguments 312 | arttxtpath = args.src 313 | 314 | devid = args.devid 315 | 316 | vocab_path = args.vocab # vocabulary for the pretrained language model 317 | modelclass_path = args.modelclass 318 | model_path = args.model 319 | 320 | closewordsim_path = args.closeword + 'Sims.pkl' 321 | closewordind_path = args.closeword + 'Indices.pkl' # character level word embeddings 322 | closewordsim_outembed_path = args.closeword_lmemb + 'Sims_outembed_' + os.path.splitext(os.path.basename(model_path))[0] + '.pkl' 323 | closewordind_outembed_path = args.closeword_lmemb + 'Indices_outembed_' + os.path.splitext(os.path.basename(model_path))[0] + '.pkl' 324 | 325 | device = 'cpu' if devid == -1 else f'cuda:{devid}' 326 | 327 | ##### beam search parameters 328 | begineos = args.begineos 329 | appendsenteos = args.appendsenteos 330 | eosavgemb = args.eosavgemb if appendsenteos else False 331 | max_step = args.max_step 332 | beam_width = args.beam_width 333 | beam_width_start = args.beam_width_start 334 | mono = True 335 | renorm = args.renorm 336 | cluster = args.cluster 337 | temp = args.temp 338 | bpe2word = args.bpe2word 339 | alpha = args.alpha 340 | alpha_start = args.alpha_start 341 | stopbyLMeos = args.stopbyLMeos 342 | ifadditive = False 343 | beta = args.beta 344 | numwords = args.n 345 | numwords_outembed = args.ns if args.ns != -1 else numwords 346 | numwords_freq = args.nf 347 | fixedlen = args.fixedlen 348 | genlen = list(map(int, args.genlen.split(','))) # including the starting '' token 349 | # and can include the ending '' token as well (if not 'stobbyLMeos') 350 | 351 | ##### read in the article/source sentences to be summarized 352 | g = open(arttxtpath, 'r') 353 | sents = [line.strip() for line in g if line.strip()] 354 | g.close() 355 | nsents = len(sents) 356 | 357 | ##### load the GPT-2 embedder class 358 | ge = GPT2Embedder(cuda_device=devid) 359 | 360 | ##### load vocabulary and the pre-trained language model 361 | vocab = pickle.load(open(vocab_path, 'rb')) 362 | 363 | if modelclass_path not in sys.path: 364 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model; the model class file must be included in the search path 365 | LMModel = torch.load(model_path, map_location=torch.device(device)) 366 | embedmatrix = LMModel.proj.weight 367 | 368 | ##### check if the close_tables exist already; if not, generate 369 | if not os.path.exists(closewordind_path): 370 | # character embeddings of the vocabulary 371 | embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=devid) 372 | values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500) 373 | # save results 374 | pickle.dump(values_cnn, open(closewordsim_path, 'wb')) 375 | pickle.dump(indices_cnn, open(closewordind_path, 'wb')) 376 | 377 | if not os.path.exists(closewordind_outembed_path): 378 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500) 379 | # save results 380 | pickle.dump(values, open(closewordsim_outembed_path, 'wb')) 381 | pickle.dump(indices, open(closewordind_outembed_path, 'wb')) 382 | 383 | closewordind = pickle.load(open(closewordind_path, 'rb')) 384 | closewordind_outembed = pickle.load(open(closewordind_outembed_path, 'rb')) 385 | 386 | ##### generate save file name 387 | basename = os.path.basename(arttxtpath) 388 | basename = os.path.splitext(basename)[0] 389 | 390 | savedir = args.savedir 391 | # savedir = './results/' 392 | 393 | smrypath = os.path.join(savedir, 'smry_') + basename + f'_Ks{beam_width_start}' + f'_clust{int(cluster)}' 394 | 395 | if renorm: 396 | smrypath += f'_renorm{int(renorm)}' 397 | if temp != 1: 398 | smrypath += f'_temper{temp}' 399 | if bpe2word != 'last': 400 | smrypath += f'_BPE{bpe2word}' 401 | 402 | # smrypath += f'_eosavg{int(eosavgemb)}' + f'_n{numwords}' 403 | smrypath += f'_n{numwords}' 404 | 405 | if numwords_outembed != numwords: 406 | smrypath += f'_ns{numwords_outembed}' 407 | if numwords_freq != 500: 408 | smrypath += f'_nf{numwords_freq}' 409 | if beam_width != 10: 410 | smrypath += f'_K{beam_width}' 411 | if stopbyLMeos: 412 | smrypath += f'_soleLMeos' 413 | ############################ 414 | # smrypath += '_close1' 415 | ############################ 416 | if alpha_start != alpha: 417 | smrypath += f'_as{alpha_start}' 418 | if fixedlen: 419 | genlen = sorted(genlen) 420 | smrypath_list = [smrypath + f'_length{l - 1}' + f'_a{alpha}' + '_all.txt' for l in genlen] 421 | else: 422 | smrypath += f'_a{alpha}' + f'_b{beta}' + '_all.txt' 423 | 424 | ##### run summary generation and write to file 425 | if fixedlen: 426 | g_list = [open(fname, 'w') for fname in smrypath_list] 427 | else: 428 | g = open(smrypath, 'w') 429 | 430 | start = time.time() 431 | for ind in tqdm(range(nsents)): 432 | template = sents[ind].strip('.').strip() # remove '.' at the end 433 | if appendsenteos: 434 | template += ' ' 435 | 436 | ### Find the close words to those in the template sentence 437 | # word_list, subvocab = findwordlist(template, closewordind, vocab, numwords=1, addeos=True) 438 | # word_list, subvocab = findwordlist_screened(template, closewordind, closewordind_outembed, vocab, numwords=6, addeos=True) 439 | word_list, subvocab = findwordlist_screened2(template, closewordind, closewordind_outembed, vocab, numwords=numwords, numwords_outembed=numwords_outembed, numwords_freq=numwords_freq, addeos=True) 440 | if cluster: 441 | clustermask = clmk_nn(embedmatrix, subvocab) 442 | 443 | ### GPT-2 embedding of the template sentence 444 | if not eosavgemb: 445 | template_vec, _ = ge.embed_sentence(template.split(), add_bos=True, bpe2word=bpe2word) 446 | else: 447 | raise ValueError 448 | 449 | ### beam search 450 | max_step_temp = min([len(template.split()), max_step]) 451 | beam = gensummary_gpt2(template_vec, 452 | ge, 453 | vocab, 454 | LMModel, 455 | word_list, 456 | subvocab, 457 | clustermask=clustermask if cluster else None, 458 | renorm=renorm, 459 | temperature=temp, 460 | bpe2word=bpe2word, 461 | max_step=max_step_temp, 462 | beam_width=beam_width, 463 | beam_width_start=beam_width_start, 464 | mono=True, 465 | alpha=alpha, 466 | alpha_start=alpha_start, 467 | begineos=begineos, 468 | stopbyLMeos=stopbyLMeos, 469 | devid=devid) 470 | 471 | ### sort and write to file 472 | if fixedlen: 473 | for j in range(len(genlen) - 1, -1, -1): 474 | g_list[j].write('-' * 5 + f'<{ind+1}>' + '-' * 5 + '\n') 475 | g_list[j].write('\n') 476 | if genlen[j] <= beam.step: 477 | ssa = fixlensummary(beam, length=genlen[j]) 478 | if ssa == []: 479 | g_list[j].write('\n') 480 | else: 481 | for m in range(len(ssa)): 482 | g_list[j].write(' '.join(ssa[m][1][1:]) + '\n') 483 | g_list[j].write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) 484 | + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n') 485 | g_list[j].writelines(['%d, ' % loc for loc in ssa[m][2]]) 486 | g_list[j].write('\n') 487 | g_list[j].write('\n') 488 | else: 489 | g_list[j].write('\n') 490 | 491 | if (ind + 1) % 10 == 0: 492 | g_list[j].flush() 493 | os.fsync(g_list[j].fileno()) 494 | else: 495 | ssa = sortsummary(beam, beta=beta) 496 | g.write('-' * 5 + f'<{ind+1}>' + '-' * 5 + '\n') 497 | g.write('\n') 498 | if ssa == []: 499 | g.write('\n') 500 | else: 501 | for m in range(len(ssa)): 502 | g.write(' '.join(ssa[m][1][1:]) + '\n') 503 | g.write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n') 504 | g.writelines(['%d, ' % loc for loc in ssa[m][2]]) 505 | g.write('\n') 506 | g.write('\n') 507 | 508 | if (ind + 1) % 10 == 0: 509 | g.flush() 510 | os.fsync(g.fileno()) 511 | 512 | print('time elapsed %s' % timeSince(start)) 513 | if fixedlen: 514 | for gg in g_list: 515 | gg.close() 516 | print('results saved to: %s' % (("\n" + " " * 18).join(smrypath_list))) 517 | else: 518 | g.close() 519 | print(f'results saved to: {smrypath}') 520 | 521 | -------------------------------------------------------------------------------- /uss/summary_select_eval.py: -------------------------------------------------------------------------------- 1 | """ 2 | Post-processing of the generated summary sentences: 3 | 1. From all the summary sentences based on beam search, select one based on some length penalty 4 | 2. Evaluation of Rouge scores, copy rate, compression rate, etc. 5 | """ 6 | import os 7 | import argparse 8 | 9 | 10 | def copy_rate(sent1, sent2): 11 | """ 12 | copy rate between two sentences. 13 | In particular, the proportion of sentence 1 that are copied from sentence 2. 14 | 15 | Input: 16 | sent1, sent2: two sentence strings (generated summary, source). 17 | Output: 18 | score: copy rate on unigrams. 19 | """ 20 | sent1_split = set(sent1.split()) 21 | sent2_split = set(sent2.split()) 22 | intersection = sent1_split.intersection(sent2_split) 23 | # recall = len(intersection) / len(sent2_split) 24 | precision = len(intersection) / len(sent1_split) 25 | # union = sent1_split.union(sent2_split) 26 | # jacd = 1 - len(intersection) / len(union) # jacquard distance 27 | # score = stats.hmean([recall, precision]) # F1 score (need to import scipy.stats.hmean) 28 | # score = 2 * recall * precision / (recall + precision) if recall != 0 and precision != 0 else 0 # F1 score 29 | 30 | return precision 31 | 32 | 33 | # =============== some default path arguments ================================== 34 | src = '/n/rush_lab/users/jzhou/LM/data/Giga-sum/input_unk.txt' 35 | ref = '/n/rush_lab/users/jzhou/LM/data/Giga-sum/task1_ref0.txt' 36 | 37 | ''' 38 | gens = '/n/rush_lab/users/jzhou/5.0_cluster/results_untied/smry_input_unk_Ks10_clust0_temper10.0_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt' 39 | save_dir = './results_untied/' 40 | ''' 41 | gen = './results_gpt2/smry_input_unk_Ks10_clust1_n6_ns10_nf300_a0.1_b0.0_all.txt' 42 | save_dir = './results_gpt2/' 43 | 44 | lp = 0.1 45 | # =============================================================================== 46 | 47 | 48 | def parse_args(): 49 | parser = argparse.ArgumentParser(description='Post-processing and evaluation of the generated summary sentences') 50 | parser.add_argument('--src', type=str, default=src, help='source sentence path') 51 | parser.add_argument('--ref', type=str, default=ref, help='reference summary path') 52 | parser.add_argument('--gen', type=str, default=gen, help='generated summary path') 53 | parser.add_argument('--save_dir', type=str, default=save_dir, help='directory to save the result') 54 | parser.add_argument('--lp', type=float, default=lp, help='length penalty (additive onto length)') 55 | args = parser.parse_args() 56 | return args 57 | 58 | 59 | if __name__ == '__main__': 60 | args = parse_args() 61 | 62 | # read in the source, reference, and generated summaries (a list of summaries for each source sentence) 63 | g = open(args.src, 'r') 64 | arts = [line.strip().strip(' .') for line in g if line.strip()] 65 | g.close() 66 | 67 | g = open(args.ref, 'r') 68 | refs = [line.strip() for line in g if line.strip()] 69 | g.close() 70 | 71 | g = open(args.gen, 'r') 72 | lines = [line.strip() for line in g if line.strip()] 73 | g.close() 74 | 75 | # length penalty for selecting the finished hypothesis from beam search 76 | # takes the form (length + lp) ^ b 77 | lp = args.lp 78 | b = 1.0 79 | 80 | # generate the new path to save the results 81 | basename = os.path.basename(args.gen) 82 | basename = os.path.splitext(basename)[0] 83 | 84 | gen_selected_path_new = os.path.join(args.save_dir, basename.replace('b0.0_all', f'b{b}_single') + '.txt') 85 | 86 | # select a single summary sentence for each source sentence with length penalty 87 | os.makedirs(args.save_dir, exist_ok=True) 88 | g = open(gen_selected_path_new, 'w') 89 | 90 | i = 0 91 | j = 1 92 | count = 0 93 | cp_rate = [] 94 | lens = [] 95 | comp_rate = [] 96 | while j <= len(lines): 97 | if j == len(lines) or lines[j].startswith('-----') and not lines[j].startswith('----- '): 98 | count += 1 99 | # from i to j-1 100 | curl = lines[(i + 1):j] 101 | ssa = [(curl[k], curl[k + 1], curl[k + 2]) for k in range(len(curl)) if k % 3 == 0] 102 | ssa = sorted(ssa, key=lambda x: float(x[1].split()[0]) / (len(x[0].split()) - lp) ** b, reverse=True) 103 | # float(x[1].split()[0]) for the combined score 104 | # float(x[1].split()[1]) for the contextual matching score 105 | # float(x[1].split()[2]) for the language model score 106 | 107 | # if arts[count - 1] == '': 108 | # g.write('\n') 109 | # else: 110 | # if len(ssa[0][0].split()) <= 1: 111 | # g.write(arts[count - 1]) 112 | # g.write('\n') 113 | # cp_rate.append(1) 114 | # else: 115 | # g.write(' '.join(ssa[0][0].split()[:-1])) 116 | # g.write('\n') 117 | # cp_rate.append(copy_rate(' '.join(ssa[0][0].split()[:-1]), arts[count - 1])) 118 | 119 | if len(ssa[0][0].split()) <= 1: 120 | # blank line: directly copy the source for summary 121 | g.write(arts[count - 1]) 122 | g.write('\n') 123 | cp_rate.append(1) 124 | comp_rate.append(1) 125 | lens.append(len(arts[count - 1].split())) 126 | else: 127 | g.write(' '.join(ssa[0][0].split()[:-1])) # do not include the last token, which is to match the 128 | g.write('\n') 129 | cp_rate.append(copy_rate(' '.join(ssa[0][0].split()[:-1]), arts[count - 1])) 130 | comp_rate.append(len(ssa[0][0].split()[:-1]) / len(arts[count - 1].split())) 131 | lens.append(len(ssa[0][0].split()[:-1])) 132 | i = j 133 | j += 1 134 | 135 | g.close() 136 | 137 | # print out the results and calculate the Rouge scores 138 | os.system('sed -i "s//UNK/g" ' + gen_selected_path_new) 139 | print('copy rate: %f' % (sum(cp_rate) / len(cp_rate))) 140 | print('compression rate: %f' % (sum(comp_rate) / len(comp_rate))) 141 | print('average summary length: %f' % (sum(lens) / len(lens))) 142 | os.system('files2rouge ' + gen_selected_path_new + ' ' + args.ref) 143 | -------------------------------------------------------------------------------- /uss/utils.py: -------------------------------------------------------------------------------- 1 | import time 2 | import math 3 | 4 | 5 | def timeSince(start): 6 | now = time.time() 7 | s = now - start 8 | m = math.floor(s / 60) 9 | s -= m * 60 10 | h = math.floor(m / 60) 11 | m -= h * 60 12 | if h == 0: 13 | return '%dm %.3fs' % (m, s) 14 | else: 15 | return '%dh %dm %.3fs' % (h, m, s) 16 | --------------------------------------------------------------------------------