├── .gitignore
├── LICENSE
├── README.md
├── data
├── gigaword
│ ├── README.md
│ ├── input.txt
│ └── task1_ref0.txt
└── sentence_compression
│ ├── README.md
│ ├── eval_src_1000_unk.txt
│ └── eval_tgt_1000_unk.txt
├── figure1.png
├── lm_lstm
├── dataload.py
├── main.py
├── model.py
├── testppl.py
├── train.py
└── utils.py
├── results_elmo_giga
├── README.md
├── smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt
└── smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_best_f1.txt
├── results_elmo_sc
├── README.md
├── smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_b1.0_single.txt
└── smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_best_f1.txt
└── uss
├── beam_search.py
├── elmo_lstm_forward.py
├── elmo_sequential_embedder.py
├── gpt2_sequential_embedder.py
├── lm_subvocab.py
├── pre_closetables.py
├── pre_word_list.py
├── sim_embed_score.py
├── sim_token_match.py
├── summary_search_elmo.py
├── summary_search_gpt2.py
├── summary_select_eval.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | *.idea
3 |
4 | lm_lstm_models/*
5 |
6 | voctbls/*
7 | results_gpt2/*
8 | results_elmo_giga/*
9 | results_elmo_sc/*
10 | *.sh
11 | *.tar.gz
12 |
13 | !data/gigaword/README.md
14 | !data/gigaword/input.txt
15 | !data/gigaword/input_unk.txt
16 | !data/gigaword/task1_ref0.txt
17 | !data/sentence_compression/eval_src_1000_unk.txt
18 | !data/sentence_compression/eval_tgt_1000_unk.txt
19 |
20 | !results_elmo_giga/README.md
21 | !results_elmo_giga/smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt
22 | !results_elmo_sc/README.md
23 | !results_elmo_sc/smry_eval_src_1000_unk_Ks10_clust1_ELcat_eosavg0_n1_a0.1_b1.0_single.txt
24 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 jzhou316
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Unsupervised Sentence Summarization
2 |
3 | [](LICENSE)
4 |
5 | Unsupervised sentence summarization by contextual matching.
6 |
7 | This is the code for the paper: \
8 | [Simple Unsupervised Summarization by Contextual Matching](https://arxiv.org/pdf/1907.13337.pdf) (ACL 2019) \
9 | Jiawei Zhou, Alexander Rush
10 |
11 |
12 |
13 |
14 |
15 | ## Overview
16 |
17 | Using contextual word embeddings (e.g. pre-trained ELMo model) along with a language model trained on summary style sentences, we are able to find sentence level summarizations in an unsupervised way without being exposed to any paired data.
18 |
19 | The summary generation process is through beam search to maximize a product-of-expert score which is a combination of a contextual matching model (relying on pre-trained left-contextual embeddings) and a language fluency model (based on the summary domain specific language model). This works for both abstractive and extractive sentence level summarization.
20 |
21 | **Note:** as our generation process is left-to-right, we used only the forward model in ELMo. We also tested on the small version of GPT-2 after its release, but found it didn't not perform as well as ELMo in our setup.
22 |
23 |
24 | ## Dependencies
25 |
26 | The code was based and tested on the following libraries:
27 | - python 3.6
28 | - PyTorch 0.4.1
29 | - allennlp 0.5.1
30 |
31 | For Rouge evaluation, we used [files2rouge](https://github.com/pltrdy/files2rouge).
32 |
33 |
34 | ## Datasets & Summary Results & Pre-trained Language Models
35 |
36 | | Data & Task | Test Set &
Unsupervised Model Output | Summary LM & Vocabulary | Full Dataset |
37 | |:---:|:---:|:---:|:---:|
38 | | English Gigaword
(abstractive summarization) | [test data](./data/gigaword)
[model output](./results_elmo_giga) | [language model](https://drive.google.com/file/d/1iF0tLvoo74-o22-1jUjMTrLwK948sMKp/view?usp=sharing) | [full data](https://github.com/harvardnlp/sent-summary) |
39 | | Google Sentence Compression
(extractive summarization) | [test data](./data/sentence_compression)
[model output](./results_elmo_sc)| [language model](https://drive.google.com/file/d/1KVh7J6Mpj6W5YFV0DPAb81OwJSo26C7g/view?usp=sharing) | [full data](https://github.com/google-research-datasets/sentence-compression) |
40 |
41 | ## Unsupervised Summary Generation
42 |
43 | To generate summaries for a given corpus of source sentences, make sure the following two components are prepared:
44 | - The ELMo model contained in the [allennlp](https://github.com/allenai/allennlp) library package
45 | - A pre-trained LSTM based language model on the summary style short sentences (we have included our language modeling and training scripts in [lm_lstm](./lm_lstm), as well as our pre-trained models [above](#Datasets-&-Summary-Results-&-Pre-trained-Language-Models))
46 |
47 | ---
48 |
49 | Suppose the file structure is as follows:
50 | ```
51 | ├── ./
52 | ├── data/
53 | ├── gigaword/
54 | ├── sentence_compression/
55 | ├── lm_lstm/
56 | ├── lm_lstm_models/
57 | ├── gigaword/
58 | ├── sentence_compression/
59 | ├── uss/
60 | ├── ...
61 | ├── ...
62 | ```
63 |
64 | where we use two datasets as examples, English Gigaword for abstractive sentence summarization and Google sentence compression dataset for extractive sentence summarization, as were used in the paper. Suppose these data are stored in the `./data/` directory.
65 |
66 | For the following commands we take the [English Gigaword dataset](https://github.com/harvardnlp/sent-summary) as an example.
67 |
68 | ---
69 |
70 | **To train a summary domain specific language model:**
71 |
72 | ```
73 | python lm_lstm/main.py --data_src user --userdata_path ./data/gigaword --userdata_train train.title.txt --userdata_val valid.title.filter.txt --userdata_test task1_ref0_unk.txt --bptt 32 --bsz 256 --embedsz 1024 --hiddensz 1024 --tieweights 0 --optim SGD --lr 0.1 --gradclip 15 --epochs 50 --vocabsave ./lm_lstm_models/gigaword/vocabTle.pkl --save ./lm_lstm_models/gigaword/Tle_LSTM_untied.pth --devid 0
74 | ```
75 | Remember to change and check the data file paths and names, and save the vocabulary and model to proper places. For a full list of hyperparameters and their meanings use `python lm_lstm/main.py --help` or check the python script.
76 |
77 | **A minor detail**: in the processed [English Gigaword dataset](https://github.com/harvardnlp/sent-summary) we used, the training and validation sets contain unknown words as ``, whereas in the test set they are represented as `UNK`. When training the language model we replace the `UNK` tokens in the test summary set by ``. And after generating the test summaries, to compare with the original reference, we map `` back to `UNK` for Rouge evaluation to be consistent with the literature.
78 |
79 | In our experiments, the above command should produce a language model which achieves train perplexity ~62, validation perplexity ~72, and test perplexity ~201 (as the test set if quite different from the train and validation sets). And training takes about a day on a DGX V100 GPU.
80 |
81 | ---
82 |
83 | After obtaining the summary domain specific language model, we can do the **summary generation using the following command:**
84 |
85 | ```
86 | python uss/summary_search_elmo.py --src ./data/gigaword/input_unk.txt --modelclass ./lm_lstm --model ./lm_lstm_models/gigaword/Tle_LSTM_untied.pth --vocab ./lm_lstm_models/gigaword/vocabTle.pkl --n 6 --ns 10 --nf 300 --elmo_layer cat --alpha 0.1 --beta 0 --beam_width 10 --devid 0 --save_dir ./results_elmo_giga/
87 | ```
88 |
89 | where:
90 | - `--src` specifies the sentence corpus to be summarized
91 | - `--modelclass` is the directory in which the language model source script is saved
92 | - `--model` is the path of the language model to be used
93 | - `--vocab` is the vocabulary file associated with the language model
94 |
95 | and finally the results will be saved in the directory `--save_dir` with a system generated file name based on user specified hyperparameters. For the full list of hyperparameters for the summary generation process and their meanings (alghough most of them were used for experimental purposes and need not to be changed), use `python uss/summary_search_elmo.py --help` or check the python script.
96 |
97 | With the above command exactly, a file named "smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt" containing all the generated sentence summarizations will be saved in the directory "./results_elmo_giga/". In this file, for each source sentence, all of the finished hypotheses from beam search are saved as candidate summary sentences, along with their alignments to the original source sentence, as well as the combined scores, contextual matching scores, and language modeling scores. Note that this searching process is relatively slow, as we need to calculate the contextual embeddings for every sentence prefix and every candidate next word in the procedure, even with our optimization of caching and batching.
98 |
99 |
100 | ## Evaluation
101 |
102 | To be consistent with the literature for evaluation, we need to select one summary sentence from the list to compare with the reference summary and calculate some metric statistics such as Rouge scores. Since our generation is unsupervised it could be difficult to select the best summary from a list of candidate summaries, and it is often the case that there is a better one than our selected one. Nevertheless we use a simple length penalized beam search score for our selection criterion.
103 |
104 | **For summary selection and evaluation, run the following command:**
105 |
106 | ```
107 | python uss/summary_select_eval.py --src ./data/gigaword/input_unk.txt --ref ./data/gigaword/task1_ref0.txt --gen ./results_elmo_giga/smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt --save_dir ./results_elmo_giga --lp 0.1
108 | ```
109 |
110 | where:
111 | - `--src`: source sentences file path
112 | - `--ref`: reference sentences file path
113 | - `--gen`: generated summary sentences file path
114 | - `--save_dir`: directory to save selected summaries
115 | - `--lp`: additive length penalty (usually between -0.1 and 0.1)
116 |
117 | This will generate a file named "smry_input_unk_Ks10_clust1_ELcat_eosavg0_n6_ns10_nf300_a0.1_b1.0_single.txt" in "./results_elmo_giga/", containing a single summary selected for each of the source sentences. And Rouge scores will be computed and printed, along with other statistics including copy rate, compression rate, and average summary length.
118 |
119 | Note that the Rouge evaluation is based on [files2rouge](https://github.com/pltrdy/files2rouge).
120 |
121 | ## Data and Sample Output
122 |
123 | We have included the test sets of English Gigaword dataset and Google sentence compression evaluation set in the "./data" folder.
124 |
125 | We also include the summary outputs from our unsupervised method for these two test sets in "./results_elmo_giga" and "./results_elmo_sc" respectively.
126 |
127 | ## Citing
128 |
129 | If you find the resources in this repository useful, please consider citing:
130 |
131 | ```
132 | @inproceedings{zhou2019simple,
133 | title={Simple Unsupervised Summarization by Contextual Matching},
134 | author={Zhou, Jiawei and Rush, Alexander M},
135 | booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
136 | pages={5101--5106},
137 | year={2019}
138 | }
139 | ```
140 |
--------------------------------------------------------------------------------
/data/gigaword/README.md:
--------------------------------------------------------------------------------
1 | ## English Gigaword Summarization Test Set
2 |
3 | - **input.txt**: source sentences
4 | - **input_unk.txt**: source sentences after replacing `UNK` with ``
5 | - **task1_ref0.txt**: reference summary sentences
6 |
--------------------------------------------------------------------------------
/data/sentence_compression/README.md:
--------------------------------------------------------------------------------
1 | ## Google Sentence Compression Test Set
2 |
3 | For extractive sentence summarization. The data is cleaned and pre-processed from the original dataset in a way similar to that of Gigaword dataset.
4 |
--------------------------------------------------------------------------------
/figure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jzhou316/Unsupervised-Sentence-Summarization/29f3e23d608143b8d09a98fac3968b6f4f97302e/figure1.png
--------------------------------------------------------------------------------
/lm_lstm/dataload.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Wed Jun 20 23:25:26 2018
4 |
5 | @author: zjw
6 | """
7 |
8 | import torchtext
9 |
10 |
11 | def loadPTB(root='E:/NLP/LM/data', batch_size=64, bptt_len=32, device=None, **kwargs):
12 | """
13 | Load the Penn Treebank dataset. Download if not existing.
14 | """
15 | TEXT = torchtext.data.Field(lower=True)
16 | train, val, test = torchtext.datasets.PennTreebank.splits(root=root, text_field=TEXT)
17 | TEXT.build_vocab(train, **kwargs) # could include: max_size, min_freq, vectors
18 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test),
19 | batch_size=batch_size,
20 | bptt_len=bptt_len,
21 | device=device,
22 | repeat=False)
23 |
24 | return TEXT, train_iter, val_iter, test_iter
25 |
26 |
27 | def loadWiki2(root='E:/NLP/LM/data', batch_size=64, bptt_len=32, device=None, **kwargs):
28 | """
29 | Load the WikiText2 dataset. Download if not existing.
30 | """
31 | TEXT = torchtext.data.Field(lower=True)
32 | train, val, test = torchtext.datasets.WikiText2.splits(root=root, text_field=TEXT)
33 | TEXT.build_vocab(train, **kwargs)
34 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test),
35 | batch_size=batch_size,
36 | bptt_len=bptt_len,
37 | device=device,
38 | repeat=False)
39 |
40 | return TEXT, train_iter, val_iter, test_iter
41 |
42 |
43 | def loadLMdata(path='E:/NLP/LM/data/penn-tree-bank-small',
44 | train='ptb.train.5k.txt',
45 | val='ptb.valid.txt',
46 | test='ptb.test.txt',
47 | batch_size=64,
48 | bptt_len=32,
49 | device=None, **kwargs):
50 | """
51 | Load a dataset for LM training. The dataset should exist already.
52 | """
53 | TEXT = torchtext.data.Field(lower=True)
54 | train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path=path,
55 | train=train,
56 | validation=val,
57 | test=test,
58 | text_field=TEXT)
59 | TEXT.build_vocab(train, val, test, **kwargs)
60 | train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test),
61 | batch_size=batch_size,
62 | bptt_len=bptt_len,
63 | device=device,
64 | repeat=False)
65 |
66 | return TEXT, train_iter, val_iter, test_iter
--------------------------------------------------------------------------------
/lm_lstm/main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Fri Oct 26 2018
4 |
5 | @author: zjw
6 | """
7 | import torch
8 | import torch.optim as optim
9 | from dataload import loadPTB, loadWiki2, loadLMdata
10 | from model import RNNModel
11 | from train import training, validating
12 | # from train_sharding import training, validating
13 | from utils import logging
14 |
15 | import os
16 | import sys
17 | # import random
18 | import argparse
19 | import time
20 | import importlib
21 | import pickle
22 |
23 |
24 | ########## set up parameters
25 | # data
26 | data_src = 'ptb'
27 | # on MicroSoft Azure
28 | # data_root = '/media/work/LM/data'
29 | # userdata_path = '/media/work/LM/data/Giga-sum' # .../penn-treebank-small
30 | # on Harvard Odyssey Cluster
31 | data_root = '/n/rush_lab/users/jzhou/LM/data'
32 | userdata_path = '/n/rush_lab/users/jzhou/LM/data/Giga-sum' # .../penn-treebank-small
33 | userdata_train = 'train.title.txt'
34 | userdata_val = 'valid.title.filter.txt'
35 | userdata_test = 'task1_ref0_unk.txt'
36 | batch_size = 128
37 | bptt_len = 32
38 | # model
39 | embed_size = 512 # 1024
40 | hidden_size = 512 # 1024
41 | num_layers = 2
42 | dropout = 0.5
43 | tieweights = 0 # 0 for False, 1 for True
44 | # optimization
45 | learning_rate = 0.01
46 | momentum = 0.9
47 | weight_decay = 1e-4
48 | grad_max_norm = 120 # 1024, 0.01 ---> 120
49 | shard_size = 64
50 | ##subvocab_size = 0
51 | #learning_rate = 0.001
52 | #grad_max_norm = None
53 | num_epochs = 50
54 |
55 | vocabsavepath = './models/vocabTle.pkl'
56 | savepath = './models/Tle_LSTM.pth'
57 | #savepath = '/media/work/LM/LMModel.pth'
58 | #savepath = '/n/rush_lab/users/jzhou/LM/LMModel.pth'
59 |
60 |
61 | def parse_args():
62 | parser = argparse.ArgumentParser(description='Training an LSTM language model.')
63 | group = parser.add_mutually_exclusive_group()
64 | group.add_argument('--devid', type=int, default=-1, help='single device id; -1 for CPU')
65 | group.add_argument('--devids', type=str, default='off', help='multiple device ids for data parallel; use comma to separate, e.g. 0, 1, 2')
66 | # parser.add_argument('--devid', type=int, default=-1, help='device id; -1 for CPU')
67 | ## parser.add_argument('--modelfile', type=str, default='model', help='file name of the model, without .py')
68 | parser.add_argument('--seed', type=int, default=0, help='random seed')
69 | parser.add_argument('--logmode', type=str, default='w', help='logging file mode')
70 | # data loading
71 | parser.add_argument('--data_src', type=str, default=data_src, choices=['ptb', 'wiki2', 'user'], help='data source')
72 | parser.add_argument('--data_root', type=str, default=data_root, help='root path for PTB/Wiki2 dataset path')
73 | parser.add_argument('--userdata_path', type=str, default=userdata_path, help='user data path')
74 | parser.add_argument('--userdata_train', type=str, default=userdata_train, help='user data training set file name')
75 | parser.add_argument('--userdata_val', type=str, default=userdata_val, help='user data validating set file name')
76 | parser.add_argument('--userdata_test', type=str, default=userdata_test, help='user data testing set file name')
77 | parser.add_argument('--bptt', type=int, default=bptt_len, help='bptt length')
78 | parser.add_argument('--bsz', type=int, default=batch_size, help='batch size')
79 | parser.add_argument('--vocabsave', type=str, default=vocabsavepath, help='file path to save the vocabulary object')
80 | # model
81 | parser.add_argument('--embedsz', type=int, default=embed_size, help='word embedding size')
82 | parser.add_argument('--hiddensz', type=int, default=hidden_size, help='hidden state size')
83 | parser.add_argument('--numlayers', type=int, default=num_layers, help='number of layers')
84 | parser.add_argument('--dropout', type=float, default=dropout, help='dropout probability')
85 | # parser.add_argument('--tieweights', help='whether to tie input and output embedding weights', action='store_true')
86 | parser.add_argument('--tieweights', type=int, default=tieweights, help='whether to tie input and output embedding weights')
87 | parser.add_argument('--start_model', type=str, default='off', help='a trained model to start with')
88 | # optimization
89 | parser.add_argument('--optim', type=str, default='SGD', choices=['SGD', 'Adam'], help='optimization algorithm')
90 | parser.add_argument('--lr', type=float, default=learning_rate, help='learning rate')
91 | parser.add_argument('--momentum', type=float, default=momentum, help='momentum for SGD')
92 | parser.add_argument('--wd', type=float, default=weight_decay, help='weight decay (L2 penalty)')
93 | parser.add_argument('--gradclip', type=float, default=grad_max_norm, help='gradient norm clip')
94 | ## parser.add_argument('--shardsz', type=int, default=shard_size, help='shard size for mixture of softmax output layer')
95 | ## parser.add_argument('--subvocabsz', type=int, default=subvocab_size, help='sub-vocabulary size for training on large corpus')
96 | parser.add_argument('--epochs', type=int, default=num_epochs, help='number of training epochs')
97 | parser.add_argument('--save', type=str, default=savepath, help='file path to save the best model')
98 | args = parser.parse_args()
99 | return args
100 |
101 | args = parse_args()
102 |
103 |
104 | ## RNNModel = importlib.import_module(args.modelfile).RNNModel
105 |
106 | cuda_device = 'cpu' if args.devid == -1 else f'cuda:{args.devid}'
107 | if args.devids is not 'off':
108 | device_ids = list(map(int, args.devids.split(',')))
109 | output_device = device_ids[0]
110 | cuda_device = f'cuda:{output_device}'
111 |
112 | # if os.name == 'nt':
113 | # # run on my personal windows computer with cpu
114 | # cuda_device = None
115 | # root = 'E:/NLP/LM/data'
116 | # path = 'E:/NLP/LM/data/penn-treebank-small'
117 | # elif os.name == 'posix':
118 | # # run on Harvard Odyssey cluster
119 | # cuda_device = 'cuda:0'
120 | # # root = '/n/rush_lab/users/jzhou/LM/data'
121 | # # path = '/n/rush_lab/users/jzhou/LM/data/penn-treebank-small'
122 | # root = '/media/work/LM/data'
123 | # path = '/media/work/LM/data/penn-treebank'
124 | # # path = '/media/work/LM/data/Giga-sum'
125 | # # train = 'train.title.txt'
126 | # # val = 'valid.title.filter.txt'
127 | # # test = 'task1_ref0.txt'
128 |
129 | log_file = os.path.splitext(args.save)[0] + '.log'
130 | f_log = open(log_file, args.logmode)
131 |
132 | logging('python ' + ' '.join(sys.argv), f_log=f_log)
133 |
134 | logging('-' * 30, f_log=f_log)
135 | logging(time.ctime(), f_log=f_log)
136 |
137 |
138 | # random.seed(args.seed) # this has no impact on the current model training
139 | torch.manual_seed(args.seed)
140 | # torch.backends.cudnn.deterministic = True
141 | # torch.backends.cudnn.benchmark = False
142 | # torch.backends.cudnn.enabled = False
143 |
144 | # print('-' * 30)
145 | # print(time.ctime())
146 |
147 | ########## load the dataset
148 | logging('-' * 30, f_log=f_log)
149 | logging('Loading data ...', f_log=f_log)
150 |
151 | # print('-' * 30)
152 | # print('Loading data ...')
153 |
154 | if args.data_src == 'ptb':
155 | TEXT, train_iter, val_iter, test_iter = loadPTB(root=args.data_root,
156 | batch_size=args.bsz,
157 | bptt_len=args.bptt,
158 | device=cuda_device)
159 | elif args.data_src == 'wiki2':
160 | TEXT, train_iter, val_iter, test_iter = loadWiki2(root=args.data_root,
161 | batch_size=args.bsz,
162 | bptt_len=args.bptt,
163 | device=cuda_device)
164 | elif args.data_src == 'user':
165 | TEXT, train_iter, val_iter, test_iter = loadLMdata(path=args.userdata_path,
166 | train=args.userdata_train,
167 | val=args.userdata_val,
168 | test=args.userdata_test,
169 | batch_size=args.bsz,
170 | bptt_len=args.bptt,
171 | device=cuda_device,
172 | min_freq=5)
173 |
174 | padid = TEXT.vocab.stoi['']
175 | vocab_size = len(TEXT.vocab)
176 |
177 | logging(f'Vocab size: {vocab_size}', f_log=f_log)
178 | if not os.path.exists(args.vocabsave):
179 | pickle.dump(TEXT.vocab, open(args.vocabsave, 'wb'))
180 | logging(f'Vocabulary object saved to: {args.vocabsave}', f_log=f_log)
181 | else:
182 | logging(f'Vocabulary object at: {args.vocabsave}', f_log=f_log)
183 | logging('Complete!', f_log=f_log)
184 | logging('-' * 30, f_log=f_log)
185 |
186 | ##if args.subvocabsz >= vocab_size or args.subvocabsz == 0:
187 | ## args.subvocabsz = None
188 |
189 | # print('Complete!')
190 | # print('-' * 30)
191 |
192 | ########## define the model and optimizer
193 |
194 | if args.start_model is 'off':
195 | LMModel = RNNModel(vocab_size=vocab_size,
196 | embed_size=args.embedsz,
197 | hidden_size=args.hiddensz,
198 | num_layers=args.numlayers,
199 | dropout=args.dropout,
200 | padid=padid,
201 | tieweights=args.tieweights)
202 | else:
203 | LMModel_start = torch.load(args.start_model).cpu()
204 | # Note: watch out if the model class has different methods from the loaded one to start with !!!
205 | LMModel = RNNModel(vocab_size=vocab_size,
206 | embed_size=args.embedsz,
207 | hidden_size=args.hiddensz,
208 | num_layers=args.numlayers,
209 | dropout=args.dropout,
210 | padid=padid,
211 | tieweights=args.tieweights)
212 | LMModel.load_state_dict(LMModel_start.state_dict())
213 |
214 |
215 | # LMModel = torch.load(args.save).cpu()
216 |
217 | model_size = sum(p.nelement() for p in LMModel.parameters())
218 | logging('-' * 30, f_log=f_log)
219 | logging(f'Model tatal parameters: {model_size}', f_log=f_log)
220 | logging('-' * 30, f_log=f_log)
221 |
222 | # print('-' * 30)
223 | # print(f'Model tatal parameters: {model_size}')
224 | # print('-' * 30)
225 |
226 | if torch.cuda.is_available() and cuda_device is not 'cpu':
227 | LMModel = LMModel.cuda(cuda_device)
228 |
229 | LMModel_parallel = None
230 | if torch.cuda.is_available() and args.devids is not 'off':
231 | LMModel_parallel = torch.nn.DataParallel(LMModel, device_ids=device_ids, output_device=output_device, dim=1)
232 | # .cuda() is necessary if LMModel was not on any GPU device
233 | # LMModel_parallel._modules['module'].lstm.flatten_parameters()
234 |
235 | if args.optim == 'SGD':
236 | optimizer = optim.SGD(LMModel.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.wd)
237 | elif args.optim == 'Adam':
238 | optimizer = optim.Adam(LMModel.parameters(), lr=args.lr, weight_decay=args.wd)
239 |
240 | scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.25, patience=1)
241 |
242 | if args.start_model is not 'off':
243 | start_model_optstate_path = os.path.splitext(args.start_model)[0] + '_optstate.pth'
244 | start_model_schstate_path = os.path.splitext(args.start_model)[0] + '_schstate.pth'
245 | if os.path.exists(start_model_optstate_path):
246 | optimizer.load_state_dict(torch.load(start_model_optstate_path))
247 | logging('-' * 30, f_log=f_log)
248 | logging('Loading saved optimizer states.', f_log=f_log)
249 | logging('-' * 30, f_log=f_log)
250 |
251 | if os.path.exists(start_model_schstate_path):
252 | scheduler.load_state_dict(torch.load(start_model_schstate_path))
253 | logging('-' * 30, f_log=f_log)
254 | logging('Loading saved scheduler states.', f_log=f_log)
255 | logging('-' * 30, f_log=f_log)
256 |
257 | # print('-' * 30)
258 | # print('Loading saved optimizer states.')
259 | # print('-' * 30)
260 |
261 | ########## traing the model
262 | if args.start_model is not 'off':
263 | start_model_rngstate_path = os.path.splitext(args.start_model)[0] + '_rngstate.pth'
264 | if os.path.exists(start_model_rngstate_path):
265 | torch.set_rng_state(torch.load(start_model_rngstate_path)['torch_rng_state'])
266 | torch.cuda.set_rng_state_all(torch.load(start_model_rngstate_path)['cuda_rng_state'])
267 | logging('-' * 30, f_log=f_log)
268 | logging('Loading saved rng states.', f_log=f_log)
269 | logging('-' * 30, f_log=f_log)
270 |
271 | train_ppl, val_ppl = training(train_iter, val_iter, args.epochs,
272 | LMModel,
273 | optimizer,
274 | scheduler,
275 | args.gradclip,
276 | args.save,
277 | ## shard_size=args.shardsz,
278 | LMModel_parallel=LMModel_parallel,
279 | f_log=f_log)
280 | ## subvocab_size=args.subvocabsz)
281 |
282 | ######### test the trained model
283 | ##test_ppl = validating(test_iter, LMModel, shard_size=args.shardsz, LMModel_parallel=LMModel_parallel, f_log=f_log)
284 | test_ppl = validating(test_iter, LMModel, LMModel_parallel=LMModel_parallel, f_log=f_log)
285 | logging('-' * 30, f_log=f_log)
286 | logging('Test ppl: %f' % test_ppl, f_log=f_log)
287 | logging('-' * 30, f_log=f_log)
288 |
289 | f_log.close()
290 |
291 | # print('-' * 30)
292 | # print('Test ppl: %f' % test_ppl)
293 | # print('-' * 30)
294 |
--------------------------------------------------------------------------------
/lm_lstm/model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Nov 29 2018
4 |
5 | @author: zjw
6 | """
7 | import torch
8 | import torch.nn as nn
9 |
10 |
11 | class RNNModel(nn.Module):
12 | def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout, padid=1, max_norm=None, tieweights=False):
13 | super(RNNModel, self).__init__()
14 |
15 | self.vocab_size = vocab_size
16 | self.embed_size = embed_size
17 | self.hidden_size = hidden_size
18 | self.num_layers = num_layers
19 | self.padid = padid
20 |
21 | self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=self.padid, max_norm=None)
22 | self.lstm = nn.LSTM(input_size=embed_size, hidden_size=hidden_size,
23 | num_layers=num_layers, dropout=dropout)
24 | self.drop = nn.Dropout(p=dropout)
25 | self.proj = nn.Linear(hidden_size, vocab_size, bias=True)
26 |
27 | self.init_weight(0.5)
28 |
29 | # tie weights
30 | if tieweights:
31 | self.proj.weight = self.embedding.weight
32 |
33 | def init_weight(self, initrange=0.1):
34 | nn.init.uniform_(self.embedding.weight, -initrange, initrange)
35 | # nn.init.uniform_(self.proj.weight, -initrange, initrange)
36 | nn.init.orthogonal_(self.proj.weight)
37 | nn.init.constant_(self.proj.bias, 0)
38 |
39 | def forward(self, batch_text, hn, subvocab=None, return_prob=False):
40 | embed = self.embedding(batch_text) # size: (seq_len, batch_size, embed_size)
41 | output, hn = self.lstm(embed, hn) # output size: (seq_len, batch_size, hidden_size)
42 | output = self.drop(output) # hn = (hn, cn), each with size: (num_layers, batch, hidden_size)
43 | if isinstance(subvocab, list):
44 | subvocab = torch.LongTensor(subvocab, device=output.device)
45 | output = self.proj(output) if subvocab is None else nn.functional.linear(output, self.proj.weight[subvocab, :], self.proj.bias[subvocab])
46 | if return_prob:
47 | output = nn.functional.softmax(output, dim=-1)
48 | # detach last hidden and cell states to truncate the computational graph for BPTT.
49 | hn = tuple(map(lambda x: x.detach(), hn))
50 | return output, hn
51 |
52 | def score_textseq(self, text, vocab, hn=None, size_average=True):
53 | """
54 | Output the log-likelihood of a text sequence.
55 | """
56 | if isinstance(text, str):
57 | text = text.split()
58 | textid = next(self.parameters()).new_tensor([vocab.stoi[w] for w in text], dtype=torch.long)
59 | with torch.no_grad():
60 | self.eval()
61 | model_output, hn = self(textid.unsqueeze(1), hn)
62 | self.train()
63 | ll = nn.functional.cross_entropy(model_output[:-1, 0, :], textid[1:], ignore_index=self.padid,
64 | reduction='elementwise_mean' if size_average else 'sum')
65 | ll = -ll.item()
66 | return ll
67 |
68 | def score_nexttoken(self, text, vocab, hn=None):
69 | """
70 | Output the predictive probabilities of the next token given a text sequence.
71 | """
72 | if isinstance(text, str):
73 | text = text.split()
74 | textid = next(self.parameters()).new_tensor([vocab.stoi[w] for w in text], dtype=torch.long)
75 | with torch.no_grad():
76 | self.eval()
77 | model_output, hn = self(textid.unsqueeze(1), hn, return_prob=True)
78 | self.train()
79 |
80 | return model_output[-1, 0, :]
81 |
82 |
--------------------------------------------------------------------------------
/lm_lstm/testppl.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Jan 31 2019
4 |
5 | @author: zjw
6 | """
7 | import torch
8 | from dataload import loadPTB, loadWiki2, loadLMdata
9 | from model import RNNModel
10 | from train import validating
11 | # from train_sharding import training, validating
12 |
13 | import os
14 | import sys
15 | # import random
16 | import argparse
17 | import time
18 | import importlib
19 | import pickle
20 |
21 |
22 | ########## set up parameters
23 | # data
24 | data_src = 'ptb'
25 | # on MicroSoft Azure
26 | # data_root = '/media/work/LM/data'
27 | # userdata_path = '/media/work/LM/data/Giga-sum' # .../penn-treebank-small
28 | # on Harvard Odyssey Cluster
29 | data_root = '/n/rush_lab/users/jzhou/LM/data'
30 | userdata_path = '/n/rush_lab/users/jzhou/LM/data/Giga-sum' # .../penn-treebank-small
31 | userdata_train = 'train.title.txt'
32 | userdata_val = 'valid.title.filter.txt'
33 | userdata_test = 'task1_ref0_unk.txt'
34 | batch_size = 128
35 | bptt_len = 32
36 |
37 | vocabsavepath = './models/vocabTle.pkl'
38 |
39 | def parse_args():
40 | parser = argparse.ArgumentParser(description='Training an LSTM language model.')
41 | group = parser.add_mutually_exclusive_group()
42 | group.add_argument('--devid', type=int, default=-1, help='single device id; -1 for CPU')
43 | group.add_argument('--devids', type=str, default='off', help='multiple device ids for data parallel; use comma to separate, e.g. 0, 1, 2')
44 | # parser.add_argument('--devid', type=int, default=-1, help='device id; -1 for CPU')
45 | ## parser.add_argument('--modelfile', type=str, default='model', help='file name of the model, without .py')
46 | parser.add_argument('--seed', type=int, default=0, help='random seed')
47 | # data loading
48 | parser.add_argument('--data_src', type=str, default=data_src, choices=['ptb', 'wiki2', 'user'], help='data source')
49 | parser.add_argument('--data_root', type=str, default=data_root, help='root path for PTB/Wiki2 dataset path')
50 | parser.add_argument('--userdata_path', type=str, default=userdata_path, help='user data path')
51 | parser.add_argument('--userdata_train', type=str, default=userdata_train, help='user data training set file name')
52 | parser.add_argument('--userdata_val', type=str, default=userdata_val, help='user data validating set file name')
53 | parser.add_argument('--userdata_test', type=str, default=userdata_test, help='user data testing set file name')
54 | parser.add_argument('--bptt', type=int, default=bptt_len, help='bptt length')
55 | parser.add_argument('--bsz', type=int, default=batch_size, help='batch size')
56 | parser.add_argument('--vocabsave', type=str, default=vocabsavepath, help='file path to save the vocabulary object')
57 | # model
58 | parser.add_argument('--model', type=str, default='off', help='a trained model to start with')
59 | args = parser.parse_args()
60 | return args
61 |
62 | args = parse_args()
63 |
64 |
65 | ## RNNModel = importlib.import_module(args.modelfile).RNNModel
66 |
67 | cuda_device = 'cpu' if args.devid == -1 else f'cuda:{args.devid}'
68 | if args.devids is not 'off':
69 | device_ids = list(map(int, args.devids.split(',')))
70 | output_device = device_ids[0]
71 | cuda_device = f'cuda:{output_device}'
72 |
73 | # if os.name == 'nt':
74 | # # run on my personal windows computer with cpu
75 | # cuda_device = None
76 | # root = 'E:/NLP/LM/data'
77 | # path = 'E:/NLP/LM/data/penn-treebank-small'
78 | # elif os.name == 'posix':
79 | # # run on Harvard Odyssey cluster
80 | # cuda_device = 'cuda:0'
81 | # # root = '/n/rush_lab/users/jzhou/LM/data'
82 | # # path = '/n/rush_lab/users/jzhou/LM/data/penn-treebank-small'
83 | # root = '/media/work/LM/data'
84 | # path = '/media/work/LM/data/penn-treebank'
85 | # # path = '/media/work/LM/data/Giga-sum'
86 | # # train = 'train.title.txt'
87 | # # val = 'valid.title.filter.txt'
88 | # # test = 'task1_ref0.txt'
89 |
90 |
91 | # random.seed(args.seed) # this has no impact on the current model training
92 | torch.manual_seed(args.seed)
93 | # torch.backends.cudnn.deterministic = True
94 | # torch.backends.cudnn.benchmark = False
95 | # torch.backends.cudnn.enabled = False
96 |
97 | # print('-' * 30)
98 | # print(time.ctime())
99 |
100 | ########## load the dataset
101 | print('-' * 30)
102 | print('Loading data ...')
103 |
104 | if args.data_src == 'ptb':
105 | TEXT, train_iter, val_iter, test_iter = loadPTB(root=args.data_root,
106 | batch_size=args.bsz,
107 | bptt_len=args.bptt,
108 | device=cuda_device)
109 | elif args.data_src == 'wiki2':
110 | TEXT, train_iter, val_iter, test_iter = loadWiki2(root=args.data_root,
111 | batch_size=args.bsz,
112 | bptt_len=args.bptt,
113 | device=cuda_device)
114 | elif args.data_src == 'user':
115 | TEXT, train_iter, val_iter, test_iter = loadLMdata(path=args.userdata_path,
116 | train=args.userdata_train,
117 | val=args.userdata_val,
118 | test=args.userdata_test,
119 | batch_size=args.bsz,
120 | bptt_len=args.bptt,
121 | device=cuda_device,
122 | min_freq=5)
123 | print(f'Vocabulary size: {len(TEXT.vocab)}')
124 | print('Complete!')
125 | print('-' * 30)
126 |
127 | ########## define the model
128 | LMModel = torch.load(args.model).cpu()
129 | '''
130 | LMModel_start = torch.load(args.start_model).cpu()
131 | # Note: watch out if the model class has different methods from the loaded one to start with !!!
132 | LMModel = RNNModel(vocab_size=vocab_size,
133 | embed_size=LMModel_start.embedsz,
134 | hidden_size=LMModel_start.hiddensz,
135 | num_layers=LMModel_start.numlayers,
136 | dropout=LMModel_start.dropout,
137 | padid=LMModel_start.padid,
138 | tieweights=LMModel_start.tieweights)
139 | LMModel.load_state_dict(LMModel_start.state_dict())
140 | '''
141 |
142 | # LMModel = torch.load(args.save).cpu()
143 |
144 | model_size = sum(p.nelement() for p in LMModel.parameters())
145 |
146 | print('-' * 30)
147 | print(f'Model tatal parameters: {model_size}')
148 | print('-' * 30)
149 |
150 | if torch.cuda.is_available() and cuda_device is not 'cpu':
151 | LMModel = LMModel.cuda(cuda_device)
152 |
153 | LMModel_parallel = None
154 | if torch.cuda.is_available() and args.devids is not 'off':
155 | LMModel_parallel = torch.nn.DataParallel(LMModel, device_ids=device_ids, output_device=output_device, dim=1)
156 | # .cuda() is necessary if LMModel was not on any GPU device
157 | # LMModel_parallel._modules['module'].lstm.flatten_parameters()
158 | '''
159 | if args.start_model is not 'off':
160 | start_model_optstate_path = os.path.splitext(args.start_model)[0] + '_optstate.pth'
161 | start_model_schstate_path = os.path.splitext(args.start_model)[0] + '_schstate.pth'
162 | if os.path.exists(start_model_optstate_path):
163 | optimizer.load_state_dict(torch.load(start_model_optstate_path))
164 | logging('-' * 30, f_log=f_log)
165 | logging('Loading saved optimizer states.', f_log=f_log)
166 | logging('-' * 30, f_log=f_log)
167 |
168 | if os.path.exists(start_model_schstate_path):
169 | scheduler.load_state_dict(torch.load(start_model_schstate_path))
170 | logging('-' * 30, f_log=f_log)
171 | logging('Loading saved scheduler states.', f_log=f_log)
172 | logging('-' * 30, f_log=f_log)
173 | '''
174 | # print('-' * 30)
175 | # print('Loading saved optimizer states.')
176 | # print('-' * 30)
177 |
178 | ######### test the trained model
179 | test_ppl = validating(test_iter, LMModel, LMModel_parallel=LMModel_parallel)
180 |
181 | print('-' * 30)
182 | print('Test ppl: %f' % test_ppl)
183 | print('-' * 30)
184 |
185 |
--------------------------------------------------------------------------------
/lm_lstm/train.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Oct 25 2018
4 |
5 | @author: zjw
6 | """
7 | import torch
8 | import torch.nn as nn
9 | import math
10 | import time
11 | import os
12 | from utils import logging, timeSince, rand_subvocab
13 |
14 |
15 | def training(train_iter, val_iter, num_epoch, LMModel, optimizer, scheduler, grad_max_norm=None, savepath='./LMModel.pth', LMModel_parallel=None, f_log=None, subvocab_size=None):
16 | criterion = nn.CrossEntropyLoss(ignore_index=LMModel.padid, reduction='sum')
17 | best_val_ppl = None
18 | last_epoch = scheduler.last_epoch
19 | LMModel.train()
20 | start = time.time()
21 | for epoch in range(last_epoch + 1, last_epoch + 1 + num_epoch):
22 | train_iter.init_epoch()
23 | loss_total = 0
24 | num_token_passed = 0
25 | hn = None
26 | for batch in train_iter:
27 | # calculate sub-vocabulary
28 | ## subvocab = rand_subvocab(batch, LMModel.vocab_size, subvocab_size)
29 | subvocab = None
30 | # update parameters
31 | optimizer.zero_grad()
32 | if LMModel_parallel is None:
33 | output, hn = LMModel(batch.text, hn if hn is not None else None, subvocab=subvocab)
34 | else:
35 | output, hn = LMModel_parallel(batch.text, hn if hn is not None else None, subvocab=subvocab.numpy().tolist() if subvocab is not None else None)
36 | if subvocab is not None:
37 | target_subids = batch.target.new_tensor([(subvocab == x).nonzero().item() for x in batch.target.view(-1).cpu()], dtype=torch.long)
38 | loss = criterion(output.view(-1, output.size(2)), batch.target.view(-1)) if subvocab is None else \
39 | criterion(output.view(-1, output.size(2)), target_subids)
40 | loss.backward()
41 |
42 | if grad_max_norm:
43 | nn.utils.clip_grad_norm_(LMModel.parameters(), grad_max_norm)
44 |
45 | # calculate perplexity
46 | loss_total += float(loss) # do not accumulate history accross training loop
47 | num_token_passed += (torch.numel(batch.target) -
48 | torch.sum(batch.target == LMModel.padid)).item()
49 | # do not count the '', which could only exist
50 | # at the end of the last batch
51 | loss_avg = loss_total / num_token_passed
52 | ppl = math.exp(loss_avg)
53 |
54 | optimizer.step()
55 |
56 | # print information
57 | if train_iter.iterations % 50 == 0 or train_iter.iterations == len(train_iter):
58 | logging('Epoch %d / %d, iteration %d / %d, ppl: %f (time elasped %s)'
59 | %(epoch + 1, last_epoch + 1 + num_epoch, train_iter.iterations, len(train_iter), ppl, timeSince(start)), f_log=f_log)
60 |
61 | # calculation ppl on validation set
62 | val_ppl = validating(val_iter, LMModel, LMModel_parallel=LMModel_parallel, f_log=f_log)
63 | LMModel.train()
64 | logging('-' * 30, f_log=f_log)
65 | logging('Validating ppl: %f' % val_ppl, f_log=f_log)
66 | logging('-' * 30, f_log=f_log)
67 |
68 | scheduler.step(val_ppl)
69 |
70 | # save the model if the validation ppl is the best so far
71 | if not best_val_ppl or val_ppl < best_val_ppl:
72 | best_val_ppl = val_ppl
73 | torch.save(LMModel, savepath)
74 | torch.save(optimizer.state_dict(), os.path.splitext(savepath)[0] + '_optstate.pth')
75 | torch.save(scheduler.state_dict(), os.path.splitext(savepath)[0] + '_schstate.pth')
76 | torch.save({'torch_rng_state': torch.get_rng_state(), 'cuda_rng_state': torch.cuda.get_rng_state_all()}, os.path.splitext(savepath)[0] + '_rngstate.pth')
77 | logging(f'Current model (after epoch {epoch+1}) saved to {savepath} (along with optimizer state dictionary & scheduler state dictionary & rng states)', f_log=f_log)
78 | logging('-' * 30, f_log=f_log)
79 |
80 | return ppl, val_ppl
81 |
82 |
83 | def validating(val_iter, LMModel, LMModel_parallel=None, f_log=None):
84 | criterion = nn.CrossEntropyLoss(ignore_index=LMModel.padid, reduction='sum')
85 | LMModel.eval()
86 | with torch.no_grad():
87 | val_iter.init_epoch()
88 | loss_total = 0
89 | num_token_passed = 0
90 | hn = None
91 | for batch in val_iter:
92 | if LMModel_parallel is None:
93 | output, hn = LMModel(batch.text, hn if hn is not None else None)
94 | else:
95 | output, hn = LMModel_parallel(batch.text, hn if hn is not None else None)
96 | loss = criterion(output.view(-1, output.size(2)), batch.target.view(-1))
97 | loss_total += float(loss)
98 | num_token_passed += torch.sum(batch.target.ne(LMModel.padid)).item()
99 |
100 | ppl = math.exp(loss_total / num_token_passed)
101 | return ppl
102 |
--------------------------------------------------------------------------------
/lm_lstm/utils.py:
--------------------------------------------------------------------------------
1 | import time
2 | import math
3 | import torch
4 |
5 |
6 | def logging(s, f_log=None, print_=True, log_=True):
7 | if print_:
8 | print(s)
9 | if log_ and f_log is not None:
10 | f_log.write(s + '\n')
11 |
12 |
13 | def timeSince(start):
14 | now = time.time()
15 | s = now - start
16 | m = math.floor(s / 60)
17 | s -= m * 60
18 | h = math.floor(m / 60)
19 | m -= h * 60
20 | if h == 0:
21 | return '%dm %ds' % (m, s)
22 | else:
23 | return '%dh %dm %ds' % (h, m, s)
24 |
25 | def rand_subvocab(batch, vocab_size, subvocab_size=None):
26 | if subvocab_size is None or subvocab_size >= vocab_size:
27 | return None
28 | batch_ids = torch.cat([batch.text.view(-1), batch.target.view(-1)]).cpu().unique()
29 | subvocab = torch.cat([torch.randperm(vocab_size)[:subvocab_size], batch_ids]).unique(sorted=True)
30 | return subvocab
31 |
32 |
--------------------------------------------------------------------------------
/results_elmo_giga/README.md:
--------------------------------------------------------------------------------
1 | **Summaries generated from our unsupervised method for Gigaword test set**
2 |
3 | Including:
4 | - Summaries selected from finished beams with a consistent length penalty
5 | - Oracle results (select the best summary from the finished beams compared with the reference)
6 |
--------------------------------------------------------------------------------
/results_elmo_sc/README.md:
--------------------------------------------------------------------------------
1 | **Summaries generated from our unsupervised method for Google sentence compression test set**
2 |
3 | Including:
4 | - Summaries selected from finished beams with a consistent length penalty
5 | - Oracle results (select the best summary from the finished beams compared with the reference)
6 |
--------------------------------------------------------------------------------
/uss/beam_search.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn.functional as F
3 |
4 | from sim_embed_score import simScoreNext, simScoreNext_GPT2
5 | # from lmsubvocab import prob_next
6 | from lm_subvocab import prob_next_1step
7 | import nltk
8 |
9 | lemma = nltk.wordnet.WordNetLemmatizer()
10 |
11 |
12 | class BeamUnit:
13 | def __init__(self, word_id, pre_loc, cur_loc, score, seq_len, vocab, **kwargs):
14 | self.score = score
15 | self.word_id = word_id
16 | self.pre_loc = pre_loc
17 | self.cur_loc = cur_loc
18 | self.seq_len = seq_len
19 | self.vocab = vocab
20 | for k, v in kwargs.items():
21 | setattr(self, k, v)
22 |
23 |
24 | class Beam:
25 | def __init__(self, init_K, vocab, init_ids, device=None, **kwargs):
26 | assert 1 <= init_K <= len(vocab), 'Initial beam size should be in [1, len(vocab)]!'
27 | assert init_K == len(init_ids), 'Initial beam size should equal to the length of initial ids!'
28 | self.K = [init_K] # dynamic beam size
29 | self.vocab = vocab
30 | self.step = 0
31 | self.device = device
32 | self.endbus = [] # ending BeamUnits
33 | self.endall = False # if all beams reach the termination
34 | if init_ids == [None]: # A special initial id
35 | seq_len = 0
36 | else:
37 | seq_len = 1
38 | self.beamseq = [[BeamUnit(word_id, pre_loc=None, cur_loc=i, score=0, seq_len=seq_len, vocab=vocab, **kwargs)
39 | for (i, word_id) in enumerate(init_ids)]]
40 | # Note: for the reason of unifying different language models, all the beams start from one single unit:
41 | # for similarity LM, init_ids should always be [None];
42 | # for normal LM, init_ids should be of your own pick (since the current LM was not trained with
43 | # a special BOS token, it must start with some given token).
44 |
45 | def beamstep(self, K, score_funcK, **kwargs):
46 | """
47 | K: beam size next step
48 | score_func: a function that takes in a list of BeamUnit and returns the next top K BeamUnit based on some scores
49 | """
50 | if self.endall:
51 | raise ValueError('Beam.endall flag is already raised. No need to do beamstep.')
52 |
53 | nexttopKK, endbus = score_funcK(self.beamseq[-1], K, **kwargs)
54 | self.endbus += endbus
55 |
56 | if nexttopKK == []:
57 | print('All beams reach EOS. Beamstep stops.')
58 | self.endall = True
59 | else:
60 | self.beamseq.append(nexttopKK) # TO DO: add termination condition
61 | self.K.append(len(nexttopKK))
62 | self.step += 1
63 |
64 | def beamcut(self, K, score_func=None, **kwargs):
65 | """
66 | Cut the current beam width down to K (top K).
67 | """
68 | assert K > 0, 'Beam width K should be positive!'
69 | if K >= self.K[-1]:
70 | print('No need to cut.')
71 | else:
72 | if score_func is None:
73 | self.beamseq[-1] = self.beamseq[-1][0:K]
74 | self.K[-1] = K
75 | else:
76 | ll = [score_func(text=self.retrieve(k + 1)[0], **kwargs) for k in range(self.K[-1])]
77 | ll_sorted = sorted(list(zip(range(len(ll)), ll)), key=lambda x: x[1], reverse=True)[0:K]
78 | ll_idx, _ = zip(*ll_sorted)
79 | self.beamseq[-1] = [self.beamseq[-1][i] for i in range(self.K[-1]) if i in ll_idx]
80 | assert len(self.beamseq[-1]) == K
81 | for k, bu in enumerate(self.beamseq[-1]):
82 | bu.cur_loc = k
83 | self.K[-1] = K
84 | return ll, ll_sorted
85 |
86 | def beamselect(self, indices=[0]):
87 | """
88 | Select the beams (at last step) according to indices.
89 | Default: select the first beam, which is equivalent to self.beamcut(1).
90 | """
91 | indices = sorted(list(set(indices))) # indices: no repeated numbers, and should be sorted
92 | assert indices[-1] < self.K[-1], 'Index out of range (beamwidth).'
93 | self.beamseq[-1] = [self.beamseq[-1][i] for i in indices]
94 | for k, bu in enumerate(self.beamseq[-1]):
95 | bu.cur_loc = k
96 | self.K[-1] = len(self.beamseq[-1])
97 |
98 | def beamback(self, seq_len):
99 | """
100 | Trace back the beam at seq_len.
101 | """
102 | assert seq_len <= len(self.beamseq), 'seq_len larger than maximum.'
103 | if self.beamseq[0][0].word_id is None:
104 | self.beamseq = self.beamseq[0:(seq_len + 1)]
105 | self.K = self.K[0:(seq_len + 1)]
106 | self.step = seq_len
107 | else:
108 | self.beamseq = self.beamseq[0:seq_len]
109 | self.K = self.K[0:seq_len]
110 | self.step = seq_len - 1
111 | self.endall = False
112 |
113 | def retrieve(self, k, seq_len=-1):
114 | """
115 | Retrieve the k-th ranked generated sentence.
116 | """
117 |
118 | if self.beamseq[0][0].word_id is not None and seq_len > 0:
119 | # for a normal LM
120 | seq_len -= 1
121 |
122 | assert 1 <= k <= self.K[seq_len], 'k must be in [1, the total number of beams at seq_len]!'
123 |
124 | rebeam = [self.beamseq[seq_len][k - 1]]
125 | n = seq_len
126 | while rebeam[0].pre_loc is not None:
127 | n -= 1
128 | rebeam = [self.beamseq[n][rebeam[0].pre_loc]] + rebeam
129 | sent = [self.vocab.itos[bu.word_id] for bu in rebeam if bu.word_id is not None]
130 |
131 | return sent, rebeam
132 |
133 | def retrieve_align(self, rebeam):
134 | """
135 | Should be run after calling Beam.retrieve(...).
136 | """
137 | align_locs = [bu.align_loc.item() for bu in rebeam if bu.word_id is not None and bu.align_loc is not None]
138 | return align_locs
139 |
140 | def retrieve_endbus(self):
141 | """
142 | Retrieve the complete sentences acquired by beam steps.
143 | """
144 | sents = []
145 | aligns = []
146 | score_avgs = []
147 | for ks in self.endbus:
148 | sent, rebeam = self.retrieve(ks[0] + 1, ks[1])
149 | score_avg = ks[2] / ks[1]
150 |
151 | sents.append(sent)
152 | aligns.append(self.retrieve_align(rebeam))
153 | score_avgs.append(score_avg)
154 |
155 | return sents, aligns, score_avgs
156 |
157 | def simscore(self, bu, K, template_vec, ee, word_list=None, mono=False,
158 | batch_size=1024, normalized=True, elmo_layer='avg'):
159 | """
160 | Score function based on sentence similarities.
161 | """
162 | if word_list is None:
163 | word_list = self.vocab.itos
164 | scores, indices, states = simScoreNext(template_vec, word_list, ee,
165 | prevs_state=bu.elmo_state, batch_size=batch_size,
166 | prevs_align=bu.align_loc if mono else None,
167 | normalized=normalized, elmo_layer=elmo_layer)
168 | scores_prob = torch.nn.functional.log_softmax(scores, dim=0)
169 |
170 | sorted_scores, sorting_indices = torch.sort(scores)
171 |
172 | nexttopK = [BeamUnit(self.vocab.stoi[word_list[i]], bu.cur_loc, None, scores_prob[i].item() + bu.score,
173 | bu.seq_len + 1, self.vocab, elmo_state=states[i], align_loc=indices[i].item())
174 | for i in sorting_indices[0:(K + 5)]
175 | # do not allow repeated words consecutively
176 | if lemma.lemmatize(self.vocab.itos[bu.word_id]) != lemma.lemmatize(word_list[i])]
177 | nexttopK = nexttopK[0:K]
178 |
179 | return nexttopK
180 |
181 | def lmscore(self, bulist, K, LMModel, word_list=None, subvocab=None, clustermask=None, renorm=False, temperature=1):
182 | """
183 | Score function based on a pretrained RNN language model.
184 | """
185 | # note that LMModel should have the same vocab as that in Beam()
186 |
187 | ## when no candidate word list is provided, use the full vocabulary
188 | if word_list is None:
189 | word_list = self.vocab.itos
190 | subvocab = None
191 | clustermask = None
192 |
193 | if self.device is not None:
194 | LMModel = LMModel.cuda(device=self.device)
195 | LMModel.eval()
196 | with torch.no_grad():
197 | onbeam_ids = list(range(len(bulist)))
198 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids],
199 | dtype=torch.long).unsqueeze(0)
200 | if bulist[onbeam_ids[0]].lm_state is None:
201 | # 'lm_state' for the current beam is either all 'None' or all not 'None'.
202 | batch_hn = None
203 | else:
204 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1),
205 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1))
206 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn,
207 | subvocab=subvocab, clustermask=clustermask, onscore=False,
208 | renorm=renorm,
209 | temperature=temperature)
210 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence
211 | hn = list(
212 | zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1), torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1)))
213 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log(
214 | subprobs)
215 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities
216 |
217 | ## rank and update
218 | if K > len(lm_cum_logprob):
219 | scores_sorted, ids_sorted = lm_cum_logprob.sort(descending=True)
220 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
221 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
222 | m,
223 | scores_sorted[m].item(),
224 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
225 | self.vocab,
226 | lm_score=lm_cum_logprob[i].item(),
227 | lm_state=hn[i // len(word_list)])
228 | for (m, i) in enumerate(ids_sorted)]
229 | else:
230 | scores_topK, ids_topK = lm_cum_logprob.topk(K)
231 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
232 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
233 | m,
234 | scores_topK[m].item(),
235 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
236 | self.vocab,
237 | lm_score=lm_cum_logprob[i].item(),
238 | lm_state=hn[i // len(word_list)])
239 | for (m, i) in enumerate(ids_topK)]
240 |
241 | endbus = []
242 |
243 | return nexttopKK, endbus
244 |
245 | def combscoreK(self, bulist, K, template_vec, ee, LMModel,
246 | word_list=None, subvocab=None, clustermask=None,
247 | mono=True, batch_size=1024, normalized=True, renorm=False, temperature=1,
248 | elmo_layer='avg', alpha=0.01, stopbyLMeos=False, ifadditive=False):
249 | """
250 | Given a list of 'BeamUnit', score the next tokens from the candidate word list based on the combination of
251 | sentence similarities and a pretrained language model. Output the top K scored new 'BeamUnit', in a list.
252 |
253 | Input:
254 | stopbyLMeos: whether to use the LM '' to solely decide end of sentence, i.e. when '' gets the
255 | highest probability from the LM, remove the generated sentence out of beam. Default: False.
256 |
257 | Note:
258 | 'word_list', 'subvocab', and 'clustermask' should be coupled, sorted based on the full vocabulary.
259 | """
260 |
261 | ## when no candidate word list is provided, use the full vocabulary
262 | if word_list is None:
263 | word_list = self.vocab.itos
264 | subvocab = None
265 | clustermask = None
266 |
267 | ## calculate the similarity scores
268 | endbus = [] # finished sequences
269 | onbeam_ids = list(range(
270 | len(bulist))) # keep track of sequences on beam that have not aligned to the end of the source sequence
271 | sim_cum_allbeam = None
272 | indices_allbeam = None
273 | states_allbeam = []
274 | for (i, bu) in enumerate(bulist):
275 | try:
276 | scores, indices, states = simScoreNext(template_vec, word_list, ee,
277 | prevs_state=bu.elmo_state, batch_size=batch_size,
278 | prevs_align=bu.align_loc if mono else None,
279 | normalized=normalized, elmo_layer=elmo_layer)
280 | scores_logprob = F.log_softmax(scores, dim=0)
281 |
282 | sim_cum_logprob = scores_logprob + torch.tensor(bu.sim_score, dtype=torch.float, device=self.device)
283 |
284 | sim_cum_allbeam = sim_cum_logprob if sim_cum_allbeam is None else torch.cat(
285 | [sim_cum_allbeam, sim_cum_logprob])
286 | indices_allbeam = indices if indices_allbeam is None else torch.cat([indices_allbeam, indices])
287 | states_allbeam = states_allbeam + states
288 |
289 | # current sequence already aligned to the end: move out of beam
290 | except AssertionError as e:
291 | print('AssertionError:', e)
292 | endbus.append((i, bu.seq_len, bu.score, bu.sim_score, bu.lm_score))
293 | onbeam_ids.remove(i)
294 |
295 | ## calculate the RNN LM scores
296 | ## note that LMModel should have the same vocab as that in Beam()
297 | if len(bulist) == 1 and bulist[0].word_id is None:
298 | # first beam step after initialization, only relying on similarity scores and no LM calculation is needed
299 | scores_comb = sim_cum_allbeam
300 | lm_cum_logprob = torch.zeros_like(sim_cum_allbeam)
301 | hn = [None] * len(onbeam_ids) # at the initial step, 'onbeam_ids' wouldn't be empty anyway
302 | else:
303 | ## all sequences have aligned to the end of source sentence
304 | if onbeam_ids == []:
305 | return [], endbus
306 | ## do the RNN LM forward calculation
307 | if bulist[onbeam_ids[0]].lm_state is None:
308 | # 'lm_state' for the current beam is either all 'None' or all not 'None'.
309 | batch_hn = None
310 | else:
311 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1),
312 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1))
313 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids],
314 | dtype=torch.long).unsqueeze(0)
315 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn,
316 | subvocab=subvocab, clustermask=clustermask, onscore=False,
317 | renorm=renorm,
318 | temperature=temperature)
319 |
320 | ### LM predictes '' with the highest probability: move out of beam
321 | if stopbyLMeos:
322 | subprobs_max, subprobs_maxids = torch.max(subprobs, dim=1)
323 | eospos = (subprobs_maxids == word_list.index('')).nonzero()
324 | if eospos.size(0) > 0: # number of ended sentences
325 | # Note: have to delete backwards! Otherwise the indices will change.
326 | oob_ids = [onbeam_ids.pop(ep.item()) for ep in eospos.squeeze(1).sort(descending=True)[0]]
327 | oob_ids = sorted(oob_ids)
328 | print('-' * 5 + ' predicted most likely by LM at location:', *oob_ids)
329 | for i in oob_ids:
330 | endbus.append((i, bulist[i].seq_len, bulist[i].score, bulist[i].sim_score, bulist[i].lm_score))
331 | # all sequences have been predicted with '' having highest probabilities
332 | if onbeam_ids == []:
333 | return [], endbus
334 | else:
335 | remainpos = [i for i in range(len(subprobs)) if i not in eospos]
336 | subprobs = subprobs[remainpos, :]
337 | probs = probs[remainpos, :]
338 | hn = (hn[0][:, remainpos, :], hn[1][:, remainpos, :])
339 | remainpos_simallbeam = []
340 | for rp in remainpos:
341 | remainpos_simallbeam += list(range(len(word_list) * rp, len(word_list) * (rp + 1)))
342 | sim_cum_allbeam = sim_cum_allbeam[remainpos_simallbeam]
343 | indices_allbeam = indices_allbeam[remainpos_simallbeam]
344 | states_allbeam = [s for (i, s) in enumerate(states_allbeam) if i in remainpos_simallbeam]
345 |
346 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence
347 | hn = list(zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1),
348 | torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1)))
349 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log(
350 | subprobs)
351 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities
352 |
353 | if ifadditive:
354 | scores_comb = torch.log((1 - alpha) * torch.exp(sim_cum_allbeam) + alpha * torch.exp(lm_cum_logprob))
355 | else:
356 | scores_comb = (1 - alpha) * sim_cum_allbeam + alpha * lm_cum_logprob
357 |
358 | ## rank and update
359 | if K > len(scores_comb):
360 | scores_comb_sorted, ids_sorted = scores_comb.sort(descending=True)
361 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
362 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
363 | m,
364 | scores_comb_sorted[m].item(),
365 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
366 | self.vocab,
367 | sim_score=sim_cum_allbeam[i].item(),
368 | lm_score=lm_cum_logprob[i].item(),
369 | lm_state=hn[i // len(word_list)],
370 | elmo_state=states_allbeam[i],
371 | align_loc=indices_allbeam[i])
372 | for (m, i) in enumerate(ids_sorted)]
373 | else:
374 | scores_comb_topK, ids_topK = scores_comb.topk(K)
375 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
376 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
377 | m,
378 | scores_comb_topK[m].item(),
379 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
380 | self.vocab,
381 | sim_score=sim_cum_allbeam[i].item(),
382 | lm_score=lm_cum_logprob[i].item(),
383 | lm_state=hn[i // len(word_list)],
384 | elmo_state=states_allbeam[i],
385 | align_loc=indices_allbeam[i])
386 | for (m, i) in enumerate(ids_topK)]
387 |
388 | return nexttopKK, endbus
389 |
390 | def combscoreK_GPT2(self, bulist, K, template_vec, ge, LMModel,
391 | word_list=None, subvocab=None, clustermask=None,
392 | mono=True, normalized=True, renorm=False, temperature=1,
393 | bpe2word='last', alpha=0.01, stopbyLMeos=False, ifadditive=False):
394 | """
395 | Given a list of 'BeamUnit', score the next tokens from the candidate word list based on the combination of
396 | sentence similarities and a pretrained language model. Output the top K scored new 'BeamUnit', in a list.
397 |
398 | Input:
399 | stopbyLMeos: whether to use the LM '' to solely decide end of sentence, i.e. when '' gets the
400 | highest probability from the LM, remove the generated sentence out of beam. Default: False.
401 |
402 | Note:
403 | 'word_list', 'subvocab', and 'clustermask' should be coupled, sorted based on the full vocabulary.
404 | """
405 |
406 | ## when no candidate word list is provided, use the full vocabulary
407 | if word_list is None:
408 | word_list = self.vocab.itos
409 | subvocab = None
410 | clustermask = None
411 |
412 | ## calculate the similarity scores
413 | endbus = [] # finished sequences
414 | onbeam_ids = list(range(
415 | len(bulist))) # keep track of sequences on beam that have not aligned to the end of the source sequence
416 | sim_cum_allbeam = None
417 | indices_allbeam = None
418 | states_allbeam = []
419 | for (i, bu) in enumerate(bulist):
420 | try:
421 | scores, indices, states = simScoreNext_GPT2(template_vec, word_list, ge,
422 | prevs_state=bu.gpt2_state,
423 | prevs_align=bu.align_loc if mono else None,
424 | normalized=normalized, bpe2word=bpe2word)
425 | scores_logprob = F.log_softmax(scores, dim=0)
426 |
427 | sim_cum_logprob = scores_logprob + torch.tensor(bu.sim_score, dtype=torch.float, device=self.device)
428 |
429 | sim_cum_allbeam = sim_cum_logprob if sim_cum_allbeam is None else torch.cat(
430 | [sim_cum_allbeam, sim_cum_logprob])
431 | indices_allbeam = indices if indices_allbeam is None else torch.cat([indices_allbeam, indices])
432 | states_allbeam = states_allbeam + states
433 |
434 | # current sequence already aligned to the end: move out of beam
435 | except AssertionError as e:
436 | print('AssertionError:', e)
437 | endbus.append((i, bu.seq_len, bu.score, bu.sim_score, bu.lm_score))
438 | onbeam_ids.remove(i)
439 |
440 | ## calculate the RNN LM scores
441 | ## note that LMModel should have the same vocab as that in Beam()
442 | if len(bulist) == 1 and bulist[0].word_id is None:
443 | # first beam step after initialization, only relying on similarity scores and no LM calculation is needed
444 | scores_comb = sim_cum_allbeam
445 | lm_cum_logprob = torch.zeros_like(sim_cum_allbeam)
446 | hn = [None] * len(onbeam_ids) # at the initial step, 'onbeam_ids' wouldn't be empty anyway
447 | else:
448 | ## all sequences have aligned to the end of source sentence
449 | if onbeam_ids == []:
450 | return [], endbus
451 | ## do the RNN LM forward calculation
452 | if bulist[onbeam_ids[0]].lm_state is None:
453 | # 'lm_state' for the current beam is either all 'None' or all not 'None'.
454 | batch_hn = None
455 | else:
456 | batch_hn = (torch.cat([bulist[i].lm_state[0] for i in onbeam_ids], dim=1),
457 | torch.cat([bulist[i].lm_state[1] for i in onbeam_ids], dim=1))
458 | batch_text = next(LMModel.parameters()).new_tensor([bulist[i].word_id for i in onbeam_ids],
459 | dtype=torch.long).unsqueeze(0)
460 | subprobs, probs, hn = prob_next_1step(LMModel, batch_text, hn=batch_hn,
461 | subvocab=subvocab, clustermask=clustermask, onscore=False,
462 | renorm=renorm,
463 | temperature=temperature)
464 |
465 | ### LM predictes '' with the highest probability: move out of beam
466 | if stopbyLMeos:
467 | subprobs_max, subprobs_maxids = torch.max(subprobs, dim=1)
468 | eospos = (subprobs_maxids == word_list.index('')).nonzero()
469 | if eospos.size(0) > 0: # number of ended sentences
470 | # Note: have to delete backwards! Otherwise the indices will change.
471 | oob_ids = [onbeam_ids.pop(ep.item()) for ep in eospos.squeeze(1).sort(descending=True)[0]]
472 | oob_ids = sorted(oob_ids)
473 | print('-' * 5 + ' predicted most likely by LM at location:', *oob_ids)
474 | for i in oob_ids:
475 | endbus.append((i, bulist[i].seq_len, bulist[i].score, bulist[i].sim_score, bulist[i].lm_score))
476 | # all sequences have been predicted with '' having highest probabilities
477 | if onbeam_ids == []:
478 | return [], endbus
479 | else:
480 | remainpos = [i for i in range(len(subprobs)) if i not in eospos]
481 | subprobs = subprobs[remainpos, :]
482 | probs = probs[remainpos, :]
483 | hn = (hn[0][:, remainpos, :], hn[1][:, remainpos, :])
484 | remainpos_simallbeam = []
485 | for rp in remainpos:
486 | remainpos_simallbeam += list(range(len(word_list) * rp, len(word_list) * (rp + 1)))
487 | sim_cum_allbeam = sim_cum_allbeam[remainpos_simallbeam]
488 | indices_allbeam = indices_allbeam[remainpos_simallbeam]
489 | states_allbeam = [s for (i, s) in enumerate(states_allbeam) if i in remainpos_simallbeam]
490 |
491 | # convert the hidden state tuple into a list of tuples, corresponding to each beam sequence
492 | hn = list(zip(torch.chunk(hn[0], chunks=len(onbeam_ids), dim=1),
493 | torch.chunk(hn[1], chunks=len(onbeam_ids), dim=1)))
494 | lm_cum_logprob = subprobs.new_tensor([bulist[i].lm_score for i in onbeam_ids]).unsqueeze(1) + torch.log(
495 | subprobs)
496 | lm_cum_logprob = lm_cum_logprob.view(-1) # this is the cumulative log probabilities
497 |
498 | if ifadditive:
499 | scores_comb = torch.log((1 - alpha) * torch.exp(sim_cum_allbeam) + alpha * torch.exp(lm_cum_logprob))
500 | else:
501 | scores_comb = (1 - alpha) * sim_cum_allbeam + alpha * lm_cum_logprob
502 |
503 | ## rank and update
504 | if K > len(scores_comb):
505 | scores_comb_sorted, ids_sorted = scores_comb.sort(descending=True)
506 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
507 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
508 | m,
509 | scores_comb_sorted[m].item(),
510 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
511 | self.vocab,
512 | sim_score=sim_cum_allbeam[i].item(),
513 | lm_score=lm_cum_logprob[i].item(),
514 | lm_state=hn[i // len(word_list)],
515 | gpt2_state=states_allbeam[i],
516 | align_loc=indices_allbeam[i])
517 | for (m, i) in enumerate(ids_sorted)]
518 | else:
519 | scores_comb_topK, ids_topK = scores_comb.topk(K)
520 | nexttopKK = [BeamUnit(self.vocab.stoi[word_list[i % len(word_list)]],
521 | bulist[onbeam_ids[i // len(word_list)]].cur_loc,
522 | m,
523 | scores_comb_topK[m].item(),
524 | bulist[onbeam_ids[i // len(word_list)]].seq_len + 1,
525 | self.vocab,
526 | sim_score=sim_cum_allbeam[i].item(),
527 | lm_score=lm_cum_logprob[i].item(),
528 | lm_state=hn[i // len(word_list)],
529 | gpt2_state=states_allbeam[i],
530 | align_loc=indices_allbeam[i])
531 | for (m, i) in enumerate(ids_topK)]
532 |
533 | return nexttopKK, endbus
534 |
--------------------------------------------------------------------------------
/uss/elmo_lstm_forward.py:
--------------------------------------------------------------------------------
1 | """
2 | A stacked forward only LSTM with skip connections between layers.
3 |
4 | Modified from allennlp/modules/elmo_lstm.py.
5 | """
6 | from typing import Optional, Tuple, List, Union
7 | import warnings
8 |
9 | import torch
10 | from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence
11 | with warnings.catch_warnings():
12 | warnings.filterwarnings("ignore", category=FutureWarning)
13 | import h5py
14 | import numpy
15 |
16 | from allennlp.modules.lstm_cell_with_projection import LstmCellWithProjection
17 | from allennlp.common.checks import ConfigurationError
18 | from allennlp.modules.encoder_base import _EncoderBase
19 | from allennlp.common.file_utils import cached_path
20 |
21 | RnnState = Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] # pylint: disable=invalid-name
22 |
23 | class ElmoLstmForward(_EncoderBase):
24 | """
25 | A stacked, forward only LSTM which uses
26 | :class:`~allennlp.modules.lstm_cell_with_projection.LstmCellWithProjection`'s
27 | with highway layers between the inputs to layers.
28 | The inputs to the forward and backward directions are independent - forward and backward
29 | states are not concatenated between layers.
30 | Additionally, this LSTM maintains its `own` state, which is updated every time
31 | ``forward`` is called. It is dynamically resized for different batch sizes and is
32 | designed for use with non-continuous inputs (i.e inputs which aren't formatted as a stream,
33 | such as text used for a language modelling task, which is how stateful RNNs are typically used).
34 | This is non-standard, but can be thought of as having an "end of sentence" state, which is
35 | carried across different sentences.
36 | Parameters
37 | ----------
38 | input_size : ``int``, required
39 | The dimension of the inputs to the LSTM.
40 | hidden_size : ``int``, required
41 | The dimension of the outputs of the LSTM.
42 | cell_size : ``int``, required.
43 | The dimension of the memory cell of the
44 | :class:`~allennlp.modules.lstm_cell_with_projection.LstmCellWithProjection`.
45 | num_layers : ``int``, required
46 | The number of bidirectional LSTMs to use.
47 | requires_grad: ``bool``, optional
48 | If True, compute gradient of ELMo parameters for fine tuning.
49 | recurrent_dropout_probability: ``float``, optional (default = 0.0)
50 | The dropout probability to be used in a dropout scheme as stated in
51 | `A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
52 | `_ .
53 | state_projection_clip_value: ``float``, optional, (default = None)
54 | The magnitude with which to clip the hidden_state after projecting it.
55 | memory_cell_clip_value: ``float``, optional, (default = None)
56 | The magnitude with which to clip the memory cell.
57 | """
58 | def __init__(self,
59 | input_size: int,
60 | hidden_size: int,
61 | cell_size: int,
62 | num_layers: int,
63 | requires_grad: bool = False,
64 | recurrent_dropout_probability: float = 0.0,
65 | memory_cell_clip_value: Optional[float] = None,
66 | state_projection_clip_value: Optional[float] = None) -> None:
67 | super(ElmoLstmForward, self).__init__(stateful=False) # change 'stateful' flag to be False
68 | # so that hidden_state can be externally provided
69 |
70 | # Required to be wrapped with a :class:`PytorchSeq2SeqWrapper`.
71 | self.input_size = input_size
72 | self.hidden_size = hidden_size
73 | self.num_layers = num_layers
74 | self.cell_size = cell_size
75 | self.requires_grad = requires_grad
76 |
77 | forward_layers = []
78 |
79 | lstm_input_size = input_size
80 | go_forward = True
81 | for layer_index in range(num_layers):
82 | forward_layer = LstmCellWithProjection(lstm_input_size,
83 | hidden_size,
84 | cell_size,
85 | go_forward,
86 | recurrent_dropout_probability,
87 | memory_cell_clip_value,
88 | state_projection_clip_value)
89 |
90 | lstm_input_size = hidden_size
91 |
92 | self.add_module('forward_layer_{}'.format(layer_index), forward_layer)
93 |
94 | forward_layers.append(forward_layer)
95 |
96 | self.forward_layers = forward_layers
97 |
98 |
99 | def forward(self, # pylint: disable=arguments-differ
100 | inputs: torch.Tensor,
101 | mask: torch.LongTensor,
102 | hidden_state: Optional[RnnState] = None) -> torch.Tensor:
103 | """
104 | Parameters
105 | ----------
106 | inputs : ``torch.Tensor``, required.
107 | A Tensor of shape ``(batch_size, sequence_length, embedding_size)``.
108 | mask : ``torch.LongTensor``, required.
109 | A binary mask of shape ``(batch_size, sequence_length)`` representing the
110 | non-padded elements in each sequence in the batch.
111 | hidden_state : ``Optional[RnnState]``, (default = None).
112 | A single tensor of shape (num_layers, batch_size, hidden_size) representing the
113 | state of an RNN with or a tuple of
114 | tensors of shapes (num_layers, batch_size, hidden_size) and
115 | (num_layers, batch_size, memory_size), representing the hidden state and memory
116 | state of an LSTM-like RNN.
117 | Returns
118 | -------
119 | A ``torch.Tensor`` of shape (num_layers, batch_size, sequence_length, hidden_size),
120 | where the num_layers dimension represents the LSTM output from that layer.
121 | A tuple of
122 | tensors of shapes (num_layers, batch_size, hidden_size) and
123 | (num_layers, batch_size, memory_size), representing the final hidden state and memory
124 | state of an LSTM-like RNN.
125 | """
126 | batch_size, total_sequence_length = mask.size()
127 | stacked_sequence_output, final_states, restoration_indices = \
128 | self.sort_and_run_forward(self._lstm_forward, inputs, mask, hidden_state) # add 'hidden_state' here
129 |
130 | num_layers, num_valid, returned_timesteps, encoder_dim = stacked_sequence_output.size()
131 | # Add back invalid rows which were removed in the call to sort_and_run_forward.
132 | if num_valid < batch_size:
133 | zeros = stacked_sequence_output.new_zeros(num_layers,
134 | batch_size - num_valid,
135 | returned_timesteps,
136 | encoder_dim)
137 | stacked_sequence_output = torch.cat([stacked_sequence_output, zeros], 1)
138 |
139 | # The states also need to have invalid rows added back.
140 | new_states = []
141 | for state in final_states:
142 | state_dim = state.size(-1)
143 | zeros = state.new_zeros(num_layers, batch_size - num_valid, state_dim)
144 | new_states.append(torch.cat([state, zeros], 1))
145 | final_states = new_states
146 |
147 | # It's possible to need to pass sequences which are padded to longer than the
148 | # max length of the sequence to a Seq2StackEncoder. However, packing and unpacking
149 | # the sequences mean that the returned tensor won't include these dimensions, because
150 | # the RNN did not need to process them. We add them back on in the form of zeros here.
151 | sequence_length_difference = total_sequence_length - returned_timesteps
152 | if sequence_length_difference > 0:
153 | zeros = stacked_sequence_output.new_zeros(num_layers,
154 | batch_size,
155 | sequence_length_difference,
156 | stacked_sequence_output[0].size(-1))
157 | stacked_sequence_output = torch.cat([stacked_sequence_output, zeros], 2)
158 |
159 | # self._update_states(final_states, restoration_indices)
160 |
161 | # Restore the original indices and return the sequence.
162 | # Has shape (num_layers, batch_size, sequence_length, hidden_size)
163 | return stacked_sequence_output.index_select(1, restoration_indices), \
164 | tuple([state.index_select(1, restoration_indices).detach() for state in final_states]) # detach final states
165 |
166 | def _lstm_forward(self,
167 | inputs: PackedSequence,
168 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \
169 | Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
170 | """
171 | Parameters
172 | ----------
173 | inputs : ``PackedSequence``, required.
174 | A batch first ``PackedSequence`` to run the stacked LSTM over.
175 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
176 | A tuple (state, memory) representing the initial hidden state and memory
177 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and
178 | (num_layers, batch_size, 1 * cell_size) respectively.
179 | Returns
180 | -------
181 | output_sequence : ``torch.FloatTensor``
182 | The encoded sequence of shape (num_layers, batch_size, sequence_length, hidden_size)
183 | final_states: ``Tuple[torch.FloatTensor, torch.FloatTensor]``
184 | The per-layer final (state, memory) states of the LSTM, with shape
185 | (num_layers, batch_size, 1 * hidden_size) and (num_layers, batch_size, 1 * cell_size)
186 | respectively. The last dimension is NOT duplicated because it contains the state/memory
187 | for ONLY the forward layers.
188 | """
189 | if initial_state is None:
190 | hidden_states: List[Optional[Tuple[torch.Tensor,
191 | torch.Tensor]]] = [None] * len(self.forward_layers)
192 | elif initial_state[0].size()[0] != len(self.forward_layers):
193 | raise ConfigurationError("Initial states were passed to forward() but the number of "
194 | "initial states does not match the number of layers.")
195 | else:
196 | hidden_states = list(zip(initial_state[0].split(1, 0), initial_state[1].split(1, 0)))
197 | # list of tuples, each one is a (hidden, memory) tuple for that layer
198 |
199 | inputs, batch_lengths = pad_packed_sequence(inputs, batch_first=True)
200 | forward_output_sequence = inputs
201 |
202 | final_states = []
203 | sequence_outputs = []
204 | for layer_index, state in enumerate(hidden_states):
205 | forward_layer = getattr(self, 'forward_layer_{}'.format(layer_index))
206 |
207 | forward_cache = forward_output_sequence
208 |
209 | if state is not None:
210 | forward_hidden_state = state[0]
211 | forward_memory_state = state[1]
212 | forward_state = (forward_hidden_state, forward_memory_state)
213 | else:
214 | forward_state = None
215 |
216 | forward_output_sequence, forward_state = forward_layer(forward_output_sequence,
217 | batch_lengths,
218 | forward_state)
219 |
220 | # Skip connections, just adding the input to the output.
221 | if layer_index != 0:
222 | forward_output_sequence += forward_cache
223 |
224 | sequence_outputs.append(forward_output_sequence)
225 | # Append the state tuples in a list, so that we can return
226 | # the final states for all the layers.
227 | final_states.append((forward_state[0],
228 | forward_state[1]))
229 |
230 | stacked_sequence_outputs: torch.FloatTensor = torch.stack(sequence_outputs)
231 | # Stack the hidden state and memory for each layer into 2 tensors of shape
232 | # (num_layers, batch_size, hidden_size) and (num_layers, batch_size, cell_size)
233 | # respectively.
234 | final_hidden_states, final_memory_states = zip(*final_states)
235 | final_state_tuple: Tuple[torch.FloatTensor,
236 | torch.FloatTensor] = (torch.cat(final_hidden_states, 0),
237 | torch.cat(final_memory_states, 0))
238 | return stacked_sequence_outputs, final_state_tuple
239 |
240 | def load_weights(self, weight_file: str) -> None:
241 | """
242 | Load the pre-trained weights from the file.
243 | """
244 | requires_grad = self.requires_grad
245 |
246 | with h5py.File(cached_path(weight_file), 'r') as fin:
247 | for i_layer, lstms in enumerate(
248 | zip(self.forward_layers)
249 | ):
250 | for j_direction, lstm in enumerate(lstms):
251 | # lstm is an instance of LSTMCellWithProjection
252 | cell_size = lstm.cell_size
253 |
254 | dataset = fin['RNN_%s' % j_direction]['RNN']['MultiRNNCell']['Cell%s' % i_layer
255 | ]['LSTMCell']
256 |
257 | # tensorflow packs together both W and U matrices into one matrix,
258 | # but pytorch maintains individual matrices. In addition, tensorflow
259 | # packs the gates as input, memory, forget, output but pytorch
260 | # uses input, forget, memory, output. So we need to modify the weights.
261 | tf_weights = numpy.transpose(dataset['W_0'][...])
262 | torch_weights = tf_weights.copy()
263 |
264 | # split the W from U matrices
265 | input_size = lstm.input_size
266 | input_weights = torch_weights[:, :input_size]
267 | recurrent_weights = torch_weights[:, input_size:]
268 | tf_input_weights = tf_weights[:, :input_size]
269 | tf_recurrent_weights = tf_weights[:, input_size:]
270 |
271 | # handle the different gate order convention
272 | for torch_w, tf_w in [[input_weights, tf_input_weights],
273 | [recurrent_weights, tf_recurrent_weights]]:
274 | torch_w[(1 * cell_size):(2 * cell_size), :] = tf_w[(2 * cell_size):(3 * cell_size), :]
275 | torch_w[(2 * cell_size):(3 * cell_size), :] = tf_w[(1 * cell_size):(2 * cell_size), :]
276 |
277 | lstm.input_linearity.weight.data.copy_(torch.FloatTensor(input_weights))
278 | lstm.state_linearity.weight.data.copy_(torch.FloatTensor(recurrent_weights))
279 | lstm.input_linearity.weight.requires_grad = requires_grad
280 | lstm.state_linearity.weight.requires_grad = requires_grad
281 |
282 | # the bias weights
283 | tf_bias = dataset['B'][...]
284 | # tensorflow adds 1.0 to forget gate bias instead of modifying the
285 | # parameters...
286 | tf_bias[(2 * cell_size):(3 * cell_size)] += 1
287 | torch_bias = tf_bias.copy()
288 | torch_bias[(1 * cell_size):(2 * cell_size)
289 | ] = tf_bias[(2 * cell_size):(3 * cell_size)]
290 | torch_bias[(2 * cell_size):(3 * cell_size)
291 | ] = tf_bias[(1 * cell_size):(2 * cell_size)]
292 | lstm.state_linearity.bias.data.copy_(torch.FloatTensor(torch_bias))
293 | lstm.state_linearity.bias.requires_grad = requires_grad
294 |
295 | # the projection weights
296 | proj_weights = numpy.transpose(dataset['W_P_0'][...])
297 | lstm.state_projection.weight.data.copy_(torch.FloatTensor(proj_weights))
298 | lstm.state_projection.weight.requires_grad = requires_grad
299 |
--------------------------------------------------------------------------------
/uss/elmo_sequential_embedder.py:
--------------------------------------------------------------------------------
1 | """
2 | Sequentially embed tokens into ELMo vectors, using only forward computation, with externally updated hidden states.
3 |
4 | Based on allennlp.commands.elmo.ElmoEmbedder and allennlp.modules.elmo._ElmoBiLm.
5 | """
6 |
7 | import json
8 | import logging
9 | from typing import List, Iterable, Tuple, Any, Optional, Dict
10 | import warnings
11 |
12 | # with warnings.catch_warnings():
13 | # warnings.filterwarnings("ignore", category=FutureWarning)
14 | # import h5py
15 | warnings.filterwarnings('ignore', message='numpy.dtype size changed')
16 | warnings.filterwarnings('ignore', message='numpy.ufunc size changed')
17 |
18 | import numpy
19 | import torch
20 | from overrides import overrides
21 |
22 | from allennlp.common.file_utils import cached_path
23 | from allennlp.common.tqdm import Tqdm
24 | from allennlp.common.util import lazy_groups_of
25 | from allennlp.common.checks import ConfigurationError
26 | from allennlp.data.token_indexers.elmo_indexer import ELMoCharacterMapper
27 | from allennlp.modules.elmo import batch_to_ids, _ElmoCharacterEncoder
28 |
29 | from elmo_lstm_forward import ElmoLstmForward
30 |
31 | logger = logging.getLogger(__name__) # pylint: disable=invalid-name
32 |
33 | DEFAULT_OPTIONS_FILE = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json" # pylint: disable=line-too-long
34 | DEFAULT_WEIGHT_FILE = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5" # pylint: disable=line-too-long
35 | DEFAULT_BATCH_SIZE = 64
36 |
37 |
38 | class ElmoEmbedderForward(torch.nn.Module):
39 | def __init__(self,
40 | options_file: str = DEFAULT_OPTIONS_FILE,
41 | weight_file: str = DEFAULT_WEIGHT_FILE,
42 | requires_grad: bool = False,
43 | vocab_to_cache: List[str] = None,
44 | cuda_device: int = -1) -> None:
45 | super(ElmoEmbedderForward, self).__init__()
46 |
47 | self._token_embedder = _ElmoCharacterEncoder2(options_file, weight_file, requires_grad=requires_grad)
48 |
49 | self._requires_grad = requires_grad
50 | if requires_grad and vocab_to_cache:
51 | logging.warning("You are fine tuning ELMo and caching char CNN word vectors. "
52 | "This behaviour is not guaranteed to be well defined, particularly. "
53 | "if not all of your inputs will occur in the vocabulary cache.")
54 | # This is an embedding, used to look up cached
55 | # word vectors built from character level cnn embeddings.
56 | self._word_embedding = None
57 | self._bos_embedding: torch.Tensor = None
58 | self._eos_embedding: torch.Tensor = None
59 | if vocab_to_cache:
60 | logging.info("Caching character cnn layers for words in vocabulary.")
61 | # This sets 3 attributes, _word_embedding, _bos_embedding and _eos_embedding.
62 | # They are set in the method so they can be accessed from outside the
63 | # constructor.
64 | self.create_cached_cnn_embeddings(vocab_to_cache)
65 | self.vocab = vocab_to_cache # the first token should be the padding token, with id = 0
66 |
67 | with open(cached_path(options_file), 'r') as fin:
68 | options = json.load(fin)
69 | if not options['lstm'].get('use_skip_connections'):
70 | raise ConfigurationError('We only support pretrained biLMs with residual connections')
71 |
72 | logger.info("Initializing ELMo Forward.")
73 | self._elmo_lstm_forward = ElmoLstmForward(input_size=options['lstm']['projection_dim'],
74 | hidden_size=options['lstm']['projection_dim'],
75 | cell_size=options['lstm']['dim'],
76 | num_layers=options['lstm']['n_layers'],
77 | memory_cell_clip_value=options['lstm']['cell_clip'],
78 | state_projection_clip_value=options['lstm']['proj_clip'],
79 | requires_grad=requires_grad)
80 | self._elmo_lstm_forward.load_weights(weight_file)
81 | if cuda_device >= 0:
82 | self._elmo_lstm_forward = self._elmo_lstm_forward.cuda(device=cuda_device)
83 | self._token_embedder = self._token_embedder.cuda(device=cuda_device)
84 | # self.cuda(device=cuda_device) # this happens in-place
85 | self.cuda_device = cuda_device if cuda_device >= 0 else 'cpu'
86 | # Number of representation layers including context independent layer
87 | self.num_layers = options['lstm']['n_layers'] + 1
88 |
89 | def batch_to_embeddings(self,
90 | batch: List[List[str]],
91 | add_bos: bool = False,
92 | add_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
93 | """
94 | Compute sentence insensitive token representations for a batch of tokenized sentences,
95 | using pretrained character level CNN. This is the first layer of ELMo representation.
96 |
97 | Parameters
98 | ----------
99 | batch : ``List[List[str]]``, required
100 | A list of tokenized sentences.
101 | add_bos: ``bool``
102 | Whether to add begin of sentence token.
103 | add_eos: ``bool``
104 | Whether to add end of sentence token.
105 | Returns
106 | -------
107 | type_representation: ``torch.Tensor``
108 | Shape ``(batch_size, sequence_length + 0/1/2, embedding_dim)`` tensor with context
109 | insensitive token representations.
110 | mask: ``torch.Tensor``
111 | Shape ``(batch_size, sequence_length + 0/1/2)`` long tensor with sequence mask.
112 | """
113 |
114 | if self._word_embedding is not None: # vocab_to_cache was passed in the constructor of this class
115 | try:
116 | word_inputs = [[self.vocab.index(w) for w in b] for b in batch]
117 | max_timesteps = max([len(b) for b in word_inputs])
118 | word_inputs = [b + [0] * (max_timesteps - len(b)) if len(b) < max_timesteps else b
119 | for b in word_inputs] # 0 is the padding id
120 | word_inputs = torch.tensor(word_inputs, dtype=torch.long, device=self.cuda_device)
121 | # word ids in the cached vocabulary
122 | # LongTensor of shape (batch_size, max_timesteps)
123 |
124 | mask_without_bos_eos = (word_inputs > 0).long()
125 | # The character cnn part is cached - just look it up.
126 | embedded_inputs = self._word_embedding(word_inputs) # type: ignore
127 | # shape (batch_size, timesteps + 0/1/2, embedding_dim)
128 | type_representation, mask = add_sentence_boundaries(
129 | embedded_inputs,
130 | mask_without_bos_eos,
131 | self._bos_embedding,
132 | self._eos_embedding,
133 | add_bos,
134 | add_eos
135 | )
136 | except RuntimeError:
137 | character_ids = batch_to_ids(batch) # size (batch_size, max_timesteps, 50)
138 | if self.cuda_device >= 0:
139 | character_ids = character_ids.cuda(device=self.cuda_device)
140 | # Back off to running the character convolutions,
141 | # as we might not have the words in the cache.
142 | token_embedding = self._token_embedder(character_ids, add_bos, add_eos)
143 | mask = token_embedding['mask']
144 | type_representation = token_embedding['token_embedding']
145 | else:
146 | character_ids = batch_to_ids(batch) # size (batch_size, max_timesteps, 50)
147 | if self.cuda_device >= 0:
148 | character_ids = character_ids.cuda(device=self.cuda_device)
149 | token_embedding = self._token_embedder(character_ids, add_bos, add_eos)
150 | mask = token_embedding['mask']
151 | type_representation = token_embedding['token_embedding']
152 |
153 | return type_representation, mask
154 |
155 | def forward(self,
156 | batch: List[List[str]],
157 | add_bos: bool = False,
158 | add_eos: bool = False,
159 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \
160 | Tuple[List[numpy.ndarray], Tuple[torch.Tensor, torch.Tensor]]:
161 | """
162 | Parameters
163 | ----------
164 | batch : ``List[List[str]]``, required
165 | A list of tokenized sentences.
166 | add_bos: ``bool``
167 | Whether to add begin of sentence token.
168 | add_eos: ``bool``
169 | Whether to add end of sentence token.
170 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
171 | A tuple (state, memory) representing the initial hidden state and memory
172 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and
173 | (num_layers, batch_size, 1 * cell_size) respectively.
174 |
175 | Or, with shape (num_layers, 1 * hidden_size) and
176 | (num_layers, 1 * cell_size) respectively, if all the batch share the same initial_state.
177 |
178 | Returns
179 | -------
180 | lstm_outputs : ``torch.FloatTensor``
181 | The encoded sequence of shape (num_layers, batch_size, sequence_length, hidden_size)
182 | final_states : ``Tuple[torch.FloatTensor, torch.FloatTensor]``
183 | The per-layer final (state, memory) states of the LSTM, with shape
184 | (num_layers, batch_size, 1 * hidden_size) and (num_layers, batch_size, 1 * cell_size)
185 | respectively. The last dimension is NOT duplicated because it contains the state/memory
186 | for ONLY the forward layers.
187 |
188 | elmo_embeddings: ``list[numpy.ndarray]``
189 | A list of tensors, each representing the ELMo vectors for the input sentence at the same index.
190 | """
191 | batch_size = len(batch)
192 | if initial_state is not None: # TO DO: need to deal with changing batch size
193 | initial_state_shape = list(initial_state[0].size())
194 | if len(initial_state_shape) == 2:
195 | initial_state = (initial_state[0].expand(batch_size, -1, -1).transpose(0, 1),
196 | initial_state[1].expand(batch_size, -1, -1).transpose(0, 1))
197 | elif len(initial_state_shape) == 3:
198 | pass
199 | else:
200 | raise ValueError("initial_state only accepts tuple of 2D or 3D input")
201 |
202 | token_embedding, mask = self.batch_to_embeddings(batch, add_bos, add_eos)
203 | lstm_outputs, final_states = self._elmo_lstm_forward(token_embedding, mask, initial_state)
204 |
205 | # Prepare the output. The first layer is duplicated.
206 | # Because of minor differences in how masking is applied depending
207 | # on whether the char cnn layers are cached, we'll be defensive and
208 | # multiply by the mask here. It's not strictly necessary, as the
209 | # mask passed on is correct, but the values in the padded areas
210 | # of the char cnn representations can change.
211 |
212 | output_tensors = [token_embedding * mask.float().unsqueeze(-1)]
213 | for layer_activations in torch.chunk(lstm_outputs, lstm_outputs.size(0), dim=0):
214 | output_tensors.append(layer_activations.squeeze(0))
215 |
216 | # without_bos_eos is a 3 element list of tuples of (batch_size, num_timesteps, dim) and
217 | # (batch_size, num_timesteps) tensors, each element representing a layer.
218 | without_bos_eos = [remove_sentence_boundaries(layer, mask, add_bos, add_eos)
219 | for layer in output_tensors]
220 | # Split the list of tuples into two tuples, each of length 3
221 | activations_without_bos_eos, mask_without_bos_eos = zip(*without_bos_eos)
222 |
223 | # Convert the activations_without_bos_eos into a single batch first tensor,
224 | # of size (batch_size, num_layers, num_timesteps, dim)
225 | activations = torch.cat([ele.unsqueeze(1) for ele in activations_without_bos_eos], dim=1)
226 | # The mask is the same for each ELMo layer, so just take the first.
227 | mask_without_bos_eos = mask_without_bos_eos[0]
228 |
229 | # organize the Elmo embeddings into a list corresponding to the batch of sentences
230 | elmo_embeddings = []
231 | for i in range(batch_size):
232 | length = int(mask_without_bos_eos[i, :].sum())
233 | if length == 0:
234 | raise ConfigurationError('There exists totally masked out sequence in the batch.')
235 | else:
236 | # elmo_embeddings.append(activations[i, :, :length, :].detach().cpu().numpy())
237 | elmo_embeddings.append(activations[i, :, :length, :].detach())
238 |
239 | return elmo_embeddings, final_states
240 |
241 | def embed_sentence(self,
242 | sentence: List[str],
243 | add_bos: bool = False,
244 | add_eos: bool = False,
245 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \
246 | Tuple[numpy.ndarray, Tuple[torch.Tensor, torch.Tensor]]:
247 | """
248 | Computes the forward only ELMo embeddings for a single tokenized sentence.
249 | See the comment under the class definition.
250 | Parameters
251 | ----------
252 | sentence : ``List[str]``, required
253 | A tokenized sentence.
254 | add_bos: ``bool``
255 | Whether to add begin of sentence token.
256 | add_eos: ``bool``
257 | Whether to add end of sentence token.
258 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
259 | A tuple (state, memory) representing the initial hidden state and memory
260 | of the LSTM, with shape (num_layers, 1, 1 * hidden_size) and
261 | (num_layers, 1, 1 * cell_size) respectively.
262 |
263 | Or, with shape (num_layers, 1 * hidden_size) and
264 | (num_layers, 1 * cell_size) respectively.
265 | Returns
266 | -------
267 | A tensor containing the ELMo vectors, and
268 | final states, tuple of size (num_layers, hidden_size) and (num_layers, memory_size).
269 | """
270 | elmo_embeddings, final_states = self.forward([sentence], add_bos, add_eos, initial_state)
271 |
272 | return elmo_embeddings[0], tuple([ele.squeeze(1) for ele in final_states])
273 |
274 | def embed_sentences(self,
275 | sentences: Iterable[List[str]],
276 | add_bos: bool = False,
277 | add_eos: bool = False,
278 | initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
279 | batch_size: int = DEFAULT_BATCH_SIZE) -> \
280 | List[Tuple[numpy.ndarray, Tuple[torch.Tensor, torch.Tensor]]]:
281 | """
282 | Computes the forward only ELMo embeddings for a iterable of sentences.
283 | See the comment under the class definition.
284 | Parameters
285 | ----------
286 | sentences : ``Iterable[List[str]]``, required
287 | An iterable of tokenized sentences.
288 | add_bos: ``bool``
289 | Whether to add begin of sentence token.
290 | add_eos: ``bool``
291 | Whether to add end of sentence token.
292 | initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
293 | A tuple (state, memory) representing the initial hidden state and memory
294 | of the LSTM, with shape (num_layers, batch_size, 1 * hidden_size) and
295 | (num_layers, batch_size, 1 * cell_size) respectively.
296 |
297 | Or, with shape (num_layers, 1 * hidden_size) and
298 | (num_layers, 1 * cell_size) respectively, if all the batch share the same initial_state.
299 | batch_size : ``int``, required
300 | The number of sentences ELMo should process at once.
301 | Returns
302 | -------
303 | A list of tuple of (numpy.ndarray/torch.Tensor, (torch.Tensor, torch.Tensor)),
304 | each representing the ELMo vectors for the input sentence
305 | at the same index, and the final states after running that sentence, with shape (num_layers, hidden_size) and
306 | (num_layers, cell_size) respectively.
307 | (The return type could also be a generator. Can convert to a list using list().)
308 | """
309 | embeddings_and_states = []
310 | print('Embedding sentences into forward ELMo vectors ---')
311 | # for batch in Tqdm.tqdm(lazy_groups_of(iter(sentences), batch_size)):
312 | for batch in lazy_groups_of(iter(sentences), batch_size):
313 | elmo_embeddings, final_states = self.forward(batch, add_bos, add_eos, initial_state)
314 | # Remember: final_states is a tuple of tensors
315 | final_states_chunked = []
316 | for i in range(2):
317 | final_states_chunked.append(list(map(lambda x: torch.squeeze(x, dim=1),
318 | final_states[i].chunk(final_states[i].size(1), dim=1))))
319 | final_states_chunked = list(zip(*final_states_chunked))
320 | assert len(elmo_embeddings) == len(final_states_chunked), 'length of embeddings and final states mismatch'
321 | # yield from zip(elmo_embeddings, final_states_chunked)
322 | embeddings_and_states += list(zip(elmo_embeddings, final_states_chunked))
323 | return embeddings_and_states
324 |
325 | def create_cached_cnn_embeddings(self, tokens: List[str]) -> None:
326 | """
327 | Given a list of tokens, this method precomputes word representations
328 | by running just the character convolutions and highway layers of elmo,
329 | essentially creating uncontextual word vectors. On subsequent forward passes,
330 | the word ids are looked up from an embedding, rather than being computed on
331 | the fly via the CNN encoder.
332 | This function sets 3 attributes:
333 | _word_embedding : ``torch.Tensor``
334 | The word embedding for each word in the tokens passed to this method.
335 | _bos_embedding : ``torch.Tensor``
336 | The embedding for the BOS token.
337 | _eos_embedding : ``torch.Tensor``
338 | The embedding for the EOS token.
339 | Parameters
340 | ----------
341 | tokens : ``List[str]``, required.
342 | A list of tokens to precompute character convolutions for.
343 | """
344 | tokens = [ELMoCharacterMapper.bos_token, ELMoCharacterMapper.eos_token] + tokens
345 | timesteps = 32
346 | batch_size = 32
347 | chunked_tokens = lazy_groups_of(iter(tokens), timesteps)
348 |
349 | all_embeddings = []
350 | device = get_device_of(next(self.parameters()))
351 | for batch in lazy_groups_of(chunked_tokens, batch_size):
352 | # Shape (batch_size, timesteps, 50)
353 | batched_tensor = batch_to_ids(batch)
354 | # NOTE: This device check is for when a user calls this method having
355 | # already placed the model on a device. If this is called in the
356 | # constructor, it will probably happen on the CPU. This isn't too bad,
357 | # because it's only a few convolutions and will likely be very fast.
358 | if device >= 0:
359 | batched_tensor = batched_tensor.cuda(device)
360 | output = self._token_embedder(batched_tensor, add_bos=False, add_eos=False)
361 | token_embedding = output["token_embedding"]
362 | mask = output["mask"]
363 | token_embedding, _ = remove_sentence_boundaries(token_embedding, mask, rmv_bos=False, rmv_eos=False)
364 | all_embeddings.append(token_embedding.view(-1, token_embedding.size(-1)))
365 | full_embedding = torch.cat(all_embeddings, 0)
366 |
367 | # We might have some trailing embeddings from padding in the batch, so
368 | # we clip the embedding and lookup to the right size.
369 | full_embedding = full_embedding[:len(tokens), :]
370 | embedding = full_embedding[2:len(tokens), :]
371 | vocab_size, embedding_dim = list(embedding.size())
372 |
373 | from allennlp.modules.token_embedders import Embedding # type: ignore
374 | self._bos_embedding = full_embedding[0, :]
375 | self._eos_embedding = full_embedding[1, :]
376 | self._word_embedding = Embedding(vocab_size, # type: ignore
377 | embedding_dim,
378 | weight=embedding.data,
379 | trainable=self._requires_grad,
380 | padding_index=0)
381 |
382 |
383 | class _ElmoCharacterEncoder2(_ElmoCharacterEncoder):
384 | @overrides
385 | def forward(self,
386 | inputs: torch.Tensor,
387 | add_bos: bool = False,
388 | add_eos: bool = False) -> Dict[str, torch.Tensor]: # pylint: disable=arguments-differ
389 | """
390 | Compute context insensitive token embeddings for ELMo representations.
391 | Parameters
392 | ----------
393 | inputs: ``torch.Tensor``
394 | Shape ``(batch_size, sequence_length, 50)`` of character ids representing the
395 | current batch.
396 | add_bos: ``bool``
397 | Whether to add begin of sentence symbol
398 | add_eos: ``bool``
399 | Whether to add end of sentence symbol
400 | Returns
401 | -------
402 | Dict with keys:
403 | ``'token_embedding'``: ``torch.Tensor``
404 | Shape ``(batch_size, sequence_length + 0/1/2, embedding_dim)`` tensor with context
405 | insensitive token representations.
406 | ``'mask'``: ``torch.Tensor``
407 | Shape ``(batch_size, sequence_length + 0/1/2)`` long tensor with sequence mask.
408 | """
409 | # Add BOS/EOS (this is the only difference from the original _ElmoCharacterEncoder class)
410 | mask = ((inputs > 0).long().sum(dim=-1) > 0).long()
411 | character_ids_with_bos_eos, mask_with_bos_eos = add_sentence_boundaries(
412 | inputs,
413 | mask,
414 | self._beginning_of_sentence_characters,
415 | self._end_of_sentence_characters,
416 | add_bos,
417 | add_eos
418 | )
419 |
420 | # the character id embedding
421 | max_chars_per_token = self._options['char_cnn']['max_characters_per_token']
422 | # (batch_size * sequence_length, max_chars_per_token, embed_dim)
423 | character_embedding = torch.nn.functional.embedding(
424 | character_ids_with_bos_eos.view(-1, max_chars_per_token),
425 | self._char_embedding_weights
426 | )
427 |
428 | # run convolutions
429 | cnn_options = self._options['char_cnn']
430 | if cnn_options['activation'] == 'tanh':
431 | activation = torch.nn.functional.tanh
432 | elif cnn_options['activation'] == 'relu':
433 | activation = torch.nn.functional.relu
434 | else:
435 | raise ConfigurationError("Unknown activation")
436 |
437 | # (batch_size * sequence_length, embed_dim, max_chars_per_token)
438 | character_embedding = torch.transpose(character_embedding, 1, 2)
439 | convs = []
440 | for i in range(len(self._convolutions)):
441 | conv = getattr(self, 'char_conv_{}'.format(i))
442 | convolved = conv(character_embedding)
443 | # (batch_size * sequence_length, n_filters for this width)
444 | convolved, _ = torch.max(convolved, dim=-1)
445 | convolved = activation(convolved)
446 | convs.append(convolved)
447 |
448 | # (batch_size * sequence_length, n_filters)
449 | token_embedding = torch.cat(convs, dim=-1)
450 |
451 | # apply the highway layers (batch_size * sequence_length, n_filters)
452 | token_embedding = self._highways(token_embedding)
453 |
454 | # final projection (batch_size * sequence_length, embedding_dim)
455 | token_embedding = self._projection(token_embedding)
456 |
457 | # reshape to (batch_size, sequence_length, embedding_dim)
458 | batch_size, sequence_length, _ = character_ids_with_bos_eos.size()
459 |
460 | return {
461 | 'mask': mask_with_bos_eos,
462 | 'token_embedding': token_embedding.view(batch_size, sequence_length, -1)
463 | }
464 |
465 |
466 | def add_sentence_boundaries(tensor: torch.Tensor,
467 | mask: torch.Tensor,
468 | sentence_begin_token: Any,
469 | sentence_end_token: Any,
470 | add_bos: bool = False,
471 | add_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
472 | """
473 | Add begin/end of sentence tokens to the batch of sentences.
474 | Given a batch of sentences with size ``(batch_size, timesteps)`` or
475 | ``(batch_size, timesteps, dim)`` this returns a tensor of shape
476 | ``(batch_size, timesteps + 0/1/2)`` or ``(batch_size, timesteps + 0/1/2, dim)`` respectively.
477 | Returns both the new tensor and updated mask.
478 | Parameters
479 | ----------
480 | tensor : ``torch.Tensor``
481 | A tensor of shape ``(batch_size, timesteps)`` or ``(batch_size, timesteps, dim)``
482 | mask : ``torch.Tensor``
483 | A tensor of shape ``(batch_size, timesteps)`` (assuming padding id is always 0)
484 | sentence_begin_token: Any (anything that can be broadcast in torch for assignment)
485 | For 2D input, a scalar with the id. For 3D input, a tensor with length dim.
486 | sentence_end_token: Any (anything that can be broadcast in torch for assignment)
487 | For 2D input, a scalar with the id. For 3D input, a tensor with length dim.
488 | add_bos: bool
489 | Whether to add begin of sentence token.
490 | add_eos: bool
491 | Whether to add end of sentence token.
492 | Returns
493 | -------
494 | tensor_with_boundary_tokens : ``torch.Tensor``
495 | The tensor with the appended and prepended boundary tokens. If the input was 2D,
496 | it has shape (batch_size, timesteps + 0/1/2) and if the input was 3D, it has shape
497 | (batch_size, timesteps + 0/1/2, dim).
498 | new_mask : ``torch.Tensor``
499 | The new mask for the tensor, taking into account the appended tokens
500 | marking the beginning and end of the sentence.
501 | """
502 | # TODO: matthewp, profile this transfer
503 | sequence_lengths = mask.sum(dim=1).detach().cpu().numpy()
504 | tensor_shape = list(tensor.data.shape)
505 | new_shape = list(tensor_shape)
506 | if add_bos:
507 | new_shape[1] = new_shape[1] + 1
508 | if add_eos:
509 | new_shape[1] = new_shape[1] + 1
510 | tensor_with_boundary_tokens = tensor.new_zeros(*new_shape)
511 | if len(tensor_shape) == 2:
512 | if add_bos:
513 | tensor_with_boundary_tokens[:, 1:(1 + tensor_shape[1])] = tensor
514 | tensor_with_boundary_tokens[:, 0] = sentence_begin_token
515 | else:
516 | tensor_with_boundary_tokens[:, 0:tensor_shape[1]] = tensor
517 | if add_eos:
518 | for i, j in enumerate(sequence_lengths):
519 | tensor_with_boundary_tokens[i, j + 1 if add_bos else j] = sentence_end_token
520 | new_mask = (tensor_with_boundary_tokens != 0).long()
521 | elif len(tensor_shape) == 3:
522 | if add_bos:
523 | tensor_with_boundary_tokens[:, 1:(1 + tensor_shape[1]), :] = tensor
524 | else:
525 | tensor_with_boundary_tokens[:, 0:tensor_shape[1], :] = tensor
526 | for i, j in enumerate(sequence_lengths):
527 | if add_bos:
528 | tensor_with_boundary_tokens[i, 0, :] = sentence_begin_token
529 | if add_eos:
530 | tensor_with_boundary_tokens[i, j + 1 if add_bos else j, :] = sentence_end_token
531 | new_mask = ((tensor_with_boundary_tokens > 0).long().sum(dim=-1) > 0).long()
532 | else:
533 | raise ValueError("add_sentence_boundary_token_ids only accepts 2D and 3D input")
534 |
535 | return tensor_with_boundary_tokens, new_mask
536 |
537 | def remove_sentence_boundaries(tensor: torch.Tensor,
538 | mask: torch.Tensor,
539 | rmv_bos: bool = False,
540 | rmv_eos: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
541 | """
542 | Remove begin/end of sentence embeddings from the batch of sentences.
543 | Given a batch of sentences with size ``(batch_size, timesteps)`` or
544 | ``(batch_size, timesteps, dim)`` this returns a tensor of shape ``(batch_size, timesteps - 0/1/2)`` or
545 | ``(batch_size, timesteps - 0/1/2, dim)`` after removing
546 | the beginning and end sentence markers. The sentences are assumed to be padded on the right,
547 | with the beginning of each sentence assumed to occur at index 0 (i.e., ``mask[:, 0]`` is assumed
548 | to be 1).
549 | Returns both the new tensor and updated mask.
550 | This function is the inverse of ``add_sentence_boundaries``.
551 | Parameters
552 | ----------
553 | tensor : ``torch.Tensor``
554 | A tensor of shape ``(batch_size, timesteps)`` or ``(batch_size, timesteps, dim)``
555 | mask : ``torch.Tensor``
556 | A tensor of shape ``(batch_size, timesteps)``
557 | rmv_bos: bool
558 | Whether to remove begin of sentence token
559 | rmv_eos: bool
560 | Whether to remove end of sentence token
561 | Returns
562 | -------
563 | tensor_without_boundary_tokens : ``torch.Tensor``
564 | The tensor after removing the boundary tokens of shape ``(batch_size, timesteps - 0/1/2)``
565 | or ``(batch_size, timesteps - 0/1/2, dim)``
566 | new_mask : ``torch.Tensor``
567 | The new mask for the tensor of shape ``(batch_size, timesteps - 0/1/2)``.
568 | """
569 | # TODO: matthewp, profile this transfer
570 | if not rmv_bos and not rmv_eos:
571 | return tensor, mask
572 |
573 | sequence_lengths = mask.sum(dim=1).detach().cpu().numpy()
574 | tensor_shape = list(tensor.data.shape)
575 | new_shape = list(tensor_shape)
576 | if rmv_bos:
577 | new_shape[1] = new_shape[1] - 1
578 | if rmv_eos:
579 | new_shape[1] = new_shape[1] - 1
580 | tensor_without_boundary_tokens = tensor.new_zeros(*new_shape)
581 | new_mask = tensor.new_zeros((new_shape[0], new_shape[1]), dtype=torch.long)
582 | for i, j in enumerate(sequence_lengths):
583 | if rmv_bos and rmv_eos and j > 2:
584 | if len(tensor_shape) == 3:
585 | tensor_without_boundary_tokens[i, :(j - 2), :] = tensor[i, 1:(j - 1), :]
586 | elif len(tensor_shape) == 2:
587 | tensor_without_boundary_tokens[i, :(j - 2)] = tensor[i, 1:(j - 1)]
588 | else:
589 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input")
590 | new_mask[i, :(j - 2)] = 1
591 | if rmv_bos and not rmv_eos and j > 1:
592 | if len(tensor_shape) == 3:
593 | tensor_without_boundary_tokens[i, :(j - 1), :] = tensor[i, 1:j, :]
594 | elif len(tensor_shape) == 2:
595 | tensor_without_boundary_tokens[i, :(j - 1)] = tensor[i, 1:j]
596 | else:
597 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input")
598 | new_mask[i, :(j - 1)] = 1
599 | if not rmv_bos and rmv_eos and j > 1:
600 | if len(tensor_shape) == 3:
601 | tensor_without_boundary_tokens[i, :(j - 1), :] = tensor[i, :(j - 1), :]
602 | elif len(tensor_shape) == 2:
603 | tensor_without_boundary_tokens[i, :(j - 1)] = tensor[i, :(j - 1)]
604 | else:
605 | raise ValueError("remove_sentence_boundaries only accepts 2D and 3D input")
606 | new_mask[i, :(j - 1)] = 1
607 |
608 | return tensor_without_boundary_tokens, new_mask
609 |
610 |
--------------------------------------------------------------------------------
/uss/gpt2_sequential_embedder.py:
--------------------------------------------------------------------------------
1 | '''
2 | Sequentially embed word tokens into GPT-2 (last-layer) hidden state vectors, with model internal states from the past saved.
3 | Note that GPT-2 uses BPE encodings for its vocabulary, so each word type will have multiple BPE units of variable length.
4 |
5 | Based on the library pytorch_pretrained_bert.
6 | '''
7 |
8 | import torch
9 | import torch.nn as nn
10 | from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model
11 |
12 | import logging
13 | logging.basicConfig(level=logging.INFO)
14 |
15 |
16 | class GPT2Embedder(nn.Module):
17 | def __init__(self, cuda_device=-1):
18 | super(GPT2Embedder, self).__init__()
19 |
20 | self.cuda_device = 'cpu' if cuda_device == -1 else f'cuda:{cuda_device}'
21 |
22 | # Load pre-trained model tokenizer (vocabulary)
23 | self.enc = GPT2Tokenizer.from_pretrained('gpt2')
24 | # Load pre-trained model (weights)
25 | self.model = GPT2Model.from_pretrained('gpt2')
26 |
27 | self.model.to(self.cuda_device)
28 | self.model.eval() # we only use the evaluation mode of the pretrained model
29 |
30 | self._bos_id = self.enc.encoder['<|endoftext|>']
31 | self._bos_past = None
32 |
33 | @property
34 | def bos_past(self):
35 | if self._bos_past is not None:
36 | return self._bos_past
37 | else:
38 | with torch.no_grad():
39 | _, self._bos_past = self.model(torch.tensor([[self._bos_id]], device=self.cuda_device), past=None)
40 | return self._bos_past
41 |
42 | def embed_sentence(self, sentence, add_bos=False, add_eos=False, bpe2word='last', initial_state=None):
43 | '''
44 | Compute the GPT-2 embeddings for a single tokenized sentence.
45 |
46 | Input:
47 | sentence (List[str]): tokenized sentence
48 | add_bos (bool): whether to add begin of sentence token '<|endoftext|>'
49 | add_eos (bool): whetehr to add end of sentenc token '<|endoftext|>' (currently not used)
50 | bpe2word (str): how to turn the BPE vectors into word vectors;
51 | 'last': last hidden state; 'avg': average hidden state.
52 | initial_state (List[torch.Tensor]): GPT-2 internal states for the past
53 |
54 | Output:
55 | embeddings (torch.Tensor): GPT-2 vectors for the sentence, size (len(sentence), 768)
56 | states (List[torch.Tensor]): GPT-2 internal states for the past, a list of length 12 (for 12 layers)
57 | '''
58 | assert isinstance(sentence, list), 'input "sentence" should be a list of word types.'
59 | assert bpe2word in ['last', 'avg']
60 |
61 | if add_bos:
62 | # initial_state is not used when 'add_bos' is True
63 | past = self.bos_past
64 | else:
65 | past = initial_state
66 |
67 | if past is None:
68 | bos_sp = '' # begin of sentence: whether there is a space or not
69 | else:
70 | bos_sp = ' '
71 |
72 | for i, w in enumerate(sentence):
73 | if i == 0:
74 | bpe_units = torch.tensor([self.enc.encode(bos_sp + w)], device=self.cuda_device)
75 | with torch.no_grad():
76 | vec, past = self.model(bpe_units, past=past)
77 | else:
78 | bpe_units = torch.tensor([self.enc.encode(' ' + w)], device=self.cuda_device)
79 | with torch.no_grad():
80 | vec, past = self.model(bpe_units, past=past)
81 |
82 | if bpe2word == 'last':
83 | vec = vec[:, -1, :]
84 | elif bpe2word == 'avg':
85 | vec = vec.mean(dim=1)
86 | else:
87 | raise ValueError
88 |
89 | embeddings = vec if i == 0 else torch.cat([embeddings, vec], dim=0)
90 |
91 | return embeddings, past
92 |
93 | def embed_words(self, words, add_bos=False, add_eos=False, bpe2word='last', initial_state=None):
94 | '''
95 | Compute the GPT-2 embeddings for a list of words.
96 | The challenge is that these words might have BPE encodings of different lengths, so we need to pad for a batch and then
97 | correctly index out the embeddings and internal states at right positions.
98 |
99 | Input:
100 | words (List[str]): a list of words
101 | add_bos (bool): whether to add begin of sentence token '<|endoftext|>'
102 | add_eos (bool): whetehr to add end of sentenc token '<|endoftext|>' (currently not used)
103 | bpe2word (str): how to turn the BPE vectors into word vectors;
104 | 'last': last hidden state; 'avg': average hidden state.
105 | initial_state (List[torch.Tensor]): GPT-2 internal states for the past
106 |
107 | Output:
108 | embeddings (torch.Tensor): GPT-2 vectors for the words, size (len(words), 768)
109 | states (List[List[torch.Tensor]]): GPT-2 internal states for the past, a list of length len(words)
110 | '''
111 | assert isinstance(words, list), 'input "words" should be a list of candidate word types for the next step.'
112 | assert bpe2word in ['last', 'avg']
113 |
114 | if add_bos:
115 | # initial_state is not used when 'add_bos' is True
116 | past = self.bos_past
117 | else:
118 | past = initial_state
119 |
120 | if past is None:
121 | bos_sp = '' # begin of sentence: whether there is a space or not
122 | else:
123 | bos_sp = ' '
124 |
125 | n = len(words)
126 | bpe_list = [self.enc.encode(bos_sp + w) for w in words]
127 | bpe_lens = [len(b) for b in bpe_list]
128 |
129 | ## padding to for a batch
130 | padding = 0
131 | max_seqlen = max(bpe_lens)
132 | bpe_list = [b + [padding] * (max_seqlen - l) for b, l in zip(bpe_list, bpe_lens)]
133 | bpe_padded = torch.tensor(bpe_list, device=self.cuda_device) # size (n, max_seqlen)
134 | if past is not None:
135 | past_seqlen = past[0].size(3)
136 | past = [p.expand(-1, n, -1, -1, -1) for p in past] # same past internal states for every word in the batch
137 | else:
138 | past_seqlen = 0
139 |
140 | ## run GPT-2 model
141 | with torch.no_grad():
142 | hid, mid = self.model(bpe_padded, past=past)
143 |
144 | ## extract the hidden states of words through indexing
145 | if bpe2word == 'last':
146 | # method 1: torch.gather
147 | index = torch.tensor(bpe_lens, device=self.cuda_device).reshape(n, 1, 1).expand(-1, -1, hid.size(2)) - 1
148 | embeddings = torch.gather(hid, 1, index).squeeze(1)
149 | # method 2: for loop
150 | # embeddings = hid.new_zeros(n, hid.size(2))
151 | # for i in range(n):
152 | # embeddings[i] = hid[i, bpe_lens[i] - 1]
153 | elif bpe2word == 'avg':
154 | a = torch.arange(max_seqlen, device=self.cuda_device).view(1, -1).expand(n, -1)
155 | b = torch.tensor(bpe_lens, device=self.cuda_device).view(-1, 1)
156 | mask = a >= b # size (n, max_seqlen)
157 | hid[mask] = 0 # mask out the padded position embeddings
158 | embeddings = hid.sum(dim=1) / b.float()
159 | else:
160 | raise ValueError
161 |
162 | ## index out the internal states
163 | states = torch.cat(mid, dim=0) # size (2 * 12, n, 12, past_seqlen + max_seqlen, 64)
164 | states = torch.split(states, 1, dim=1) # list of length n
165 | states = [torch.chunk(s.index_select(3, torch.arange(past_seqlen + l, device=self.cuda_device)), 12, dim=0)
166 | for s, l in zip(states, bpe_lens)]
167 |
168 | return embeddings, states
169 |
170 |
--------------------------------------------------------------------------------
/uss/lm_subvocab.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 |
4 | def prob_next(LMModel, vocab, text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False):
5 | """
6 | Output the probability distribution for the next word based on a pretrained LM, given the previous text.
7 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary.
8 |
9 | Input:
10 | LMModel: pretrained RNN LM model.
11 | vocab: full vocabulary. 'torchtext.vocab.Vocab'.
12 | text: previous words in the sentence.
13 | hn: initial hidden states to the LM.
14 | subvocab: sub-vocabulary. 'torch.LongTensor'.
15 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)).
16 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities.
17 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False.
18 |
19 | Output:
20 | (if subvocab is not None) subprobs: probability distribution over the sub-vocabulary.
21 | probs: probability distribution over the full vocabulary.
22 | hn: hidden states.
23 | """
24 |
25 | if clustermask is not None:
26 | assert subvocab is not None, 'clustermask provided but No subvocab provided.'
27 |
28 | if isinstance(text, str):
29 | text = text.split()
30 |
31 | textid = next(LMModel.parameters()).new_tensor([vocab.stoi[w] for w in text],
32 | dtype=torch.long)
33 | with torch.no_grad():
34 | LMModel.eval()
35 | batch_text = textid.unsqueeze(1)
36 | embed = LMModel.embedding(batch_text)
37 | output, hn = LMModel.lstm(embed, hn)
38 | output = LMModel.proj(output) # size: (seq_len, batch_size=1, vocab_size)
39 |
40 | probs = torch.nn.functional.softmax(output[-1].squeeze(), dim=0)
41 |
42 | if subvocab is None:
43 | # if no subvocab is provided, return the full probability distribution and hidden states
44 | return probs, hn
45 |
46 | ## cluster on the raw scores (rather than the probabilities) before passing to the softmax layer
47 | if onscore:
48 | scores = output[-1].squeeze()
49 | subscores = scores[subvocab]
50 | if clustermask is None:
51 | subprobs = torch.nn.functional.softmax(subscores, dim=0)
52 | return subprobs, probs, hn
53 | for i in range(len(subvocab)):
54 | subscores[i] = scores[clustermask[i]].sum()
55 | subprobs = torch.nn.functional.softmax(subscores, dim=0)
56 | return subprobs, probs, hn
57 |
58 | ## cluster on the probabilities
59 | subprobs = probs[subvocab]
60 | if clustermask is None:
61 | if renorm:
62 | subprobs = subprobs / subprobs.sum()
63 | # subprobs = torch.nn.functional.softmax(subprobs, dim=0) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller
64 | return subprobs, probs, hn
65 |
66 | for i in range(len(subvocab)):
67 | subprobs[i] = probs[clustermask[i]].sum()
68 | if renorm:
69 | subprobs = subprobs / subprobs.sum()
70 | # subprobs = torch.nn.functional.softmax(subprobs, dim=0) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller
71 | return subprobs, probs, hn
72 |
73 |
74 | def prob_next_1step(LMModel, batch_text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False, temperature=1):
75 | """
76 | Output the probability distribution for the next word based on a pretrained LM, carried in only one step of the forward pass.
77 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary.
78 | This function is specifically used in the beam search.
79 |
80 | Input:
81 | LMModel: pretrained RNN LM model.
82 | batch_text: text id input to the language model, of size (seq_len=1, batch_size=onbeam_size).
83 | hn: hidden states to the LM, a tuple and each of size (num_layers * num_directions, batch_size=onbeam_size, hidden_size).
84 | subvocab: sub-vocabulary. 'torch.LongTensor'.
85 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)).
86 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities.
87 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False.
88 |
89 | Output:
90 | subprobs: probability distribution over the sub-vocabulary. Size: (batch_size=onbeam_size, subvocab_size)
91 | probs: probability distribution over the full vocabulary. Size: (batch_size=onbeam_size, vocab_size)
92 | hn: hidden states. Tuple, each of size (num_layers * num_directions, batch_size=onbeam_size, hidden_size).
93 | """
94 |
95 | if clustermask is not None:
96 | assert subvocab is not None, 'clustermask provided but No subvocab provided.'
97 |
98 | with torch.no_grad():
99 | LMModel.eval()
100 | embed = LMModel.embedding(batch_text)
101 | output, hn = LMModel.lstm(embed, hn)
102 | output = LMModel.proj(output) # size: (seq_len=1, batch_size=onbeam_size, vocab_size)
103 |
104 | output = output / temperature
105 | probs = torch.nn.functional.softmax(output.squeeze(0), dim=1) # size: (batch_size=onbeam_size, vocab_size)
106 |
107 | if subvocab is None:
108 | # if no subvocab is provided, return the full probability distribution and hidden states
109 | return probs, probs, hn
110 |
111 | # ## cluster on the raw scores (rather than the probabilities) before passing to the softmax layer
112 | # if onscore:
113 |
114 | # return
115 |
116 | ## cluster on the probabilities
117 | subprobs = probs[:, subvocab] # size: (batch_size=onbeam_size, subvocab_size)
118 | if clustermask is None:
119 | if renorm:
120 | subprobs = subprobs / torch.sum(subprobs, dim=1, keepdim=True)
121 | # subprobs = torch.nn.functional.softmax(subprobs, dim=1) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller
122 | return subprobs, probs, hn
123 |
124 | for i in range(len(subvocab)):
125 | subprobs[:, i] = probs[:, clustermask[i]].sum(dim=1)
126 |
127 | if renorm:
128 | subprobs = subprobs / torch.sum(subprobs, dim=1, keepdim=True)
129 | # subprobs = torch.nn.functional.softmax(subprobs, dim=1) # this makes the ratio p1/p2 between p1 and p2 (p1 > p2) smaller
130 | return subprobs, probs, hn
131 |
132 |
133 | def prob_sent(LMModel, vocab, text, hn=None, subvocab=None, clustermask=None, onscore=False, renorm=False, size_average=False):
134 | """
135 | Output the log-likelihood of a sentence based on a pretrained LM.
136 | If 'subvocab' is not None, the distribution is restricted on the specified sub-vocabulary.
137 |
138 | Input:
139 | LMModel: pretrained RNN LM model.
140 | vocab: full vocabulary. 'torchtext.vocab.Vocab'.
141 | text: previous words in the sentence.
142 | hn: initial hidden states to the LM.
143 | subvocab: sub-vocabulary. 'torch.LongTensor'.
144 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)).
145 | onscore: whether to cluster on the raw scores before softmax layer, rather than cluster on the probabilities.
146 | renorm: whether to renormalize the probabilities over the sub-vocabulary. This parameter only works if 'onscore' is False.
147 | size_average: whether to average the log-likelihood according to the sequence length.
148 |
149 | Output:
150 | ll: log-likelihood of the given sentence evaluated by the pretrained LM.
151 | hn: hidden states.
152 | """
153 |
154 | if clustermask is not None:
155 | assert subvocab is not None, 'clustermask provided but No subvocab provided.'
156 |
157 | if isinstance(text, str):
158 | text = text.split()
159 |
160 | ## no subvocab is provided, operating on the full vocabulary
161 | textid = next(LMModel.parameters()).new_tensor([vocab.stoi[w] for w in text],
162 | dtype=torch.long)
163 | if subvocab is None:
164 | with torch.no_grad():
165 | LMModel.eval()
166 | batch_text = textid.unsqueeze(1)
167 | embed = LMModel.embedding(batch_text)
168 | output, hn = LMModel.lstm(embed, hn)
169 | output = LMModel.proj(output) # size: (seq_len, batch_size=1, vocab_size)
170 | ll = torch.nn.functional.cross_entropy(output.squeeze()[:-1, :], textid[1:], size_average=size_average, ignore_index=LMModel.padid)
171 | ll = -ll.item()
172 | return ll, hn
173 |
174 | ## subvocab is provided
175 | textid_sub = next(LMModel.parameters()).new_tensor([subvocab.numpy().tolist().index(vocab.stoi[w]) for w in text],
176 | dtype=torch.long)
177 | subprobs_sent = torch.zeros(len(text) - 1, len(subvocab), device=next(LMModel.parameters()).device)
178 | for i in range(len(text) - 1):
179 | subprobs, probs, hn = prob_next(LMModel, vocab, text[i], hn, subvocab, clustermask, onscore, renorm)
180 | subprobs_sent[i] = subprobs
181 | ll = torch.nn.functional.nll_loss(torch.log(subprobs_sent), textid_sub[1:], size_average=size_average, ignore_index=LMModel.padid)
182 | ll = -ll.item()
183 | return ll, hn
184 |
185 |
186 | def clmk_nn(embedmatrix, subvocab, normalized=True):
187 | """
188 | Generate 'clustermask', based on nearest neighbors, i.e. each word outside of the sub-vocabulary is assigned
189 | to the group of its closest one in the sub-vocabulary.
190 |
191 | Input:
192 | embedmatrix: word embedding matrix. Default should be the output embedding from the RNN language model.
193 | subvocab: sub-vocabulary. 'torch.LongTensor'.
194 | normalized: whether to use the normalized dot product as the distance measure, i.e. cosine similarity.
195 |
196 | Output:
197 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)).
198 | """
199 |
200 | submatrix = embedmatrix[subvocab]
201 | sim_table = torch.mm(submatrix, embedmatrix.t())
202 | if normalized:
203 | sim_table = sim_table / torch.ger(submatrix.norm(2, 1), embedmatrix.norm(2, 1))
204 | maxsim, maxsim_ind = torch.max(sim_table, dim=0)
205 |
206 | groups = []
207 | vocab_ind = torch.arange(len(embedmatrix), device=embedmatrix.device)
208 | clustermask = torch.zeros_like(sim_table, dtype=torch.uint8, device='cpu')
209 | for i in range(len(subvocab)):
210 | groups.append(vocab_ind[maxsim_ind == i].long())
211 | clustermask[i][groups[i]] = 1
212 |
213 | return clustermask
214 |
215 |
216 | def clmk_cn(embedmatrix, subvocab, simthre=0.6, normalized=True):
217 | """
218 | Generate 'clustermask', based on the cone method, i.e. each word in the sub-vocabulary is joined by the closest words in a cone,
219 | specified by a cosine similarity threshold.
220 |
221 | Input:
222 | embedmatrix: word embedding matrix. Default should be the output embedding from the RNN language model.
223 | subvocab: sub-vocabulary. 'torch.LongTensor'.
224 | simthre: cosine similarity threshold.
225 | normalized: whether to use the normalized dot product as the distance measure, i.e. cosine similarity.
226 |
227 | Output:
228 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)).
229 | """
230 |
231 | submatrix = embedmatrix[subvocab]
232 | sim_table = torch.mm(submatrix, embedmatrix.t())
233 | if normalized:
234 | sim_table = sim_table / torch.ger(submatrix.norm(2, 1), embedmatrix.norm(2, 1))
235 |
236 | clustermask = (sim_table > simthre).to('cpu')
237 | ## remove the indices that are already in the sub-vocabulary
238 | subvocabmask = torch.zeros_like(clustermask, dtype=torch.uint8)
239 | subvocabmask[:, subvocab] = 1
240 | clustermask = (clustermask ^ subvocabmask) & clustermask # set difference
241 | for i in range(len(subvocab)):
242 | clustermask[i][subvocab[i]] = 1 # add back the current word in the sub-vocabulary
243 |
244 | return clustermask
245 |
--------------------------------------------------------------------------------
/uss/pre_closetables.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import pickle
3 | import math
4 | from tqdm import tqdm
5 |
6 | import sys
7 |
8 | from elmo_sequential_embedder import ElmoEmbedderForward
9 | from sim_embed_score import pickElmoForwardLayer
10 |
11 |
12 | def ELMoBotEmbedding(itos, device=-1):
13 | """
14 | itos: List[str]. A list of words consisting of the vocabulary.
15 | device: int. -1 for cpu.
16 | """
17 | ee = ElmoEmbedderForward(cuda_device=device)
18 | vocab_vecs, _ = zip(*ee.embed_sentences([[w] for w in itos], add_bos=True, batch_size=1024))
19 | vocab_vecs = [pickElmoForwardLayer(vec, 'bot') for vec in vocab_vecs]
20 | embedmatrix = torch.cat(vocab_vecs, dim=0) # size: (vocab_size, embed_size)
21 |
22 | return embedmatrix
23 |
24 |
25 | def findclosewords_vocab(vocab, embedmatrix, numwords=500, normalized=True, device='cpu'):
26 | """
27 | Find closest words for every word in the vocabulary.
28 | """
29 | v = len(vocab)
30 | assert v == len(embedmatrix)
31 |
32 | embedmatrix = embedmatrix.to(device)
33 |
34 | chunk_size = 1000 # to solve the problem of out of memory
35 |
36 | if v > chunk_size:
37 | n = math.ceil(v / chunk_size)
38 | else:
39 | n = 1
40 | values = None
41 | indices = None
42 | start = 0
43 | for i in tqdm(range(n)):
44 | embedmatrix_chunk = embedmatrix[start:(start + chunk_size), :]
45 | start = start + chunk_size
46 |
47 | sim_table = torch.mm(embedmatrix_chunk, embedmatrix.t())
48 | if normalized:
49 | sim_table = sim_table / torch.ger(embedmatrix_chunk.norm(2, 1), embedmatrix.norm(2, 1))
50 |
51 | values_chunk, indices_chunk = sim_table.topk(numwords, dim=1)
52 | values = values_chunk if values is None else torch.cat([values, values_chunk], dim=0)
53 | indices = indices_chunk if indices is None else torch.cat([indices, indices_chunk], dim=0)
54 |
55 | return values.to('cpu'), indices.to('cpu') # values and indices have size (vocab_len, numwords)
56 |
57 |
58 | if __name__ == '__main__':
59 |
60 | vocab_path = '../4.0_cluster/vocabTle.pkl' # vocabulary for the pretrained language model
61 | closewordsim_path = '../4.0_cluster/vocabTleCloseWordSims.pkl'
62 | closewordind_path = '../4.0_cluster/vocabTleCloseWordIndices.pkl' # character level word embeddings
63 | closewordsim_outembed_path = 'vocabTleCloseWordSims_outembed_MoS.pkl'
64 | closewordind_outembed_path = 'vocabTleCloseWordIndices_outembed_MoS.pkl'
65 | modelclass_path = '../LSTM_MoS'
66 | model_path = '../LSTM_MoS/models/LMModelMoSTle2.pth'
67 |
68 | # vocab_path = '../LSTM_LUC/vocabTle50k.pkl' # vocabulary for the pretrained language model
69 | # closewordsim_path = 'vocabTle50kCloseWordSims.pkl'
70 | # closewordind_path = 'vocabTle50kCloseWordIndices.pkl' # character level word embeddings
71 | # closewordsim_outembed_path = 'vocabTle50kCloseWordSims_outembed_wtI.pkl'
72 | # closewordind_outembed_path = 'vocabTle50kCloseWordIndices_outembed_wtI.pkl'
73 | # modelclass_path = '../LSTM_LUC'
74 | # model_path = '../LSTM_LUC/models/TleLUC_wtI_0-0.0001-1Penalty.pth'
75 |
76 | # vocabulary
77 | vocab = pickle.load(open(vocab_path, 'rb'))
78 |
79 | # # character embeddings of the vocabulary
80 | # embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=0)
81 | # values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500)
82 |
83 | # # save results
84 | # pickle.dump(values_cnn, open(closewordsim_path, 'wb'))
85 | # pickle.dump(indices_cnn, open(closewordind_path, 'wb'))
86 |
87 | # output embeddings of the vocabulary
88 | modelclass_path = modelclass_path
89 | if modelclass_path not in sys.path:
90 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model; the model class file must be included in the search path
91 | LMModel = torch.load(model_path, map_location=torch.device('cpu'))
92 | embedmatrix = LMModel.proj_vocab.weight
93 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500)
94 |
95 | # save results
96 | pickle.dump(values, open(closewordsim_outembed_path, 'wb'))
97 | pickle.dump(indices, open(closewordind_outembed_path, 'wb'))
98 |
99 |
--------------------------------------------------------------------------------
/uss/pre_word_list.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 |
4 | def findwordlist(template, closewordind, vocab, numwords=10, addeos=False):
5 | """
6 | Based on a template sentence, find the candidate word list.
7 |
8 | Input:
9 | template: source sentence.
10 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor.
11 | vocab: full vocabulary.
12 | numwords: number of closest words per word in the template.
13 | addeos: whether to include '' in the candidate word list.
14 | """
15 | if isinstance(template, str):
16 | template = template.split()
17 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template])
18 | # subvocab = closewordind[templateind, :numwords].flatten().cpu() # torch.flatten() only exists from PyTorch 0.4.1
19 | subvocab = closewordind[templateind, :numwords].view(-1).cpu()
20 | if addeos:
21 | subvocab = torch.cat([subvocab, torch.LongTensor([vocab.stoi['']])])
22 | subvocab = subvocab.unique(sorted=True)
23 | word_list = [vocab.itos[i] for i in subvocab]
24 |
25 | return word_list, subvocab
26 |
27 |
28 | def findwordlist_screened(template, closewordind, closewordind_outembed, vocab, numwords=10, addeos=False):
29 | """
30 | Based on a template sentence, find the candidate word list, according to the character level RNN embeddings but
31 | screened by the output embeddings.
32 |
33 | Input:
34 | template: source sentence.
35 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor.
36 | closewordind_embed: same as 'closewordind', but using output embeddings.
37 | vocab: full vocabulary.
38 | numwords: number of closest words per word in the template.
39 | addeos: whether to include '' in the candidate word list.
40 | """
41 | if isinstance(template, str):
42 | template = template.split()
43 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template])
44 |
45 | subvocab = closewordind[templateind, :numwords].view(-1).cpu()
46 | subvocab_embed = closewordind_outembed[templateind, 1:numwords].view(-1).cpu()
47 | subvocab_intemplate = closewordind[templateind, 0].view(-1).cpu()
48 |
49 | subvocab_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device)
50 | subvocab_mask[subvocab] = 1
51 | subvocab_embed_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device)
52 | subvocab_embed_mask[subvocab_embed] = 1
53 |
54 | subvocab_screened_mask = (subvocab_mask ^ subvocab_embed_mask) & subvocab_mask
55 | subvocab_screened_mask[subvocab_intemplate] = 1 # add back the words in the template sentence
56 | if addeos:
57 | subvocab_screened_mask[vocab.stoi['']] = 1
58 |
59 | subvocab_screened = torch.arange(len(vocab), dtype=torch.long, device=subvocab.device)
60 | subvocab_screened = subvocab_screened[subvocab_screened_mask]
61 |
62 | word_list = [vocab.itos[i] for i in subvocab_screened]
63 |
64 | return word_list, subvocab_screened
65 |
66 |
67 | def findwordlist_screened2(template, closewordind, closewordind_outembed, vocab, numwords=10,
68 | numwords_outembed=None, numwords_freq=500, addeos=False):
69 | """
70 | Based on a template sentence, find the candidate word list, according to the character level RNN embeddings but
71 | screened by the output embeddings, and keep the words that are in the top 'numwords_freq' list in the vocabulary.
72 |
73 | Input:
74 | template: source sentence.
75 | closewordind: precalculated 100 closest word indices (using character embeddings). torch.LongTensor.
76 | closewordind_embed: same as 'closewordind', but using output embeddings.
77 | vocab: full vocabulary.
78 | numwords: number of closest words per word in the template.
79 | numwords_outembed: number of closest words per word in the output embedding to be screened out.
80 | numwords_freq: number of the most frequent words in the vocabulary to remain.
81 | addeos: whether to include '' in the candidate word list.
82 | """
83 | if numwords_outembed is None:
84 | numwords_outembed = numwords
85 |
86 | if numwords_outembed <= 1:
87 | return findwordlist(template, closewordind, vocab, numwords=numwords, addeos=addeos)
88 |
89 | if isinstance(template, str):
90 | template = template.split()
91 | templateind = closewordind.new_tensor([vocab.stoi[w] for w in template])
92 |
93 | subvocab = closewordind[templateind, :numwords].view(-1).cpu()
94 | subvocab_embed = closewordind_outembed[templateind, 1:numwords_outembed].view(-1).cpu()
95 | subvocab_intemplate = closewordind[templateind, 0].view(-1).cpu()
96 |
97 | subvocab_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device)
98 | subvocab_mask[subvocab] = 1
99 | subvocab_embed_mask = torch.zeros(len(vocab), dtype=torch.uint8, device=subvocab.device)
100 | subvocab_embed_mask[subvocab_embed[subvocab_embed >= numwords_freq]] = 1 # never remove the most frequent words
101 |
102 | subvocab_screened_mask = (subvocab_mask ^ subvocab_embed_mask) & subvocab_mask
103 | subvocab_screened_mask[subvocab_intemplate] = 1 # add back the words in the template sentence
104 | if addeos:
105 | subvocab_screened_mask[vocab.stoi['']] = 1
106 |
107 | subvocab_screened = torch.arange(len(vocab), dtype=torch.long, device=subvocab.device)
108 | subvocab_screened = subvocab_screened[subvocab_screened_mask]
109 |
110 | word_list = [vocab.itos[i] for i in subvocab_screened]
111 |
112 | return word_list, subvocab_screened
113 |
--------------------------------------------------------------------------------
/uss/sim_embed_score.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import time
4 | from utils import timeSince
5 | # from tqdm import tqdm
6 |
7 | from sim_token_match import OneTokenMatch
8 |
9 |
10 | def pickElmoForwardLayer(embedding, elmo_layer='avg'):
11 | """
12 | Given a forward only ELMo embedding vector of size (3, #words, 512), pick up the layer
13 | """
14 | assert elmo_layer in ['top', 'mid', 'bot', 'avg', 'cat']
15 |
16 | if elmo_layer == 'top':
17 | embedding = embedding[2]
18 | elif elmo_layer == 'mid':
19 | embedding = embedding[1]
20 | elif elmo_layer == 'bot':
21 | embedding = embedding[0]
22 | elif elmo_layer == 'avg':
23 | if isinstance(embedding, np.ndarray):
24 | embedding = np.average(embedding, axis=0)
25 | elif isinstance(embedding, torch.Tensor):
26 | embedding = torch.mean(embedding, dim=0)
27 | elif elmo_layer == 'cat':
28 | if isinstance(embedding, np.ndarray):
29 | embedding = np.reshape(embedding.transpose(1, 0, 2),
30 | (-1, embedding.shape[0] * embedding.shape[2])) # concat 3 layers, bottom first
31 | elif isinstance(embedding, torch.Tensor):
32 | embedding = embedding.transpose(0, 1).reshape(-1, embedding.size(0) * embedding.size(2))
33 |
34 | return embedding
35 |
36 |
37 | def simScoreNext(template_vec,
38 | word_list,
39 | ee,
40 | batch_size=1024,
41 | prevs_state=None,
42 | prevs_align=None,
43 | normalized=True,
44 | elmo_layer='avg'):
45 | """
46 | Score the next tokens based on sentence level similarity, with previous alignment fixed.
47 |
48 | Input:
49 | template_vec: template sentence ELMo vectors.
50 | word_list: a list of next candidate words.
51 | ee: a ``ElmoEmbedderForward`` class.
52 | batch_size: for ee to use.
53 | prevs_state: previous hidden states.
54 | prevs_align: aligning location for the last word in the sequence.
55 | If provided, monotonicity is required.
56 | normalized: whether to use normalized dot product (cosine similarity) for token similarity calculation.
57 | elmo_layer: ELMo layer to use.
58 | Output:
59 | scores: unsorted one-token similarity scores, torch.Tensor.
60 | indices: matched indices in template_vec for each token, torch.LongTensor.
61 | states: corresponding ELMo forward lstm hidden states, List.
62 | """
63 | sentences = [[w] for w in word_list]
64 | src_vec = pickElmoForwardLayer(template_vec, elmo_layer)
65 | if prevs_state is None:
66 | assert prevs_align is None, 'Nothing should be passed in when no history.'
67 | # beginning of sentence, the first token
68 | embeddings_and_states = ee.embed_sentences(sentences, add_bos=True, batch_size=batch_size)
69 | else:
70 | # in the middle of sentence, sequential update
71 | # start = time.time()
72 | embeddings_and_states = ee.embed_sentences(sentences, initial_state=prevs_state, batch_size=batch_size)
73 | # print('ELMo embedding: ' + timeSince(start))
74 |
75 | embeddings, states = zip(*embeddings_and_states) # this returns two tuples
76 |
77 | scores = []
78 | indices = []
79 | print('Calculating similarities ---')
80 | # start = time.time()
81 | embeddings = [pickElmoForwardLayer(vec, elmo_layer) for vec in embeddings]
82 | scores, indices = OneTokenMatch(src_vec, embeddings, normalized=normalized, starting_loc=prevs_align)
83 | # print('Similarities: ' + timeSince(start))
84 |
85 | return scores, indices, list(states)
86 |
87 |
88 | def simScoreNext_GPT2(template_vec,
89 | word_list,
90 | ge,
91 | bpe2word='last',
92 | prevs_state=None,
93 | prevs_align=None,
94 | normalized=True):
95 | """
96 | Score the next tokens based on sentence level similarity, with previous alignment fixed.
97 | In particular, this function uses GPT-2 to embed the sentences/candidate words:
98 | - Calculate the embeddings for each candidate word using pre-trained GPT-2 model, given the previous hidden states
99 | - Calculate best alignment positions and similarity scores for each word
100 |
101 | Note:
102 | - GPT-2 uses BPE tokenizer, so each word may be split into several different units
103 |
104 | Input:
105 | template_vec (torch.Tensor): template sentence GPT-2 embedding vectors
106 | word_list (list): a list of next candidate words
107 | ge (:class:`GPT2Embedder`): a `GPT2Embedder` object for embedding words using GPT-2
108 | bpe2word (str): how to turn the BPE vectors into word vectors.
109 | 'last': last hidden state; 'avg': average hidden state.
110 | prevs_state (list[torch.Tensor]): previous hidden states for the GPT-2 model
111 | prevs_align (int): aligning location for the last word in the sequence.
112 | If provided, monotonicity is required.
113 | normalized (bool): whether to use normalized dot product (cosine similarity) for token similarity calculation
114 |
115 | Output:
116 | scores (torch.Tensor): unsorted one-token similarity scores
117 | indices (torch.LongTensor): matched indices in template_vec for each token
118 | states (list): corresponding GPT-2 past internal hidden states
119 | """
120 | assert bpe2word in ['last', 'avg']
121 |
122 | if prevs_state is None:
123 | # beginning of sentence, the first token
124 | assert prevs_align is None, 'Nothing should be passed in when no history.'
125 | add_bos = True
126 | else:
127 | # in the middle of a sentence, sequential update
128 | add_bos = False
129 |
130 | embeddings, states = ge.embed_words(word_list, add_bos=add_bos, bpe2word=bpe2word, initial_state=prevs_state)
131 |
132 | scores = []
133 | indices = []
134 | print('Calculating similarities ---')
135 | # start = time.time()
136 | scores, indices = OneTokenMatch(template_vec, embeddings, normalized=normalized, starting_loc=prevs_align)
137 | # print('Similarities: ' + timeSince(start))
138 |
139 | return scores, indices, states
140 |
141 |
142 | """
143 | def simScoreNext_GPT2(template_vec,
144 | bpe_encoding_grouped,
145 | model,
146 | bpe2word='last',
147 | prevs_state=None, prevs_align=None, normalized=True):
148 | '''
149 | Score the next tokens based on sentence level similarity, with previous alignment fixed.
150 | In particular, this function uses GPT-2 to embed the sentences/candidate words:
151 | - Calculate the embeddings for each candidate word using pretrained GPT-2 model, given the previous hidden states
152 | - Calculate best alignment positions and similarity scores for each word
153 |
154 | Note:
155 | - GPT-2 uses BPE tokenizer, so each word may be splitted into several different units
156 |
157 | Input:
158 | template_vec (torch.Tensor): template sentence GPT-2 embedding vectors
159 | word_list (list): a list of next candidate words
160 | prevs_state (list[torch.Tensor]): previous hidden states for the GPT-2 model
161 | tokenizer (pytorch_pretrained_bert.tokenization_gpt2.GPT2Tokenizer): GPT-2 tokenizer
162 | model (pytorch_pretrained_bert.modeling_gpt2.GPT2Model): GPT-2 Model
163 | bpe2word (str): how to turn the BPE vectors into word vectors.
164 | 'last': last hidden state; 'avg': average hidden state.
165 | prevs_align (int): aligning location for the last word in the sequence.
166 | If provided, monotonicity is required.
167 | normalized (bool): whether to use normalized dot product (cosine similarity) for token similarity calculation
168 |
169 | Output:
170 | scores (torch.Tensor): unsorted one-token similarity scores
171 | indices (torch.LongTensor): matched indices in template_vec for each token
172 | states (list): corresponding GPT-2 hidden states
173 | '''
174 | assert bpe2word in ['last', 'avg']
175 |
176 | device = next(model.parameters()).device
177 | model.eval()
178 |
179 | if prevs_state is None:
180 | # beginning of sentence, the first token
181 | assert prevs_align is None, 'Nothing should be passed in when no history.'
182 | else:
183 | # in the middle of a sentence, sequential update
184 | assert prevs_state is not None, 'There should be history.'
185 |
186 | embeddings = [] # word embeddings
187 | states = [] # hidden states saved for sequential calculations
188 | with torch.no_grad():
189 | for bpe_encoding in bpe_encoding_grouped:
190 | # bpe_encoding is a tensor of bpe unit ids
191 | vec, past = model(bpe_encoding, past=prevs_state)
192 | # vec: size (n, len(bpe_encoding), 768)
193 | # past: a list of length 12, each of size (2, n, 12, len(bpe_encoding), 64)
194 | # which records keys, values for 12 heads in each of the 12 layers
195 | # where n is the number of words of the same len(bpe_encoding) in the word list
196 |
197 | if bpe2word == 'last':
198 | embeddings.append(vec[:, -1, :]) # size (n, 768)
199 | elif bpe2word == 'avg':
200 | embeddings.append(vec.mean(dim=1)) # size (n, 768)
201 | else: # impossible
202 | raise ValueError
203 |
204 | past = torch.cat(past, dim=0) # size (2 * 12, n, 12, len(bpe_encoding), 64)
205 | past = torch.split(past, 1, dim=1) # list of length n, each of size (2 * 12, 1, 12, len(bpe_encoding), 64)
206 | states += past
207 |
208 | embeddings = torch.cat(embeddings, dim=0) # size (#word_list, 768)
209 | states = [torch.chunk(s, 12, dim=0) for s in states]
210 |
211 | scores = []
212 | indices = []
213 | print('Calculating similarities ---')
214 | # start = time.time()
215 | scores, indices = OneTokenMatch(template_vec, embeddings, normalized=normalized, starting_loc=prevs_align)
216 | # print('Similarities: ' + timeSince(start))
217 |
218 | return scores, indices, states
219 | """
220 |
--------------------------------------------------------------------------------
/uss/sim_token_match.py:
--------------------------------------------------------------------------------
1 | """
2 | Sequential calculation of word similarities in an embedding space.
3 | Previous alignments are fixed once found.
4 |
5 | Each time, do the best word vector matching between a template sentence and a list of single tokens, based on
6 | cosine similarities or dot products.
7 | In the simplest case, do not require monotonicity.
8 | """
9 | import torch
10 |
11 |
12 | def OneTokenMatch(src, token_list, normalized=False, starting_loc=None):
13 | """
14 | Input:
15 | src: source sequence, such as a long sentence vector to be summarized.
16 | token_list: a list of word vectors to be matched with 'src'.
17 | starting_loc: aligning location for the last word in the sequence.
18 | If provided, monotonicity is required.
19 | Output:
20 | similarities: the best similarity scores for each token in 'token_list'.
21 | indices: the matched indices in 'src' for the best scores for each token.
22 | """
23 | if isinstance(token_list, list):
24 | assert isinstance(token_list[0], torch.Tensor) and isinstance(src, torch.Tensor), \
25 | 'source/template sequence must be torch.Tensor.'
26 | assert len(token_list[0].size()) == len(src.size()) == 2, 'input sequences must be 2D series.'
27 | elif isinstance(token_list, torch.Tensor):
28 | assert isinstance(src, torch.Tensor), 'source/template sequence must be torch.Tensor.'
29 | assert len(token_list.size()) == len(src.size()) == 2, 'input sequences must be 2D series.'
30 | else:
31 | raise TypeError
32 |
33 | if starting_loc is not None:
34 | # require monotonicity, by only looking at 'src' from or after 'starting_loc'
35 | # strict monotonicity
36 | assert starting_loc < len(src) - 1, 'last word already matched to the last token in template, ' \
37 | 'when requiring strict monotonicity.'
38 | src = src[(starting_loc + 1):]
39 | # weak monotonicity
40 | # assert starting_loc < len(src)
41 | # src = src[starting_loc:]
42 |
43 | if isinstance(token_list, list):
44 | token_matrix = torch.cat(token_list, dim=0)
45 | elif isinstance(token_list, torch.Tensor):
46 | token_matrix = token_list
47 | else:
48 | raise TypeError
49 | sim_table = torch.mm(src, token_matrix.t()) # size: (src_len, token_list_len) or (truncated_src_len, token_list_len)
50 |
51 | if normalized:
52 | sim_table = sim_table / torch.ger(src.norm(2, 1), token_matrix.norm(2, 1))
53 |
54 | similarities, indices = torch.max(sim_table, dim=0)
55 |
56 | if starting_loc is not None:
57 | indices += starting_loc + 1 # strict monotonicity
58 | # indices += starting_loc # weak monotonicity
59 |
60 | return similarities, indices
61 |
62 |
63 | def TokenMatch(src, tgt, mono=True, weakmono=False, normalized=True):
64 | """
65 | Calculate the similarity between two sentences by word embedding match and single token alignment.
66 |
67 | Input:
68 | src: source sequence word embeddings. 'torch.Tensor' of size (src_seq_len, embed_dim).
69 | tgt: short target sequence word embeddings to be matched to 'src'. 'torch.Tensor' of size
70 | (tgt_seq_len, embed_dim).
71 | mono: whether to constrain the alignments to be monotonic. Default: True.
72 | weakmono: whether to relax the alignment monotonicity to be weak (non-strict). Only effective when 'mono'
73 | is True. Default: False.
74 | normalized: whether to normalize the dot product in calculating word similarities, i.e. whether to use
75 | cosine similarity or just dot product. Default: True.
76 |
77 | Output:
78 | similarity: sequence similarity, by summing the max similarities of the best alignment.
79 | indices: locations in the 'src' sequence that each 'tgt' token is aligned to.
80 | """
81 |
82 | assert isinstance(src, torch.Tensor) and isinstance(tgt, torch.Tensor), 'input sequences must be torch.Tensor.'
83 | assert len(src.size()) == len(tgt.size()) == 2, 'input sequences must be 2D series.'
84 |
85 | sim_table = torch.mm(src, tgt.t())
86 | if normalized:
87 | sim_table = sim_table / torch.ger(src.norm(2, 1), tgt.norm(2, 1))
88 |
89 | if mono:
90 | src_len, tgt_len = sim_table.size()
91 | max_sim = []
92 | if weakmono:
93 | indices = [0]
94 | for i in range(1, tgt_len + 1):
95 | mi, ii = torch.max(sim_table[indices[i - 1]:, i - 1].unsqueeze(1), dim=0)
96 | max_sim.append(mi)
97 | indices.append(ii + indices[i - 1])
98 | else:
99 | indices = [-1]
100 | for i in range(1, tgt_len + 1):
101 | if indices[i - 1] == src_len - 1:
102 | max_sim.append(sim_table[-1, i - 1].unsqueeze(0))
103 | indices.append(indices[i - 1])
104 | else:
105 | mi, ii = torch.max(sim_table[(indices[i - 1] + 1):, i - 1].unsqueeze(1), dim=0)
106 | max_sim.append(mi)
107 | indices.append(ii + indices[i - 1] + 1)
108 | max_sim = torch.cat(max_sim)
109 | indices = torch.cat(indices[1:])
110 | else:
111 | max_sim, indices = torch.max(sim_table, dim=0)
112 |
113 | similarity = torch.sum(max_sim)
114 |
115 | return similarity, indices
116 |
--------------------------------------------------------------------------------
/uss/summary_search_elmo.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import time
3 | import sys
4 | import os
5 | import argparse
6 |
7 | import torch
8 | from tqdm import tqdm
9 |
10 | from elmo_sequential_embedder import ElmoEmbedderForward
11 | from pre_closetables import ELMoBotEmbedding, findclosewords_vocab
12 | # from pre_word_list import findwordlist, findwordlist_screened
13 | from pre_word_list import findwordlist_screened2
14 | from lm_subvocab import clmk_nn
15 | from beam_search import Beam
16 | from utils import timeSince
17 |
18 |
19 | def gensummary_elmo(template_vec,
20 | ee,
21 | vocab,
22 | LMModel,
23 | word_list,
24 | subvocab,
25 | clustermask=None,
26 | mono=True,
27 | renorm=True,
28 | temperature=1,
29 | elmo_layer='avg',
30 | max_step=20,
31 | beam_width=10,
32 | beam_width_start=10,
33 | alpha=0.1,
34 | alpha_start=0.1,
35 | begineos=True,
36 | stopbyLMeos=False,
37 | devid=0,
38 | **kwargs):
39 | """
40 | Unsupervised sentence summary generation using beam search, by contextual matching and a summary style language model.
41 | The contextual matching here is on top of pretrained ELMo embeddings.
42 |
43 | Input:
44 | - template_vec (torch.Tensor): forward only ELMo embeddings of the source sentence.
45 | 'torch.Tensor' of size (3, seq_len, 512).
46 | - ee (elmo_sequential_embedder.ElmoEmbedderForward): 'elmo_sequential_embedder.ElmoEmbedderForward' object.
47 | - vocab (torchtext.vocab.Vocab): 'torchtext.vocab.Vocab' object. Should be the same as is used for the
48 | pretrained language model.
49 | - LMModel (user defined torch.nn.Module): a pretrained language model on the summary sentences.
50 | - word_list (list): a list of words in the vocabulary to work with. 'List'.
51 | - subvocab (torch.LongTensor): 'torch.LongTensor' consisting of the indices of the words corresponding
52 | to `word_list`.
53 | - clustermask (torch.ByteTensor): a binary mask for each of the sub-vocabulary word.
54 | 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). Default:None.
55 | - mono (bool): whether to keep monotonicity contraint. Default: True.
56 | - renorm (bool): whether to renormalize the probabilities over the sub-vocabulary. Default: True.
57 | - temperature (float): temperature applied to the softmax in the language model. Default: 1.
58 | - elmo_layer (str): which ELMo layer to use as the word type representation.
59 | Choose from ['avg', 'cat', 'bot', 'mid', 'top']. Default: 'avg'.
60 | - max_step (int): maximum number of beam steps.
61 | - beam_width (int): beam width.
62 | - beam_width_start (int): beam width of the first step.
63 | - alpha (float): the amount of language model part used for scoring. The score is:
64 | (1 - \alpha) * similarity_logscore + \alpha * LM_logscore.
65 | - alpha_start (float): the amount of language model part used for scoring, only for the first step.
66 | - begineos (bool): whether to begin with the special '' token as is trained in the language model.
67 | Note that ELMo has its own special beginning token. Default: True.
68 | - stopbyLMeos (bool): whether to stop a sentence solely by the language model predicting '' as the
69 | top possibility. Default: False.
70 | - devid (int): device id to run the algorithm and LSTM language models. 'int', default: 0. -1 for cpu.
71 | **kwargs: other arguments input to function .
72 | E.g. - normalized (bool): whether to normalize the dot product when calculating the similarity,
73 | which makes it cosine similarity. Default: True.
74 | - ifadditive (bool): whether to use an additive model on mixing the probability scores. Default: False.
75 |
76 | Output:
77 | - beam (beam_search.Beam): 'Beam' object, recording all the generated sequences.
78 |
79 | """
80 | device = 'cpu' if devid == -1 else f'cuda:{devid}'
81 |
82 | # Beam Search: initialization
83 | if begineos:
84 | beam = Beam(1, vocab, init_ids=[vocab.stoi['']], device=device,
85 | sim_score=0, lm_score=0, lm_state=None, elmo_state=None, align_loc=None)
86 | else:
87 | beam = Beam(1, vocab, init_ids=[None], device=device,
88 | sim_score=0, lm_score=0, lm_state=None, elmo_state=None, align_loc=None)
89 |
90 | # first step: start with 'beam_width_start' best matched words
91 | beam.beamstep(beam_width_start,
92 | beam.combscoreK,
93 | template_vec=template_vec,
94 | ee=ee,
95 | LMModel=LMModel,
96 | word_list=word_list,
97 | subvocab=subvocab,
98 | clustermask=clustermask,
99 | alpha=alpha_start,
100 | renorm=renorm,
101 | temperature=temperature,
102 | elmo_layer=elmo_layer,
103 | # normalized=True,
104 | # ifadditive=False,
105 | **kwargs)
106 |
107 | # run beam search, until all sentences hit or max_step reached
108 | for s in range(max_step):
109 | print(f'beam step {s + 1} ' + '-' * 50 + '\n')
110 | beam.beamstep(beam_width,
111 | beam.combscoreK,
112 | template_vec=template_vec,
113 | ee=ee,
114 | LMModel=LMModel,
115 | word_list=word_list,
116 | subvocab=subvocab,
117 | clustermask=clustermask,
118 | mono=mono,
119 | alpha=alpha,
120 | renorm=renorm,
121 | temperature=temperature,
122 | stopbyLMeos=stopbyLMeos,
123 | elmo_layer=elmo_layer,
124 | # normalized=True,
125 | # ifadditive=False,
126 | **kwargs)
127 | # all beams reach termination
128 | if beam.endall:
129 | break
130 |
131 | return beam
132 |
133 |
134 | def sortsummary(beam, beta=0):
135 | """
136 | Sort the generated summaries by beam search, with length penalty considered.
137 |
138 | Input:
139 | - beam (beam_search.Beam): 'Beam' object finished with beam search.
140 | - beta (float): length penalty when sorting. Default: 0 (no length penalty).
141 |
142 | Output:
143 | - ssa (list[tuple]): 'List[Tuple]' of (score_avg, sentence, alignment, sim_score, lm_score).
144 | """
145 | sents = []
146 | aligns = []
147 | score_avgs = []
148 | sim_scores = []
149 | lm_scores = []
150 |
151 | for ks in beam.endbus:
152 | sent, rebeam = beam.retrieve(ks[0] + 1, ks[1])
153 | score_avg = ks[2] / (ks[1] ** beta)
154 |
155 | sents.append(sent)
156 | aligns.append(beam.retrieve_align(rebeam))
157 | score_avgs.append(score_avg)
158 | sim_scores.append(ks[3])
159 | lm_scores.append(ks[4])
160 |
161 | ssa = sorted([(score_avgs[i], sents[i], aligns[i], sim_scores[i], lm_scores[i]) for i in range(len(sents))],
162 | reverse=True)
163 |
164 | return ssa
165 |
166 |
167 | def fixlensummary(beam, length=-1):
168 | """
169 | Pull out fixed length summaries from the beam search.
170 |
171 | Input:
172 | - beam (beam_search.Beam): 'Beam' object finished with beam search.
173 | - length (int): wanted length of the summary.
174 |
175 | Output:
176 | - ssa (list[tuple]): 'List[Tuple]' of sorted (score, sentence, alignments, sim_score, lm_score).
177 | """
178 | assert length >= 1 and length <= beam.step, 'invalid sentence length.'
179 |
180 | ssa = []
181 | for i in range(beam.K[length]):
182 | sent, rebeam = beam.retrieve(i + 1, l)
183 | ssa.append((beam.beamseq[length][i].score,
184 | sent,
185 | beam.retrieve_align(rebeam),
186 | beam.beamseq[length][i].sim_score,
187 | beam.beamseq[length][i].lm_score))
188 |
189 | return ssa
190 |
191 |
192 | ###############################################################################
193 | ########## some default parameters ##########
194 | ###############################################################################
195 | devid = 0
196 |
197 | ##### for English giga words
198 | arttxtpath = './data/Giga-sum/input_unk_250.txt'
199 | # arttxtpath = './data/Giga-sum/input_unk_251-500.txt'
200 | # arttxtpath = './data/Giga-sum/input_unk_501-750.txt'
201 | # arttxtpath = './data/Giga-sum/input_unk_751-1000.txt'
202 | # arttxtpath = './data/Giga-sum/input_unk_1001-1250.txt'
203 | # arttxtpath = './data/Giga-sum/input_unk_1251-1500.txt'
204 | # arttxtpath = './data/Giga-sum/input_unk_1501-1750.txt'
205 | # arttxtpath = './data/Giga-sum/input_unk_1751-1951.txt'
206 |
207 | # arttxtpath = './data/Giga-sum/input_unk.txt'
208 |
209 | '''
210 | vocab_path = './lm_lstm_models/gigaword/vocabTle.pkl'
211 | modelclass_path = './lm_lstm'
212 | model_path = './lm_lstm_models/gigaword/Tle_LSTM_untied.pth'
213 | closeword = './voctbls/vocabTleCloseWord'
214 | closeword_lmemb = './voctbls/vocabTleCloseWord'
215 | savedir = './results_elmo_giga/'
216 | '''
217 |
218 | ##### for Google sentence compression dataset
219 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt'
220 |
221 | vocab_path = './lm_lstm_models/sentence_compression/vocabsctgt.pkl'
222 | modelclass_path = './lm_lstm'
223 | model_path = './lm_lstm_models/sentence_compression/sctgt_LSTM_1024_untied.pth'
224 | closeword = './voctbls/vocabsctgtCloseWord'
225 | closeword_lmemb = './voctbls/vocabsctgtCloseWord'
226 | savedir = './results_elmo_sc/'
227 |
228 | '''
229 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt'
230 |
231 | vocab_path = './lm_lstm_models/sentence_compression/vocabsctgt.pkl'
232 | modelclass_path = './lm_lstm'
233 | model_path = './lm_lstm_models/sentence_compression/sctgt_LSTM_untied.pth'
234 | closeword = './voctbls/vocabsctgtCloseWord'
235 | closeword_lmemb = './voctbls/vocabsctgtCloseWord'
236 | savedir = './results_elmo_sc_512/'
237 | '''
238 |
239 | ##### beam search parameters
240 | begineos = True
241 | appendsenteos = True
242 | eosavgemb = False
243 | max_step = 20
244 | beam_width = 10
245 | beam_width_start = 10
246 | # mono = True
247 | renorm = False
248 | cluster = True
249 | temperature = 1
250 | elmo_layer = 'avg'
251 | alpha = 0.1
252 | alpha_start = alpha
253 | stopbyLMeos = False
254 | # ifadditive = False
255 | beta = 0.0
256 |
257 | # find word list
258 | numwords = 6
259 | numwords_outembed = -1
260 | numwords_freq = 500
261 |
262 | # if fix generation length
263 | fixedlen = False
264 | genlen = '9' # '9, 10, 11' for example for multiple lengths; including the starting '' token, and can include
265 | # the ending '' token as well (if not 'stobbyLMeos')
266 |
267 | ###############################################################################
268 |
269 |
270 | def parse_args():
271 | parser = argparse.ArgumentParser(description='Unsupervised generation of summaries from source file.')
272 | # source file
273 | parser.add_argument('--src', type=str, default=arttxtpath, help='source sentences file')
274 | parser.add_argument('--devid', type=int, default=devid, help='device id; -1 for cpu')
275 | # preparations
276 | parser.add_argument('--vocab', type=str, default=vocab_path, help='vocabulary file')
277 | parser.add_argument('--modelclass', type=str, default=modelclass_path,
278 | help='location of the model class definition file')
279 | parser.add_argument('--model', type=str, default=model_path, help='pre-trained language model')
280 | parser.add_argument('--closeword', type=str, default=closeword, help='character embedding close word tables')
281 | parser.add_argument('--closeword_lmemb', type=str, default=closeword_lmemb,
282 | help='LM output embedding close word tables')
283 | parser.add_argument('--savedir', type=str, default=savedir, help='directory to save results')
284 | # beam search parameters
285 | parser.add_argument('--begineos', type=int, default=int(begineos), help='whether to start with ')
286 | parser.add_argument('--appendsenteos', type=int, default=int(appendsenteos),
287 | help='whether to append at the end of source sentence')
288 | parser.add_argument('--eosavgemb', type=int, default=int(eosavgemb),
289 | help='whether to encode using average hidden states')
290 | parser.add_argument('--max_step', type=int, default=max_step, help='maximum beam step')
291 | parser.add_argument('--beam_width', type=int, default=beam_width, help='beam width')
292 | parser.add_argument('--beam_width_start', type=int, default=beam_width_start, help='beam width at first step')
293 | parser.add_argument('--renorm', type=int, default=int(renorm),
294 | help='whether to renormalize the probabilities over the sub-vocabulary')
295 | parser.add_argument('--cluster', type=int, default=int(cluster),
296 | help='whether to do clustering for the sub-vocabulary probabilities')
297 | parser.add_argument('--temp', type=float, default=temperature,
298 | help='temperature used to smooth the output of the softmax layer')
299 | parser.add_argument('--elmo_layer', type=str, default=elmo_layer, choices=['bot', 'mid', 'top', 'avg', 'cat'],
300 | help='elmo layer to use')
301 | parser.add_argument('--alpha', type=float, default=alpha, help='mixture coefficient for LM')
302 | parser.add_argument('--alpha_start', type=float, default=alpha_start,
303 | help='mixture coefficient for LM for the first step')
304 | parser.add_argument('--stopbyLMeos', type=int, default=int(stopbyLMeos),
305 | help='whether to stop the sentence solely by LM prediction')
306 | parser.add_argument('--beta', type=int, default=beta, help='length penalty')
307 | parser.add_argument('--n', type=int, default=numwords,
308 | help='number of closest words for each token to form the candidate list')
309 | parser.add_argument('--ns', type=int, default=numwords_outembed,
310 | help='number of closest words for each token in the output embedding for each token '
311 | 'to screen the candidate list')
312 | parser.add_argument('--nf', type=int, default=numwords_freq,
313 | help='number of the most frequent words in the vocabulary to keep in the candidate list')
314 | parser.add_argument('--fixedlen', type=int, default=int(fixedlen),
315 | help='whether to generate fixed length summaries')
316 | parser.add_argument('--genlen', type=str, default=genlen,
317 | help='lengths of summaries to be generated; should be comma separated')
318 |
319 | args = parser.parse_args()
320 | return args
321 |
322 |
323 | if __name__ == '__main__':
324 | args = parse_args()
325 |
326 | ##### input arguments
327 | arttxtpath = args.src
328 |
329 | devid = args.devid
330 |
331 | vocab_path = args.vocab # vocabulary for the pre-trained language model
332 | modelclass_path = args.modelclass
333 | model_path = args.model
334 |
335 | closewordsim_path = args.closeword + 'Sims.pkl'
336 | closewordind_path = args.closeword + 'Indices.pkl' # character level word embeddings
337 | closewordsim_outembed_path = args.closeword_lmemb + 'Sims_outembed_' + \
338 | os.path.splitext(os.path.basename(model_path))[0] + '.pkl'
339 | closewordind_outembed_path = args.closeword_lmemb + 'Indices_outembed_' + \
340 | os.path.splitext(os.path.basename(model_path))[0] + '.pkl'
341 |
342 | device = 'cpu' if devid == -1 else f'cuda:{devid}'
343 |
344 | ##### beam search parameters
345 | begineos = args.begineos
346 | appendsenteos = args.appendsenteos
347 | eosavgemb = args.eosavgemb if appendsenteos else False
348 | max_step = args.max_step
349 | beam_width = args.beam_width
350 | beam_width_start = args.beam_width_start
351 | mono = True
352 | renorm = args.renorm
353 | cluster = args.cluster
354 | temp = args.temp
355 | elmo_layer = args.elmo_layer
356 | alpha = args.alpha
357 | alpha_start = args.alpha_start
358 | stopbyLMeos = args.stopbyLMeos
359 | ifadditive = False
360 | beta = args.beta
361 | numwords = args.n
362 | numwords_outembed = args.ns if args.ns != -1 else numwords
363 | numwords_freq = args.nf
364 | fixedlen = args.fixedlen
365 | genlen = list(map(int, args.genlen.split(','))) # including the starting '' token
366 | # and can include the ending '' token as well (if not 'stobbyLMeos')
367 |
368 | ##### read in the article/source sentences to be summarized
369 | g = open(arttxtpath, 'r')
370 | sents = [line.strip() for line in g if line.strip()]
371 | g.close()
372 | nsents = len(sents)
373 |
374 | ##### load the ELMo forward embedder class
375 | ee = ElmoEmbedderForward(cuda_device=devid)
376 |
377 | ##### load vocabulary and the pre-trained language model
378 | vocab = pickle.load(open(vocab_path, 'rb'))
379 |
380 | if modelclass_path not in sys.path:
381 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model
382 | # the model class file must be included in the search path
383 | LMModel = torch.load(model_path, map_location=torch.device(device))
384 | embedmatrix = LMModel.proj.weight
385 |
386 | ##### check if the close_tables exist already; if not, generate
387 | if not os.path.exists(closewordind_path):
388 | # character embeddings of the vocabulary
389 | embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=devid)
390 | values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500)
391 | # save results
392 | os.makedirs(os.path.dirname(closewordind_path), exist_ok=True)
393 | pickle.dump(values_cnn, open(closewordsim_path, 'wb'))
394 | pickle.dump(indices_cnn, open(closewordind_path, 'wb'))
395 |
396 | if not os.path.exists(closewordind_outembed_path):
397 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500)
398 | # save results
399 | os.makedirs(os.path.dirname(closewordind_outembed_path), exist_ok=True)
400 | pickle.dump(values, open(closewordsim_outembed_path, 'wb'))
401 | pickle.dump(indices, open(closewordind_outembed_path, 'wb'))
402 |
403 | closewordind = pickle.load(open(closewordind_path, 'rb'))
404 | closewordind_outembed = pickle.load(open(closewordind_outembed_path, 'rb'))
405 |
406 | ##### generate save file name
407 | basename = os.path.basename(arttxtpath)
408 | basename = os.path.splitext(basename)[0]
409 |
410 | savedir = args.savedir
411 |
412 | smrypath = os.path.join(savedir, 'smry_') + basename + f'_Ks{beam_width_start}' + f'_clust{int(cluster)}'
413 |
414 | if renorm:
415 | smrypath += f'_renorm{int(renorm)}'
416 | if temp != 1:
417 | smrypath += f'_temper{temp}'
418 | if elmo_layer != 'avg':
419 | smrypath += f'_EL{elmo_layer}'
420 |
421 | smrypath += f'_eosavg{int(eosavgemb)}' + f'_n{numwords}'
422 |
423 | if numwords_outembed != numwords:
424 | smrypath += f'_ns{numwords_outembed}'
425 | if numwords_freq != 500:
426 | smrypath += f'_nf{numwords_freq}'
427 | if beam_width != 10:
428 | smrypath += f'_K{beam_width}'
429 | if stopbyLMeos:
430 | smrypath += f'_soleLMeos'
431 |
432 | if alpha_start != alpha:
433 | smrypath += f'_as{alpha_start}'
434 | if fixedlen:
435 | genlen = sorted(genlen)
436 | smrypath_list = [smrypath + f'_length{l - 1}' + f'_a{alpha}' + '_all.txt' for l in genlen]
437 | else:
438 | smrypath += f'_a{alpha}' + f'_b{beta}' + '_all.txt'
439 |
440 | ##### run summary generation and write to file
441 | if fixedlen:
442 | os.makedirs(os.path.dirname(smrypath), exist_ok=True)
443 | g_list = [open(fname, 'w') for fname in smrypath_list]
444 | else:
445 | os.makedirs(os.path.dirname(smrypath), exist_ok=True)
446 | g = open(smrypath, 'w')
447 |
448 | start = time.time()
449 | for ind in tqdm(range(nsents)):
450 | template = sents[ind].strip('.').strip() # remove '.' at the end
451 | if appendsenteos:
452 | template += ' '
453 |
454 | ### Find the close words to those in the template sentence
455 | # word_list, subvocab = findwordlist(template, closewordind, vocab, numwords=1, addeos=True)
456 | # word_list, subvocab = findwordlist_screened(template, closewordind, closewordind_outembed,
457 | # vocab, numwords=6, addeos=True)
458 | word_list, subvocab = findwordlist_screened2(template, closewordind, closewordind_outembed, vocab,
459 | numwords=numwords, numwords_outembed=numwords_outembed,
460 | numwords_freq=numwords_freq, addeos=True)
461 | if cluster:
462 | clustermask = clmk_nn(embedmatrix, subvocab)
463 |
464 | ### ELMo embedding of the template sentence
465 | if eosavgemb is False:
466 | template_vec, _ = ee.embed_sentence(template.split(), add_bos=True)
467 | else:
468 | tt = template.split()[:-1]
469 | hiddens = []
470 | template_vec = None
471 | current_hidden = None
472 | for i in range(len(tt)):
473 | current_embed, current_hidden = ee.embed_sentence([tt[i]], add_bos=True if i == 0 else False,
474 | initial_state=current_hidden)
475 | hiddens.append(current_hidden)
476 | template_vec = current_embed if template_vec is None else torch.cat([template_vec, current_embed],
477 | dim=1)
478 | hiddens_h, hiddens_c = zip(*hiddens)
479 | hiddens_avg = (sum(hiddens_h) / len(hiddens_h), sum(hiddens_c) / len(hiddens_c))
480 | eosavg, _ = ee.embed_sentence([''], initial_state=hiddens_avg)
481 | template_vec = torch.cat([template_vec, eosavg], dim=1)
482 |
483 | ### beam search
484 | max_step_temp = min([len(template.split()), max_step])
485 | beam = gensummary_elmo(template_vec,
486 | ee,
487 | vocab,
488 | LMModel,
489 | word_list,
490 | subvocab,
491 | clustermask=clustermask if cluster else None,
492 | renorm=renorm,
493 | temperature=temp,
494 | elmo_layer=elmo_layer,
495 | max_step=max_step_temp,
496 | beam_width=beam_width,
497 | beam_width_start=beam_width_start,
498 | mono=mono,
499 | alpha=alpha,
500 | alpha_start=alpha_start,
501 | begineos=begineos,
502 | stopbyLMeos=stopbyLMeos,
503 | ifadditive=ifadditive,
504 | devid=devid)
505 |
506 | ### sort and write to file
507 | if fixedlen:
508 | for j in range(len(genlen) - 1, -1, -1):
509 | g_list[j].write('-' * 5 + f'<{ind + 1}>' + '-' * 5 + '\n')
510 | g_list[j].write('\n')
511 | if genlen[j] <= beam.step:
512 | ssa = fixlensummary(beam, length=genlen[j])
513 | if ssa == []:
514 | g_list[j].write('\n')
515 | else:
516 | for m in range(len(ssa)):
517 | g_list[j].write(' '.join(ssa[m][1][1:]) + '\n')
518 | g_list[j].write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3])
519 | + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n')
520 | g_list[j].writelines(['%d, ' % loc for loc in ssa[m][2]])
521 | g_list[j].write('\n')
522 | g_list[j].write('\n')
523 | else:
524 | g_list[j].write('\n')
525 |
526 | if (ind + 1) % 10 == 0:
527 | g_list[j].flush()
528 | os.fsync(g_list[j].fileno())
529 | else:
530 | ssa = sortsummary(beam, beta=beta)
531 | g.write('-' * 5 + f'<{ind + 1}>' + '-' * 5 + '\n')
532 | g.write('\n')
533 | if ssa == []:
534 | g.write('\n')
535 | else:
536 | for m in range(len(ssa)):
537 | g.write(' '.join(ssa[m][1][1:]) + '\n')
538 | g.write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) + ' ' + '{:.3f}'.format(
539 | ssa[m][4]) + '\n')
540 | g.writelines(['%d, ' % loc for loc in ssa[m][2]])
541 | g.write('\n')
542 | g.write('\n')
543 |
544 | if (ind + 1) % 10 == 0:
545 | g.flush()
546 | os.fsync(g.fileno())
547 |
548 | print('time elapsed %s' % timeSince(start))
549 | if fixedlen:
550 | for gg in g_list:
551 | gg.close()
552 | print('results saved to: %s' % (("\n" + " " * 18).join(smrypath_list)))
553 | else:
554 | g.close()
555 | print(f'results saved to: {smrypath}')
556 |
--------------------------------------------------------------------------------
/uss/summary_search_gpt2.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import pickle
3 | import time
4 | import math
5 | import sys
6 | import os
7 | from tqdm import tqdm
8 | import argparse
9 |
10 | from pre_closetables import ELMoBotEmbedding, findclosewords_vocab
11 |
12 | from gpt2_sequential_embedder import GPT2Embedder
13 | # from pre_word_list import findwordlist, findwordlist_screened
14 | from pre_word_list import findwordlist_screened2
15 | from lm_subvocab import clmk_nn
16 | from beam_search import Beam
17 | from utils import timeSince
18 |
19 |
20 | def gensummary_gpt2(template_vec,
21 | ge,
22 | vocab,
23 | LMModel,
24 | word_list,
25 | subvocab,
26 | clustermask=None,
27 | mono=True,
28 | renorm=True,
29 | temperature=1,
30 | bpe2word='last',
31 | max_step = 20,
32 | beam_width = 10,
33 | beam_width_start = 10,
34 | alpha=0.1,
35 | alpha_start=0.1,
36 | begineos=True,
37 | stopbyLMeos=False,
38 | devid=0,
39 | **kwargs):
40 | """
41 | Unsupervised sentence summary generation using beam search, by contextual matching and a summary style language model.
42 | The contextual matching here is on top of pretrained ELMo embeddings.
43 |
44 | Input:
45 | template_vec: forward only ELMo embeddings of the source sentence. 'torch.Tensor' of size (3, seq_len, 512).
46 | ge: 'gpt2_sequential_embedder.GPT2Embedder' object.
47 | vocab: 'torchtext.vocab.Vocab' object. Should be the same as is used for the pretrained language model.
48 | LMModel: a pretrained language model on the summary sentences.
49 | word_list: a list of words in the vocabulary to work with. 'List'.
50 | subvocab: 'torch.LongTensor' consisting of the indices of the words corresponding to 'word_list'.
51 | clustermask: a binary mask for each of the sub-vocabulary word. 'torch.ByteTensor' of size (len(sub-vocabulary), len(vocabulary)). Default:None.
52 | mono: whether to keep monotonicity contraint. Default: True.
53 | renorm: whether to renormalize the probabilities over the sub-vocabulary. Default: True.
54 | Temperature: temperature applied to the softmax in the language model. Default: 1.
55 | bpe2word: how to turn the BPE vectors into word vectors. Choose from ['last', 'avg']. Default: 'last'.
56 | max_step: maximum number of beam steps.
57 | beam_width: beam width.
58 | beam_width_start: beam width of the first step.
59 | alpha: the amount of language model part used for scoring. The score is: (1 - \alpha) * similarity_logscore + \alpha * LM_logscore.
60 | begineos: whether to begin with the special '' token as is trained in the language model. Note that ELMo has its own special beginning token. Default: True.
61 | stopbyLMeos: whether to stop a sentence solely by the language model predicting '' as the top possibility. Default: False.
62 | devid: device id to run the algorithm and LSTM language models. 'int', default: 0. -1 for cpu.
63 | **kwargs: other arguments input to function .
64 | E.g. normalized: whether to normalize the dot product when calculating the similarity, which makes it cosine similarity. Default: True.
65 | ifadditive: whether to use an additive model on mixing the probability scores. Default: False.
66 |
67 | Output:
68 | beam: 'Beam' object, recording all the generated sequences.
69 |
70 | """
71 | device = 'cpu' if devid == -1 else f'cuda:{devid}'
72 |
73 | # Beam Search: initialization
74 | if begineos:
75 | beam = Beam(1, vocab, init_ids=[vocab.stoi['']], device=device,
76 | sim_score=0, lm_score=0, lm_state=None, gpt2_state=None, align_loc=None)
77 | else:
78 | beam = Beam(1, vocab, init_ids=[None], device=device,
79 | sim_score=0, lm_score=0, lm_state=None, gpt2_state=None, align_loc=None)
80 |
81 | # first step: start with 'beam_width_start' best matched words
82 | beam.beamstep(beam_width_start,
83 | beam.combscoreK_GPT2,
84 | template_vec=template_vec,
85 | ge=ge,
86 | LMModel=LMModel,
87 | word_list=word_list,
88 | subvocab=subvocab,
89 | clustermask=clustermask,
90 | alpha=alpha_start,
91 | renorm=renorm,
92 | temperature=temperature,
93 | bpe2word=bpe2word,
94 | normalized=True,
95 | ifadditive=False,
96 | **kwargs)
97 |
98 | # run beam search, until all sentences hit or max_step reached
99 | for s in range(max_step):
100 | print(f'beam step {s+1} ' + '-' * 50 + '\n')
101 | beam.beamstep(beam_width,
102 | beam.combscoreK_GPT2,
103 | template_vec=template_vec,
104 | ge=ge,
105 | LMModel=LMModel,
106 | word_list=word_list,
107 | subvocab=subvocab,
108 | clustermask=clustermask,
109 | mono=mono,
110 | alpha=alpha,
111 | renorm=renorm,
112 | temperature=temperature,
113 | stopbyLMeos=stopbyLMeos,
114 | bpe2word=bpe2word,
115 | normalized=True,
116 | ifadditive=False,
117 | **kwargs)
118 | # all beams reach termination
119 | if beam.endall:
120 | break
121 |
122 | return beam
123 |
124 |
125 | def sortsummary(beam, beta=0):
126 | """
127 | Sort the generated summaries by beam search, with length penalty considered.
128 |
129 | Input:
130 | beam: 'Beam' object finished with beam search.
131 | beta: length penalty when sorting. Default: 0 (no length penalty).
132 | Output:
133 | ssa: 'List[Tuple]' of (score_avg, sentence, alignment, sim_score, lm_score).
134 | """
135 | sents = []
136 | aligns = []
137 | score_avgs = []
138 | sim_scores = []
139 | lm_scores = []
140 |
141 | for ks in beam.endbus:
142 | sent, rebeam = beam.retrieve(ks[0] + 1, ks[1])
143 | score_avg = ks[2] / (ks[1] ** beta)
144 |
145 | sents.append(sent)
146 | aligns.append(beam.retrieve_align(rebeam))
147 | score_avgs.append(score_avg)
148 | sim_scores.append(ks[3])
149 | lm_scores.append(ks[4])
150 |
151 | ssa = sorted([(score_avgs[i], sents[i], aligns[i], sim_scores[i], lm_scores[i]) for i in range(len(sents))], reverse=True)
152 |
153 | return ssa
154 |
155 |
156 | def fixlensummary(beam, length=-1):
157 | """
158 | Pull out fixed length summaries from the beam search.
159 |
160 | Input:
161 | beam: 'Beam' object finished with beam search.
162 | length: wanted length of the summary.
163 | Output:
164 | ssa: 'List[Tuple]' of sorted (score, sentence, alignments, sim_score, lm_score).
165 | """
166 | assert length >=1 and length <= beam.step, 'invalid sentence length.'
167 |
168 | ssa = []
169 | for i in range(beam.K[length]):
170 | sent, rebeam = beam.retrieve(i + 1, l)
171 | ssa.append((beam.beamseq[length][i].score, sent, beam.retrieve_align(rebeam), beam.beamseq[length][i].sim_score, beam.beamseq[length][i].lm_score))
172 |
173 | return ssa
174 |
175 |
176 | ##### input arguments
177 | #arttxtpath = '../LM/data/Giga-sum/input_unk_250.txt'
178 | #arttxtpath = '../LM/data/Giga-sum/input_unk_251-500.txt'
179 | #arttxtpath = '../LM/data/Giga-sum/input_unk_501-750.txt'
180 | #arttxtpath = '../LM/data/Giga-sum/input_unk_751-1000.txt'
181 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1001-1250.txt'
182 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1251-1500.txt'
183 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1501-1750.txt'
184 | #arttxtpath = '../LM/data/Giga-sum/input_unk_1751-1951.txt'
185 |
186 | arttxtpath = '../LM/data/Giga-sum/input_unk.txt'
187 |
188 | devid = 0
189 |
190 | '''
191 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt'
192 |
193 | vocab_path = '../LM/LSTM/models_sc/vocabsctgt.pkl'
194 | modelclass_path = '../LM/LSTM'
195 | model_path = '../LM/LSTM/models_sc/sctgt_LSTM_1024_untied.pth'
196 | closeword = './voctbls/vocabsctgtCloseWord'
197 | closeword_lmemb = './voctbls/vocabsctgtCloseWord_1024_untied_'
198 | savedir = './results_sc_1024_untied_gpt2/'
199 | '''
200 |
201 | '''
202 | arttxtpath = '/n/rush_lab/users/jzhou/sentence-compression/dataclean/eval_src_1000_unk.txt'
203 |
204 | vocab_path = '../LM/LSTM/models_sc/vocabsctgt.pkl'
205 | modelclass_path = '../LM/LSTM'
206 | model_path = '../LM/LSTM/models_sc/sctgt_LSTM_untied.pth'
207 | closeword = './voctbls/vocabsctgtCloseWord'
208 | closeword_lmemb = './voctbls/vocabsctgtCloseWord_untied_'
209 | savedir = './results_sc_untied/'
210 | '''
211 |
212 |
213 | vocab_path = '../LM/LSTM/models/vocabTle.pkl'
214 | modelclass_path = '../LM/LSTM'
215 | model_path = '../LM/LSTM/models/Tle_LSTM_untied.pth'
216 | closeword = './voctbls/vocabTleCloseWord'
217 | closeword_lmemb = './voctbls/vocabTleCloseWord_untied_'
218 | savedir = './results_gpt2/'
219 |
220 |
221 | # vocab_path = '../LM/LSTM/models/vocabTle.pkl'
222 | # modelclass_path = '../LM/LSTM'
223 | # model_path = '../LM/LSTM/models/Tle_LSTM.pth'
224 | # closeword = 'vocabTleCloseWord'
225 | # closeword_lmemb = 'vocabTleCloseWord'
226 | # savedir = './results/'
227 |
228 |
229 | # vocab_path = '../LM/LSTM_LUC/models/vocabTle50k.pkl'
230 | # modelclass_path = '../LM/LSTM_LUC'
231 | # model_path = '../LM/LSTM_LUC/models/TleLUC_wtI_noB_0-0.0001-1Penalty.pth'
232 | # closeword = 'vocabTle50kCloseWord'
233 | # closeword_lmemb = 'vocabTle50kCloseWord'
234 |
235 | # vocab_path = '../4.0_cluster/vocabTle.pkl' # vocabulary for the pretrained language model
236 | # modelclass_path = '../LM/LSTM_MoS'
237 | # model_path = '../LM/LSTM_MoS/models/LMModelMoSTle2.pth'
238 | # closeword = '../4.0_cluster/vocabTleCloseWord' # character level word embeddings
239 | # closeword_lmemb = 'vocabTleCloseWord'
240 |
241 |
242 | ##### beam search parameters
243 | begineos = True
244 | appendsenteos = True
245 | eosavgemb = False
246 | max_step = 20
247 | beam_width = 10
248 | beam_width_start = 10
249 | # mono = True
250 | renorm = False
251 | cluster = True
252 | temperature = 1
253 | bpe2word = 'last'
254 | alpha = 0.1
255 | alpha_start = alpha
256 | stopbyLMeos = False
257 | # ifadditive = False
258 | beta = 0.0
259 |
260 | # find word list
261 | numwords = 6
262 | numwords_outembed = -1
263 | numwords_freq = 500
264 |
265 | # if fix generation length
266 | fixedlen = False
267 | genlen = '9'
268 | # genlen = [9] # including the starting '' token, and can include the ending '' token as well (if not 'stobbyLMeos')
269 |
270 |
271 | def parse_args():
272 | parser = argparse.ArgumentParser(description='Unsupervisely generate summaries from source file.')
273 | # source file
274 | parser.add_argument('--src', type=str, default=arttxtpath, help='source sentences file')
275 | parser.add_argument('--devid', type=int, default=devid, help='device id; -1 for cpu')
276 | # preparations
277 | parser.add_argument('--vocab', type=str, default=vocab_path, help='vocabulary file')
278 | parser.add_argument('--modelclass', type=str, default=modelclass_path, help='location of the model class definition file')
279 | parser.add_argument('--model', type=str, default=model_path, help='pre-trained language model')
280 | parser.add_argument('--closeword', type=str, default=closeword, help='character embedding close word tables')
281 | parser.add_argument('--closeword_lmemb', type=str, default=closeword_lmemb, help='LM output embedding close word tables')
282 | parser.add_argument('--savedir', type=str, default=savedir, help='directory to save results')
283 | # beam search parameters
284 | parser.add_argument('--begineos', type=int, default=int(begineos), help='whether to start with ')
285 | parser.add_argument('--appendsenteos', type=int, default=int(appendsenteos), help='whether to append at the end of source sentence')
286 | parser.add_argument('--eosavgemb', type=int, default=int(eosavgemb), help='whether to encode using average hidden states (deprecated)')
287 | parser.add_argument('--max_step', type=int, default=max_step, help='maximum beam step')
288 | parser.add_argument('--beam_width', type=int, default=beam_width, help='beam width')
289 | parser.add_argument('--beam_width_start', type=int, default=beam_width_start, help='beam width at first step')
290 | parser.add_argument('--renorm', type=int, default=int(renorm), help='whether to renormalize the probabilities over the sub-vocabulary')
291 | parser.add_argument('--cluster', type=int, default=int(cluster), help='whether to do clustering for the sub-vocabulary probabilities')
292 | parser.add_argument('--temp', type=float, default=temperature, help='temperature used to smooth the output of the softmax layer')
293 | parser.add_argument('--bpe2word', type=str, default=bpe2word, choices=['last', 'avg'], help='how to use BPE hidden states to represent a word')
294 | parser.add_argument('--alpha', type=float, default=alpha, help='mixture coefficient for LM')
295 | parser.add_argument('--alpha_start', type=float, default=alpha_start, help='mixture coefficient for LM for the first step')
296 | parser.add_argument('--stopbyLMeos', type=int, default=int(stopbyLMeos), help='whether to stop the sentence solely by LM prediction')
297 | parser.add_argument('--beta', type=int, default=beta, help='length penalty')
298 | parser.add_argument('--n', type=int, default=numwords, help='number of closest words for each token to form the candidate list')
299 | parser.add_argument('--ns', type=int, default=numwords_outembed, help='number of closest words for each token in the output embedding for each token to screen the candidate list')
300 | parser.add_argument('--nf', type=int, default=numwords_freq, help='number of the most frequent words in the vocabulary to keep in the candidate list')
301 | parser.add_argument('--fixedlen', type=int, default=int(fixedlen), help='whether to generate fixed length summaries')
302 | parser.add_argument('--genlen', type=str, default=genlen, help='lengths of summaries to be generated; should be comma separated')
303 |
304 | args = parser.parse_args()
305 | return args
306 |
307 |
308 | if __name__ == '__main__':
309 | args = parse_args()
310 |
311 | ##### input arguments
312 | arttxtpath = args.src
313 |
314 | devid = args.devid
315 |
316 | vocab_path = args.vocab # vocabulary for the pretrained language model
317 | modelclass_path = args.modelclass
318 | model_path = args.model
319 |
320 | closewordsim_path = args.closeword + 'Sims.pkl'
321 | closewordind_path = args.closeword + 'Indices.pkl' # character level word embeddings
322 | closewordsim_outembed_path = args.closeword_lmemb + 'Sims_outembed_' + os.path.splitext(os.path.basename(model_path))[0] + '.pkl'
323 | closewordind_outembed_path = args.closeword_lmemb + 'Indices_outembed_' + os.path.splitext(os.path.basename(model_path))[0] + '.pkl'
324 |
325 | device = 'cpu' if devid == -1 else f'cuda:{devid}'
326 |
327 | ##### beam search parameters
328 | begineos = args.begineos
329 | appendsenteos = args.appendsenteos
330 | eosavgemb = args.eosavgemb if appendsenteos else False
331 | max_step = args.max_step
332 | beam_width = args.beam_width
333 | beam_width_start = args.beam_width_start
334 | mono = True
335 | renorm = args.renorm
336 | cluster = args.cluster
337 | temp = args.temp
338 | bpe2word = args.bpe2word
339 | alpha = args.alpha
340 | alpha_start = args.alpha_start
341 | stopbyLMeos = args.stopbyLMeos
342 | ifadditive = False
343 | beta = args.beta
344 | numwords = args.n
345 | numwords_outembed = args.ns if args.ns != -1 else numwords
346 | numwords_freq = args.nf
347 | fixedlen = args.fixedlen
348 | genlen = list(map(int, args.genlen.split(','))) # including the starting '' token
349 | # and can include the ending '' token as well (if not 'stobbyLMeos')
350 |
351 | ##### read in the article/source sentences to be summarized
352 | g = open(arttxtpath, 'r')
353 | sents = [line.strip() for line in g if line.strip()]
354 | g.close()
355 | nsents = len(sents)
356 |
357 | ##### load the GPT-2 embedder class
358 | ge = GPT2Embedder(cuda_device=devid)
359 |
360 | ##### load vocabulary and the pre-trained language model
361 | vocab = pickle.load(open(vocab_path, 'rb'))
362 |
363 | if modelclass_path not in sys.path:
364 | sys.path.insert(1, modelclass_path) # this is for torch.load to load the entire model; the model class file must be included in the search path
365 | LMModel = torch.load(model_path, map_location=torch.device(device))
366 | embedmatrix = LMModel.proj.weight
367 |
368 | ##### check if the close_tables exist already; if not, generate
369 | if not os.path.exists(closewordind_path):
370 | # character embeddings of the vocabulary
371 | embedmatrix_cnn = ELMoBotEmbedding(vocab.itos, device=devid)
372 | values_cnn, indices_cnn = findclosewords_vocab(vocab, embedmatrix_cnn, numwords=500)
373 | # save results
374 | pickle.dump(values_cnn, open(closewordsim_path, 'wb'))
375 | pickle.dump(indices_cnn, open(closewordind_path, 'wb'))
376 |
377 | if not os.path.exists(closewordind_outembed_path):
378 | values, indices = findclosewords_vocab(vocab, embedmatrix, numwords=500)
379 | # save results
380 | pickle.dump(values, open(closewordsim_outembed_path, 'wb'))
381 | pickle.dump(indices, open(closewordind_outembed_path, 'wb'))
382 |
383 | closewordind = pickle.load(open(closewordind_path, 'rb'))
384 | closewordind_outembed = pickle.load(open(closewordind_outembed_path, 'rb'))
385 |
386 | ##### generate save file name
387 | basename = os.path.basename(arttxtpath)
388 | basename = os.path.splitext(basename)[0]
389 |
390 | savedir = args.savedir
391 | # savedir = './results/'
392 |
393 | smrypath = os.path.join(savedir, 'smry_') + basename + f'_Ks{beam_width_start}' + f'_clust{int(cluster)}'
394 |
395 | if renorm:
396 | smrypath += f'_renorm{int(renorm)}'
397 | if temp != 1:
398 | smrypath += f'_temper{temp}'
399 | if bpe2word != 'last':
400 | smrypath += f'_BPE{bpe2word}'
401 |
402 | # smrypath += f'_eosavg{int(eosavgemb)}' + f'_n{numwords}'
403 | smrypath += f'_n{numwords}'
404 |
405 | if numwords_outembed != numwords:
406 | smrypath += f'_ns{numwords_outembed}'
407 | if numwords_freq != 500:
408 | smrypath += f'_nf{numwords_freq}'
409 | if beam_width != 10:
410 | smrypath += f'_K{beam_width}'
411 | if stopbyLMeos:
412 | smrypath += f'_soleLMeos'
413 | ############################
414 | # smrypath += '_close1'
415 | ############################
416 | if alpha_start != alpha:
417 | smrypath += f'_as{alpha_start}'
418 | if fixedlen:
419 | genlen = sorted(genlen)
420 | smrypath_list = [smrypath + f'_length{l - 1}' + f'_a{alpha}' + '_all.txt' for l in genlen]
421 | else:
422 | smrypath += f'_a{alpha}' + f'_b{beta}' + '_all.txt'
423 |
424 | ##### run summary generation and write to file
425 | if fixedlen:
426 | g_list = [open(fname, 'w') for fname in smrypath_list]
427 | else:
428 | g = open(smrypath, 'w')
429 |
430 | start = time.time()
431 | for ind in tqdm(range(nsents)):
432 | template = sents[ind].strip('.').strip() # remove '.' at the end
433 | if appendsenteos:
434 | template += ' '
435 |
436 | ### Find the close words to those in the template sentence
437 | # word_list, subvocab = findwordlist(template, closewordind, vocab, numwords=1, addeos=True)
438 | # word_list, subvocab = findwordlist_screened(template, closewordind, closewordind_outembed, vocab, numwords=6, addeos=True)
439 | word_list, subvocab = findwordlist_screened2(template, closewordind, closewordind_outembed, vocab, numwords=numwords, numwords_outembed=numwords_outembed, numwords_freq=numwords_freq, addeos=True)
440 | if cluster:
441 | clustermask = clmk_nn(embedmatrix, subvocab)
442 |
443 | ### GPT-2 embedding of the template sentence
444 | if not eosavgemb:
445 | template_vec, _ = ge.embed_sentence(template.split(), add_bos=True, bpe2word=bpe2word)
446 | else:
447 | raise ValueError
448 |
449 | ### beam search
450 | max_step_temp = min([len(template.split()), max_step])
451 | beam = gensummary_gpt2(template_vec,
452 | ge,
453 | vocab,
454 | LMModel,
455 | word_list,
456 | subvocab,
457 | clustermask=clustermask if cluster else None,
458 | renorm=renorm,
459 | temperature=temp,
460 | bpe2word=bpe2word,
461 | max_step=max_step_temp,
462 | beam_width=beam_width,
463 | beam_width_start=beam_width_start,
464 | mono=True,
465 | alpha=alpha,
466 | alpha_start=alpha_start,
467 | begineos=begineos,
468 | stopbyLMeos=stopbyLMeos,
469 | devid=devid)
470 |
471 | ### sort and write to file
472 | if fixedlen:
473 | for j in range(len(genlen) - 1, -1, -1):
474 | g_list[j].write('-' * 5 + f'<{ind+1}>' + '-' * 5 + '\n')
475 | g_list[j].write('\n')
476 | if genlen[j] <= beam.step:
477 | ssa = fixlensummary(beam, length=genlen[j])
478 | if ssa == []:
479 | g_list[j].write('\n')
480 | else:
481 | for m in range(len(ssa)):
482 | g_list[j].write(' '.join(ssa[m][1][1:]) + '\n')
483 | g_list[j].write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3])
484 | + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n')
485 | g_list[j].writelines(['%d, ' % loc for loc in ssa[m][2]])
486 | g_list[j].write('\n')
487 | g_list[j].write('\n')
488 | else:
489 | g_list[j].write('\n')
490 |
491 | if (ind + 1) % 10 == 0:
492 | g_list[j].flush()
493 | os.fsync(g_list[j].fileno())
494 | else:
495 | ssa = sortsummary(beam, beta=beta)
496 | g.write('-' * 5 + f'<{ind+1}>' + '-' * 5 + '\n')
497 | g.write('\n')
498 | if ssa == []:
499 | g.write('\n')
500 | else:
501 | for m in range(len(ssa)):
502 | g.write(' '.join(ssa[m][1][1:]) + '\n')
503 | g.write('{:.3f}'.format(ssa[m][0]) + ' ' + '{:.3f}'.format(ssa[m][3]) + ' ' + '{:.3f}'.format(ssa[m][4]) + '\n')
504 | g.writelines(['%d, ' % loc for loc in ssa[m][2]])
505 | g.write('\n')
506 | g.write('\n')
507 |
508 | if (ind + 1) % 10 == 0:
509 | g.flush()
510 | os.fsync(g.fileno())
511 |
512 | print('time elapsed %s' % timeSince(start))
513 | if fixedlen:
514 | for gg in g_list:
515 | gg.close()
516 | print('results saved to: %s' % (("\n" + " " * 18).join(smrypath_list)))
517 | else:
518 | g.close()
519 | print(f'results saved to: {smrypath}')
520 |
521 |
--------------------------------------------------------------------------------
/uss/summary_select_eval.py:
--------------------------------------------------------------------------------
1 | """
2 | Post-processing of the generated summary sentences:
3 | 1. From all the summary sentences based on beam search, select one based on some length penalty
4 | 2. Evaluation of Rouge scores, copy rate, compression rate, etc.
5 | """
6 | import os
7 | import argparse
8 |
9 |
10 | def copy_rate(sent1, sent2):
11 | """
12 | copy rate between two sentences.
13 | In particular, the proportion of sentence 1 that are copied from sentence 2.
14 |
15 | Input:
16 | sent1, sent2: two sentence strings (generated summary, source).
17 | Output:
18 | score: copy rate on unigrams.
19 | """
20 | sent1_split = set(sent1.split())
21 | sent2_split = set(sent2.split())
22 | intersection = sent1_split.intersection(sent2_split)
23 | # recall = len(intersection) / len(sent2_split)
24 | precision = len(intersection) / len(sent1_split)
25 | # union = sent1_split.union(sent2_split)
26 | # jacd = 1 - len(intersection) / len(union) # jacquard distance
27 | # score = stats.hmean([recall, precision]) # F1 score (need to import scipy.stats.hmean)
28 | # score = 2 * recall * precision / (recall + precision) if recall != 0 and precision != 0 else 0 # F1 score
29 |
30 | return precision
31 |
32 |
33 | # =============== some default path arguments ==================================
34 | src = '/n/rush_lab/users/jzhou/LM/data/Giga-sum/input_unk.txt'
35 | ref = '/n/rush_lab/users/jzhou/LM/data/Giga-sum/task1_ref0.txt'
36 |
37 | '''
38 | gens = '/n/rush_lab/users/jzhou/5.0_cluster/results_untied/smry_input_unk_Ks10_clust0_temper10.0_ELcat_eosavg0_n6_ns10_nf300_a0.1_b0.0_all.txt'
39 | save_dir = './results_untied/'
40 | '''
41 | gen = './results_gpt2/smry_input_unk_Ks10_clust1_n6_ns10_nf300_a0.1_b0.0_all.txt'
42 | save_dir = './results_gpt2/'
43 |
44 | lp = 0.1
45 | # ===============================================================================
46 |
47 |
48 | def parse_args():
49 | parser = argparse.ArgumentParser(description='Post-processing and evaluation of the generated summary sentences')
50 | parser.add_argument('--src', type=str, default=src, help='source sentence path')
51 | parser.add_argument('--ref', type=str, default=ref, help='reference summary path')
52 | parser.add_argument('--gen', type=str, default=gen, help='generated summary path')
53 | parser.add_argument('--save_dir', type=str, default=save_dir, help='directory to save the result')
54 | parser.add_argument('--lp', type=float, default=lp, help='length penalty (additive onto length)')
55 | args = parser.parse_args()
56 | return args
57 |
58 |
59 | if __name__ == '__main__':
60 | args = parse_args()
61 |
62 | # read in the source, reference, and generated summaries (a list of summaries for each source sentence)
63 | g = open(args.src, 'r')
64 | arts = [line.strip().strip(' .') for line in g if line.strip()]
65 | g.close()
66 |
67 | g = open(args.ref, 'r')
68 | refs = [line.strip() for line in g if line.strip()]
69 | g.close()
70 |
71 | g = open(args.gen, 'r')
72 | lines = [line.strip() for line in g if line.strip()]
73 | g.close()
74 |
75 | # length penalty for selecting the finished hypothesis from beam search
76 | # takes the form (length + lp) ^ b
77 | lp = args.lp
78 | b = 1.0
79 |
80 | # generate the new path to save the results
81 | basename = os.path.basename(args.gen)
82 | basename = os.path.splitext(basename)[0]
83 |
84 | gen_selected_path_new = os.path.join(args.save_dir, basename.replace('b0.0_all', f'b{b}_single') + '.txt')
85 |
86 | # select a single summary sentence for each source sentence with length penalty
87 | os.makedirs(args.save_dir, exist_ok=True)
88 | g = open(gen_selected_path_new, 'w')
89 |
90 | i = 0
91 | j = 1
92 | count = 0
93 | cp_rate = []
94 | lens = []
95 | comp_rate = []
96 | while j <= len(lines):
97 | if j == len(lines) or lines[j].startswith('-----') and not lines[j].startswith('----- '):
98 | count += 1
99 | # from i to j-1
100 | curl = lines[(i + 1):j]
101 | ssa = [(curl[k], curl[k + 1], curl[k + 2]) for k in range(len(curl)) if k % 3 == 0]
102 | ssa = sorted(ssa, key=lambda x: float(x[1].split()[0]) / (len(x[0].split()) - lp) ** b, reverse=True)
103 | # float(x[1].split()[0]) for the combined score
104 | # float(x[1].split()[1]) for the contextual matching score
105 | # float(x[1].split()[2]) for the language model score
106 |
107 | # if arts[count - 1] == '':
108 | # g.write('\n')
109 | # else:
110 | # if len(ssa[0][0].split()) <= 1:
111 | # g.write(arts[count - 1])
112 | # g.write('\n')
113 | # cp_rate.append(1)
114 | # else:
115 | # g.write(' '.join(ssa[0][0].split()[:-1]))
116 | # g.write('\n')
117 | # cp_rate.append(copy_rate(' '.join(ssa[0][0].split()[:-1]), arts[count - 1]))
118 |
119 | if len(ssa[0][0].split()) <= 1:
120 | # blank line: directly copy the source for summary
121 | g.write(arts[count - 1])
122 | g.write('\n')
123 | cp_rate.append(1)
124 | comp_rate.append(1)
125 | lens.append(len(arts[count - 1].split()))
126 | else:
127 | g.write(' '.join(ssa[0][0].split()[:-1])) # do not include the last token, which is to match the
128 | g.write('\n')
129 | cp_rate.append(copy_rate(' '.join(ssa[0][0].split()[:-1]), arts[count - 1]))
130 | comp_rate.append(len(ssa[0][0].split()[:-1]) / len(arts[count - 1].split()))
131 | lens.append(len(ssa[0][0].split()[:-1]))
132 | i = j
133 | j += 1
134 |
135 | g.close()
136 |
137 | # print out the results and calculate the Rouge scores
138 | os.system('sed -i "s//UNK/g" ' + gen_selected_path_new)
139 | print('copy rate: %f' % (sum(cp_rate) / len(cp_rate)))
140 | print('compression rate: %f' % (sum(comp_rate) / len(comp_rate)))
141 | print('average summary length: %f' % (sum(lens) / len(lens)))
142 | os.system('files2rouge ' + gen_selected_path_new + ' ' + args.ref)
143 |
--------------------------------------------------------------------------------
/uss/utils.py:
--------------------------------------------------------------------------------
1 | import time
2 | import math
3 |
4 |
5 | def timeSince(start):
6 | now = time.time()
7 | s = now - start
8 | m = math.floor(s / 60)
9 | s -= m * 60
10 | h = math.floor(m / 60)
11 | m -= h * 60
12 | if h == 0:
13 | return '%dm %.3fs' % (m, s)
14 | else:
15 | return '%dh %dm %.3fs' % (h, m, s)
16 |
--------------------------------------------------------------------------------