├── README.md ├── aesop-result ├── .DS_Store ├── tab1-paranmt-h4.txt ├── tab1-qqppos-h4.txt ├── tab2-paranmt-h2.txt └── tab2-qqppos-h2.txt ├── data-processing.py ├── demo-input-data.txt ├── demo.py ├── downstream-dataset ├── .DS_Store ├── adversarial │ ├── rte.csv │ └── sst.csv ├── combined │ ├── rte.csv │ └── sst.csv └── original │ ├── rte.csv │ └── sst.csv ├── evaluation ├── candidate_selection.py ├── eval.py └── eval_utils.py ├── extract_sentence.py ├── finetune_trainer.py ├── helper ├── __init__.py ├── helper.py └── utils.py ├── requirement.txt ├── run_eval.py ├── ted2.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # AESOP 2 | 3 | Here is the code base for **AESOP: Paraphrase Generation with Adaptive Syntactic Control** by [Jiao Sun](https://sunjiao123sun.github.io/), [Xuezhe Ma](https://xuezhemax.github.io/) and [Nanyun Peng](https://vnpeng.net/), this work is accepted by EMNLP 2021. 4 | 5 | Please consider citing our work if you find either our code or data useful. 6 | 7 | ``` 8 | @inproceedings{sun2021aesop, 9 | title = {AESOP: Paraphrase Generation with Adaptive Syntactic Control}, 10 | author = {Sun, Jiao and Ma, Xuezhe and Peng, Nanyun}, 11 | booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, 12 | year = {2021} 13 | } 14 | ``` 15 | 16 | The code base is actively maintained, please reach out to jiaosun@usc.edu if you encounter any issues or raise an issue here! We would like to thank authors of [SGCP](https://arxiv.org/pdf/2005.08417.pdf) and huggingface library. Part of the evaluation script is adopted from the [SGCP repository](https://github.com/malllabiisc/SGCP), and AESOP is implemented using [huggingface](https://github.com/huggingface/). 17 | 18 | ## Dependencies 19 | 20 | ``` 21 | pip install -r requirement.txt 22 | ``` 23 | 24 | Please download all the required tools, data (our preprocessed data) and software (e.g., stanford CoreNLP) from https://drive.google.com/file/d/1MP9k48BuBCdAPhWXjfuq7Cl9b9ZIdAPB/view?usp=sharing 25 | 26 | Unzipping this zip file should give you 27 | 28 | 1. `evaluation`: contains the script, software and dependencies that are necessary for evaluating the model performance. 29 | 2. `pretrained-models`: contain pretrained models to replicate our results, `h` means height: h2 means trimming the parse tree at height 2. Please see Figure 2 in our paper as an example 30 | 3. `raw-data`: contains data for two datasets: QQP-Pos and ParaNMT-50M (small), we use the same split from [SGCP](https://github.com/malllabiisc/SGCP) 31 | 4. `processed-data`: after pre-processing the raw data to feed into huggingface transformer 32 | 33 | These unzipped files should be put directly under AESOP main directory. 34 | 35 | The two settings we will be introducing will only differ during the inference time 36 | 37 | 38 | 39 | ## Proprocessing 40 | 41 | AESOP has two modes: 42 | 43 | - when the taget syntactic parses are vailable from crowd-sourced exemplar sentences, we extract the syntactic parses from the exemplar sentences and use them as target syntactic parse to guide the generation (set `use_template` as Y) 44 | - when the exemplar sentences are not available, we use the retrieval-based selection strategy to adaptively determine a set of target syntactic parses (set `use_template` as Y) 45 | 46 | ``` 47 | python data-processing.py --input_dir raw-data/QQPPos --output_dir processed-data/QQPPos-hf-refine --use_template Y 48 | 49 | python data-processing.py --input_dir raw-data/QQPPos --output_dir processed-data/QQPPos-hf-refine --use_template N 50 | ``` 51 | 52 | this will generate all the necessary files we need for both datasets, please see `processed-data` to see what to expect, and move different directorys to proper locations if needed 53 | 54 | 55 | 56 | ## Table1: target syntactic parse from exemplar sentences 57 | 58 | 1. **load pretrained model and do the inference**: please fill in `[output_file_path_...]` based on your own developmet environment 59 | 60 | QQPPos 61 | 62 | ```shell 63 | python run_eval.py pretrained-models/qqppos-h4 processed-data/QQPPos-hf-refine/exemplar/level5/test.source [output_file_path_qqppos] 64 | ``` 65 | 66 | ParaNMT 67 | 68 | ```shell 69 | python run_eval.py pretrained-models/paranmt-h4 processed-data/ParaNMT50-hf-refine/exemplar/level5/test.source [output_file_path_paranmt] 70 | ``` 71 | 72 | 📝 this should give you two files with lines of {target syntactic parse} \ {paraphrase} 73 | 74 | 2. we use a simple rule to **extract generated paraphrases** 75 | 76 | ``` 77 | python extract_sentence.py --input_file [output_file_path_qqppos/paranmt] 78 | ``` 79 | 80 | 📝 this should give you two files with {paraphrases}, and they should be the same as the ones in `aesop-result/tab1-paranmt/qqppos-h4.txt` 81 | 82 | 3. last step to **get the evaluation metrics** as shown in our paper's Table 1 83 | 84 | - QQPPos 85 | 86 | ```shell 87 | python -m evaluation.eval -r raw-data/QQPPos/test/ref.txt -t raw-data/QQPPos/test/tgt.txt -i aesop-result/tab1-qqppos-h4.txt 88 | ``` 89 | 90 | - ParaNMT 91 | 92 | ```shell 93 | python -m evaluation.eval -r raw-data/QQPPos/test/ref.txt -t raw-data/ParaNMT50m/test/tgt.txt -i aesop-result/tab1-paranmt-h4.txt 94 | ``` 95 | 96 | 4. If you want to train those two models from scratch, please use these 97 | 98 | - QQPPos 99 | 100 | ```shell 101 | python finetune_trainer.py --data_dir processed-data/QQPPos-hf-refine/exemplar/level5 --learning_rate 3e-5 --warmup_steps 500 --num_train_epochs 25 --output_dir [output_dir] --max_source_length 512 --max_target_length 128 --do_train --overwrite_output --model_name_or_path facebook/bart-base --gradient_accumulation_steps 32 --save_total_limit 2 102 | ``` 103 | 104 | - ParaNMT 105 | 106 | ```shell 107 | python finetune_trainer.py --data_dir processed-data/ParaNMT50-hf-refine/exemplar/level5 --learning_rate 3e-5 --warmup_steps 500 --num_train_epochs 25 --output_dir [output_dir] --max_source_length 512 --max_target_length 128 --do_train --overwrite_output --model_name_or_path facebook/bart-base --gradient_accumulation_steps 32 --save_total_limit 2 108 | ``` 109 | 110 | 111 | 112 | ## Table 2: adaptive syntactic parse selection 113 | 114 | 1. replicate the result when the ground-truth is not available -- AESOP generates multiple paraphrases by adaptively selecting target syntactic parses, `processed-data/QQPPos-hf-refine/diverse/level3.source ` is the file that AESOP generates at the time we tested. When you run the preprocessing file, it might gives you a different file because of the randomness introduced by the sampling strategy. In the following, you may see 115 | 116 | QQPPos 117 | 118 | ``` 119 | python run_eval.py pretrained-models/qqppos-h2 processed-data/QQPPos-hf-refine/diverse/level3.source diverse-qqppos.txt --fp16 120 | ``` 121 | 122 | ParaNMT 123 | 124 | ``` 125 | python run_eval.py pretrained-models/paranmt-h2 processed-data/ParaNMT50-hf-refine/diverse/level3.source diverse-paranmt.txt --fp16 126 | ``` 127 | 128 | 📝 "processed-data/ParaNMT50-hf-refine/diverse/level3.source" 129 | 130 | in the following, we will use the ones for QQPPos as an illustration, and the one for ParaNMT will be just replacing the path for QQPPos with the one for paranmt. 131 | 132 | 2. as before, extract the paraphrases from the model output 133 | 134 | ``` 135 | python extract_sentence.py --input_file diverse-qqppos.txt 136 | ``` 137 | 138 | 📝 It should give you a file called `diverse-qqppos_extract.txt`, which should look 139 | 140 | 3. This file contains 10 instances per example. We choose one among those using `ROUGE` scores in our work, you can also choose other available metrics 141 | 142 | ``` 143 | python candidate_selection.py -gen_dir ./ -scbart_generate diverse-qqppos_extract.txt -target processed-data/QQPPos-parse-hf-refine/diverse/level3.target -output_file diverse-l3-select 144 | ``` 145 | 146 | 4. after the selection, the file should look exactly the same as `aesop-result/table2-qqppos-h2.txt` 147 | 148 | 5. Then you can get the metrics except TED@2 reported in Table 2 after running the evaluation script. Please note that TED@2 cannot be acuqired from the that testing script because it is looking at the selected paraphrase and the ground-truth paraphrase, but selected paraphrases are from our retrieved target syntactic parses, and it is defined as an average value between all target syntactic parses and generated paraphrases. 149 | 150 | ``` 151 | python ted2.py -i diverse-qqppos_extract.txt -s diverse-l3-select -t processed-data/QQPPos-hf-refine/diverse/level3.source 152 | ``` 153 | 154 | 155 | 156 | ## Use AESOP as a paraphrase tool in your project 157 | 158 | If you are looking for a paraphrasing tool to generate paraphrases with diverse syntactic parses based on your input sentence only, please give AESOP a try! 159 | 160 | ```shell 161 | # first parse the input sentences and generate the necessary file for running the model 162 | python demo.py --output_dir demo_output 163 | # run the generation model 164 | python run_eval.py pretrained-models/paranmt-h2 demo_output/level3_paranmt.source demo_output/level3_result.txt --fp16 165 | # extract sentences from the model output 166 | python extract_sentence.py --input demo_output/level3_result.txt 167 | ``` 168 | 169 | -------------------------------------------------------------------------------- /aesop-result/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PlusLabNLP/AESOP/0f376d1413c1ef605b7a008992e3a562c9020b99/aesop-result/.DS_Store -------------------------------------------------------------------------------- /aesop-result/tab1-paranmt-h4.txt: -------------------------------------------------------------------------------- 1 | yesterday was a dream for me . . . 2 | times share apartment failed . 3 | i 'm sure the white people know you 're not white , which is why they laugh at me . 4 | do n't you and alcohol have to be banned ? 5 | how do you get in the cab ? 6 | a large black dog sits beside him . 7 | we ca n't imagine where he might hide it underground . 8 | what are we gon na charge the boy with ? 9 | do you know about anyone who would want to hurt your husband ? 10 | going through nixon 's downfall is probably equal to breaking into the hotel . 11 | it 's the first time you 've had a studio . 12 | this is how thomas got his big hat ! 13 | let 's get a helicopter . 14 | you got ready ? 15 | amazingly , they wo n't be given credit by a bank for poor black men in west adams . 16 | applications should be lodged within 24 00 on 22 may 2013 . 17 | i 'm sure the looks on the calf 's eyes scare the butchers . 18 | mr queen spent the last five years alone in the country , without civilisation . 19 | mrs honeychurch will have a wedding with mr vyse in january . 20 | we also believe that today 's explosion may have something to do with the shooting at the apartments . 21 | he was about five ten in height , and a well tailored jacket had failed to hide the slightly protruding shoulders . 22 | there are their extensive consultations in the last year . 23 | it 'll be different from the gold . 24 | can you stop all the jokes from the fifth part ? 25 | he 's a crazy scientist who 's gon na destroy this town . 26 | the fucker stole 50 worth of things today . 27 | but your joss is fantastically good . 28 | the car sells gasoline ? 29 | we did it , just to respond to his plan . 30 | you ever see jagged films ? 31 | it hurts your head , does n't it ? 32 | the added money ca n't be bought . 33 | every player must take care of the unlock of the lock . 34 | came to kiss him . 35 | he must have a range of different moves , attack , counterattack , and defense . 36 | remember , you 're going to miss a lot more than that . 37 | you do n't throw things away . . . you teach . 38 | meetings shall consist of a large number of preparatory activities that candidates must be prepared for . 39 | then win 30 shillings to go . 40 | they kidnapped my baby . 41 | are you enjoying the victory ? 42 | the boy was murdered by a terrorist at the time of his death . 43 | nights here are different than in town . 44 | this is an institution . 45 | he was speaking in hebrew , and the ten minutes passed without a word . 46 | i like those who do n't like fire ! 47 | when i saw him , he looked like an fbi raid on him at the time . 48 | you 've never hurt my baby . 49 | in the event , it is required by the court to issue such declarations under oath if national law allows it . 50 | he 'll scare her off so she 'll be fine . 51 | he 's sitting down , watching those statements . 52 | looks like she wrote a book on driving . 53 | i got him out of jail yesterday . 54 | i 'm asking for five million . 55 | please choose between films , or learn pottery classes again . 56 | when the women settle down , they 're starting to serenade them . 57 | his paintings do n't have a gram of truth on them . 58 | you worked the prime minister for the first time ? 59 | bad guy is just when we do n't hunt monsters . 60 | since june 2008 , most of the company has been shut down . 61 | did n't i shock him ? 62 | may be affected by the difficulty of food and drink . 63 | i could fire both of us . 64 | you 're looking like a pretty prince in that dress . 65 | file checked 66 | 'is he dying ? 'asked panagyn . 67 | i was there ? 68 | do n't want to be friends with me . 69 | i was n't mad ? 70 | a war declaration can only mean an attack on the nation 's troops and civilians . 71 | grandpa , we 're going to run so fast that we wo n't even meet our quotas . 72 | you 're gon na miss me , ladies . 73 | the men say that they 're all born twice . 74 | and i preferred that captain muller or anyone else who did n't know about it . 75 | now they 're gon na kill my film . 76 | we have to have a contract . 77 | accumulation of these substances affects water quality and fish health . 78 | let 's have a few shots . 79 | sigursdon , there 's no going to sell . 80 | did grady agree to a negotiation ? 81 | budget revenue and spending balance must be equal . 82 | we want them to remember me as a monster . 83 | he was in the auction a few months ago . 84 | do it with the sword at your side , no demon will hurt you ! 85 | the life of the planet is 3 billion years old . 86 | we 'll show you where to sit . 87 | there 's a guy in english ! 88 | it must be weird for you . 89 | the torture of erzebet 's servants was revealed by the torture . 90 | in my sleep , this has been burned by the blood of the worm . 91 | an elderly woman was accused of sexual abuse by a minor . 92 | about twenty kids were fighting with herman that night . 93 | usually a bacterial infection that causes a narrowing of arteries . 94 | why do you scare me ? 95 | let 's get something to buy ! 96 | we swear we wo n't fight . 97 | you ca n't be guilty at my watch ! 98 | transaction data are therefore generally regarded as information relating to the environment . 99 | a direct shot would have sent you to death right away . 100 | let 's get all the bags and the cattle . 101 | the image of the rotating earth appeared there . 102 | drop the gun . 103 | does he want to play football ? 104 | do you have the stone back ? 105 | i can tell you the whole story ! 106 | it is not so appropriate to examine the third ground of appeal first . 107 | do n't you think negotiation is important ? 108 | why did you follow him ? 109 | they 're risky . 110 | dwera was angry with mudfoot 's behaviour . 111 | i was caught in cardassia when they attacked the klingons . 112 | no one is allowed to take the law into their own hands and commit violence . 113 | it confuses human destinies . 114 | sometimes , their perfume could n't disguise the terrible smell . 115 | you 're going to drink ? 116 | there are so many things left for your life . 117 | how could everyone in town become a schizophrenic overnight ? 118 | he 'll take the bullet to offend the germans . 119 | there 's a border crossing everywhere . 120 | he was abused by me ! 121 | the commander keeps quiet , he 's still in trouble , mr . clark . 122 | these walls wo n't be pushed away . 123 | he 'll be proud of me . 124 | although it was a clever choice . 125 | it 's weird , but your mind can do it . 126 | according to ballistics , he was shot by a friend of betty 's gun . 127 | he said he stole only bread from the bakery . 128 | thanks a lot for the queen . 129 | and he lives with this poor guy , huh ? 130 | i 'm sure frank wanted him to steal a state secret . 131 | he wanted to torture him and kill him . 132 | i lived in fifteen cities for six years . 133 | we 'll buy something . 134 | is this a home party ? 135 | anyone see anyone ? 136 | have you ever seen her panties ? 137 | see you in the morning . 138 | you can shoot before i shoot , and then we 'll count to ten . 139 | the cover ignited fire the hydrogen was in the center of the fireball . 140 | we ca n't afford an oil price . 141 | my friends are n't busy today . 142 | wild shots have caused dozens of dead to be shot . 143 | we did n't solve a lot of technical glitches . 144 | all you have to do is sit there and look like a mystery . 145 | you say it 's really raining in your movie . 146 | i 'm just gon na sit here and look mysterious . 147 | you 're saying he was n't strong enough . 148 | i have sentenced a traitor to an empire 's law . 149 | his security 's like this . 150 | they dragged the headless bodies into a decaying shed . 151 | did you notice he 's still carrying videos ? 152 | it 's a farm surrounded by woods . 153 | cothy was released from the war , was n't it ? 154 | shall we go back to the house ? ' 155 | give me a taiwanese gang 's territory ! 156 | nobody 's so good he ca n't be a salesman . 157 | nothing even seemed to his moustache . 158 | the girl put her slim hand on my shoulder and caressed me . 159 | it 's a nice couple , really nice boys . 160 | rocks clattered , burning like a hot iron in the palm of his hand . 161 | he 's a little dog who likes schumama . 162 | i 'm not begging anyone , not even you . 163 | big differences show in social issues . 164 | garuda screamed at him and ran . 165 | they did n't stop until everyone was crazy . i would n't come out of the bathroom . 166 | are those apartments real ? '' 167 | it was there ? 168 | do n't give the last penny to the film . 169 | there 's a powerful healing power that comes out of love . 170 | you got the stones . you got your things . 171 | rejected applications to refund the remuneration received by other customers were rejected . 172 | the long process has been launched to help great lawyers 173 | you 're an idiot ! 174 | one day , our ship will be able to travel to another planet . 175 | there are no damages paid to the defendants . 176 | the disclosure would not be my permission . 177 | this will be back in 18 months . 178 | why do n't we have a fight in court ? 179 | the crime he committed was his attempted murder . 180 | it 's bad if a guy like tom van ends up like that . 181 | he seems to have made an error . 182 | let 's get mr . pontail a loaf of bread . 183 | no official record is available . 184 | let 's calm down . 185 | the food cost to about 2,000 standard . 186 | you called a prosecutor , a defense attorney or a judge ? 187 | i mean , all those wannabe knights are saying this . 188 | he 's never gon na stop being crazy before he gets out of the bathroom . 189 | garuda screamed at him , running to the gate . 190 | he 's got to go through some tests , which was decided . 191 | it would be if i did n't try , but it 's never been tried . 192 | beaten even worse ! 193 | still , you ca n't forgive me for doing this . 194 | we have to huddle together to keep warm . 195 | i hear you did n't say anything about fighting or crying . '' 196 | i wo n't solve anything by screaming . 197 | mountain 's coastline . . . is the shore . 198 | let 's get some ice and cups ready ! 199 | why would n't he be addicted to drugs or alcohol ? 200 | looks like there 's been a mistake ? 201 | are you taking any pills ? 202 | even if control of molecular levels failed , they are able to re crystallize dilithium without external help . 203 | did you call a prosecutor , a defense attorney , a judge ? 204 | he must not play daily japanese so he can be healthy . 205 | they 're taking the virginity to those who have broken their promise . 206 | i think dolores has done exactly the same thing for years . 207 | we only have to wait 10 weeks . 208 | your visit made me tired . 209 | there 's a cartuche tutankhamen , someone called . 210 | there are elections for president in the conference room . 211 | have you ever taught anyone how to speak ? 212 | can guns shoot under water ? 213 | the damn girl has disappeared . 214 | anyway , you got ta go back to the bull faster ! 215 | which is not true . 216 | i did n't even look in grandma 's barn . 217 | what did we underestimate how difficult it was to make a cover for the book ? 218 | i thought the boy had died in an attack on the terrorists . 219 | and there was a woman working next door who was also a maid . 220 | we ca n't take a few shots . 221 | european commission stepped in to protect businesses from fraudulent marketing 222 | i 'm sure the north shaolin would have regarded his enemies as his patron . 223 | there are no ways to know if anyone was doing it deliberately . 224 | there 's only one view of changing your information . 225 | you ca n't talk about this right now . 226 | and they always smoke a cigarette for hours . 227 | i still said he was n't strong enough for them . 228 | although miranda had no idea at the moment . 229 | they raised me boys to grow up on my own . 230 | financial restructuring of businesses shall not be performed at the same time . 231 | she 's among my favorite girls . 232 | she 's licking her lips and her cheeks with beautiful roubles . 233 | the sport is important for preventing a large number of diseases . 234 | such rules must provide the maximum possible flexibility . 235 | there 's nothing to be afraid of . 236 | you do n't seem to understand me . '' the prisoner interrupted him . 237 | i hear he 's gon na cancel the ban . 238 | as long as i do n't have a demonic force on your side , i wo n't hurt you . 239 | grandma , let dr . lo be on his own . 240 | his side is a large black wolf with a large dog . 241 | the beautiful , extraordinary loneliness of her eyes makes me more than ever before . 242 | care of the safety of customers is your duty , right ? 243 | i looked him in the eye and felt like he was testing my brain . 244 | at the beginning of the 1970 s the new personalities appeared . 245 | you 're not making me out of your sorry . 246 | throughout the meeting her words were calm and mild . 247 | she wanted the best for him to kiss her first . 248 | well , chances of sex will be reduced by wilson . 249 | she and her husband exchanged surprised glances . 250 | did i hear yet ? 251 | when a girl gets into trouble , she starts to suspect her ex boyfriend . 252 | grady 's got approval negotiation ? 253 | estimates of different stakeholders have differed . 254 | i 'm a tough guy and you 're not gon na love me . 255 | did i have a snack box ? 256 | security system demonstrating there are serious failings . 257 | ever heard him with doogie howser ? 258 | do you keep his stuff for 10 years ? 259 | couple of seconds are on their way . 260 | oh , now he 's stupid ? 261 | it 's quite sad that your girlfriend died . 262 | you 're a monster who likes pain . 263 | you 'll take a bullet for insulting the germans . 264 | the operation is going to be determined by a sniper rifle ! 265 | how can an individual produce nuclear weapons ? 266 | you 've done ? 267 | they hit his helmets and leg . 268 | a further four stand in the entrance to the village . 269 | , as such , a pension scheme similar to defined benefit pensions is offered to an even smaller part of the employer . 270 | therefore , she examined the amount in which the animal test was performed . 271 | make sure someone is looking behind your back . 272 | why do n't you just turn off the lock ? 273 | there 's nothing simpler than the construction of the centre . 274 | it might work if i do n't make a couple of stupid mistakes . 275 | you 're kidding me , are n't you ? 276 | he smiled and showed a clear pleasure . 277 | arthralgia had a peculiar look in his face . 278 | is there anything difficult about making tea ? 279 | bean wondered if the soldier 's words came out of his mouth . 280 | you should n't wait for half an hour to kill her . kill her right away . 281 | i 'm in two steps to the bike . 282 | throughout their meetings , the sabbats were calm and seemingly undeterred . 283 | the boy and tyler went to lunch together . 284 | we have to pack the bags and the animals . 285 | we had to take risks . 286 | each is designed for a very efficient way of achieving certain objectives . 287 | the world 's most ferocious fish . 288 | people hang their laundry out of windows , kids play football . 289 | he killed an animal before the audience . 290 | did i tell you the other alex 's wife ? 291 | the answer is less obvious if we compare europe 's countries . 292 | we ca n't see any octopus over five feet . 293 | maybe they 'll never get home , tatiana ! 294 | so , i 'm talking about territory of a taiwanese gang . 295 | he calmed a wild animal . 296 | then i had a ride that exhilarated her , but then she fell asleep on the bed . 297 | has any objection raised to the withdrawal of gossanahu ? 298 | now , let 's get these chairs over here ! 299 | did you smell ? 300 | i 'm not gon na beg you ? 301 | 'i 'm going to die , please ? 'asked panagyn . 302 | and now my film has been destroyed . 303 | he stayed there for more than 10 years ? 304 | the memory allocation is missing . 305 | the flowers were appreciated . 306 | he wanted to kill no one . 307 | it has to be in my contract ! 308 | is this a visit or is it going to be long ? 309 | the fear makes me very worried . 310 | i should have said hatred . 311 | my baby 's been kidnapped ! 312 | hey , hey , come ? 313 | let 's meet in the store tomorrow . 314 | we 're taking pictures . 315 | hey , our hospital ? 316 | let me see where you 're sitting . 317 | i 'm not worried about my son 's little attacks . 318 | consult your veterinarian when deciding how to dispose of non essential medicines . 319 | you were supposed to play 10 years in football . 320 | there 's only three of them , that 's good practice . 321 | unfortunately , it can be . 322 | tell him about his physical and psychological condition . 323 | is that a nose kiss ? 324 | i almost have . 325 | there are few technical glitches . 326 | your hand 's hot . 327 | he described a job like this one for a janitor . 328 | he 's told it 's close to the border with lu . 329 | why not ban alcohol on cigarettes ? 330 | did some weird phone call from my wife ? 331 | charlie karl . you 're going to kill us ! 332 | they killed family member ? 333 | kaelin 's friends turned an ugly hostile look . 334 | do you want to go to dinner ? 335 | imagine how happy and how surprised it was to me . 336 | he 's waving money like sheikh oman . 337 | how strange and funny it is . 338 | will be laughed and entertained . 339 | i 'll see you in 4 or 6 weeks . 340 | what the hell are you saying ? '' 341 | that 's what you 're supposed to be ! 342 | did you see linda bloom on that picnic ? 343 | would n't have an invitation . 344 | did i change there ? 345 | maybe those tests are why they kidnapped her . 346 | boss , i wanted to get out of this mess ! 347 | i forgot my theater is below the street . 348 | that 's really cool ! 349 | you were eating outside ? 350 | it seems like you 're crazy about your ideas . 351 | did n't it just seems weird ? 352 | you mean he 's being helped ? 353 | you did n't get lucky . 354 | i summoned the gryf , put him on his strong back and commanded him with a clear command . 355 | the grass grew on the cobbled wall . 356 | yes , i 'll have a cup of coffee too . 357 | all commitments of the nature of this chapter should be considered in this revision . 358 | wo n't you give it a little aggressive ? 359 | the marriages were not approached intelligently . 360 | i think they upgraded you . 361 | you burned out because you farted . 362 | when i built it , they took care of the existing water and sewer networks . 363 | there were two of them carrying a gun . 364 | the list of damage related to the removal of the wreckage is also specified . 365 | it 's sad . sounds sad . 366 | i do n't know what good taste he has . 367 | your wife does n't appreciate it . 368 | why did n't the messenger bring you an envelope full of money ? 369 | you 're not interested in going ? 370 | i really do n't like being stuck in bed . 371 | no proper compensation would be there . 372 | i 'd kick that son of a bitch 's ass . 373 | a report on the disappearance of a woman from woodburne asked her husband . 374 | you wo n't be victorious in the retreat . 375 | i do n't think you acted very territorially . 376 | we do n't know how he died . 377 | it 's the break in at the hotel room that killed nixon . 378 | does it look like an autopsy report ? 379 | there 's a danger that nael will die . 380 | a further content of this site is blocked by the web browser for security reasons . 381 | the weight is a little higher than this . 382 | they used intermittent . 383 | they belonged to my mom . 384 | he works at a new york branch . 385 | crying at the same time . 386 | the one on the stable is useless . 387 | do you want to dance ? 388 | you 're seeing that rock ? 389 | it 's a really fun moment ! 390 | i was so disrespectful ! 391 | in the attack , he 's trapped . 392 | if they had attacked me , they would have killed me . 393 | i 'm calling for applications for an administrative post at eurojust . 394 | i swear i 'll always dream of you before me . 395 | she 's a scout who watches her grandma as she kills a bull . 396 | i 'm afraid they 'll catch you . 397 | i mix him with water and put him in the pump . 398 | is it a difference now ? 399 | all i want is an open space and we . 400 | i do n't really know . 401 | would you please excuse me , sir ? ' 402 | stop your pants . 403 | the oyster in the water is crazy . 404 | when i keep going , the darkness will keep me from going to deeper darkness . 405 | then kalizkan did n't know a single magician . 406 | 'which one is he ? 'asked jack . 407 | there was no air . 408 | the blender are great . 409 | it 's about the box ! 410 | apparently he has no hungry appetite . 411 | we do n't know if he killed kwak . 412 | the party will certainly be enjoyed . 413 | they do n't have a leader . 414 | this is a day . 415 | do n't be an idiot ! 416 | inspector , gangs on the street are n't real . 417 | he looked at the tall fist and winced . 418 | do you ever want to be upset about something like that ? 419 | the eyes twitched slightly . 420 | i do n't swallow pills . 421 | do n't wear your gunpowder in the dress . 422 | breaking the record for mayan is not suicidal . 423 | you know , the thing is , he does n't love me anymore . 424 | you ca n't drink . 425 | he chose them ! ' 426 | did you watch out ? 427 | that 's your fate ! '' 428 | the radios are terrible . 429 | how do you hurt ? 430 | jackie writes for gina . 431 | a switchboard at work stations will be installed in an easily accessible position close to the door . 432 | i 'm not interested in christmas . 433 | i wish i could n't get some tips from you . 434 | even though she had a strange smile sometimes . 435 | you said i should find a wolf where he 's gon na find his prey . 436 | and the pathologists were kang . 437 | sounds like someone 's killing someone . 438 | the installed system was not completed with the internet connection . 439 | we ca n't save anyone in this place . 440 | she could n't take a bite from vanessa . 441 | how badly did you hit that guard ? 442 | he is facing serious allegations of theft and he is remorseful . 443 | there are books in english as well as chemistry . 444 | that 's a rope at the bottom of the steep slope . 445 | in french , did you invite a foreign soldier ? 446 | in any of the solar systems , there had never been such a huge wild animal . 447 | this is going to kill them all . 448 | it is also possible to broadcast digital media on the network . 449 | i 'm going to see you this afternoon . 450 | anyone want to trace strahm on his mobile phone ? 451 | you did n't think i was going to make breakfast ? 452 | i guess you 're going to go to edu tomorrow . 453 | it makes a sense of guilt and fear that trisha feels guilty about both . 454 | the 5,000 grand will be my 50 . 455 | do n't think about sex at the moment . 456 | your gender does n't play a part in this competition . 457 | have you heard the emperor 's orders ? 458 | he 's going to lick ? 459 | what kind of greed is it ? 460 | we wo n't tell people . 461 | i 've never seen a game before , and nothing like that . 462 | to be defended by who ? 463 | it 's everybody 's scream . 464 | imperialist countries can not only use such a mindset as justification for economic greed . 465 | everyone should be focused on ensuring the safety of the secretary of state . 466 | there 's nothing wrong with him , seeing him as a solution . 467 | nothing laughed here , for us . 468 | want to be a friend ? 469 | moron , will you find the whitefish ? 470 | a dose of thorazine 300 milligrams is simply impossible . 471 | i 'll kill sabata . 472 | everyone here knows about it . 473 | you 're not afraid of death . 474 | there 's a strange feeling here . 475 | they do n't sell . 476 | the son of a bitch came back . 477 | tough girl is hard on you . 478 | you give it to lucky seven . 479 | the decision is burton 's responsibility to wake up the others . 480 | the coffee was creamy . 481 | he was good too . 482 | he 's never seen a better film in my life . 483 | it 's just that he sent a real killer . 484 | and if you 're afraid , i 'm angry . 485 | i 've seen you treat my deputies with the same courtesy . 486 | but at the center of this loss , i 'm still calm . 487 | baseball league is dominated by me . 488 | is there a forecast for the sale ? 489 | you wo n't starve . 490 | is it also mrs nehru 's involvement , too ? 491 | there 's another judicial error ? 492 | are you saying everything is mine ? 493 | do n't be an asshole . 494 | dad smokes smoke in his pregnancy . 495 | you 're the one who 's lucky . 496 | we 're not stopping here . 497 | to put an innocent man in danger ? 498 | that 's madness . 499 | why not write a motive for him ? 500 | i did n't have to worry about anything else in malacandra except for oyarsa . 501 | well , be sure it 's not too late . 502 | the pains were already born of anger . 503 | there 's barrenger molesting a little girl ! 504 | bring the girl in nine in the evening . 505 | we were at war when i was young . 506 | i was so smart . 507 | by removing the mask , the mask was removed . 508 | of course i would have taken merlin . 509 | did all friends want to sleep with a virgin ? 510 | i 'm going to play a very strong , high energy kind of rock n'roll . 511 | your mom is gon na love this ! 512 | we 're evacuating the entire area . 513 | i 'm not listening ! 514 | i was so weird ! 515 | the machine does n't exist . 516 | i do n't think they listen . 517 | and in fact , i started with my ass that made the cement . 518 | there 's a joke on the internet . do you know a doctor and a bunch of students ? 519 | he leaves his mother alone . 520 | there was an amazing feeling . 521 | there is a technical discussion about how to assess the results of the tests . 522 | your devices were familiar . 523 | warm periods were not in particular . 524 | i do n't think bears have a rap ramp . 525 | everything is yours . 526 | we might as well keep this awkward silence . 527 | you 're gon na have it hard . 528 | i hate your evening visits . 529 | i 'd rather wait than half the way , and kill her in half an hour . 530 | all the ladies in the world today talk like tramps . 531 | but in these terrible circumstances , everyone is always proud . 532 | i 'm sure there 's some kind of magic explanation . damon does n't drink hydrochloric acid . 533 | i 'll take care of the violence . 534 | he wanted to know the exact number . 535 | why do n't we go to lunch again ? 536 | is he going to play dave tonight ? 537 | luck is on your side . 538 | but i have to admit i had the pleasure to work with you both . 539 | you hoped it would n't drain ? 540 | the increase in the number of deaths should be expected otherwise . 541 | in the bank , it 's too risky . 542 | he 's sending an apology , detective . 543 | and the family will have a lot of problems ? 544 | is there anything lurking this way ? 545 | she created a mysterious figure , who created a secret system for sending people . 546 | yes , if you 're lucky ! 547 | the whole london is surrounded by police and all residents are trapped here . 548 | no one is doing this so fast ! 549 | that the electorate likes brave actresses . 550 | as the commission 's main priority is to integrate environmental issues into policy dialogue . 551 | neither of your students did their homework . 552 | you 're stupid . 553 | i 'm sure i wo n't murder my best friend . 554 | the architect , the real one . 555 | he really looks out for me . 556 | car has no motor . 557 | i 'd hurt his feelings for him . 558 | our share , now 50 . 559 | you have no idea ! 560 | i will be saved by the planet of xenon . 561 | you want to see him today ? 562 | there will be no third world country ! 563 | it 's like you 're dying so fast . 564 | i 'll listen to your scream . 565 | good thing , that 's short enough for two weeks . 566 | did he dance ? 567 | are you kidding me , dad ? 568 | we do n't forget a day like this . 569 | should have taken the plane tonight . 570 | were you kidding me ? 571 | it changes in his mood . 572 | thunder , i 've got three goals . 573 | could they have done this ? 574 | time to go to sleep ! 575 | there 's my friend . 576 | is everything all right ? 577 | but helena was n't worried about that . 578 | can i see her opening the door and not leaving any prints ? 579 | as you command , majesty ! 580 | she had a predictable complication . 581 | well , i 'll talk about it . 582 | the gym 's here . 583 | as esk bit her lip , she imagined the shame on her face . 584 | i 'm going to cast a perpetual shadow on south park . 585 | we assume we do n't have a chance . 586 | the jasmine is known . it 's only possible to pick her up before dawn . 587 | i 've been thinking about writing my memoirs . 588 | it 's going to be just another vague rumor , like an alien in hangar 18 . 589 | yesterday i stood beside vorbis , regardless of where or when he wanted to do it . 590 | that came as a surprise to stilgar . 591 | that was a really weird day . 592 | carol , if i do n't find mrs santa claus , i ca n't stay here . 593 | what 's going on with scooby doo ? 594 | this sunday was the big saturday mainboard . 595 | the envoy would not need the commander any further . 596 | you 'd rather die . 597 | what kind of monster is this ? 598 | the trapper was dead . 599 | i can find this unit in all processors , 'cause they 're math . 600 | physical injuries heal , but it 's never easy with emotional scars . 601 | our topic will be mental health . 602 | he 'll be on his way back tonight . 603 | i was so beautiful . 604 | many people doubted deeply about this investment . 605 | i 'll forget about that . 606 | there was a helicopter . 607 | all those flavors are complementary . 608 | the adjunct stepped back and wiped the dust from his gloves . 609 | we better not have taken any chances , '' he said . 610 | just relax yourself ! 611 | do n't worry about it . i 'm sure you wo n't find your little dick . 612 | it wo n't be possible ! 613 | did you bring a dollar bill ? 614 | the property failed for s 615 | does anyone see anything ? 616 | i 've got a lot of work to do tomorrow ! ' 617 | that 's so cute , right ? 618 | i 'm gon na ignore what you 're saying . 619 | there are muslims killed . 620 | he has to be operated , or he 's going to die ! 621 | the turk was caught in an ambush ? 622 | do you have a descriptive description ? 623 | are you kidding me ? 624 | you 're gon na have it hard . 625 | by chance , did you talk to your brother 's death ? 626 | only by throwing money away he could gain strength . 627 | would we build a house in that mountain ? 628 | before you started recording , it was necessary to save the presentation before starting the broadcast . 629 | i just want to talk to my dad ! 630 | the capacity of the office needs to be strengthened . 631 | the wording of the decision does not specifically allow for an extension of the deadline . 632 | very handsome young man . 633 | he spent the last few minutes seething with hatred and anger at all of humanity . 634 | they run from fear of the visions of st clairvoyant . 635 | we 'll luck it does n't hurt anyone . 636 | we 're making a big army effort . 637 | his name is elena waiting for you . 638 | the ambush was ours ! 639 | we 're planning a movie . 640 | i 've been having a similar problem for the last 10 minutes . 641 | i prefer bay 's theory of two or three stages of life . 642 | i thought you were helping us . 643 | although technology has improved . 644 | it 's incredible he 's gon na get married in front of me . 645 | anyone who bites the dog is executed . 646 | our requirements are met . 647 | brock began to move , but he said nothing out of his mouth . 648 | i 'm not gon na tell your wife she 's a whore . 649 | what i do n't know is a ship capable of resisting the impact of its underwater monitor . 650 | if you 're lucky , you 'll prove yourself right . 651 | i 'm making noises from terrible shrieks and terrible outbursts . 652 | our reporter woke up , did n't he ? 653 | maybe i 'll find a solution in negotiation . 654 | you 're annoying . 655 | that was not a joke . 656 | there are still no cd in the stolen cars . 657 | how awful it was for me . 658 | did i see the jaw movie ? 659 | he stood up and raised his sword . 660 | i 'll tell you tomorrow at the banquet . 661 | there are wonderful things about watching this film . 662 | it did n't surprise teresa to think of something like this . 663 | the coming is a great pleasure to me . 664 | i still have 57 seconds , sir . 665 | none of them had said anything when his friend was staring at him and he certainly has n't said anything . 666 | he 's searching for his secret federal scientists constantly . 667 | that 's what we have alcohol for . 668 | i can take care of your son , sofia . 669 | let 's go get some coffee . 670 | but the fun fool who makes me rich . 671 | we 're not right about anything . 672 | that 's so exciting ! 673 | i 'm not worried about that . 674 | the ceremony promises to be beautiful . 675 | several heads flew from the door and around the corner . 676 | the printer had to be calibrated . 677 | we 're just like family . 678 | i think it 's anyone . 679 | your crutch and your servants are pleasing me . 680 | i wrote dylan , what they lynched after the circus . 681 | sir , it 's not private . 682 | do you want to have a towel on your butt ? 683 | we 'd like to travel . 684 | there 's a member of s . r . p . who 's here tomorrow . 685 | i 'm cleaning up your car for 100 . 686 | without missing , she lacked the perfection 687 | secured both sides of the door with heavy locks . 688 | if you 're with us , i 'll give you the best chance to find them before they leave town . 689 | they 're loading the trucks with radioactive material . 690 | forster led the development of military non lethal weapons such as mclennen forster . 691 | it 's not exactly 692 | this guy does n't die ! 693 | there will be a greater emphasis on geographical indications and differences in manufacturing procedures than in the past . 694 | i was so sweet . 695 | are you gon na hurt him ? 696 | my father wants me to take care of him . 697 | you two ca n't talk ! 698 | i feel like an idiot blonde . 699 | you had a dress ! 700 | there were probably ten against the whole unit . 701 | it 's hydarn 's fault . 702 | the great moroccan civilization building a permanent monument to its achievements . 703 | there was a dream ! 704 | i do n't think anyone 's gon na catch the doorman 's nap . 705 | with my father , it should n't have been surprising . 706 | we waited 20 minutes and waited for the sun to come out when we got there . 707 | do n't give him any more medication . 708 | i 'll dig a hole for him . 709 | it 's sometimes a bad thing to do . 710 | his name is tom howard . 711 | they ca n't let me do this . 712 | you do n't make a mistake like garber . 713 | you 're giving yourself a unique job . 714 | fifty years ago , they tried to destroy our moon . 715 | the thing that 's gon na make you feel good is your moisturiser , right ? 716 | i 'm sure all these injuries are related to weightlifting . 717 | come and wait for another few minutes . 718 | there 's nothing to be afraid of . 719 | you 've spoken a little like god . 720 | he ca n't be six months old for me . 721 | you 're really going to change your life by reading this book . 722 | do you want to talk to mr eagles ? 723 | it 's four marshmallows to eat ! 724 | it 's like riding a bike . 725 | what a shame we 're wasting it on these unscrupulous bastards . 726 | well , you did n't see her ? 727 | they 're pretty scared of her , are n't they , doctor ? 728 | unlike some of europe 's steeds , he must rest and do n't cuddle . 729 | the rocket has been developed . 730 | we just wanted to find out if there 's enough room . 731 | bondi is watched by the fbl 732 | let 's forget how ordinary things are done . 733 | did you get your bandolier ? 734 | you wo n't stay overnight ? 735 | we do n't know what 's nearby ? 736 | when is the mother coming for us ? 737 | it 's european wine , and it has colors similar to blood . 738 | when the mayor fails to implement this policy , they will give him responsibility . 739 | let 's welcome to the biggest imagination . 740 | we take action in the event of a marketing fraud by the european commission 741 | do n't you have to dress ? 742 | when i get him , it will be my fault . 743 | vodka would n't be a good match for whiskey . 744 | in addition , simplification proposals are currently under way 745 | who 's neeble ? 746 | in this context , three different situations are distinguished 747 | what procedures were followed for accelerated access to grid infrastructure ? 748 | it 's like someone else did your dirty job , that loser did n't make the right choice . 749 | he came to him to feed the thirst of his brain to drink out of self respect and alcohol . 750 | hey , do n't waste your time . 751 | there were a few majestic trees , and white winged birds circled around them . 752 | the heart is broken too . 753 | buddhists reportedly believe to reincarnate their souls on 49th day . 754 | bills are as fast and anonymous as possible . 755 | i just found two women in the toilet when they found her . 756 | we 'll go to the hotel tomorrow . 757 | dude , there must be some weird stuff going on . 758 | will he lend me out ? 759 | why would he give her a grave , even though i would n't want to marry her for a nobleman ? 760 | you ca n't marry william whele ! 761 | to launch positions . 762 | there 's no one hiding him , no one feeding him . 763 | that 's the point . . . i 'm just playing . 764 | would you like to tell roger ? 765 | they 've told me a classic friend who would want me to be . 766 | i kind of stole our car . 767 | i 'll take the next elevator . 768 | i was talking right away . 769 | there was a mistake . 770 | an average reward for a slow and quick career also does not match the reward for an average career . 771 | yet the anger flared again , as if the bubble had burst . 772 | a self deprecating cockroach should avoid such a place . 773 | are you thinking about an unexpected wedding ? 774 | he said if humanity did n't have a future , he would die . 775 | you have a contract with dola girosa . 776 | that 's really funny . 777 | her sister then had a bad idea that she would n't have a nice coat . 778 | did it seem like hard ? 779 | what 's the devil 's taste like ? 780 | you know it 's a crime . 781 | it looks like his moustache is drooping . 782 | i 'm just gon na pull the trigger on my butt . 783 | that 's so beautiful . 784 | this improvement will lead to a reduction in uncertainty about the possibility of assistance in the event of a disaster . 785 | i can not be moved to this workstation . 786 | they wo n't abuse . 787 | they 'll probably arrest you in west hollywood if you do n't cut your dog . 788 | the visa has been flagged as a asian . 789 | for a moment of horror , there was a hole in the corner and watching . 790 | do you know you 're dead ? 791 | you do n't plan on staying long , do you ? 792 | what kind of technology will it be ? 793 | we 're trapped . 794 | korea reacted quickly and decisively . 795 | a powerful energy called healing comes out of love . 796 | the vampire one could kill himself . 797 | pregnancies before the scheduled pregnancy should be replaced by suitable alternative treatments . 798 | she went to the river you were talking about , jumped , and died ! 799 | the fingers may have been broken by several blows . 800 | so , you 're interested in selling the remaining companies ? 801 | -------------------------------------------------------------------------------- /aesop-result/tab2-paranmt-h2.txt: -------------------------------------------------------------------------------- 1 | did i dream about you last night ? '' 2 | 'there are no apartments in time ' 3 | do white people know you 're not white ? '' 4 | alcohol and cigarettes are banned . 5 | he got the cab , how 'd it go ? 6 | crouched a large , black , wolf like dog . '' 7 | it 's hard to imagine what he 's hiding in the underground . 8 | have we found something to charge the boy ? 9 | do you know anyone who would want to hurt your husband ? '' 10 | that 's what nixon went through , like breaking into the hotel . 11 | the studio 's your first time ! 12 | i 'm wondering how thomas got that big hat . 13 | we need a helicopter . 14 | you 're not gon na dress ? 15 | the odd thing is that banks do n't lend loans to poor black people in west adams . 16 | 'should be lodged by 22 may 2013 at midnight 17 | at some butchers , the sight of the calf is frightening . 18 | mr queen has spent the last 5 years alone and out of civilization . 19 | mrs honeychurch and mr vyse will have a wedding in january '' 20 | police suspect there was a connection between the explosion and today 's shooting at the apartments . 21 | about five and a half feet tall , his well tailored suit only revealed slightly rounded shoulders . 22 | extensive consultations held over the past year '' 23 | with the gold everything changes . 24 | you 're going to stop all the jokes that are going on in section 5 . 25 | he 's going to play a crazy scientist who 's gon na destroy the whole city 26 | that son of a bitch stole 50 bucks . 27 | but a really great joss . 28 | like selling gasoline in a toy shop ? 29 | we 're going to do the opposite of his plan . 30 | have you seen jaws '' ? 31 | is n't that a headache ? 32 | there 's more than money . '' 33 | whoever finds the key must unlock the lock . 34 | did you want a kiss ? '' 35 | swordsman 's skills require a range of moves , a variation in attack , a counterattack or a defense . 36 | but remember , you 're going to be five feet taller . 37 | i 'm teaching instead of throwing things away . 38 | in general , meetings include a large number of preparatory activities that candidates need to take into account . 39 | he 'll win 30 shillings . 40 | my baby was kidnapped . 41 | you like the win ? 42 | i think his son died in an assassination attempt . 43 | on a different night than in town ! 44 | see '' is that a military hospital ? 45 | ten minutes he spoke in hebrew . 46 | do you hate the fire ? '' 47 | but he seemed to be raided by the fbl when i saw him . 48 | i did n't hurt your baby 49 | if national law allows , the court is required to issue such declarations under oath . 50 | then he 'll scare her off and leave her alone . 51 | and i 'm sitting on my ass and looking at the statement . 52 | as if she were writing a book about driving . 53 | yesterday they released him from jail . 54 | we 're asking for five million . 55 | you can choose between films and art courses . 56 | once the females are settled , they start singing . 57 | there was n't even one gram of truth in the painting . 58 | for the first time , you work for the prime minister ? 59 | we 're just going to chase monsters . 60 | closed most of its business in early june 2008 . 61 | the shock ? 62 | these effects may be difficult to eat and drink . 63 | 'we could both be fired ' 64 | dressed like that means you 're more beautiful than the prince . '' 65 | file failed to verify 66 | they die ? '' panagyn asked . 67 | you 're not an astronaut ? 68 | did i want to be your friend ? '' 69 | you 're not mad at me ? 70 | declaring war and attacking the nation 's troops is not necessarily the same thing . '' 71 | i 'm too fast for our quotas ! 72 | you ca n't miss me , ladies . 73 | they 're all born twice . 74 | 'and captain muller or anyone who does n't know about it is preferred ' 75 | and now they 're killing my film . 76 | it 's in the contract . 77 | in the event of accumulation , water quality and fish health will be adversely affected . 78 | we both have shots ! 79 | in mr . sigursdon , sales are n't allowed . 80 | grady and the negotiation ? 81 | in maintaining a balance between income from the budget and expenditure . 82 | i wanted them to remember me as not just a monster . '' 83 | that 's what she bought in an auction a few months ago 84 | as long as you have the sword on your hip , no one 's gon na hurt you . '' 85 | 3 billion years of human life . 86 | i 'll show you in the seats 87 | there 's someone speaking english . '' 88 | is n't that something weird ? 89 | the torture was torture , which made the servants of erzebet confess . 90 | worm blood is still burning in my sleep . 91 | there was a woman who had been charged with sexually abusing a minor . 92 | i had 20 kids fighting herman that night after the game . 93 | common cause of bacterial infections narrowing arteries '' 94 | for me , the scare ! 95 | we have to buy something ! 96 | 'we wo n't fight , i swear . ' 97 | i ca n't sentence you to death on my watch . 98 | so people in principle consider data relating to transactions to be environmental information . 99 | a little closer and you 'd be dead . 100 | i have to get all the bags and the animals . 101 | then came the image of the rotating earth . 102 | 'put the gun down ! ' 103 | you want to play football ? 104 | have you returned the stones ? 105 | did you say everything ? '' 106 | the third ground of appeal must therefore be addressed first . 107 | it means negotiation . 108 | and you pursued him ? 109 | for him the risk is too big . 110 | what mudfoot was doing irritated dwera . 111 | trapped in cardassia when klingon attacked . . . 112 | taking the law into your own hands and committing acts of violence . . . 113 | people , their destinies are subordinated to him . 114 | at some point in the night , their perfume was nearly enough to disguise the terrible smell 115 | want something to drink ? 116 | there 's a lot to live for 117 | all the people in town are going to be schizophrenic overnight 118 | if he 's offended by the germans , he 'll shoot . 119 | but the line is everywhere . 120 | he '' has n't abused me ! 121 | despite all the difficulties , commander clark remains calm . 122 | do n't let me get you out of the walls . '' 123 | i 'll be proud of him 124 | but a smart decision ! 125 | he 's weird , but he can do it with his mind . 126 | our friend 's gun was ballistics , according to a bullet that killed betty . 127 | as far as i know , he only stole one loaf of bread . . . 128 | 'congratulations on the queen 's behalf ! ' 129 | and now he lives with that poor guy ? 130 | frank wanted him to steal a state secret , for one thing . 131 | torture and murder . 132 | we lived in 15 different cities for six years . 133 | for something to buy . 134 | not home ? '' 135 | there 's someone here ? 136 | in her pants . 137 | in the morning . 138 | before i fire , i 'll count ten '' 139 | with a burst of flame , the hydrogen inside exploded in the fireball . 140 | i ca n't give them an oil price . 141 | my friend 's still busy . 142 | wild shots and bombs killed dozens of members of the gang . 143 | too many technical glitches we did n't solve . '' 144 | just sitting here looking mysterious . 145 | i heard it rains in your movie . 146 | sit down and look mysterious . 147 | not strong enough for me ! '' 148 | traitor , for whom the empire 's laws will be punished . 149 | full alert , so does the security . 150 | lucian dragged the dead headless body into a decaying shack . 151 | he 's still carrying videos . . . '' 152 | his farm and forest near your house are in the same neighborhood . 153 | has n't the war released cothia ? '' 154 | we 'll go home , okay ? 155 | give me a piece of taiwanese gangland . '' 156 | no one can be too good for a salesman 157 | for a moment he seemed to have a hanging moustache . 158 | she put her slim hand on my shoulder and caressed me '' 159 | sounds like they 're really nice guys . 160 | the stones struck them hard and burned as a hot iron . 161 | i have a little dog who likes schumama 162 | no one is forcing me to beg . 163 | social issues show big differences . 164 | he ran to the gate and garuda screamed again . 165 | if they do n't stop being crazy , i 'm not going to the bathroom ! 166 | you know who owns those apartments ? 167 | does n't he cry ? 168 | not for the last penny of the film . 169 | powerful healing energy comes from love . 170 | take the rocks back and take your things back 171 | they rejected the applications submitted by various customers for a refund of turnover tax . 172 | jen started a lengthy judicial procedure , for helping brilliant lawyers . 173 | you son of a bitch . 174 | one day our ship will be able to fly on another planet . 175 | we wo n't compensate the accused for their damage . 176 | without my permission to disclose the military . 177 | six months later they will recover . '' 178 | let 's argue in court . '' 179 | guilty of attempted second degree murder '' 180 | it 's bad enough if a guy like tom van ends up like this . 181 | agent spikings seems to have made some mistakes . 182 | for mr . pontail 's bread ! 183 | he 's cleared of all official records ! 184 | calm down . '' 185 | i think the food costs about two thousand standards . '' 186 | a prosecutor , a defense attorney or a judge ? 187 | i only know what every knight says , 188 | before they all go crazy , i 'm not going to the bathroom . 189 | garuda screamed again and ran to the door . 190 | there must be tests that have been decided . '' 191 | not because i did n't try . 192 | you know , he hit me twice . 193 | it is difficult for me to forgive myself for this . 194 | let 's get our bodies warmed up . '' 195 | you said there was no screaming or crying . 196 | screaming wo n't solve anything . 197 | 'there 's a coastline behind that mountain ! ' 198 | we have to put ice in the cups ! 199 | did i fail drugs or alcohol ? 200 | like someone made a mistake ? 201 | you took your medication . 202 | even if the molecular level controls failed , dilithium can be re crystallized without external help . 203 | a prosecutor , a defense attorney , a judge ? 204 | beating the japanese daily must be healthy for him . '' 205 | take the virginity and take the ones who broke your promise . 206 | i think dolores did exactly what she had done for years . . . 207 | 'only ten weeks and we 'll be together ' 208 | i 've had enough of your late visits ! 209 | someone named cartuche tutankhamen . 210 | holding a presidential election in a conference room 211 | i did n't teach you how to speak 212 | weapons that fire underwater . 213 | hell , there 's no trace of the damn girl ! 214 | you have to go back to the bull . 215 | and that 's not true ? 216 | did you check grandma 's stables ? 217 | we underestimated how hard it was to pose for the cover of the book . 218 | the boy is believed to have died as a result of the terrorist attacks . 219 | and the woman who worked at the house next door also killed herself . . . 220 | i 'm gon na grab some shots . 221 | i support the protection of businesses by the european commission in the fight against fraud 222 | i think the north shaolin sees its enemies as its patron . '' 223 | in any case , the way to know if anyone was doing this deliberately ? 224 | all you see is how your information looks . '' 225 | ca n't you even talk about this ? '' 226 | then i always smoke for an hour . . . 227 | according to sharon , he was n't strong enough . 228 | miranda had no idea at the moment . 229 | i 've grown up on my own . 230 | structuring financial restructuring and restructuring of businesses is usually carried out simultaneously . 231 | girls in a ponytail '' like me . 232 | that beauty is drawing you on your lips and on your cheeks ! 233 | 'more importantly , sport can prevent many diseases . 234 | take the maximum possible flexibility . '' 235 | be afraid of something . '' 236 | i do n't understand what you 're talking about . '' 237 | according to him , prohibition will be lifted . 238 | when you have the sword at your side , no demonic force will hurt you . 239 | grandma , let dr . 240 | next to him , a large , wolfish dog sat . 241 | from her beauty , a strange loneliness in her eyes that no one had ever seen before . 242 | so you 're responsible for customer safety , right ? 243 | he looked me in the eye , as if he were examining my brain . 244 | there were new artists in the early 1970 s . 245 | your apology 's not helping me . 246 | francesca sabatini kept quiet and calm . '' 247 | she wanted him to kiss her for the first time . 248 | well , wilson will significantly reduce my chances of having sex . . . 249 | she looked at langdon in surprise . 250 | you ever heard of him ? 251 | an ex boyfriend of mine everyone suspects when the girl is in trouble . 252 | grady agreed to a negotiation ? 253 | estimates of different stakeholders differ considerably . 254 | maybe you 're not gon na like me because i 'm tough . 255 | so they took the snacks ? 256 | the evidence shows , there are serious shortcomings in the security system . 257 | who 's doogie howser , you know ? 258 | he 's been holding his stuff for 10 years ? 259 | there were enemies in a few seconds . 260 | am i such a fool ? 261 | a little sad loss of your friend . 262 | you 're crazy , you like pain ? 263 | 'insulting the germans means taking the bullet ' 264 | operation success depends on a sniper 's gun . 265 | does n't anyone have nuclear weapons ? '' 266 | miracles never end , right ? 267 | flecks on his helmet and his feet . 268 | next to the entrance to the village there are four more . 269 | thus , something similar to defined benefit pensions is offered only to a small number of employers . 270 | so she examined the amount of doses given to animals in animal testing . 271 | you have to make sure someone is watching you . 272 | is n't that a lock ? 273 | building the center is the easiest thing for me . 274 | i think if i did n't make a mistake , it might work . 275 | did you just joke ? '' 276 | he smiled , obviously thrilled . 277 | with stiff arches and stiff cheekbones he seems to have a strange face . 278 | how hard can it be with tea ? 279 | bean wondered if the soldier 's words came out of his mouth . 280 | instead , instead of waiting for half an hour to kill him . 281 | take two steps to the bike . 282 | at this point , francesca sabatini was calm and seemingly calm . 283 | tyler was going to have a quick lunch . 284 | i 'll get the bags and the animals . 285 | it 's worth the risk . 286 | they are all designed very efficiently to achieve certain objectives . 287 | hunted by the biggest fish on the planet 288 | the laundry hangs out of the window and the kids play football . 289 | before his audience he killed the animals ! 290 | the wife of another alex . 291 | however , if we compare europe 's countries to each other , the answer may be less clear . 292 | i see a couple of octopus that have more than five feet . . . 293 | well , maybe they 're not home . 294 | then , this is the territory of taiwanese gang . 295 | he calmed those wild animals down . 296 | she was happy with the ride , even though she was then in bed and exhausted . 297 | general , objection to the withdrawal of gossanahu ? 298 | you better get these chairs over here . 299 | smells burned . '' 300 | i beg you not to want this . 301 | they 're gon na die ? '' panagyn asked . 302 | kill my film . '' 303 | over ten years have they kept his stuff ? '' 304 | memory allocation failed , according to the corresponding image buffer . 305 | we would be grateful . 306 | no one deserves death . . . ! 307 | here 's to my contract ! 308 | just a visit or will it be a long time ? 309 | i 'm really worried about it ! 310 | hatred '' on the other hand . 311 | my son was kidnapped . 312 | is that someone ? 313 | we could meet in the store tomorrow . 314 | come take a picture . 315 | it 's a veterans hospital . 316 | i 'll show you where we 're sitting . 317 | my son 's little attacks are not my concern . 318 | to dispose of unnecessary medicines , you need to consult your veterinarian . 319 | ten years in football '' 320 | good enough for three people . 321 | of course , no way . 322 | if noga is healthy , his mental state . 323 | he really smiles at his nose . 324 | i 'm almost there . 325 | here 's a look at the technical glitches . 326 | i have hot hands . . . 327 | here 's an overview of a job that looks like this to a janitor . 328 | he said he was somewhere near the border with lu . 329 | alcohol and cigarettes are not allowed . 330 | i got an odd phone call . 331 | you 're going to kill us , for charlie ! 332 | your family murdered '' 333 | his friends ' eyes turned to hatred and hostility . 334 | you want to go to dinner ? 335 | 'imagine how pleased i was to be and how surprised i was ! ' 336 | flashy with money like sheik oman . '' 337 | 'this is so strange . . . 338 | funny and entertaining . '' 339 | you can do this in 4 to 6 weeks . 340 | i 'm saying what ? 341 | in that case , you . 342 | when you met linda bloom , right ? 343 | d'hoffyn 's not invited . 344 | has the course changed ? 345 | may be the blame for her kidnapping . '' 346 | boss , we have to escape this catastrophe ! 347 | remember , i have a theater below the street 348 | is n't that great ? 349 | you 're not eating ? 350 | you 're looking like a funny idea . 351 | you think it 's weird ? 352 | and you say someone helped him ? 353 | not so lucky '' 354 | the moment she summoned gryf , she climbed into his strong back and commanded him with a firm command . 355 | grow the grass on that wall . '' 356 | i have a cup of coffee too . 357 | consider all the commitments adopted pursuant to this chapter . '' 358 | kind of aggressive , right ? 359 | i have an intelligent way of getting married . 360 | did you tell me you upgraded ? 361 | by lighting up the fart , you burned yourself . 362 | he was built on an existing water and sewer network . 363 | both had guns . 364 | there is also an explicit list of damage to the removal of the debris . 365 | screaming like that sounds sad to me . 366 | his taste . 367 | your wife would appreciate it . 368 | the messenger just sent you an envelope full of money . 369 | you 're not going ? 370 | i do n't like being locked up in bed . '' 371 | you do n't want any compensation ? 372 | son of a bitch would kick his ass . 373 | there 's a missing person 's report from woodburne , a woman who 's writing about her husband 374 | in the end , retreats . 375 | too territorial . 376 | i do n't know how he died . 377 | breaking into the hotel room was the same thing that happened to nixon . 378 | this is an autopsy report ? 379 | nael threatens to die . 380 | to protect the security , other pages of this site have also been blocked . 381 | above a little higher weight . 382 | in short bursts , they use . 383 | my mom 's on it '' ! 384 | employed in new york 's office '' 385 | nicole was still crying . 386 | without a stable . 387 | do you like to dance ? 388 | you see that rock over there ? 389 | funny , then . 390 | you 're being disrespectful . 391 | trapped in cardassia when klingon attacked . '' 392 | i would have died if they had attacked me . '' 393 | candidates for eurojust 's administrative functions are invited . 394 | you promised me you 'd put your dreams in front of mine . 395 | the scout is looking at her grandmother as she kills a bull . 396 | the only thing you 're afraid of is the catch . 397 | i mixed him up with water , and he put him in the pump . 398 | because of the differences ? 399 | we both have to find an open space and light the fire ! 400 | it 's not clear . 401 | excuse me , mr . harriman ? 402 | stop , asshole . '' 403 | it really seems silly to put an oyster on a clam . 404 | it 's true that if you go on , you 'll leave the darkness and go to the next darkness '' . 405 | i 'm the only magician he knows . . . 406 | then who is he ? '' jack asked . 407 | air 's gone . '' 408 | is that a good blender ? '' 409 | think in a box ! 410 | i do n't think he 's hungry . '' 411 | according to the evidence , he did n't kill kwak . 412 | it 's really fun . 413 | behind him was no leader ! 414 | on this day , it was terrible . 415 | stop talking like a fool ! 416 | inspector . . . there 's no fraternity in the street 417 | the tall fist saw him and winced . 418 | never get mad ? '' 419 | then his eyes twitched a little bit . 420 | no , swallowing pills . . . 421 | do n't wear a gunpowder in your dress . 422 | it 's just suicide when we try to break mayan 's record . 423 | i mean , she does n't love me anymore . 424 | i 'd like something to drink . 425 | has he chosen himself ? 426 | you were looking out ? 427 | not like your fate . 428 | listen to the radio . '' 429 | you hurt what ? 430 | report to gina from jackie 's account . 431 | install light switches at work stations readily available at the door . '' 432 | i 'm not interested in christmas . 433 | you got any tips for me ? 434 | i can also see her smile , which can sometimes be very mysterious . 435 | i 'm looking for a wolf where he can find his prey . 436 | and professor kang , who was a pathologist , 437 | sounds like someone 's gon na kill somebody . 438 | installing windows internet connection failed . '' 439 | i ca n't save anyone in this place . '' 440 | may vanessa also be a demon victim . '' 441 | after the attack on the guard , you hurt yourself ? 442 | serious allegations that she had stolen from her company . 443 | in english , science and chemistry . 444 | i 'll find an endless rope somewhere in the steepest slope 445 | did you invite foreign troops to france 's sovereign territory ? '' 446 | there was no such huge , wild animal anywhere in the solar system . 447 | 'they all want to kill themselves ' 448 | enables the broadcast of digital media . '' 449 | me and you this afternoon . 450 | can you find strahm on his mobile phone ? '' 451 | shall i prepare breakfast ? '' 452 | i guess you 're going to edu tomorrow . 453 | the thought that trisha was scared gave her a sense of guilt and fear . 454 | of 5,000 , it 's 50 . 455 | there 's no time for sex 456 | sex does n't compete . 457 | 'have you heard his majesty 's order ? ' 458 | if you want , and lick him ? 459 | there 's so much greed ? 460 | we do n't tell people . 461 | seeing the game and the match and everything like that is a completely new experience for me . 462 | against whom he should be defended ? 463 | so they all shout . 464 | an imperialist country using such thinking is justified in economic greed . 465 | for a visit to the secretary , i want everyone involved . 466 | did you see it as a problem , like a chance to solve it ? '' 467 | 'we 're not the only ones who think it 's funny ' 468 | 'will you be friends ? ' 469 | stupid , where are the whitefish ? '' 470 | ca n't you respond to 300 milligrams of thorazine ? '' 471 | kill the bandits sabata . '' 472 | let 's all know ! '' 473 | do you no longer fear death ? '' 474 | something 's going to be a little weird . 475 | never sold the papers '' 476 | son of a bitch came back . 477 | not like a tough girl . 478 | 'all of you on happy seven ? 479 | burton depended on him to wake up as soon as possible . 480 | there 's colored coffee . '' 481 | even with a nice rifle . 482 | i did n't see a better movie . 483 | there 's one possibility , he was sent by a real killer . 484 | the fear makes me furious . 485 | you 'll treat me the same courtesy as my deputies . '' 486 | miss there 's a serenity , divine peace . '' 487 | i think the first league is thinking like me . '' 488 | any sales forecast ? 489 | no , i do n't want to eat . 490 | have you taken part too , mrs . nehru ? 491 | is that another mistake ? 492 | is that all mine ? '' 493 | stop being such an asshole . 494 | in the pregnancy of my mother , my father smoked . 495 | because of your luck . 496 | i 'm not stopping . 497 | put an innocent man in danger ? '' 498 | but that 's crazy . 499 | he did n't write his reasons ? 500 | 'there was nothing to worry about in malacandra , except for oyarsa 501 | i 'll make sure you 're not late 502 | the pain gave rise to anger . 503 | barrenger molested little girls '' 504 | bring the girl home at 9 00 in the evening . 505 | the war was my youth ! 506 | look how smart you are ! ' 507 | remove the mask permanently . '' 508 | it is clear that if i have to choose , i will take merlin with me . 509 | all your friends have sex with virgins , so do n't you ? '' 510 | then we play rock n'roll with a lot of energy . 511 | your mom 's in love '' with this ! 512 | throughout the evacuation . 513 | you 're not listening , you do n't understand . 514 | look how weird it is . . . 515 | not on the machine . 516 | i 'm just not listening . . . 517 | and kicking butt starts with an ass that makes cement . 518 | do you hear the internet joke about a doctor and a bunch of students ? '' 519 | '' will she ever leave her mother alone ? '' 520 | what an amazing feeling . 521 | in technical terms , the evaluation of the test results . 522 | 'everyone knows your various devices 523 | most common in warm periods '' 524 | bears do n't have a raucous ramp . 525 | come on , it belongs to you . 526 | at least an uncomfortable silence for the first time . 527 | this is going to be tough . 528 | i 'm sick of your late night . 529 | do n't wait half the way to kill her . . . 530 | but it seems like talking like a whore is the usual thing for every woman today . 531 | they were always smiling proudly under these conditions . 532 | all right , but damon is actually drinking hydrochloric acid . 533 | 'you should have made sure there was no violence ' 534 | not an exact number . 535 | shall we go for lunch ? '' 536 | dave 's still playing today . 537 | wish you luck . '' 538 | i have to admit , i have been happy to work with you both . '' 539 | i hope it 's not a drain ? 540 | otherwise expect an increase in the number of casualties . 541 | that the bank considers it too risky . 542 | according to him , he was sent an apology . 543 | i guess the whole family will be in trouble ? 544 | something hidden in the hallway ? '' 545 | the mysterious figure , according to her , created a secret system of checks . 546 | 'yes , i 'm lucky . ' 547 | all of london , they locked up the entire population in this place . 548 | not so quick ! '' 549 | voters simply adore daring actresses . 550 | among the priorities of the commission is an integration of environmental issues into policy dialogue . 551 | with the students , you did n't . 552 | son of a bitch ! 553 | i 'm not gon na murder my best friend , you know that ? 554 | of course , he 's a great architect . 555 | i like him . . . 556 | no motor ! 557 | '' would it hurt my feelings ? '' 558 | we 're going to 50 percent . 559 | i had no idea ! 560 | i 'm sure the planet of xenon will save captain wobbo . 561 | let 's see him today ? 562 | 'there will be no third world country ' 563 | do n't die so fast ! ' 564 | tell me your scream '' 565 | i do n't think two weeks is enough . '' 566 | mr . paddick did n't dance ? 567 | not funny , dad ? '' 568 | you know , some days do n't forget . 569 | peter connelly , originally wanted to fly this evening . 570 | for fun ? 571 | his mood may have changed . 572 | thunder 11 . . . all three objectives . '' 573 | did they do it ? '' 574 | go to sleep . '' 575 | there 's my friends . 576 | i 'm sure everything is okay 577 | that did n't bother helena . 578 | 'she opened the door , but she left no prints ? ' 579 | as your majesty commands . 580 | her complications are predictable . 581 | let 's talk about it , okay ? 582 | 'that 's our gym ' 583 | esk bit her lip and then saw how they would send her home in shame . 584 | south park will remain in perpetual darkness ! '' 585 | i do n't think there 's any chance . 586 | familiar to everyone that jasmine can only be taken before dawn . . . 587 | my decision was to write my memoirs . 588 | it 's nothing more than a vague rumor , like an alien in hangar 18 . 589 | stand by vorbis wherever he is and when he wants to do what he did yesterday . '' 590 | stilgar could not answer that question . '' 591 | i 'm having a weird day . 592 | carol , i ca n't go on santa if i do n't find mrs . claus . 593 | you all right , scooby doo ? 594 | we have a great sunday list . . . '' 595 | envoy did not need the commander at present . 596 | i 'd rather die . 597 | what kind of monster did that ? 598 | so we killed him ! 599 | and this unit can be found in all processors because they 're mathematical . 600 | an injury to the body is easy to heal , but emotional scars do n't disappear so easily . 601 | let 's talk about our mental health . 602 | he 'll be back tonight . 603 | they 're so beautiful ! 604 | for many people , such investments raised doubts . . . 605 | forget about it . '' 606 | chopper 's here . . . 607 | all tastes are complementary . . . very nice 608 | with the adjunct wiping the dust out of his gloves he headed north . 609 | we had no choice , '' appius snapped . 610 | 'take it easy ' 611 | i have nothing to worry about 'cause he 's never gon na find a little dick 612 | ca n't work ! ' 613 | i brought dollar bills . 614 | could not connect to properties page . '' 615 | you know , someone saw something , you know ? 616 | i have a lot of work to do tomorrow . 617 | did she look cute ? '' 618 | i 'll ignore it , you know . 619 | for the muslims . 620 | we have to operate or he dies . 621 | in the trap , turk ? 622 | you got a description ? 623 | am i kidding ? '' 624 | it 's gon na be hard . 625 | have you seen him since he died ? '' 626 | the only way to feel strong was to throw money away . 627 | let 's climb up the mountain and build a house . '' 628 | the presentation must be saved prior to the start of the record , schedule , or start the broadcast . 629 | we 'd like to talk to your father . 630 | the capacity of the agency must be strengthened further . 631 | 'that decision does not explicitly allow an extension of the deadline 632 | for a young handsome boy . 633 | were the last minutes of his life filled with hatred and rage against all mankind . '' 634 | he 's running out of fear of the vision of st clairvoyant . 635 | with luck , i wo n't hurt anyone ! 636 | we are making great military efforts . 637 | call it elena 's waiting song . '' 638 | a damn trap '' we 've just entered ! 639 | i have plans for the film . 640 | for the last 10 minutes i 've had something like that ! 641 | i prefer bay 's theory about two or three stages of life . 642 | you helped us , you said . 643 | but technology has improved . 644 | i 'm sure shershow 's wedding is in front of me . 645 | i 'll execute you if you bite the dog 646 | we have set out certain requirements for quality . 647 | brock moved his lips and made no sound . '' 648 | i 'm not gon na call your wife a whore . '' 649 | is there any ship capable of resisting the impact of its underwater camera ? '' 650 | we 'll be lucky . 651 | sounds like terrible shrieks '' like terrible waivers . 652 | awake brave journalist '' 653 | there may be a solution to the negotiation . 654 | you 're so annoying . 655 | i was n't joking . 656 | there 's no cd in the stolen cars . 657 | i had a terrible feeling about her 658 | have you ever seen jaws ? '' 659 | however , morak stepped forward and raised his sword . 660 | he 'll announce it at a banquet tomorrow . 661 | i watched the film . 662 | theresa was surprised to think of something like this . 663 | pleasure in your arrival '' i 'm glad . 664 | 57 seconds to the end ! 665 | no one said a word when his friend looked at him . 666 | 'her secret is what federal scientists are working on every day 667 | we do n't have any alcohol and that 's why . 668 | sofia , i 'm not responsible for your son 's breast feeding . 669 | we just got a cup of coffee . 670 | the funny fool will get me rich ! 671 | i 'm not doing anything wrong . 672 | is that so exciting ? '' 673 | i never worry about such things . . . 674 | i 'll have a nice ceremony . . . 675 | a couple of heads were sticking out of the door , and they looked around the corner . 676 | printer calibrated '' 677 | impossible to become a family . . . 678 | there could be anyone . . . 679 | your crutch and your staff is pleasing me . 680 | dylan wrote about lynching at the circus . '' 681 | that land is private ! 682 | you want me to slap a towel on my butt ? 683 | i want to travel . 684 | an ex member of s . r . p . will arrive tomorrow . '' 685 | so you 're going to pay me 100 if i clean your car . '' 686 | miss the only defect to be perfect . '' 687 | lock all the doors . '' 688 | you have the best chance to find them before they leave town . 689 | their cars are full of radioactive material . 690 | the main project aimed at developing an army of non lethal weapons has been called mclennen forster . 691 | the condition is exactly the same . 692 | the guy should never have died . 693 | as in the past , much greater attention should be paid to geographical indications and differences in manufacturing methods 694 | you 're so cute . 695 | is that gon na hurt him ? ' 696 | that 's what my dad wanted me to do . 697 | so you two ca n't talk . '' 698 | it makes me feel like an idiot blonde . 699 | what a beautiful dress . 700 | there were ten against the whole regiment . '' 701 | it 's hydarn 's fault . 702 | in memory of the achievements of the moroccan civilization , forever . 703 | and she had a strange dream . 704 | there 's no one to catch the doorman when he 's asleep . 705 | i know my father , so i 'm not surprised . 706 | after we made about 20 bucks , we waited for the sun to come out . 707 | no more medication . 708 | you know , do you have to dig a hole ? 709 | sometimes things do n't go according to plan . 710 | tom howard nudges him . '' 711 | i 'm not allowed to do this ! 712 | garber never made such a mistake . 713 | i have one task for you 714 | half a century ago , those scavengers destroyed our moon ! 715 | do you have any moisturiser ? 716 | all of these injuries ( weight loss ) 717 | we 'll try to keep up for a few minutes . 718 | i 'm not afraid of anything . 719 | sounds like god to you . 720 | she does n't seem to be six months old 721 | maybe the book will change your life . 722 | want to talk to mr eagles ? 723 | do n't eat four marshmallows , stu . '' 724 | 'so you could learn to ride a bike like this ' 725 | shameful to waste them on such insolent fools ! '' 726 | 'did you see her ? ' 727 | she scared the hell out of them , did n't she , doctor ? 728 | he does n't need rest or cuddling , which is something else than in europe . 729 | i made the missile . . . 730 | i just wanted to check if there 's enough space . . . 731 | the fbi was looking for bondi . 732 | just forget how we handle an ordinary job . 733 | does anyone have a bandolier ? '' 734 | is that why you wo n't stay overnight ? '' 735 | what 's nearby ? 736 | hey , when 's mom coming back ? 737 | which is a european wine whose colour is blood . 738 | if these policies fail , citizens will demand the mayor 's responsibility . 739 | welcome to the biggest dream 740 | before the fraud starts , the european commission is taking action to protect businesses . 741 | you do n't have clothes ? 742 | 'give it to me and i will make a mistake 743 | but it 's not a good idea to mix vodka with whiskey . 744 | further proposals for simplification are currently under way . 745 | is neeble one of them ? '' 746 | i define three situations in this context '' 747 | how did it respond to the acceleration of the approval procedures for the grid infrastructure ? 748 | i do n't think it 's a good choice if you want someone to do your dirty work . '' 749 | i think it 's because he wanted to feed the thirst of his brain for self respect and alcohol . 750 | so let 's not waste our time ! 751 | from the grasslands came a couple of majestic trees , full of wild white birds . 752 | out of your heart . 753 | in buddhism , the spirit of the 49th day is believed to be reborn . 754 | electronic bills have an advantage of speed and anonymity . 755 | on the toilet where his husband found her . 756 | look , i 'll take care of the hotel tomorrow . 757 | dude , there 's weird stuff going on ! 758 | to lend you a razor ? 759 | i do n't see why we should try to marry her instead of a tomb . 760 | i ca n't marry william whele . 761 | 'gun position , move ! ' 762 | do n't hide him , no one will feed him . '' 763 | the rest . . . they 're toys . 764 | no one told roger about it '' 765 | the classic kind of friends he wants me to be . 766 | kind of like stealing our car . 767 | let 's go to another elevator . 768 | i did n't talk to him . 769 | a mistake in there . 770 | as a result , the average reward for a slow and quick career does not correspond to the reward for an average lifespan 771 | then again his anger flared as a bubble burst . . . 772 | careful to avoid a cockroach in such a hotel . '' 773 | any idea what a surprise wedding would be ? 774 | he said if humanity did n't have a future , he would die . 775 | in that case , your contract and dola girosa . 776 | not funny to me . 777 | so she did n't have a nice coat , and my sister threw her out . '' 778 | you think it 's hard ? 779 | i wonder what devil 's taste tastes like ! 780 | is that a crime ? 781 | almost as if his moustache had shrunk . 782 | i 'll just take the shifter off my butt . 783 | it 's so beautiful ! 784 | there will be no doubt about the possibility of assistance in some disasters thanks to these improvements . 785 | could not transfer documents to workspace . '' 786 | no one abused me . 787 | in west hollywood they can arrest you if you do n't vaccinate your dog 788 | they marked the asian visa . 789 | i heard the monster poked a hole in the corner and looked at them . '' 790 | you know he 's dead ? 791 | not for a long time ? 792 | the technology he developed . 793 | the one who locked us up . 794 | the colony responded quickly and decisively . 795 | i have a powerful energy to heal from love . 796 | do n't kill the original vampire . '' 797 | before planning pregnancy , consider switching to suitable complementary medicines . 798 | she went to the river you were talking about , jumped into her and died ! 799 | looks like a couple of shots broke his fingers 800 | so , you 're planning on selling the remaining companies . 801 | -------------------------------------------------------------------------------- /data-processing.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | from helper.utils import * 4 | import tempfile 5 | import pandas as pd 6 | import os 7 | import multiprocessing 8 | import math 9 | from collections import Counter 10 | import itertools 11 | 12 | 13 | 14 | parser = argparse.ArgumentParser() 15 | # for generating the corresponding src file used for bart generation model 16 | parser.add_argument("--input_dir", "-i", help="the input directory", type=str) 17 | parser.add_argument("--output_dir", "-o", help="the output directory", type=str) 18 | parser.add_argument("--use_template", choices=["Y", "N"], help="if we are using template as the syntactic signal") 19 | args = parser.parse_args() 20 | 21 | def generate_non_trim_version(o_dir, src_parses, tgt_pure_parses, src_lines, tgt_lines, signal, 22 | exemplar): 23 | ''' 24 | :param o_dir: output directory for the entire setting 25 | :param src_parses: source parses 26 | :param tgt_pure_parses: target parses 27 | :param src_lines: source sentences 28 | :param tgt_lines: target sentences 29 | :param signal: train/dev/test 30 | :param exemplar: the choices are [exemplar, non-exemplar] 31 | :return: 32 | ''' 33 | direc_path = f"{o_dir}/{exemplar}/no-trim" 34 | if not os.path.exists(direc_path): 35 | os.makedirs(direc_path) 36 | output_source = open(f"{direc_path}/{signal}.source", "w+") 37 | output_tgt = open(f"{direc_path}/{signal}.target", "w+") 38 | for i in range(0, len(src_parses)): 39 | # , , -> 40 | output_source.write(f"{src_lines[i]}{src_parses[i]}{tgt_pure_parses[i]}\n") 41 | output_tgt.write(f"{tgt_lines[i]}\n") 42 | 43 | def generate_trim_version(o_dir, src_parses, tgt_pure_parses, src_lines, tgt_lines, signal, height, 44 | exemplar): 45 | direc_path = f"{o_dir}/{exemplar}/level{height}" 46 | if not os.path.exists(direc_path): 47 | os.makedirs(f"{direc_path}") 48 | output_source = open(f"{direc_path}/{signal}.source", "w+") 49 | output_tgt = open(f"{direc_path}/{signal}.target", "w+") 50 | src_trim, tgt_trim = [], [] 51 | for i in range(0, len(src_parses)): 52 | # , , -> 53 | trim_tgt = trim_str(tgt_pure_parses[i], height) 54 | trim_src = trim_str(src_parses[i], height) 55 | src_trim.append(trim_src) 56 | tgt_trim.append(trim_tgt) 57 | output_source.write(f"{src_lines[i]}{src_parses[i]}{trim_tgt}\n") 58 | output_tgt.write(f"{tgt_lines[i]}\n") 59 | return src_trim, tgt_trim 60 | 61 | def generate_tgt_parse(arguments): 62 | result = [] 63 | level_, freq, src_lines, level = arguments 64 | for i in range(0, len(src_lines)): 65 | possible_drawn = step2_rouge(level_, freq, src_lines[i], level)[3] 66 | for possible in possible_drawn: 67 | # output_file.write(f"{src_lines[i]}{src_parses[i]}{possible}\n") 68 | result.append(f"{src_lines[i]}{src_parses[i]}{possible}\n") 69 | return result 70 | 71 | 72 | if __name__ == '__main__': 73 | input_dir = args.input_dir 74 | for signal in ["train", "test", "val"]: 75 | print("signal: ", signal) 76 | if signal != "test": 77 | src, tgt = f"{input_dir}/{signal}/src.txt", f"{input_dir}/{signal}/tgt.txt" 78 | else: 79 | src, tgt = f"{input_dir}/{signal}/src.txt", f"{input_dir}/{signal}/ref.txt" 80 | spe = stanford_parsetree_extractor() 81 | src_pure_parses, src_parses = spe.run(src) 82 | tgt_pure_parses, tgt_parses = spe.run(tgt) 83 | src_lines, tgt_lines = [line.strip("\n") for line in open(src, "r").readlines()], \ 84 | [line.strip("\n") for line in open(tgt, "r").readlines()] 85 | if args.use_template == "N": 86 | generate_non_trim_version(args.output_dir, src_parses, tgt_pure_parses, src_lines, 87 | tgt_lines, signal, "non-exemplar") 88 | for level in range(3, 11): 89 | src_trim, tgt_trim = generate_trim_version(args.output_dir, src_parses, 90 | tgt_pure_parses, src_lines, 91 | tgt_lines, signal, level, "non-exemplar") 92 | if signal != "test": 93 | # write the statistics of the combination 94 | path = f"{args.output_dir}/repe_statistics" 95 | if not os.path.exists(path): 96 | os.makedirs(path) 97 | frequency_file = open(f"{path}/repe_para_{level}.txt", "w+") 98 | frequency_dict = Counter(map(tuple, map(sorted, list(zip(src_trim, tgt_trim))))) 99 | for key, value in frequency_dict.items(): 100 | if value >= 1: 101 | print(f"{key}\t{value}\n") 102 | frequency_file.write(f"{key}\t{value}\n") 103 | 104 | elif signal == "test": 105 | print("write diverse source file") 106 | # generate the future target parses from the frequencies list 107 | path = f"{args.output_dir}/repe_statistics" 108 | if not os.path.exists(f"{path}/diverse"): 109 | os.makedirs(f"{path}/diverse") 110 | output_file = open(f"{path}/diverse/level{level}.source", "w+") 111 | frequency_lines = open(f"{path}/repe_para_{level}.txt", "r").readlines() 112 | level_, freq = generate_dict(frequency_lines), generate_counts_dict(frequency_lines) 113 | 114 | if multiprocessing.cpu_count() < len(src_lines): 115 | num_processes = multiprocessing.cpu_count() 116 | else: 117 | num_processes = len(src_lines) 118 | print("num_processes: ", num_processes) 119 | chunk_size = int(len(src_lines) / num_processes) 120 | result = [] 121 | chunks_src = [src_lines[i:i + chunk_size] for i in range(0, len(src_lines), chunk_size)] 122 | pool = multiprocessing.Pool(processes=num_processes) 123 | result.extend(pool.map(generate_tgt_parse, zip([level_] * num_processes, [freq] * num_processes, chunks_src, [level] * num_processes))) 124 | for line in list(itertools.chain(*result)): 125 | output_file.write(line) 126 | 127 | 128 | if args.use_template == "Y": 129 | # need to access the exemplar dataset to get the syntax info 130 | exemplar = f"{input_dir}/{signal}/tgt.txt" 131 | exemplar_pure_pareses, _ = spe.run(exemplar) 132 | generate_non_trim_version(args.output_dir, src_parses, exemplar_pure_pareses, 133 | src_lines, tgt_lines, signal, "exemplar") 134 | for level in range(3, 11): 135 | generate_trim_version(args.output_dir, src_parses, exemplar_pure_pareses, 136 | src_lines, tgt_lines, signal, level, "exemplar") 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | -------------------------------------------------------------------------------- /demo-input-data.txt: -------------------------------------------------------------------------------- 1 | Your drawing is so bright even sunshine adjusts focus! 2 | Singapore's management is so reliable even Singapore will improve the health of a person! 3 | Young artists are so productive even their subject matter focuses your attention on that subject! 4 | Watching the video again with clarity is so hilarious even enjoyment laugh! 5 | 2020 is so sad even sadness feel depression! 6 | Facing difficulties, Helen Keller is so resilient even her mother cry! 7 | LeBron James is so tall even my house fall to grind! 8 | -------------------------------------------------------------------------------- /demo.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from helper.utils import * 3 | 4 | if __name__ == '__main__': 5 | parser = argparse.ArgumentParser() 6 | parser.add_argument("--output_dir", "-o") 7 | args = parser.parse_args() 8 | 9 | file_path = "demo-input-data.txt" 10 | spe = stanford_parsetree_extractor() 11 | src_pure_parses, src_parses = spe.run(file_path) 12 | src_lines = [line.strip("\n") for line in open(file_path, "r").readlines()] 13 | 14 | level = 3 15 | print("write diverse source file") 16 | # generate the future target parses from the frequencies list 17 | path = "processed-data/ParaNMT50-hf-refine/repe_statistics" 18 | 19 | output_file = open(f"{args.output_dir}/level{level}_paranmt.source", "w+") 20 | frequency_lines = open(f"{path}/repe_para_{level}.txt", "r").readlines() 21 | level_, freq = generate_dict(frequency_lines), generate_counts_dict(frequency_lines) 22 | for i in range(0, len(src_lines)): 23 | print(i) 24 | possible_drawn = step2_rouge(level_, freq, src_lines[i], level)[3] 25 | for possible in possible_drawn: 26 | output_file.write(f"{src_lines[i]}{src_parses[i]}{possible}\n") 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /downstream-dataset/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PlusLabNLP/AESOP/0f376d1413c1ef605b7a008992e3a562c9020b99/downstream-dataset/.DS_Store -------------------------------------------------------------------------------- /evaluation/candidate_selection.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import argparse 4 | import time 5 | import os 6 | import nltk 7 | from nltk.tokenize import word_tokenize 8 | import editdistance 9 | import rouge 10 | from bleu import * 11 | from nltk.translate.bleu_score import corpus_bleu 12 | 13 | def slice_11(my_list, n): 14 | composite_list = [my_list[x:x+n] for x in range(0, len(my_list),n)] 15 | return composite_list 16 | 17 | 18 | def bleu_scorer(ref, hyp, script='default'): 19 | refsend = [] 20 | for i in range(len(ref)): 21 | refsi = [] 22 | for j in range(len(ref[i])): 23 | refsi.append(ref[i][j].split()) 24 | refsend.append(refsi) 25 | 26 | gensend = [] 27 | for i in range(len(hyp)): 28 | gensend.append(hyp[i].split()) 29 | 30 | if script == 'nltk': 31 | metrics = corpus_bleu(refsend, gensend) 32 | return [metrics] 33 | 34 | metrics = compute_bleu(refsend, gensend) 35 | return metrics 36 | 37 | rouge_eval = rouge.Rouge(metrics=['rouge-1', 'rouge-2', 'rouge-l']) 38 | 39 | def select_posed_bleu(src, df_sub): 40 | poseds = [] 41 | for idx in list(df_sub.index): 42 | syn = df_sub.loc[idx, 'syn_paraphrase'] 43 | temp = df_sub.loc[idx, 'template'] 44 | syn_tags = list(zip(*nltk.pos_tag(word_tokenize(syn))))[1] 45 | temp_tags = list(zip(*nltk.pos_tag(word_tokenize(temp))))[1] 46 | posed = editdistance.eval(syn_tags, temp_tags) 47 | poseds.append(posed) 48 | 49 | min_posed = min(poseds) 50 | posed_idx = [i for i in range(len(poseds)) if poseds[i] == min_posed] 51 | max_bleu = -1 52 | final_idx = None 53 | id_start = list(df_sub.index)[0] 54 | for idx in posed_idx: 55 | syn = df_sub.loc[id_start + idx, 'syn_paraphrase'] 56 | bleu = bleu_scorer([[src]], [syn])[0] 57 | if bleu > max_bleu: 58 | max_bleu = bleu 59 | final_idx = id_start + idx 60 | 61 | return final_idx 62 | 63 | def select_rouge(src, df_sub): 64 | max_rouge = -1 65 | max_idx = None 66 | for idx in list(df_sub.index): 67 | syn = df_sub.loc[idx, 'syn_paraphrase'] 68 | rouge = rouge_eval.get_scores([syn], [src])[0]['rouge-1']['f'] 69 | if rouge > max_rouge: 70 | max_rouge = rouge 71 | max_idx = idx 72 | return max_idx 73 | 74 | def ranker_select_rouge(src, df_sub): 75 | max_rouge = -1 76 | max_idx = None 77 | for idx in list(df_sub.index): 78 | syn = df_sub.loc[idx, 'syn_paraphrase'] 79 | rouge1 = rouge_eval.get_scores([syn], [src])[0]['rouge-1']['f'] 80 | rouge2 = rouge_eval.get_scores([syn], [src])[0]['rouge-2']['f'] 81 | rougel = rouge_eval.get_scores([syn], [src])[0]['rouge-l']['f'] 82 | rouge_general = 0.2 * rouge1 + 0.3 * rouge2 + 0.5 * rougel 83 | if rouge_general > max_rouge: 84 | max_rouge = rouge_general 85 | max_idx = idx 86 | return max_idx 87 | 88 | def select_bleu(src, df_sub): 89 | max_bleu = -1 90 | max_idx = None 91 | for idx in list(df_sub.index): 92 | syn = df_sub.loc[idx, 'syn_paraphrase'] 93 | bleu = bleu_scorer([[src]], [syn])[1][0] 94 | if bleu > max_bleu: 95 | max_bleu = bleu 96 | max_idx = idx 97 | return max_idx 98 | 99 | def select_maxht(df_sub): 100 | max_ht = -1 101 | max_idx = None 102 | for idx in list(df_sub.index): 103 | ht = int(df_sub.loc[idx, 'height']) 104 | if ht > max_ht: 105 | max_ht = ht 106 | max_idx = idx 107 | 108 | return max_idx 109 | 110 | if __name__ == "__main__": 111 | 112 | parser = argparse.ArgumentParser('Convert trees file to sentence file') 113 | parser.add_argument('-mode', default = 'test', help = '') 114 | parser.add_argument('-gen_dir', help = ' ', default="./") 115 | parser.add_argument('-output_file', help="the name of the output_file") 116 | # parser.add_argument('-clean_gen_file', required = True, help = 'name of the file') 117 | # parser.add_argument('-res_file', required = True, help = 'name of the file') 118 | parser.add_argument('-crt', choices = ['posed','rouge', 'bleu', 'maxht', 'rouge-general'], 119 | default ='bleu', 120 | help = "Criteria to select best generation") 121 | parser.add_argument('-sample', type=int, default=10) 122 | parser.add_argument('-scbart_generate', help="the file scbart generated", default="output/template-based-diverse-wr.txt") 123 | parser.add_argument('-target', help="the target file", default="eval_data/template-based3-set1.target") 124 | args = parser.parse_args() 125 | 126 | generate_lines = open(args.scbart_generate, "r").readlines() 127 | target_lines = open(args.target,"r").readlines() 128 | target_lines = [line.split("")[0].strip() for line in target_lines] 129 | assert len(generate_lines) == len(target_lines) 130 | df_ls = [] 131 | for i in range(0, len(generate_lines)): 132 | generate = generate_lines[i] 133 | target = target_lines[i] 134 | df_ls.append({ 135 | "source": target, 136 | "syn_paraphrase": generate 137 | }) 138 | df = pd.DataFrame(df_ls) 139 | 140 | # df = pd.read_csv(os.path.join(args.gen_dir, args.clean_gen_file)) 141 | srcs_unq = [] 142 | idss = [] 143 | ids = [] 144 | prev_src = None 145 | prev_temp = None 146 | it = 0 147 | 148 | srcs_unq = [ls[0].strip("\n") for ls in slice_11(df["source"].values, args.sample)] 149 | idss = slice_11(df["source"].index, args.sample) 150 | 151 | assert len(idss) == len(srcs_unq) 152 | elites = [] 153 | for src, ids in zip(srcs_unq, idss): 154 | df_sub = df.loc[ids] 155 | 156 | if args.crt == 'posed': 157 | final_idx = select_posed_bleu(src, df_sub) 158 | elif args.crt == 'bleu': 159 | final_idx = select_bleu(src, df_sub) 160 | elif args.crt == 'maxht': 161 | final_idx = select_maxht(df_sub) 162 | elif args.crt == 'rouge-general': 163 | final_idx = ranker_select_rouge(src, df_sub) 164 | else: 165 | final_idx = select_rouge(src, df_sub) 166 | elites.append(final_idx) 167 | 168 | df_elite = df[df.index.isin(elites)] 169 | 170 | assert len(df_elite) == len(srcs_unq) 171 | try: 172 | references = df_elite['reference'].values 173 | except: 174 | references = [] 175 | syn_paras = df_elite['syn_paraphrase'].values 176 | sources = df_elite['source'].values 177 | 178 | # para_f, source_f = open(os.path.join(args.gen_dir, 'para.txt'), "w+"), \ 179 | # open(os.path.join(args.gen_dir, 'source.txt'), "w+") 180 | # para_f = open(os.path.join(args.gen_dir, 'QQPPos-para.txt'), "w+") 181 | para_f = open(os.path.join(args.gen_dir, args.output_file), "w+") 182 | for i, row in df_elite.iterrows(): 183 | syn_para, source = row["syn_paraphrase"].strip("\n").strip(), row["source"].strip("\n").strip() 184 | para_f.write(syn_para + "\n") 185 | # source_f.write(source + "\n") 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | -------------------------------------------------------------------------------- /evaluation/eval.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import rouge 3 | 4 | from eval_utils import Meteor, stanford_parsetree_extractor, \ 5 | compute_tree_edit_distance 6 | from tqdm import tqdm 7 | import ipdb as pdb 8 | import subprocess 9 | 10 | MULTI_BLEU_PERL = 'apps/multi-bleu.perl' 11 | 12 | def run_multi_bleu(input_file, reference_file): 13 | bleu_output = subprocess.check_output( 14 | "./{} -lc {} < {}".format(MULTI_BLEU_PERL, reference_file, input_file), 15 | stderr=subprocess.STDOUT, shell=True).decode('utf-8') 16 | bleu = float( 17 | bleu_output.strip().split("\n")[-1] 18 | .split(",")[0].split("=")[1][1:]) 19 | return bleu 20 | 21 | 22 | parser = argparse.ArgumentParser() 23 | parser.add_argument('--input_file', '-i', type=str) 24 | parser.add_argument('--ref_file', '-r', type=str) 25 | parser.add_argument('--temp_file', '-t', type = str) 26 | args = parser.parse_args() 27 | 28 | n_ref_line = len(list(open(args.ref_file))) 29 | n_inp_line = len(list(open(args.input_file))) 30 | print("#lines - ref: {}, inp: {}".format(n_ref_line, n_inp_line)) 31 | assert n_inp_line == n_ref_line, \ 32 | "#ref {} != #inp {}".format(n_ref_line, n_inp_line) 33 | 34 | bleu_score = run_multi_bleu(args.input_file, args.ref_file) 35 | print("bleu", bleu_score) 36 | spe = stanford_parsetree_extractor() 37 | input_parses = spe.run(args.input_file) 38 | ref_parses = spe.run(args.ref_file) 39 | temp_parses = spe.run(args.temp_file) 40 | spe.cleanup() 41 | assert len(input_parses) == n_inp_line 42 | assert len(ref_parses) == n_inp_line 43 | 44 | all_meteor = [] 45 | all_ted = [] 46 | all_ted_t = [] 47 | all_rouge1 = [] 48 | all_rouge2 = [] 49 | all_rougel = [] 50 | preds = [] 51 | 52 | rouge_eval = rouge.Rouge(metrics=['rouge-n', 'rouge-l'], 53 | max_n=2, 54 | limit_length=True, 55 | length_limit=100, 56 | length_limit_type='words', 57 | apply_avg=False, 58 | apply_best=False, 59 | alpha=0.5, # Default F1_score 60 | weight_factor=1.2, 61 | stemming=True) 62 | meteor = Meteor() 63 | pbar = tqdm(zip(open(args.input_file), 64 | open(args.ref_file), 65 | input_parses, 66 | ref_parses, 67 | temp_parses)) 68 | i = 0 69 | 70 | height = 5 71 | for input_line, ref_line, input_parse, ref_parse,temp_parse in pbar: 72 | ted = compute_tree_edit_distance(input_parse, ref_parse, height) 73 | ted_t = compute_tree_edit_distance(input_parse, temp_parse, height) 74 | ms = meteor._score(input_line.strip(), [ref_line.strip()]) 75 | rs = rouge_eval.get_scores([input_line.strip()], [ref_line.strip()]) 76 | 77 | all_rouge1.append(rs['rouge-1'][0]['f'][0]) 78 | all_rouge2.append(rs['rouge-2'][0]['f'][0]) 79 | all_rougel.append(rs['rouge-l'][0]['f'][0]) 80 | all_meteor.append(ms) 81 | all_ted.append(ted) 82 | all_ted_t.append(ted_t) 83 | pbar.set_description( 84 | "bleu: {:.3f}, rouge-1: {:.3f}, rouge-2: {:.3f}, " 85 | "rouge-l: {:.3f}, meteor: {:.3f}, syntax-TED: {:.3f}, Template-TED: {:.3f}".format( 86 | bleu_score, 87 | sum(all_rouge1) / len(all_rouge1) * 100, 88 | sum(all_rouge2) / len(all_rouge1) * 100, 89 | sum(all_rougel) / len(all_rouge1) * 100, 90 | sum(all_meteor) / len(all_meteor) * 100, 91 | sum(all_ted) / len(all_ted), 92 | sum(all_ted_t) / len(all_ted_t))) 93 | 94 | print( 95 | "bleu: {:.3f}, rouge-1: {:.3f}, rouge-2: {:.3f}, " 96 | "rouge-l: {:.3f}, meteor: {:.3f}, syntax-TED: {:.3f}, Template-TED: {:.3f}".format( 97 | bleu_score, 98 | sum(all_rouge1) / len(all_rouge1) * 100, 99 | sum(all_rouge2) / len(all_rouge1) * 100, 100 | sum(all_rougel) / len(all_rouge1) * 100, 101 | sum(all_meteor) / len(all_meteor) * 100, 102 | sum(all_ted) / len(all_ted), 103 | sum(all_ted_t) / len(all_ted_t))) 104 | -------------------------------------------------------------------------------- /evaluation/eval_utils.py: -------------------------------------------------------------------------------- 1 | # Python wrapper for METEOR implementation, by Xinlei Chen 2 | # Acknowledge Michael Denkowski for the generous discussion and help 3 | 4 | import os 5 | import re 6 | import subprocess 7 | import threading 8 | import tempfile 9 | 10 | from nltk.tree import Tree 11 | from zss import simple_distance, Node 12 | import pdb 13 | 14 | STANFORD_CORENLP = 'apps/stanford-corenlp-full-2018-10-05' 15 | METEOR_JAR = 'apps/meteor-1.5.jar' 16 | METEOR_DATA = 'data/paraphrase-en.gz' 17 | 18 | def enc(s): 19 | return s.encode('utf-8') 20 | 21 | 22 | def dec(s): 23 | return s.decode('utf-8') 24 | 25 | 26 | class Meteor: 27 | def __init__(self): 28 | self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, 29 | '-', '-', '-stdio', '-l', 'en', '-norm', '-a', 30 | METEOR_DATA] 31 | self.meteor_p = subprocess.Popen( 32 | self.meteor_cmd, 33 | cwd=os.getcwd(), 34 | stdin=subprocess.PIPE, 35 | stdout=subprocess.PIPE, 36 | stderr=subprocess.PIPE) 37 | # Used to guarantee thread safety 38 | self.lock = threading.Lock() 39 | 40 | def compute_score(self, gts, res): 41 | assert(gts.keys() == res.keys()) 42 | imgIds = gts.keys() 43 | scores = [] 44 | 45 | eval_line = 'EVAL' 46 | self.lock.acquire() 47 | for i in imgIds: 48 | assert(len(res[i]) == 1) 49 | stat = self._stat(res[i][0], gts[i]) 50 | eval_line += ' ||| {}'.format(stat) 51 | 52 | self.meteor_p.stdin.write(enc('{}\n'.format(eval_line))) 53 | self.meteor_p.stdin.flush() 54 | for i in range(0, len(imgIds)): 55 | scores.append(dec(float(self.meteor_p.stdout.readline().strip()))) 56 | score = float(dec(self.meteor_p.stdout.readline().strip())) 57 | self.lock.release() 58 | 59 | return score, scores 60 | 61 | def _stat(self, hypothesis_str, reference_list): 62 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 63 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 64 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 65 | self.meteor_p.stdin.write(enc(score_line + "\n")) 66 | self.meteor_p.stdin.flush() 67 | return dec(self.meteor_p.stdout.readline()).strip() 68 | 69 | def _score(self, hypothesis_str, reference_list): 70 | # self.lock.acquire() 71 | with self.lock: 72 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 73 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 74 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 75 | self.meteor_p.stdin.write(enc(score_line + "\n")) 76 | self.meteor_p.stdin.flush() 77 | stats = dec(self.meteor_p.stdout.readline().strip()) 78 | eval_line = 'EVAL ||| {}'.format(stats) 79 | # EVAL ||| stats 80 | self.meteor_p.stdin.write(enc('{}\n'.format(eval_line))) 81 | self.meteor_p.stdin.flush() 82 | score = float(dec(self.meteor_p.stdout.readline()).strip()) 83 | # bug fix: there are two values returned by the jar file, one average, and one all, so do it twice 84 | # thanks for Andrej for pointing this out 85 | score = float(dec(self.meteor_p.stdout.readline().strip())) 86 | # self.lock.release() 87 | return score 88 | 89 | def __del__(self): 90 | self.lock.acquire() 91 | self.meteor_p.stdin.close() 92 | self.meteor_p.kill() 93 | self.meteor_p.wait() 94 | self.lock.release() 95 | 96 | 97 | def deleaf(parse_string): 98 | tree = Tree.fromstring(parse_string.strip(), read_leaf=lambda s: "") 99 | for sub in tree.subtrees(): 100 | for n, child in enumerate(sub): 101 | if isinstance(child, str): 102 | continue 103 | if len(list(child.subtrees(filter=lambda x: x.label() == '-NONE-'))) == len(child.leaves()): 104 | del sub[n] 105 | oneline = tree.pformat(margin=10000, parens=[" ( ", " ) "]) 106 | oneline = re.sub(' +', ' ', oneline) 107 | return oneline 108 | 109 | 110 | def extract_parses(fname): 111 | # extract parses from corenlp output 112 | # based on https://github.com/miyyer/scpn/blob/master/read_paranmt_parses.py 113 | with open(fname, 'r', encoding='utf-8') as f: 114 | 115 | count = 0 116 | sentences = [] 117 | data = {'tokens': [], 'pos': [], 'parse': '', 'deps': []} 118 | for idx, line in enumerate(f): 119 | if idx <= 1: 120 | continue 121 | if line.startswith('Sentence #'): 122 | new_sent = True 123 | new_pos = False 124 | new_parse = False 125 | new_deps = False 126 | if idx == 2: 127 | continue 128 | 129 | sentences.append(data) 130 | count += 1 131 | 132 | data = {'tokens': [], 'pos': [], 'parse': '', 'deps': []} 133 | 134 | # read original sentence 135 | elif new_sent: 136 | new_sent = False 137 | new_pos = True 138 | 139 | elif new_pos and line.startswith("Tokens"): 140 | continue 141 | 142 | # read POS tags 143 | elif new_pos and line.startswith('[Text='): 144 | line = line.strip().split() 145 | w = line[0].split('[Text=')[-1] 146 | pos = line[-1].split('PartOfSpeech=')[-1][:-1] 147 | data['tokens'].append(w) 148 | data['pos'].append(pos) 149 | 150 | # start reading const parses 151 | elif (new_pos or new_parse) and len(line.strip()): 152 | if line.startswith("Constituency parse"): 153 | continue 154 | new_pos = False 155 | new_parse = True 156 | data['parse'] += ' ' + line.strip() 157 | 158 | # start reading deps 159 | elif (new_parse and line.strip() == "") or \ 160 | line.startswith("Dependency Parse"): 161 | new_parse = False 162 | new_deps = True 163 | 164 | elif new_deps and len(line.strip()): 165 | line = line.strip()[:-1].split('(', 1) 166 | rel = line[0] 167 | x1, x2 = line[1].split(', ') 168 | x1 = x1.replace("'", "") 169 | x2 = x2.replace("'", "") 170 | x1 = int(x1.rsplit('-', 1)[-1]) 171 | x2 = int(x2.rsplit('-', 1)[-1]) 172 | data['deps'].append((rel, x1 - 1, x2 - 1)) 173 | 174 | else: 175 | new_deps = False 176 | 177 | sentences.append(data) 178 | 179 | return sentences 180 | 181 | 182 | class stanford_parsetree_extractor: 183 | def __init__(self): 184 | self.stanford_corenlp_path = os.path.join(STANFORD_CORENLP, "*") 185 | print("standford corenlp path:", self.stanford_corenlp_path) 186 | self.output_dir = tempfile.TemporaryDirectory() 187 | self.cmd = ['java', '-cp', self.stanford_corenlp_path, 188 | '-Xmx2G', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', 189 | '-annotators', 'tokenize,ssplit,pos,parse', 190 | '-ssplit.eolonly', '-outputFormat', 'text', 191 | '-outputDirectory', self.output_dir.name, 192 | '-file', None] 193 | 194 | def run(self, file): 195 | print("parsing file:", file) 196 | self.cmd[-1] = file 197 | out = subprocess.run( 198 | self.cmd, 199 | cwd=os.getcwd(), 200 | stdout=subprocess.PIPE, 201 | stderr=subprocess.PIPE) 202 | print(out) 203 | parsed_file = \ 204 | os.path.join( 205 | self.output_dir.name, 206 | os.path.split(file)[1] + ".out") 207 | return [deleaf(e['parse']).strip() for e in extract_parses(parsed_file)] 208 | 209 | def cleanup(self): 210 | self.output_dir.cleanup() 211 | 212 | 213 | def build_tree(s): 214 | old_t = Tree.fromstring(s) 215 | new_t = Node("S") 216 | 217 | def create_tree(curr_t, t): 218 | if t.label() and t.label() != "S": 219 | new_t = Node(t.label()) 220 | curr_t.addkid(new_t) 221 | else: 222 | new_t = curr_t 223 | for i in t: 224 | if isinstance(i, Tree): 225 | create_tree(new_t, i) 226 | create_tree(new_t, old_t) 227 | return new_t 228 | 229 | 230 | def strdist(a, b): 231 | if a == b: 232 | return 0 233 | else: 234 | return 1 235 | 236 | def string_comma(string): 237 | start = 0 238 | new_string = '' 239 | while start < len(string): 240 | if string[start:].find(",") == -1: 241 | new_string += string[start:] 242 | break 243 | else: 244 | index = string[start:].find(",") 245 | if string[start - 2] != "(": 246 | new_string += string[start:start + index] 247 | new_string += " " 248 | else: 249 | new_string = new_string[:start-1] +", " 250 | start = start + index + 1 251 | return new_string 252 | 253 | def clean_tuple_str(tuple_str): 254 | new_str_ls = [] 255 | if len(tuple_str) == 1: 256 | new_str_ls.append(tuple_str[0]) 257 | else: 258 | for i in str(tuple_str).split(", "): 259 | if i.count("'") == 2: 260 | new_str_ls.append(i.replace("'", "")) 261 | elif i.count("'") == 1: 262 | new_str_ls.append(i.replace("\"", "")) 263 | str_join = ' '.join(ele for ele in new_str_ls) 264 | return string_comma(str_join) 265 | 266 | def to_tuple(lst): 267 | return tuple(to_tuple(i) if isinstance(i, list) else i for i in lst) 268 | 269 | def trim_tree_nltk(root, height): 270 | try: 271 | root.label() 272 | except AttributeError: 273 | return 274 | 275 | if height < 1: 276 | return 277 | all_child_state = [] 278 | # print(root.label()) 279 | all_child_state.append(root.label()) 280 | 281 | if len(root) >= 1: 282 | for child_index in range(len(root)): 283 | child = root[child_index] 284 | if trim_tree_nltk(child, height - 1): 285 | all_child_state.append(trim_tree_nltk(child, height - 1)) 286 | # print(all_child_state) 287 | return all_child_state 288 | 289 | 290 | def trim_str(string, height): 291 | return clean_tuple_str(to_tuple(trim_tree_nltk(Tree.fromstring(string), height))) 292 | 293 | 294 | def compute_tree_edit_distance(pred_parse, ref_parse, height): 295 | if height == 3: 296 | return simple_distance( 297 | build_tree(trim_str(ref_parse, height)), build_tree(trim_str(pred_parse, height)), label_dist=strdist) 298 | else: 299 | return simple_distance( 300 | build_tree(ref_parse), build_tree(pred_parse), label_dist=strdist) -------------------------------------------------------------------------------- /extract_sentence.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | parser = argparse.ArgumentParser() 4 | parser.add_argument("--input_file", help="the input file from bart output") 5 | 6 | 7 | parser.add_argument("--keyword", default="") 8 | args = parser.parse_args() 9 | 10 | input_lines = open(args.input_file, "r").readlines() 11 | # ref_lines = open(args.ref_file, "r").readlines() 12 | keyword = args.keyword 13 | input_file = args.input_file 14 | if ".txt" in args.input_file: 15 | input_file = input_file.replace(".txt", "") 16 | if keyword == "": 17 | output_file = open(input_file + f"_sep_extract", "w+") 18 | error_file = open(input_file + f"_sep_error", "w+") 19 | else: 20 | output_file = open(input_file + f"_return_extract", "w+") 21 | error_file = open(input_file + f"_return_error", "w+") 22 | 23 | count = 0 24 | 25 | def deal_non_sep(string, keyword): 26 | if keyword in string: raise Exception("it has ") 27 | else: 28 | if "ROOT" not in string: 29 | # did not try to generate the syntactic parse at all 30 | final_str = string 31 | else: 32 | last_para_count = 0 33 | # find the last element with "(" in it 34 | for i in range(0, len(string.split(" "))): 35 | item = string.split(" ")[i] 36 | if "(" in item or ")" in item: 37 | last_para_count = i 38 | valid_tokens = string.split(" ")[last_para_count + 1:] 39 | final_str = " ".join(token for token in valid_tokens) 40 | if final_str == "": 41 | final_str = "." 42 | return final_str 43 | 44 | 45 | for line in input_lines: 46 | line = line.strip("\n") 47 | if keyword in line: 48 | output_file.write(line.split(keyword)[1] + "\n") 49 | else: 50 | output_file.write(deal_non_sep(line, args.keyword) + "\n") 51 | error_file.write(str(count+1) + "\t" + line + "\n") 52 | error_file.write("extract: " + deal_non_sep(line, args.keyword) + "\n") 53 | # output_file.write("\n") 54 | 55 | count = count + 1 56 | 57 | -------------------------------------------------------------------------------- /finetune_trainer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Copyright 2020 The HuggingFace Team. All rights reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import logging 17 | import os 18 | import sys 19 | from dataclasses import dataclass, field 20 | from typing import Optional 21 | 22 | import transformers 23 | from transformers import ( 24 | AutoConfig, 25 | AutoModelForSeq2SeqLM, 26 | AutoTokenizer, 27 | HfArgumentParser, 28 | MBartTokenizer, 29 | Seq2SeqTrainer, 30 | Seq2SeqTrainingArguments, 31 | set_seed, 32 | ) 33 | from transformers.trainer_utils import EvaluationStrategy, is_main_process 34 | from transformers.training_args import ParallelMode 35 | from utils import ( 36 | Seq2SeqDataCollator, 37 | Seq2SeqDataset, 38 | assert_all_frozen, 39 | build_compute_metrics_fn, 40 | check_output_dir, 41 | freeze_embeds, 42 | freeze_params, 43 | lmap, 44 | save_json, 45 | use_task_specific_params, 46 | write_txt_file, 47 | ) 48 | 49 | 50 | logger = logging.getLogger(__name__) 51 | 52 | 53 | @dataclass 54 | class ModelArguments: 55 | """ 56 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. 57 | """ 58 | 59 | model_name_or_path: str = field( 60 | metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} 61 | ) 62 | config_name: Optional[str] = field( 63 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 64 | ) 65 | tokenizer_name: Optional[str] = field( 66 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 67 | ) 68 | cache_dir: Optional[str] = field( 69 | default=None, 70 | metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, 71 | ) 72 | freeze_encoder: bool = field(default=False, metadata={"help": "Whether tp freeze the encoder."}) 73 | freeze_embeds: bool = field(default=False, metadata={"help": "Whether to freeze the embeddings."}) 74 | 75 | 76 | @dataclass 77 | class DataTrainingArguments: 78 | """ 79 | Arguments pertaining to what data we are going to input our model for training and eval. 80 | """ 81 | 82 | data_dir: str = field( 83 | metadata={"help": "The input data dir. Should contain the .tsv files (or other data files) for the task."} 84 | ) 85 | task: Optional[str] = field( 86 | default="summarization", 87 | metadata={"help": "Task name, summarization (or summarization_{dataset} for pegasus) or translation"}, 88 | ) 89 | max_source_length: Optional[int] = field( 90 | default=1024, 91 | metadata={ 92 | "help": "The maximum total input sequence length after tokenization. Sequences longer " 93 | "than this will be truncated, sequences shorter will be padded." 94 | }, 95 | ) 96 | max_target_length: Optional[int] = field( 97 | default=128, 98 | metadata={ 99 | "help": "The maximum total sequence length for target text after tokenization. Sequences longer " 100 | "than this will be truncated, sequences shorter will be padded." 101 | }, 102 | ) 103 | val_max_target_length: Optional[int] = field( 104 | default=142, 105 | metadata={ 106 | "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer " 107 | "than this will be truncated, sequences shorter will be padded. " 108 | "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " 109 | "during ``evaluate`` and ``predict``." 110 | }, 111 | ) 112 | test_max_target_length: Optional[int] = field( 113 | default=142, 114 | metadata={ 115 | "help": "The maximum total sequence length for test target text after tokenization. Sequences longer " 116 | "than this will be truncated, sequences shorter will be padded." 117 | }, 118 | ) 119 | n_train: Optional[int] = field(default=-1, metadata={"help": "# training examples. -1 means use all."}) 120 | n_val: Optional[int] = field(default=-1, metadata={"help": "# validation examples. -1 means use all."}) 121 | n_test: Optional[int] = field(default=-1, metadata={"help": "# test examples. -1 means use all."}) 122 | src_lang: Optional[str] = field(default=None, metadata={"help": "Source language id for translation."}) 123 | tgt_lang: Optional[str] = field(default=None, metadata={"help": "Target language id for translation."}) 124 | eval_beams: Optional[int] = field(default=None, metadata={"help": "# num_beams to use for evaluation."}) 125 | ignore_pad_token_for_loss: bool = field( 126 | default=True, 127 | metadata={"help": "If only pad tokens should be ignored. This assumes that `config.pad_token_id` is defined."}, 128 | ) 129 | 130 | 131 | def handle_metrics(split, metrics, output_dir): 132 | """ 133 | Log and save metrics 134 | 135 | Args: 136 | - split: one of train, val, test 137 | - metrics: metrics dict 138 | - output_dir: where to save the metrics 139 | """ 140 | 141 | logger.info(f"***** {split} metrics *****") 142 | for key in sorted(metrics.keys()): 143 | logger.info(f" {key} = {metrics[key]}") 144 | save_json(metrics, os.path.join(output_dir, f"{split}_results.json")) 145 | 146 | 147 | def main(): 148 | # See all possible arguments in src/transformers/training_args.py 149 | # or by passing the --help flag to this script. 150 | # We now keep distinct sets of args, for a cleaner separation of concerns. 151 | 152 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments)) 153 | 154 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): 155 | # If we pass only one argument to the script and it's the path to a json file, 156 | # let's parse it to get our arguments. 157 | model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) 158 | else: 159 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 160 | 161 | check_output_dir(training_args) 162 | 163 | # Setup logging 164 | logging.basicConfig( 165 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 166 | datefmt="%m/%d/%Y %H:%M:%S", 167 | level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN, 168 | ) 169 | logger.warning( 170 | "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", 171 | training_args.local_rank, 172 | training_args.device, 173 | training_args.n_gpu, 174 | bool(training_args.parallel_mode == ParallelMode.DISTRIBUTED), 175 | training_args.fp16, 176 | ) 177 | # Set the verbosity to info of the Transformers logger (on main process only): 178 | if is_main_process(training_args.local_rank): 179 | transformers.utils.logging.set_verbosity_info() 180 | transformers.utils.logging.enable_default_handler() 181 | transformers.utils.logging.enable_explicit_format() 182 | logger.info("Training/evaluation parameters %s", training_args) 183 | 184 | # Set seed 185 | set_seed(training_args.seed) 186 | 187 | # Load pretrained model and tokenizer 188 | # 189 | # Distributed training: 190 | # The .from_pretrained methods guarantee that only one local process can concurrently 191 | # download model & vocab. 192 | 193 | config = AutoConfig.from_pretrained( 194 | model_args.config_name if model_args.config_name else model_args.model_name_or_path, 195 | cache_dir=model_args.cache_dir, 196 | ) 197 | 198 | extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout") 199 | for p in extra_model_params: 200 | if getattr(training_args, p, None): 201 | assert hasattr(config, p), f"({config.__class__.__name__}) doesn't have a `{p}` attribute" 202 | setattr(config, p, getattr(training_args, p)) 203 | 204 | tokenizer = AutoTokenizer.from_pretrained( 205 | model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, 206 | cache_dir=model_args.cache_dir, 207 | ) 208 | model = AutoModelForSeq2SeqLM.from_pretrained( 209 | model_args.model_name_or_path, 210 | from_tf=".ckpt" in model_args.model_name_or_path, 211 | config=config, 212 | cache_dir=model_args.cache_dir, 213 | ) 214 | 215 | # use task specific params 216 | use_task_specific_params(model, data_args.task) 217 | 218 | # set num_beams for evaluation 219 | if data_args.eval_beams is None: 220 | data_args.eval_beams = model.config.num_beams 221 | 222 | # set decoder_start_token_id for MBart 223 | if model.config.decoder_start_token_id is None and isinstance(tokenizer, MBartTokenizer): 224 | assert ( 225 | data_args.tgt_lang is not None and data_args.src_lang is not None 226 | ), "mBart requires --tgt_lang and --src_lang" 227 | model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.tgt_lang] 228 | 229 | if model_args.freeze_embeds: 230 | freeze_embeds(model) 231 | if model_args.freeze_encoder: 232 | freeze_params(model.get_encoder()) 233 | assert_all_frozen(model.get_encoder()) 234 | 235 | dataset_class = Seq2SeqDataset 236 | 237 | # Get datasets 238 | train_dataset = ( 239 | dataset_class( 240 | tokenizer, 241 | type_path="train", 242 | data_dir=data_args.data_dir, 243 | n_obs=data_args.n_train, 244 | max_target_length=data_args.max_target_length, 245 | max_source_length=data_args.max_source_length, 246 | prefix=model.config.prefix or "", 247 | ) 248 | if training_args.do_train 249 | else None 250 | ) 251 | eval_dataset = ( 252 | dataset_class( 253 | tokenizer, 254 | type_path="val", 255 | data_dir=data_args.data_dir, 256 | n_obs=data_args.n_val, 257 | max_target_length=data_args.val_max_target_length, 258 | max_source_length=data_args.max_source_length, 259 | prefix=model.config.prefix or "", 260 | ) 261 | if training_args.do_eval or training_args.evaluation_strategy != EvaluationStrategy.NO 262 | else None 263 | ) 264 | test_dataset = ( 265 | dataset_class( 266 | tokenizer, 267 | type_path="test", 268 | data_dir=data_args.data_dir, 269 | n_obs=data_args.n_test, 270 | max_target_length=data_args.test_max_target_length, 271 | max_source_length=data_args.max_source_length, 272 | prefix=model.config.prefix or "", 273 | ) 274 | if training_args.do_predict 275 | else None 276 | ) 277 | 278 | # Initialize our Trainer 279 | compute_metrics_fn = ( 280 | build_compute_metrics_fn(data_args.task, tokenizer) if training_args.predict_with_generate else None 281 | ) 282 | trainer = Seq2SeqTrainer( 283 | model=model, 284 | args=training_args, 285 | train_dataset=train_dataset, 286 | eval_dataset=eval_dataset, 287 | data_collator=Seq2SeqDataCollator(tokenizer, data_args, training_args.tpu_num_cores), 288 | compute_metrics=compute_metrics_fn, 289 | tokenizer=tokenizer, 290 | ) 291 | 292 | all_metrics = {} 293 | # Training 294 | if training_args.do_train: 295 | logger.info("*** Train ***") 296 | 297 | train_result = trainer.train( 298 | model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None 299 | ) 300 | metrics = train_result.metrics 301 | metrics["train_n_objs"] = data_args.n_train 302 | 303 | trainer.save_model() # this also saves the tokenizer 304 | 305 | if trainer.is_world_process_zero(): 306 | handle_metrics("train", metrics, training_args.output_dir) 307 | all_metrics.update(metrics) 308 | 309 | # Need to save the state, since Trainer.save_model saves only the tokenizer with the model 310 | trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json")) 311 | 312 | # For convenience, we also re-save the tokenizer to the same directory, 313 | # so that you can share your model easily on huggingface.co/models =) 314 | tokenizer.save_pretrained(training_args.output_dir) 315 | 316 | # Evaluation 317 | if training_args.do_eval: 318 | logger.info("*** Evaluate ***") 319 | 320 | metrics = trainer.evaluate( 321 | metric_key_prefix="val", max_length=data_args.val_max_target_length, num_beams=data_args.eval_beams 322 | ) 323 | metrics["val_n_objs"] = data_args.n_val 324 | metrics["val_loss"] = round(metrics["val_loss"], 4) 325 | 326 | if trainer.is_world_process_zero(): 327 | 328 | handle_metrics("val", metrics, training_args.output_dir) 329 | all_metrics.update(metrics) 330 | 331 | if training_args.do_predict: 332 | logger.info("*** Predict ***") 333 | 334 | test_output = trainer.predict( 335 | test_dataset=test_dataset, 336 | metric_key_prefix="test", 337 | max_length=data_args.val_max_target_length, 338 | num_beams=data_args.eval_beams, 339 | ) 340 | metrics = test_output.metrics 341 | metrics["test_n_objs"] = data_args.n_test 342 | 343 | if trainer.is_world_process_zero(): 344 | metrics["test_loss"] = round(metrics["test_loss"], 4) 345 | handle_metrics("test", metrics, training_args.output_dir) 346 | all_metrics.update(metrics) 347 | 348 | if training_args.predict_with_generate: 349 | test_preds = tokenizer.batch_decode( 350 | test_output.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True 351 | ) 352 | test_preds = lmap(str.strip, test_preds) 353 | write_txt_file(test_preds, os.path.join(training_args.output_dir, "test_generations.txt")) 354 | 355 | if trainer.is_world_process_zero(): 356 | save_json(all_metrics, os.path.join(training_args.output_dir, "all_results.json")) 357 | 358 | return all_metrics 359 | 360 | 361 | def _mp_fn(index): 362 | # For xla_spawn (TPUs) 363 | main() 364 | 365 | 366 | if __name__ == "__main__": 367 | main() 368 | -------------------------------------------------------------------------------- /helper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PlusLabNLP/AESOP/0f376d1413c1ef605b7a008992e3a562c9020b99/helper/__init__.py -------------------------------------------------------------------------------- /helper/helper.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | from nltk.tree import Tree 4 | 5 | def remove_leaves_from_tree(root): 6 | # if we get to the leaf nodes, then return 7 | try: 8 | root.label() 9 | except AttributeError: 10 | return 11 | 12 | all_child_state = [] 13 | all_child_state.append([root.label()]) 14 | for child in root: 15 | if remove_leaves_from_tree(child): 16 | all_child_state.append(remove_leaves_from_tree(child)) 17 | 18 | return all_child_state 19 | 20 | 21 | def trim_tree(root, height): 22 | if isinstance(root, str): 23 | return root 24 | if height < 1: 25 | return 26 | all_child_state = [] 27 | # adding itself 28 | all_child_state.extend(root[0]) 29 | for child in root[1:]: 30 | if trim_tree(child, height - 1): 31 | all_child_state.append(trim_tree(child, height - 1)) 32 | return all_child_state 33 | 34 | def clean_tuple_str(tuple_str): 35 | new_str_ls = [] 36 | if len(tuple_str) == 1: 37 | new_str_ls.append(tuple_str[0]) 38 | else: 39 | for i in str(tuple_str).split(", "): 40 | if i.count("'") == 2: 41 | new_str_ls.append(i.replace("'", "")) 42 | elif i.count("'") == 1: 43 | new_str_ls.append(i.replace("\"", "")) 44 | str_join = ' '.join(ele for ele in new_str_ls) 45 | return string_comma(str_join) 46 | 47 | def to_tuple(lst): 48 | return tuple(to_tuple(i) if isinstance(i, list) else i for i in lst) 49 | 50 | def string_comma(string): 51 | start = 0 52 | new_string = '' 53 | while start < len(string): 54 | if string[start:].find(",") == -1: 55 | new_string += string[start:] 56 | break 57 | else: 58 | index = string[start:].find(",") 59 | if string[start - 2] != "(": 60 | new_string += string[start:start + index] 61 | new_string += " " 62 | else: 63 | new_string = new_string[:start-1] +", " 64 | start = start + index + 1 65 | return new_string 66 | 67 | if __name__ == '__main__': 68 | parse_str = "(ROOT (S (VP (ADVP (RB suddenly)) (FW i) (VP (VBP 've) (VP (VBN gone) (PP (IN from) (S (VP (VBG trying) (S (VP (TO to) (VP (VB find) (PRT (RP out)) (SBAR (IN if) (S (NP (NN anyone)) (VP (MD might) (VP (VB have) (NP (PRP it)) (ADVP (IN in)) (PP (IN for) (NP (NP (NNS marcos)) (PP (TO to) (S (VP (VBG wondering) (SBAR (IN if) (S (NP (EX there)) (VP (VBZ 's) (NP (NP (NN anyone)) (PP (IN at) (NP (NN school))) (SBAR (WHNP (WP who)) (S (VP (VBZ does) (RB n't)))))))))))))))))))))))))) (. .)))" 69 | tree = Tree.fromstring(parse_str) 70 | a = remove_leaves_from_tree(tree) 71 | print(clean_tuple_str(to_tuple(trim_tree(a, 3)))) 72 | -------------------------------------------------------------------------------- /helper/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from nltk.tree import Tree 3 | import os 4 | import re 5 | import subprocess 6 | import threading 7 | import tempfile 8 | import codecs 9 | import ast 10 | from numpy.random import choice 11 | from rouge_score import rouge_scorer 12 | import itertools 13 | 14 | def deleaf(parse_string): 15 | tree = Tree.fromstring(parse_string.strip(), read_leaf=lambda s: "") 16 | for sub in tree.subtrees(): 17 | for n, child in enumerate(sub): 18 | if isinstance(child, str): 19 | continue 20 | if len(list(child.subtrees(filter=lambda x: x.label() == '-NONE-'))) == len(child.leaves()): 21 | del sub[n] 22 | oneline = tree.pformat(margin=10000, parens=[" ( ", " ) "]) 23 | oneline = re.sub(' +', ' ', oneline) 24 | return oneline 25 | 26 | def convert_str(string): 27 | new_list= [] 28 | for ele in string.split(" "): 29 | if ")" in ele: 30 | new_list.append(str(re.sub(r'^.*?\)', ')', ele))) 31 | else: 32 | new_list.append(ele) 33 | new_str = " ".join(ele for ele in new_list) 34 | return new_str 35 | 36 | def trim_tree_nltk(root, height): 37 | try: 38 | root.label() 39 | except AttributeError: 40 | return 41 | 42 | if height < 1: 43 | return 44 | all_child_state = [] 45 | # print(root.label()) 46 | all_child_state.append(root.label()) 47 | 48 | if len(root) >= 1: 49 | for child_index in range(len(root)): 50 | child = root[child_index] 51 | if trim_tree_nltk(child, height - 1): 52 | all_child_state.append(trim_tree_nltk(child, height - 1)) 53 | # print(all_child_state) 54 | return all_child_state 55 | 56 | 57 | # extract parses from corenlp output 58 | def extract_parses(fname): 59 | f = codecs.getreader('utf-8')(open(fname, 'rb')) 60 | 61 | count = 0 62 | sentences = [] 63 | data = {'tokens':[], 'pos':[], 'parse':'', 'deps':[]} 64 | for idx, line in enumerate(f): 65 | if line.startswith('Sentence #'): 66 | new_sent = True 67 | new_pos = False 68 | new_parse = False 69 | new_deps = False 70 | if idx == 0: 71 | continue 72 | 73 | # label_sentence(data) 74 | # print ' '.join(data['tokens']) 75 | # data['label'] = dataset[count]['label'] 76 | sentences.append(data) 77 | count += 1 78 | 79 | data = {'tokens':[], 'pos':[], 'parse':'', 'deps':[]} 80 | 81 | # read original sentence 82 | elif new_sent: 83 | # data['sent'] = line.strip() 84 | new_sent = False 85 | new_pos = True 86 | 87 | # read POS tags 88 | elif new_pos and line.startswith('[Text='): 89 | line = line.strip().split() 90 | w = line[0].split('[Text=')[-1] 91 | pos = line[-1].split('PartOfSpeech=')[-1][:-1] 92 | data['tokens'].append(w) 93 | data['pos'].append(pos) 94 | 95 | # start reading const parses 96 | elif (new_pos or new_parse) and line.strip() != '': 97 | new_pos = False 98 | new_parse = True 99 | data['parse'] += ' '+line.strip() 100 | data['pure_parse'] = convert_str(data['parse']) 101 | 102 | # start reading deps 103 | elif line.strip() == '': 104 | new_parse = False 105 | new_deps = True 106 | 107 | elif new_deps and line.strip() != '': 108 | line = line.strip()[:-1].split('(',1) 109 | rel = line[0] 110 | x1, x2 = line[1].split(', ') 111 | x1 = x1.replace("'", "") 112 | x2 = x2.replace("'", "") 113 | x1 = int(x1.rsplit('-', 1)[-1]) 114 | x2 = int(x2.rsplit('-', 1)[-1]) 115 | data['deps'].append((rel, x1 - 1, x2 - 1)) 116 | 117 | else: 118 | new_deps = False 119 | 120 | # add last sentence 121 | # label_sentence(data) 122 | # data['label'] = dataset[count]['label'] 123 | sentences.append(data) 124 | 125 | f.close() 126 | 127 | return sentences 128 | 129 | STANFORD_CORENLP = '../evaluation/apps/stanford-corenlp-full-2018-10-05' 130 | class stanford_parsetree_extractor: 131 | def __init__(self): 132 | self.stanford_corenlp_path = os.path.join(STANFORD_CORENLP, "*") 133 | print("standford corenlp path:", self.stanford_corenlp_path) 134 | self.output_dir = tempfile.TemporaryDirectory() 135 | self.cmd = ['java', '-cp', self.stanford_corenlp_path, 136 | '-Xmx40g', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', 137 | '-parse.model', 'edu/stanford/nlp/models/srparser/englishSR.ser.gz', 138 | '-annotators', 'tokenize,ssplit,pos,parse', 139 | '-ssplit.eolonly', '-outputFormat', 'text', 140 | '-outputDirectory', self.output_dir.name, 141 | '-file', None] 142 | 143 | def run(self, file): 144 | print("parsing file:", file) 145 | self.cmd[-1] = file 146 | out = subprocess.run( 147 | self.cmd, 148 | cwd=os.getcwd(), 149 | stdout=subprocess.PIPE, 150 | stderr=subprocess.PIPE) 151 | print(out) 152 | parsed_file = \ 153 | os.path.join( 154 | self.output_dir.name, 155 | os.path.split(file)[1] + ".out") 156 | return [e['pure_parse'] for e in 157 | extract_parses(parsed_file)], [e['parse'] for e in extract_parses(parsed_file)] 158 | 159 | def cleanup(self): 160 | self.output_dir.cleanup() 161 | 162 | def remove_leaves_from_tree(root): 163 | # if we get to the leaf nodes, then return 164 | try: 165 | root.label() 166 | except AttributeError: 167 | return 168 | all_child_state = [] 169 | all_child_state.append([root.label()]) 170 | for child in root: 171 | if remove_leaves_from_tree(child): 172 | all_child_state.append(remove_leaves_from_tree(child)) 173 | return all_child_state 174 | 175 | def clean_tuple_str(tuple_str): 176 | new_str_ls = [] 177 | if len(tuple_str) == 1: 178 | new_str_ls.append(tuple_str[0]) 179 | else: 180 | for i in str(tuple_str).split(", "): 181 | if i.count("'") == 2: 182 | new_str_ls.append(i.replace("'", "")) 183 | elif i.count("'") == 1: 184 | new_str_ls.append(i.replace("\"", "")) 185 | str_join = ' '.join(ele for ele in new_str_ls) 186 | return string_comma(str_join) 187 | 188 | def prune_tree(parse_string, height): 189 | parse_tree = Tree.fromstring(parse_string) 190 | non_leaf_tree = remove_leaves_from_tree(parse_tree) 191 | final_str = clean_tuple_str(to_tuple(trim_tree(non_leaf_tree, height))) 192 | return final_str 193 | 194 | def rouge_score(string, ls): 195 | scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) 196 | rouge_result = [scorer.score(string, ls[i]) for i in range(len(ls))] 197 | rouge1 = [i["rouge1"].fmeasure for i in rouge_result] 198 | rouge2 = [i["rouge2"].fmeasure for i in rouge_result] 199 | rougeL = [i["rougeL"].fmeasure for i in rouge_result] 200 | return rouge1, rouge2, rougeL 201 | 202 | def trim_tree(root, height): 203 | if isinstance(root, str): 204 | return root 205 | if height < 1: 206 | return 207 | 208 | all_child_state = [] 209 | all_child_state.extend(root[0]) 210 | 211 | for child in root[1:]: 212 | if trim_tree(child, height - 1): 213 | all_child_state.append(trim_tree(child, height - 1)) 214 | return all_child_state 215 | 216 | def string_comma(string): 217 | start = 0 218 | new_string = '' 219 | while start < len(string): 220 | if string[start:].find(",") == -1: 221 | new_string += string[start:] 222 | break 223 | else: 224 | index = string[start:].find(",") 225 | if string[start - 2] != "(": 226 | new_string += string[start:start + index] 227 | new_string += " " 228 | else: 229 | new_string = new_string[:start-1] +", " 230 | start = start + index + 1 231 | return new_string 232 | 233 | def to_tuple(lst): 234 | return tuple(to_tuple(i) if isinstance(i, list) else i for i in lst) 235 | 236 | 237 | def trim_str(string, height): 238 | return clean_tuple_str(to_tuple(trim_tree_nltk(Tree.fromstring(string), height))) 239 | 240 | 241 | 242 | def generate_dict(lines): 243 | result = {} 244 | for line in lines: 245 | line = line.strip("\n").split("\t") 246 | tuple_str = ast.literal_eval(line[0]) 247 | if tuple_str[0] in result: 248 | result[tuple_str[0]].append(tuple_str[1]) 249 | else: 250 | result[tuple_str[0]] = [tuple_str[1]] 251 | 252 | if tuple_str[1] in result: 253 | result[tuple_str[1]].append(tuple_str[0]) 254 | else: 255 | result[tuple_str[1]] = [tuple_str[0]] 256 | return result 257 | 258 | 259 | def generate_counts_dict(lines): 260 | result = {} 261 | for line in lines: 262 | line = line.strip("\n").split("\t") 263 | tuple_str = ast.literal_eval(line[0]) 264 | if tuple_str[0] in result: 265 | result[tuple_str[0]].append(int(line[1])) 266 | else: 267 | result[tuple_str[0]] = [int(line[1])] 268 | 269 | if tuple_str[1] in result: 270 | result[tuple_str[1]].append(int(line[1])) 271 | else: 272 | result[tuple_str[1]] = [int(line[1])] 273 | return result 274 | 275 | def pick_n_parses_freqs(level, freq, res1, n): 276 | # helper function to step2_rouge() 277 | all_parses = list(level.keys()) 278 | return_result = [] 279 | for i in res1: 280 | candidate_ls = level[all_parses[i]] 281 | freq_ls = freq[all_parses[i]] 282 | prob_ls = [item / sum(freq_ls) for item in freq_ls] 283 | if n < len(candidate_ls): 284 | return_result.append(choice(candidate_ls, n, replace=False, p=prob_ls).tolist()) 285 | else: 286 | return_result.append(candidate_ls) 287 | return list(itertools.chain.from_iterable(return_result)) 288 | 289 | 290 | def step2_rouge(level, freq, src_str, level_n, k_picks=5, n=2): 291 | """ 292 | :param src_str: the source parse -- string 293 | :param level_n: level_n is the index of the level we are targeting at: (0, 5), (1, 4), (2, 3) 294 | :param k_picks: find k most similar parse strings 295 | :param n: pick n tgt template parse for each similar parse 296 | :return: the returned k_picks*n possible tgt parses based on rouge scores 297 | """ 298 | # level = levels[level_n] 299 | # freq = freqs[level_n] 300 | all_parses = list(level.keys()) 301 | rouge1, rouge2, rougeL = rouge_score(src_str, all_parses) 302 | res1_before_pick = sorted(range(len(rouge1)), key=lambda sub: rouge1[sub]) 303 | res2_before_pick = sorted(range(len(rouge2)), key=lambda sub: rouge2[sub]) 304 | resL_before_pick = sorted(range(len(rougeL)), key=lambda sub: rougeL[sub]) 305 | res1, res2, resL = res1_before_pick[-k_picks:], res2_before_pick[-k_picks:], resL_before_pick[-k_picks:] 306 | 307 | w1, w2, w3 = 0.2, 0.3, 0.5 308 | weighted_res = [w1 * x + w2 * y + w3 * z for x, y, z in zip(res1_before_pick, res2_before_pick, resL_before_pick)] 309 | resW = sorted(range(len(weighted_res)), key=lambda sub: rougeL[sub])[-k_picks:] 310 | 311 | print("start parsing rouge 1") 312 | # parses_1 = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in res1])) 313 | parses_1 = pick_n_parses_freqs(level, freq, res1, n) 314 | print("start parsing rouge 2") 315 | # parse_2 = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in res2])) 316 | parses_2 = pick_n_parses_freqs(level, freq, res2, n) 317 | print("start parsing rouge L") 318 | # parse_L = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in resL])) 319 | parses_L = pick_n_parses_freqs(level, freq, resL, n) 320 | print("start parsing rouge weighted") 321 | # parses_weighted = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in 322 | # resW])) 323 | parses_weighted = pick_n_parses_freqs(level, freq, resW, n) 324 | return parses_1, parses_2, parses_L, parses_weighted 325 | 326 | # add for rebuttal -- adding new experiments for always choose the most frequent target parses 327 | def pick_n_parses_freqs_new(level, freq, res1, n): 328 | # helper function to step2_rouge() 329 | all_parses = list(level.keys()) 330 | return_result = [] 331 | for i in res1: 332 | # the candidate list that contains all the parses 333 | candidate_ls = level[all_parses[i]] 334 | freq_ls = freq[all_parses[i]] 335 | return_result.append([candidate_ls[freq_ls.index(max(freq_ls))]]) 336 | return list(itertools.chain.from_iterable(return_result)) 337 | 338 | def step2_rouge_new(level, freq, src_str, level_n, k_picks=10, n=1): 339 | """ 340 | :param src_str: the source parse -- string 341 | :param level_n: level_n is the index of the level we are targeting at: (0, 5), (1, 4), (2, 3) 342 | :param k_picks: find k most similar parse strings 343 | :param n: pick n tgt template parse for each similar parse 344 | :return: the returned k_picks*n possible tgt parses based on rouge scores 345 | """ 346 | # level = levels[level_n] 347 | # freq = freqs[level_n] 348 | all_parses = list(level.keys()) 349 | rouge1, rouge2, rougeL = rouge_score(src_str, all_parses) 350 | res1_before_pick = sorted(range(len(rouge1)), key=lambda sub: rouge1[sub]) 351 | res2_before_pick = sorted(range(len(rouge2)), key=lambda sub: rouge2[sub]) 352 | resL_before_pick = sorted(range(len(rougeL)), key=lambda sub: rougeL[sub]) 353 | res1, res2, resL = res1_before_pick[-k_picks:], res2_before_pick[-k_picks:], resL_before_pick[-k_picks:] 354 | 355 | w1, w2, w3 = 0.2, 0.3, 0.5 356 | weighted_res = [w1 * x + w2 * y + w3 * z for x, y, z in zip(res1_before_pick, res2_before_pick, resL_before_pick)] 357 | resW = sorted(range(len(weighted_res)), key=lambda sub: weighted_res[sub])[-k_picks:] 358 | 359 | print("start parsing rouge 1") 360 | # parses_1 = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in res1])) 361 | parses_1 = pick_n_parses_freqs(level, freq, res1, n) 362 | print("start parsing rouge 2") 363 | # parse_2 = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in res2])) 364 | parses_2 = pick_n_parses_freqs(level, freq, res2, n) 365 | print("start parsing rouge L") 366 | # parse_L = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in resL])) 367 | parses_L = pick_n_parses_freqs(level, freq, resL, n) 368 | print("start parsing rouge weighted") 369 | # parses_weighted = list(itertools.chain.from_iterable([level[all_parses[i]][:n] for i in 370 | # resW])) 371 | parses_weighted = pick_n_parses_freqs_new(level, freq, resW, n) 372 | return parses_1, parses_2, parses_L, parses_weighted 373 | 374 | -------------------------------------------------------------------------------- /requirement.txt: -------------------------------------------------------------------------------- 1 | pytorch==1.8.0 2 | transformers==4.2.0.dev0 3 | attrdict==2.0.1 4 | attrs==19.1.0 5 | boto3==1.9.199 6 | botocore==1.12.199 7 | certifi==2019.6.16 8 | chardet==3.0.4 9 | comet-git-pure==0.19.11 10 | comet-ml==2.0.5 11 | configobj==5.0.6 12 | cycler==0.10.0 13 | docutils==0.14 14 | everett==1.0.2 15 | editdistance==0.3.1 16 | idna==2.8 17 | jmespath==0.9.4 18 | joblib==0.13.2 19 | jsonschema==3.0.1 20 | kiwisolver==1.1.0 21 | matplotlib==3.1.1 22 | netifaces==0.10.9 23 | nltk==3.4.5 24 | numpy==1.17.0 25 | nvidia-ml-py3==7.352.0 26 | pandas==0.24.2 27 | Pillow==6.1.0 28 | pyparsing==2.4.2 29 | pyrouge==0.1.3 30 | pyrsistent==0.15.4 31 | python-dateutil==2.8.0 32 | regex==2019.6.8 33 | requests==2.22.0 34 | s3transfer==0.2.1 35 | scipy==1.3.0 36 | six==1.12.0 37 | torch==1.3.0 38 | tqdm==4.32.2 39 | urllib3==1.25.3 40 | websocket-client==0.56.0 41 | wurlitzer==1.0.3 42 | gensim==3.8.0 43 | zss==1.2.0 44 | h5py==2.10.0 45 | subword-nmt 46 | ipdb -------------------------------------------------------------------------------- /run_eval.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Copyright 2020 The HuggingFace Team. All rights reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import argparse 17 | import datetime 18 | import json 19 | import time 20 | import warnings 21 | from logging import getLogger 22 | from pathlib import Path 23 | from typing import Dict, List 24 | 25 | import torch 26 | from tqdm import tqdm 27 | 28 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer 29 | from utils import calculate_bleu, calculate_rouge, chunks, parse_numeric_n_bool_cl_kwargs, use_task_specific_params 30 | 31 | 32 | logger = getLogger(__name__) 33 | 34 | 35 | DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu" 36 | 37 | 38 | def generate_summaries_or_translations( 39 | examples: List[str], 40 | out_file: str, 41 | model_name: str, 42 | batch_size: int = 8, 43 | device: str = DEFAULT_DEVICE, 44 | fp16=False, 45 | task="summarization", 46 | prefix=None, 47 | **generate_kwargs, 48 | ) -> Dict: 49 | """Save model.generate results to , and return how long it took.""" 50 | fout = Path(out_file).open("w", encoding="utf-8") 51 | model_name = str(model_name) 52 | model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device) 53 | if fp16: 54 | model = model.half() 55 | 56 | tokenizer = AutoTokenizer.from_pretrained(model_name) 57 | logger.info(f"Inferred tokenizer type: {tokenizer.__class__}") # if this is wrong, check config.model_type. 58 | 59 | start_time = time.time() 60 | # update config with task specific params 61 | use_task_specific_params(model, task) 62 | if prefix is None: 63 | prefix = prefix or getattr(model.config, "prefix", "") or "" 64 | for examples_chunk in tqdm(list(chunks(examples, batch_size))): 65 | examples_chunk = [prefix + text for text in examples_chunk] 66 | batch = tokenizer(examples_chunk, return_tensors="pt", truncation=True, padding="longest").to(device) 67 | summaries = model.generate( 68 | input_ids=batch.input_ids, 69 | attention_mask=batch.attention_mask, 70 | **generate_kwargs, 71 | ) 72 | dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False) 73 | for hypothesis in dec: 74 | fout.write(hypothesis + "\n") 75 | fout.flush() 76 | fout.close() 77 | runtime = int(time.time() - start_time) # seconds 78 | n_obs = len(examples) 79 | return dict(n_obs=n_obs, runtime=runtime, seconds_per_sample=round(runtime / n_obs, 4)) 80 | 81 | 82 | def datetime_now(): 83 | return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 84 | 85 | 86 | def run_generate(verbose=True): 87 | """ 88 | 89 | Takes input text, generates output, and then using reference calculates the BLEU scores. 90 | 91 | The results are saved to a file and returned to the caller, and printed out unless ``verbose=False`` is passed. 92 | 93 | Args: 94 | verbose (:obj:`bool`, `optional`, defaults to :obj:`True`): print results to stdout 95 | 96 | Returns: 97 | a tuple: ``(scores, params}`` 98 | - ``scores``: a dict of scores data ``{'bleu': 39.6501, 'n_obs': 2000, 'runtime': 186, 'seconds_per_sample': 0.093}`` 99 | - ``params``: a dict of custom params, e.g. ``{'num_beams': 5, 'length_penalty': 0.8}`` 100 | """ 101 | 102 | parser = argparse.ArgumentParser() 103 | parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.") 104 | parser.add_argument("input_path", type=str, help="like cnn_dm/test.source") 105 | parser.add_argument("save_path", type=str, help="where to save summaries") 106 | parser.add_argument("--reference_path", type=str, required=False, help="like cnn_dm/test.target") 107 | parser.add_argument("--score_path", type=str, required=False, default="metrics.json", help="where to save metrics") 108 | parser.add_argument("--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.") 109 | parser.add_argument( 110 | "--prefix", type=str, required=False, default=None, help="will be added to the begininng of src examples" 111 | ) 112 | parser.add_argument("--task", type=str, default="summarization", help="used for task_specific_params + metrics") 113 | parser.add_argument("--bs", type=int, default=8, required=False, help="batch size") 114 | parser.add_argument( 115 | "--n_obs", type=int, default=-1, required=False, help="How many observations. Defaults to all." 116 | ) 117 | parser.add_argument("--fp16", action="store_true") 118 | parser.add_argument("--dump-args", action="store_true", help="print the custom hparams with the results") 119 | parser.add_argument( 120 | "--info", 121 | nargs="?", 122 | type=str, 123 | const=datetime_now(), 124 | help="use in conjunction w/ --dump-args to print with the results whatever other info you'd like, e.g. lang=en-ru. If no value is passed, the current datetime string will be used.", 125 | ) 126 | # Unspecified args like --num_beams=2 --decoder_start_token_id=4 are passed to model.generate 127 | args, rest = parser.parse_known_args() 128 | parsed_args = parse_numeric_n_bool_cl_kwargs(rest) 129 | if parsed_args and verbose: 130 | print(f"parsed the following generate kwargs: {parsed_args}") 131 | examples = [" " + x.rstrip() if "t5" in args.model_name else x.rstrip() for x in open(args.input_path).readlines()] 132 | if args.n_obs > 0: 133 | examples = examples[: args.n_obs] 134 | Path(args.save_path).parent.mkdir(exist_ok=True) 135 | if args.reference_path is None and Path(args.score_path).exists(): 136 | warnings.warn(f"score_path {args.score_path} will be overwritten unless you type ctrl-c.") 137 | runtime_metrics = generate_summaries_or_translations( 138 | examples, 139 | args.save_path, 140 | args.model_name, 141 | batch_size=args.bs, 142 | device=args.device, 143 | fp16=args.fp16, 144 | task=args.task, 145 | prefix=args.prefix, 146 | **parsed_args, 147 | ) 148 | 149 | if args.reference_path is None: 150 | return {} 151 | 152 | # Compute scores 153 | score_fn = calculate_bleu if "translation" in args.task else calculate_rouge 154 | output_lns = [x.rstrip() for x in open(args.save_path).readlines()] 155 | reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()][: len(output_lns)] 156 | scores: dict = score_fn(output_lns, reference_lns) 157 | scores.update(runtime_metrics) 158 | 159 | if args.dump_args: 160 | scores.update(parsed_args) 161 | if args.info: 162 | scores["info"] = args.info 163 | 164 | if verbose: 165 | print(scores) 166 | 167 | if args.score_path is not None: 168 | json.dump(scores, open(args.score_path, "w")) 169 | 170 | return scores 171 | 172 | 173 | if __name__ == "__main__": 174 | # Usage for MT: 175 | # python run_eval.py MODEL_NAME $DATA_DIR/test.source $save_dir/test_translations.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_bleu.json --task translation $@ 176 | run_generate(verbose=True) 177 | -------------------------------------------------------------------------------- /ted2.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import rouge 3 | 4 | from eval_utils import Meteor, stanford_parsetree_extractor, compute_tree_edit_distance 5 | from tqdm import tqdm 6 | import subprocess 7 | from nltk.tree import Tree 8 | 9 | def trim_tree_nltk(root, height): 10 | try: 11 | root.label() 12 | except AttributeError: 13 | return 14 | 15 | if height < 1: 16 | return 17 | all_child_state = [] 18 | # print(root.label()) 19 | all_child_state.append(root.label()) 20 | 21 | if len(root) >= 1: 22 | for child_index in range(len(root)): 23 | child = root[child_index] 24 | if trim_tree_nltk(child, height - 1): 25 | all_child_state.append(trim_tree_nltk(child, height - 1)) 26 | # print(all_child_state) 27 | return all_child_state 28 | 29 | def string_comma(string): 30 | start = 0 31 | new_string = '' 32 | while start < len(string): 33 | if string[start:].find(",") == -1: 34 | new_string += string[start:] 35 | break 36 | else: 37 | index = string[start:].find(",") 38 | if string[start - 2] != "(": 39 | new_string += string[start:start + index] 40 | new_string += " " 41 | else: 42 | new_string = new_string[:start - 1] + ", " 43 | start = start + index + 1 44 | return new_string 45 | 46 | def clean_tuple_str(tuple_str): 47 | new_str_ls = [] 48 | if len(tuple_str) == 1: 49 | new_str_ls.append(tuple_str[0]) 50 | else: 51 | for i in str(tuple_str).split(", "): 52 | if i.count("'") == 2: 53 | new_str_ls.append(i.replace("'", "")) 54 | elif i.count("'") == 1: 55 | new_str_ls.append(i.replace("\"", "")) 56 | str_join = ' '.join(ele for ele in new_str_ls) 57 | return string_comma(str_join) 58 | 59 | def to_tuple(lst): 60 | return tuple(to_tuple(i) if isinstance(i, list) else i for i in lst) 61 | 62 | def get_syntax_templates(template_file): 63 | parses = [test_str.split("")[-1].strip() for test_str in open(template_file).readlines()] 64 | parses = [clean_tuple_str(to_tuple(trim_tree_nltk(Tree.fromstring(parse_str), 3))) for 65 | parse_str in parses] 66 | return parses 67 | 68 | parser = argparse.ArgumentParser() 69 | parser.add_argument('--input_file', '-i', type=str, help="full generated file, ") 70 | parser.add_argument('--select_file', '-s', type=str) 71 | parser.add_argument('--temp_file', '-t', type=str) 72 | args = parser.parse_args() 73 | 74 | n_select_line = len(list(open(args.select_file))) 75 | 76 | input_lines = [line.strip("\n").strip() for line in open(args.input_file, "r").readlines()] 77 | indices = [] 78 | for line in open(args.select_file, "r").readlines(): 79 | new_line = line.strip("\n").strip() 80 | indices.append(input_lines.index(new_line)) 81 | 82 | temp_parses = "" 83 | if "scpn" in args.input_file.lower(): 84 | templates = [ 85 | '( ROOT ( S ( NP ) ( VP ) ( . ) ) )', 86 | '( ROOT ( S ( VP ) ( . ) ) )', 87 | '( ROOT ( NP ( NP ) ( . ) ) )', 88 | '( ROOT ( FRAG ( SBAR ) ( . ) ) )', 89 | '( ROOT ( S ( S ) ( , ) ( CC ) ( S ) ( . ) ) )', 90 | '( ROOT ( S ( LST ) ( VP ) ( . ) ) )', 91 | '( ROOT ( SBARQ ( WHADVP ) ( SQ ) ( . ) ) )', 92 | '( ROOT ( S ( PP ) ( , ) ( NP ) ( VP ) ( . ) ) )', 93 | '( ROOT ( S ( ADVP ) ( NP ) ( VP ) ( . ) ) )', 94 | '( ROOT ( S ( SBAR ) ( , ) ( NP ) ( VP ) ( . ) ) )' 95 | ] 96 | if "qqpp" in args.input_file.lower(): 97 | temp_parses = templates * 3000 98 | elif "paranmt" in args.input_file.lower(): 99 | temp_parses = templates * 800 100 | else: 101 | temp_parses = get_syntax_templates(args.temp_file) 102 | temp_parses = [clean_tuple_str(to_tuple(trim_tree_nltk(Tree.fromstring(parse_str), 3))) for 103 | parse_str in temp_parses] 104 | 105 | if not isinstance(temp_parses, list): 106 | raise Exception("template parses are not a items of a list!") 107 | temp_parses = [temp_parses[i] for i in indices] 108 | 109 | 110 | print("#lines - select: {}, temp: {}".format(n_select_line, len(temp_parses))) 111 | assert n_select_line == len(temp_parses), \ 112 | "#select {} != #templates {}".format(n_select_line, temp_parses) 113 | 114 | spe = stanford_parsetree_extractor() 115 | select_parses = spe.run(args.select_file) 116 | select_parses = [clean_tuple_str(to_tuple(trim_tree_nltk(Tree.fromstring(parse_str), 3))) for 117 | parse_str in select_parses] 118 | 119 | spe.cleanup() 120 | 121 | all_ted = [] 122 | all_ted_t = [] 123 | 124 | # Default F1_score 125 | pbar = tqdm(zip(select_parses, temp_parses)) 126 | 127 | for select_parse, temp_parse in pbar: 128 | ted_t = compute_tree_edit_distance(select_parse, temp_parse) 129 | all_ted_t.append(ted_t) 130 | pbar.set_description( 131 | "ted-e: {:.3f}".format( 132 | sum(all_ted_t) / len(all_ted_t) 133 | )) 134 | 135 | print("ted-e: {:.3f}".format( 136 | sum(all_ted_t) / len(all_ted_t) 137 | )) -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 The HuggingFace Team. All rights reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import itertools 16 | import json 17 | import linecache 18 | import math 19 | import os 20 | import pickle 21 | import socket 22 | from logging import getLogger 23 | from pathlib import Path 24 | from typing import Callable, Dict, Iterable, List, Tuple, Union 25 | 26 | import git 27 | import numpy as np 28 | import torch 29 | import torch.distributed as dist 30 | from rouge_score import rouge_scorer, scoring 31 | from sacrebleu import corpus_bleu 32 | from torch import nn 33 | from torch.utils.data import Dataset, Sampler 34 | 35 | from sentence_splitter import add_newline_to_end_of_each_sentence 36 | from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer 37 | from transformers.file_utils import cached_property 38 | 39 | 40 | try: 41 | from fairseq.data.data_utils import batch_by_size 42 | 43 | FAIRSEQ_AVAILABLE = True 44 | except (ImportError, ModuleNotFoundError): 45 | FAIRSEQ_AVAILABLE = False 46 | 47 | 48 | def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100): 49 | """From fairseq""" 50 | if target.dim() == lprobs.dim() - 1: 51 | target = target.unsqueeze(-1) 52 | nll_loss = -lprobs.gather(dim=-1, index=target) 53 | smooth_loss = -lprobs.sum(dim=-1, keepdim=True) 54 | if ignore_index is not None: 55 | pad_mask = target.eq(ignore_index) 56 | nll_loss.masked_fill_(pad_mask, 0.0) 57 | smooth_loss.masked_fill_(pad_mask, 0.0) 58 | else: 59 | nll_loss = nll_loss.squeeze(-1) 60 | smooth_loss = smooth_loss.squeeze(-1) 61 | 62 | nll_loss = nll_loss.sum() # mean()? Scared to break other math. 63 | smooth_loss = smooth_loss.sum() 64 | eps_i = epsilon / lprobs.size(-1) 65 | loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss 66 | return loss, nll_loss 67 | 68 | 69 | def lmap(f: Callable, x: Iterable) -> List: 70 | """list(map(f, x))""" 71 | return list(map(f, x)) 72 | 73 | 74 | def calculate_bleu(output_lns, refs_lns, **kwargs) -> dict: 75 | """Uses sacrebleu's corpus_bleu implementation.""" 76 | return {"bleu": round(corpus_bleu(output_lns, [refs_lns], **kwargs).score, 4)} 77 | 78 | 79 | def build_compute_metrics_fn(task_name: str, tokenizer: PreTrainedTokenizer) -> Callable[[EvalPrediction], Dict]: 80 | def non_pad_len(tokens: np.ndarray) -> int: 81 | return np.count_nonzero(tokens != tokenizer.pad_token_id) 82 | 83 | def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]: 84 | pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True) 85 | label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True) 86 | pred_str = lmap(str.strip, pred_str) 87 | label_str = lmap(str.strip, label_str) 88 | return pred_str, label_str 89 | 90 | def summarization_metrics(pred: EvalPrediction) -> Dict: 91 | pred_str, label_str = decode_pred(pred) 92 | rouge: Dict = calculate_rouge(pred_str, label_str) 93 | summ_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1) 94 | rouge.update({"gen_len": summ_len}) 95 | return rouge 96 | 97 | def translation_metrics(pred: EvalPrediction) -> Dict: 98 | pred_str, label_str = decode_pred(pred) 99 | bleu: Dict = calculate_bleu(pred_str, label_str) 100 | gen_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1) 101 | bleu.update({"gen_len": gen_len}) 102 | return bleu 103 | 104 | compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics 105 | return compute_metrics_fn 106 | 107 | 108 | def trim_batch( 109 | input_ids, 110 | pad_token_id, 111 | attention_mask=None, 112 | ): 113 | """Remove columns that are populated exclusively by pad_token_id""" 114 | keep_column_mask = input_ids.ne(pad_token_id).any(dim=0) 115 | if attention_mask is None: 116 | return input_ids[:, keep_column_mask] 117 | else: 118 | return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask]) 119 | 120 | 121 | class AbstractSeq2SeqDataset(Dataset): 122 | def __init__( 123 | self, 124 | tokenizer, 125 | data_dir, 126 | max_source_length, 127 | max_target_length, 128 | type_path="train", 129 | n_obs=None, 130 | prefix="", 131 | **dataset_kwargs 132 | ): 133 | super().__init__() 134 | self.src_file = Path(data_dir).joinpath(type_path + ".source") 135 | self.tgt_file = Path(data_dir).joinpath(type_path + ".target") 136 | self.len_file = Path(data_dir).joinpath(type_path + ".len") 137 | if os.path.exists(self.len_file): 138 | self.src_lens = pickle_load(self.len_file) 139 | self.used_char_len = False 140 | else: 141 | self.src_lens = self.get_char_lens(self.src_file) 142 | self.used_char_len = True 143 | self.max_source_length = max_source_length 144 | self.max_target_length = max_target_length 145 | assert min(self.src_lens) > 0, f"found empty line in {self.src_file}" 146 | self.tokenizer = tokenizer 147 | self.prefix = prefix if prefix is not None else "" 148 | 149 | if n_obs is not None: 150 | self.src_lens = self.src_lens[:n_obs] 151 | self.pad_token_id = self.tokenizer.pad_token_id 152 | self.dataset_kwargs = dataset_kwargs 153 | dataset_kwargs.update({"add_prefix_space": True} if isinstance(self.tokenizer, BartTokenizer) else {}) 154 | 155 | def __len__(self): 156 | return len(self.src_lens) 157 | 158 | @staticmethod 159 | def get_char_lens(data_file): 160 | return [len(x) for x in Path(data_file).open().readlines()] 161 | 162 | @cached_property 163 | def tgt_lens(self): 164 | """Length in characters of target documents""" 165 | return self.get_char_lens(self.tgt_file) 166 | 167 | def make_sortish_sampler(self, batch_size, distributed=False, shuffle=True, **kwargs): 168 | if distributed: 169 | return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs) 170 | else: 171 | return SortishSampler(self.src_lens, batch_size, shuffle=shuffle) 172 | 173 | def make_dynamic_sampler(self, max_tokens_per_batch=1024, **kwargs): 174 | assert FAIRSEQ_AVAILABLE, "Dynamic batch size requires `pip install fairseq`" 175 | assert not self.used_char_len, "You must call python make_len_file.py before calling make_dynamic_sampler" 176 | sorted_indices = list(self.make_sortish_sampler(1024, shuffle=False)) 177 | 178 | def num_tokens_in_example(i): 179 | return min(self.src_lens[i], self.max_target_length) 180 | 181 | # call fairseq cython function 182 | batch_sampler: List[List[int]] = batch_by_size( 183 | sorted_indices, 184 | num_tokens_fn=num_tokens_in_example, 185 | max_tokens=max_tokens_per_batch, 186 | required_batch_size_multiple=64, 187 | ) 188 | shuffled_batches = [batch_sampler[i] for i in np.random.permutation(range(len(batch_sampler)))] 189 | # move the largest batch to the front to OOM quickly (uses an approximation for padding) 190 | approximate_toks_per_batch = [max(self.src_lens[i] for i in batch) * len(batch) for batch in shuffled_batches] 191 | largest_batch_idx = np.argmax(approximate_toks_per_batch) 192 | shuffled_batches[0], shuffled_batches[largest_batch_idx] = ( 193 | shuffled_batches[largest_batch_idx], 194 | shuffled_batches[0], 195 | ) 196 | return shuffled_batches 197 | 198 | def __getitem__(self, item): 199 | raise NotImplementedError("You must implement this") 200 | 201 | def collate_fn(self, batch): 202 | raise NotImplementedError("You must implement this") 203 | 204 | 205 | class LegacySeq2SeqDataset(AbstractSeq2SeqDataset): 206 | def __getitem__(self, index) -> Dict[str, torch.Tensor]: 207 | """Call tokenizer on src and tgt_lines""" 208 | index = index + 1 # linecache starts at 1 209 | source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n") 210 | tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n") 211 | assert source_line, f"empty source line for index {index}" 212 | assert tgt_line, f"empty tgt line for index {index}" 213 | source_inputs = self.encode_line(self.tokenizer, source_line, self.max_source_length) 214 | target_inputs = self.encode_line(self.tokenizer, tgt_line, self.max_target_length) 215 | 216 | source_ids = source_inputs["input_ids"].squeeze() 217 | target_ids = target_inputs["input_ids"].squeeze() 218 | src_mask = source_inputs["attention_mask"].squeeze() 219 | return { 220 | "input_ids": source_ids, 221 | "attention_mask": src_mask, 222 | "labels": target_ids, 223 | } 224 | 225 | def encode_line(self, tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"): 226 | """Only used by LegacyDataset""" 227 | return tokenizer( 228 | [line], 229 | max_length=max_length, 230 | padding="max_length" if pad_to_max_length else None, 231 | truncation=True, 232 | return_tensors=return_tensors, 233 | **self.dataset_kwargs, 234 | ) 235 | 236 | def collate_fn(self, batch) -> Dict[str, torch.Tensor]: 237 | input_ids = torch.stack([x["input_ids"] for x in batch]) 238 | masks = torch.stack([x["attention_mask"] for x in batch]) 239 | target_ids = torch.stack([x["labels"] for x in batch]) 240 | pad_token_id = self.pad_token_id 241 | y = trim_batch(target_ids, pad_token_id) 242 | source_ids, source_mask = trim_batch(input_ids, pad_token_id, attention_mask=masks) 243 | batch = { 244 | "input_ids": source_ids, 245 | "attention_mask": source_mask, 246 | "labels": y, 247 | } 248 | return batch 249 | 250 | 251 | class Seq2SeqDataset(AbstractSeq2SeqDataset): 252 | """A dataset that calls prepare_seq2seq_batch.""" 253 | 254 | def __getitem__(self, index) -> Dict[str, str]: 255 | index = index + 1 # linecache starts at 1 256 | source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n") 257 | tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n") 258 | assert source_line, f"empty source line for index {index}" 259 | assert tgt_line, f"empty tgt line for index {index}" 260 | return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1} 261 | 262 | def collate_fn(self, batch) -> Dict[str, torch.Tensor]: 263 | """Call prepare_seq2seq_batch.""" 264 | batch_encoding: Dict[str, torch.Tensor] = self.tokenizer.prepare_seq2seq_batch( 265 | [x["src_texts"] for x in batch], 266 | tgt_texts=[x["tgt_texts"] for x in batch], 267 | max_length=self.max_source_length, 268 | max_target_length=self.max_target_length, 269 | return_tensors="pt", 270 | **self.dataset_kwargs, 271 | ).data 272 | batch_encoding["ids"] = torch.tensor([x["id"] for x in batch]) 273 | return batch_encoding 274 | 275 | 276 | class Seq2SeqDataCollator: 277 | def __init__(self, tokenizer, data_args, tpu_num_cores=None): 278 | self.tokenizer = tokenizer 279 | self.pad_token_id = tokenizer.pad_token_id 280 | assert ( 281 | self.pad_token_id is not None 282 | ), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined." 283 | self.data_args = data_args 284 | self.tpu_num_cores = tpu_num_cores 285 | self.dataset_kwargs = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {} 286 | if data_args.src_lang is not None: 287 | self.dataset_kwargs["src_lang"] = data_args.src_lang 288 | if data_args.tgt_lang is not None: 289 | self.dataset_kwargs["tgt_lang"] = data_args.tgt_lang 290 | 291 | def __call__(self, batch) -> Dict[str, torch.Tensor]: 292 | if hasattr(self.tokenizer, "prepare_seq2seq_batch"): 293 | batch = self._encode(batch) 294 | input_ids, attention_mask, labels = ( 295 | batch["input_ids"], 296 | batch["attention_mask"], 297 | batch["labels"], 298 | ) 299 | else: 300 | input_ids = torch.stack([x["input_ids"] for x in batch]) 301 | attention_mask = torch.stack([x["attention_mask"] for x in batch]) 302 | labels = torch.stack([x["labels"] for x in batch]) 303 | 304 | labels = trim_batch(labels, self.pad_token_id) 305 | input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask) 306 | 307 | batch = { 308 | "input_ids": input_ids, 309 | "attention_mask": attention_mask, 310 | "labels": labels, 311 | } 312 | return batch 313 | 314 | def _shift_right_t5(self, input_ids): 315 | # shift inputs to the right 316 | shifted_input_ids = input_ids.new_zeros(input_ids.shape) 317 | shifted_input_ids[..., 1:] = input_ids[..., :-1].clone() 318 | shifted_input_ids[..., 0] = self.pad_token_id 319 | return shifted_input_ids 320 | 321 | def _encode(self, batch) -> Dict[str, torch.Tensor]: 322 | batch_encoding = self.tokenizer.prepare_seq2seq_batch( 323 | [x["src_texts"] for x in batch], 324 | tgt_texts=[x["tgt_texts"] for x in batch], 325 | max_length=self.data_args.max_source_length, 326 | max_target_length=self.data_args.max_target_length, 327 | padding="max_length" if self.tpu_num_cores is not None else "longest", # TPU hack 328 | return_tensors="pt", 329 | **self.dataset_kwargs, 330 | ) 331 | return batch_encoding.data 332 | 333 | 334 | class SortishSampler(Sampler): 335 | "Go through the text data by order of src length with a bit of randomness. From fastai repo." 336 | 337 | def __init__(self, data, batch_size, shuffle=True): 338 | self.data, self.bs, self.shuffle = data, batch_size, shuffle 339 | 340 | def __len__(self) -> int: 341 | return len(self.data) 342 | 343 | def __iter__(self): 344 | return iter(sortish_sampler_indices(self.data, self.bs, shuffle=self.shuffle)) 345 | 346 | 347 | def sortish_sampler_indices(data: List, bs: int, shuffle=True) -> np.array: 348 | "Go through the text data by order of src length with a bit of randomness. From fastai repo." 349 | if not shuffle: 350 | return np.argsort(np.array(data) * -1) 351 | 352 | def key_fn(i): 353 | return data[i] 354 | 355 | idxs = np.random.permutation(len(data)) 356 | sz = bs * 50 357 | ck_idx = [idxs[i : i + sz] for i in range(0, len(idxs), sz)] 358 | sort_idx = np.concatenate([sorted(s, key=key_fn, reverse=True) for s in ck_idx]) 359 | sz = bs 360 | ck_idx = [sort_idx[i : i + sz] for i in range(0, len(sort_idx), sz)] 361 | max_ck = np.argmax([key_fn(ck[0]) for ck in ck_idx]) # find the chunk with the largest key, 362 | ck_idx[0], ck_idx[max_ck] = ck_idx[max_ck], ck_idx[0] # then make sure it goes first. 363 | sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([], dtype=np.int) 364 | sort_idx = np.concatenate((ck_idx[0], sort_idx)) 365 | return sort_idx 366 | 367 | 368 | class DistributedSortishSampler(Sampler): 369 | """Copied from torch DistributedSampler""" 370 | 371 | def __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True): 372 | if num_replicas is None: 373 | if not dist.is_available(): 374 | raise RuntimeError("Requires distributed package to be available") 375 | num_replicas = dist.get_world_size() 376 | if rank is None: 377 | if not dist.is_available(): 378 | raise RuntimeError("Requires distributed package to be available") 379 | rank = dist.get_rank() 380 | self.dataset = dataset 381 | self.num_replicas = num_replicas 382 | self.rank = rank 383 | self.epoch = 0 384 | if add_extra_examples: 385 | self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas)) 386 | self.total_size = self.num_samples * self.num_replicas 387 | else: 388 | self.total_size = len(dataset) 389 | self.num_samples = len(self.available_indices) 390 | self.batch_size = batch_size 391 | self.add_extra_examples = add_extra_examples 392 | self.shuffle = shuffle 393 | 394 | def __iter__(self) -> Iterable: 395 | g = torch.Generator() 396 | g.manual_seed(self.epoch) 397 | 398 | sortish_data = [self.dataset.src_lens[i] for i in self.available_indices] 399 | sortish_indices = sortish_sampler_indices(sortish_data, self.batch_size, shuffle=self.shuffle) 400 | indices = [self.available_indices[i] for i in sortish_indices] 401 | assert len(indices) == self.num_samples 402 | return iter(indices) 403 | 404 | @cached_property 405 | def available_indices(self) -> np.array: 406 | indices = list(range(len(self.dataset))) 407 | # add extra samples to make it evenly divisible 408 | indices += indices[: (self.total_size - len(indices))] 409 | assert len(indices) == self.total_size 410 | # subsample 411 | available_indices = indices[self.rank : self.total_size : self.num_replicas] 412 | return available_indices 413 | 414 | def __len__(self): 415 | return self.num_samples 416 | 417 | def set_epoch(self, epoch): 418 | self.epoch = epoch 419 | 420 | 421 | logger = getLogger(__name__) 422 | 423 | 424 | def use_task_specific_params(model, task): 425 | """Update config with summarization specific params.""" 426 | task_specific_params = model.config.task_specific_params 427 | 428 | if task_specific_params is not None: 429 | pars = task_specific_params.get(task, {}) 430 | logger.info(f"setting model.config to task specific params for {task}:\n {pars}") 431 | logger.info("note: command line args may override some of these") 432 | model.config.update(pars) 433 | 434 | 435 | def pickle_load(path): 436 | """pickle.load(path)""" 437 | with open(path, "rb") as f: 438 | return pickle.load(f) 439 | 440 | 441 | def pickle_save(obj, path): 442 | """pickle.dump(obj, path)""" 443 | with open(path, "wb") as f: 444 | return pickle.dump(obj, f) 445 | 446 | 447 | def flatten_list(summary_ids: List[List]): 448 | return [x for x in itertools.chain.from_iterable(summary_ids)] 449 | 450 | 451 | def save_git_info(folder_path: str) -> None: 452 | """Save git information to output_dir/git_log.json""" 453 | repo_infos = get_git_info() 454 | save_json(repo_infos, os.path.join(folder_path, "git_log.json")) 455 | 456 | 457 | def save_json(content, path, indent=4, **json_dump_kwargs): 458 | with open(path, "w") as f: 459 | json.dump(content, f, indent=indent, sort_keys=True, **json_dump_kwargs) 460 | 461 | 462 | def load_json(path): 463 | with open(path) as f: 464 | return json.load(f) 465 | 466 | 467 | def get_git_info(): 468 | try: 469 | repo = git.Repo(search_parent_directories=True) 470 | repo_infos = { 471 | "repo_id": str(repo), 472 | "repo_sha": str(repo.head.object.hexsha), 473 | "repo_branch": str(repo.active_branch), 474 | "hostname": str(socket.gethostname()), 475 | } 476 | return repo_infos 477 | except TypeError: 478 | return { 479 | "repo_id": None, 480 | "repo_sha": None, 481 | "repo_branch": None, 482 | "hostname": None, 483 | } 484 | 485 | 486 | ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"] 487 | 488 | 489 | def extract_rouge_mid_statistics(dct): 490 | new_dict = {} 491 | for k1, v1 in dct.items(): 492 | mid = v1.mid 493 | new_dict[k1] = {stat: round(getattr(mid, stat), 4) for stat in ["precision", "recall", "fmeasure"]} 494 | return new_dict 495 | 496 | 497 | def calculate_rouge( 498 | pred_lns: List[str], 499 | tgt_lns: List[str], 500 | use_stemmer=True, 501 | rouge_keys=ROUGE_KEYS, 502 | return_precision_and_recall=False, 503 | bootstrap_aggregation=True, 504 | newline_sep=True, 505 | ) -> Dict: 506 | """Calculate rouge using rouge_scorer package. 507 | 508 | Args: 509 | pred_lns: list of summaries generated by model 510 | tgt_lns: list of groundtruth summaries (e.g. contents of val.target) 511 | use_stemmer: Bool indicating whether Porter stemmer should be used to 512 | strip word suffixes to improve matching. 513 | rouge_keys: which metrics to compute, defaults to rouge1, rouge2, rougeL, rougeLsum 514 | return_precision_and_recall: (False) whether to also return precision and recall. 515 | bootstrap_aggregation: whether to do the typical bootstrap resampling of scores. Defaults to True, if False 516 | this function returns a collections.defaultdict[metric: list of values for each observation for each subscore]`` 517 | newline_sep:(default=True) whether to add newline between sentences. This is essential for calculation rougeL 518 | on multi sentence summaries (CNN/DM dataset). 519 | 520 | Returns: 521 | Dict[score: value] if aggregate else defaultdict(list) keyed by rouge_keys 522 | 523 | """ 524 | scorer = rouge_scorer.RougeScorer(rouge_keys, use_stemmer=use_stemmer) 525 | aggregator = scoring.BootstrapAggregator() 526 | for pred, tgt in zip(tgt_lns, pred_lns): 527 | # rougeLsum expects "\n" separated sentences within a summary 528 | if newline_sep: 529 | pred = add_newline_to_end_of_each_sentence(pred) 530 | tgt = add_newline_to_end_of_each_sentence(tgt) 531 | scores = scorer.score(pred, tgt) 532 | aggregator.add_scores(scores) 533 | 534 | if bootstrap_aggregation: 535 | result = aggregator.aggregate() 536 | if return_precision_and_recall: 537 | return extract_rouge_mid_statistics(result) # here we return dict 538 | else: 539 | return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()} 540 | 541 | else: 542 | return aggregator._scores # here we return defaultdict(list) 543 | 544 | 545 | # Utilities for freezing parameters and checking whether they are frozen 546 | 547 | 548 | def freeze_params(model: nn.Module): 549 | """Set requires_grad=False for each of model.parameters()""" 550 | for par in model.parameters(): 551 | par.requires_grad = False 552 | 553 | 554 | def freeze_embeds(model): 555 | """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5.""" 556 | model_type = model.config.model_type 557 | 558 | if model_type == "t5": 559 | freeze_params(model.shared) 560 | for d in [model.encoder, model.decoder]: 561 | freeze_params(d.embed_tokens) 562 | elif model_type == "fsmt": 563 | for d in [model.model.encoder, model.model.decoder]: 564 | freeze_params(d.embed_positions) 565 | freeze_params(d.embed_tokens) 566 | else: 567 | freeze_params(model.model.shared) 568 | for d in [model.model.encoder, model.model.decoder]: 569 | freeze_params(d.embed_positions) 570 | freeze_params(d.embed_tokens) 571 | 572 | 573 | def grad_status(model: nn.Module) -> Iterable: 574 | return (par.requires_grad for par in model.parameters()) 575 | 576 | 577 | def any_requires_grad(model: nn.Module) -> bool: 578 | return any(grad_status(model)) 579 | 580 | 581 | def assert_all_frozen(model): 582 | model_grads: List[bool] = list(grad_status(model)) 583 | n_require_grad = sum(lmap(int, model_grads)) 584 | npars = len(model_grads) 585 | assert not any(model_grads), f"{n_require_grad/npars:.1%} of {npars} weights require grad" 586 | 587 | 588 | def assert_not_all_frozen(model): 589 | model_grads: List[bool] = list(grad_status(model)) 590 | npars = len(model_grads) 591 | assert any(model_grads), f"none of {npars} weights require grad" 592 | 593 | 594 | def parse_numeric_n_bool_cl_kwargs(unparsed_args: List[str]) -> Dict[str, Union[int, float, bool]]: 595 | """ 596 | Parse an argv list of unspecified command line args to a dict. 597 | Assumes all values are either numeric or boolean in the form of true/false. 598 | """ 599 | result = {} 600 | assert len(unparsed_args) % 2 == 0, f"got odd number of unparsed args: {unparsed_args}" 601 | num_pairs = len(unparsed_args) // 2 602 | for pair_num in range(num_pairs): 603 | i = 2 * pair_num 604 | assert unparsed_args[i].startswith("--") 605 | if unparsed_args[i + 1].lower() == "true": 606 | value = True 607 | elif unparsed_args[i + 1].lower() == "false": 608 | value = False 609 | else: 610 | try: 611 | value = int(unparsed_args[i + 1]) 612 | except ValueError: 613 | value = float(unparsed_args[i + 1]) # this can raise another informative ValueError 614 | 615 | result[unparsed_args[i][2:]] = value 616 | return result 617 | 618 | 619 | def write_txt_file(ordered_tgt, path): 620 | f = Path(path).open("w") 621 | for ln in ordered_tgt: 622 | f.write(ln + "\n") 623 | f.flush() 624 | 625 | 626 | def chunks(lst, n): 627 | """Yield successive n-sized chunks from lst.""" 628 | for i in range(0, len(lst), n): 629 | yield lst[i : i + n] 630 | 631 | 632 | def check_output_dir(args, expected_items=0): 633 | """ 634 | Checks whether to bail out if output_dir already exists and has more than expected_items in it 635 | 636 | `args`: needs to have the following attributes of `args`: 637 | - output_dir 638 | - do_train 639 | - overwrite_output_dir 640 | 641 | `expected_items`: normally 0 (default) - i.e. empty dir, but in some cases a few files are expected (e.g. recovery from OOM) 642 | """ 643 | if ( 644 | os.path.exists(args.output_dir) 645 | and len(os.listdir(args.output_dir)) > expected_items 646 | and args.do_train 647 | and not args.overwrite_output_dir 648 | ): 649 | raise ValueError( 650 | f"Output directory ({args.output_dir}) already exists and " 651 | f"has {len(os.listdir(args.output_dir))} items in it (expected {expected_items} items). " 652 | "Use --overwrite_output_dir to overcome." 653 | ) 654 | --------------------------------------------------------------------------------