├── .gitignore ├── .gitmodules ├── LICENSE ├── README.md ├── data └── quora_dataset │ ├── quora_data_prepro.h5 │ └── quora_data_prepro.json ├── environment.yml ├── misc ├── __init__.py ├── dataloader.py ├── net_utils.py ├── train_util.py └── utils.py ├── models ├── enc_dec_dis.py └── enc_dec_sh_dis.py ├── prepro ├── prepro_quora.py └── quora_prepro.py ├── pycocoevalcap ├── .gitignore ├── README.md ├── bleu │ ├── LICENSE │ ├── bleu.py │ └── bleu_scorer.py ├── cider │ ├── cider.py │ └── cider_scorer.py ├── eval.py ├── license.txt ├── meteor │ ├── data │ │ └── paraphrase-en.gz │ ├── meteor-1.5.jar │ └── meteor.py ├── rouge │ └── rouge.py ├── setup.py └── tokenizer │ ├── ptbtokenizer.py │ └── stanford-corenlp-3.4.1.jar ├── train.sh └── training ├── train_edlp.py └── train_edlps.py /.gitignore: -------------------------------------------------------------------------------- 1 | tst.py 2 | *.pyc 3 | */__pycache__/* 4 | runs/* 5 | runs 6 | ignore.txt 7 | .vscode 8 | .vscode/* 9 | logs 10 | logs/* 11 | save 12 | save/* 13 | samples 14 | samples/* 15 | result 16 | result/* 17 | */__pycache__/* 18 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "pycocoevalcap"] 2 | path = pycocoevalcap 3 | url = https://github.com/salaniz/pycocoevalcap.git 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Dev Chauhan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Paraphrase Question Generator using Shared Discriminator 2 | 3 | PyTorch code for Paraphrase Question Generator. This code-based is built upon this paper [Learning Semantic Sentence Embeddings using Pair-wise Discriminator](https://www.aclweb.org/anthology/C18-1230.pdf). 4 | 5 | If you want the code used for experiments please head over to the `orig-code` branch. This code consists of some good models mentioned in above paper only with more readable and usable form. 6 | 7 | ### Requirements and Setup 8 | 9 | ##### Use Anaconda or Miniconda 10 | 11 | 1. Install Anaconda or Miniconda distribution based on Python3+ from their [downloads' site](https://conda.io/docs/user-guide/install/download.html). 12 | 2. Clone this repository and create an environment: 13 | 14 | ``` 15 | git clone https://www.github.com/dev-chauhan/PQG-pytorch 16 | cd PQG-pytorch 17 | conda env create -f environment.yml 18 | 19 | # activate the environment 20 | conda activate PQG 21 | ``` 22 | 3. After that for logging you need to install [tensorboardX](https://github.com/lanpa/tensorboardX). 23 | ``` 24 | pip install tensorboardX 25 | ``` 26 | ### Dataset 27 | 28 | You can directly use following files downloading them into `data` folder or by following the process shown below it. 29 | ##### Data Files 30 | Download all the data files from here. 31 | - [quora_data_prepro.h5](https://figshare.com/s/5463afb24cba05629cdf) 32 | - [quora_data_prepro.json](https://figshare.com/s/5463afb24cba05629cdf) 33 | - [quora_raw_train.json](https://figshare.com/s/5463afb24cba05629cdf) 34 | - [quora_raw_val.json](https://figshare.com/s/5463afb24cba05629cdf) 35 | - [quora_raw_test.json](https://figshare.com/s/5463afb24cba05629cdf) 36 | 37 | 38 | ##### Download Dataset 39 | We have referred [neuraltalk2](https://github.com/karpathy/neuraltalk2) and [Text-to-Image Synthesis ](https://github.com/reedscot/icml2016) to prepare our code base. The first thing you need to do is to download the Quora Question Pairs dataset from the [Quora Question Pair website](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) and put the same in the `data` folder. 40 | 41 | If you want to train from scratch continue reading or if you just want to evaluate using a pretrained model then head over to `Datafiles` section and download the data files (put all the data files in the `data` folder) and run `score.py` to evaluate pretrained model. 42 | 43 | Now we need to do some preprocessing, head over to the `prepro` folder and run 44 | 45 | ``` 46 | $ cd prepro 47 | $ python quora_prepro.py 48 | ``` 49 | 50 | **Note** The above command generates json files for 100K question pairs for train, 5k question pairs for validation and 30K question pairs for Test set. 51 | If you want to change this and instead use only 50K question pairs for training and rest remaining the same, then you need to make some minor changes in the above file. After this step, it will generate the files under the `data` folder. `quora_raw_train.json`, `quora_raw_val.json` and `quora_raw_test.json` 52 | 53 | ##### Preprocess Paraphrase Question 54 | 55 | ``` 56 | $ python prepro_quora.py --input_train_json ../data/quora_raw_train.json --input_test_json ../data/quora_raw_test.json 57 | ``` 58 | This will generate two files in `data/` folder, `quora_data_prepro.h5` and `quora_data_prepro.json`. 59 | 60 | ### Training 61 | 62 | ``` 63 | $ ./train.sh --n_epoch 64 | ``` 65 | 66 | You can change training data set and validation data set lengths by adding arguments `--train_dataset_len` and `--val_dataset_len` which are default to `100000` and `30000` which is maximum. 67 | 68 | There are other arguments also for you to experiment like `--batch_size`, `--learning_rate`, `--drop_prob_lm`, etc. 69 | 70 | You can resume training using `--start_from` argument in which you have to give path of saved model. 71 | ### Save and log 72 | 73 | First you have to make empty directories `save`, `samples`, and `logs`. 74 | For each training there will be a directory having unique name in `save`. Saved model will be a `.tar` file. Each model will be saved as `` in that directory. 75 | 76 | In `samples` directory with same unique name as above the directory contains a `.txt` file for each epoch as `_train.txt` or `_val.txt` having generated paraphrases by model at the end of that epoch on validation data set. 77 | 78 | Logs for training and evaluation is stored in `logs` directory which you can see using `tensorboard` by running following command. 79 | ``` 80 | tensorboard --logdir 81 | ``` 82 | This command will tell you where you can see your logs on browser, commonly it is `localhost:6006` but you can change it using `--port` argument in above command. 83 | 84 | ### Results 85 | Following are the results for 100k quora question pairs dataset for some models. 86 | 87 | Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr | 88 | ---|--|--|--|--|--|--|--| 89 | EDL|0.4162|0.2578|0.1724|0.1219|0.4191|0.3244|0.6189| 90 | EDLPS|0.4754|0.3160|0.2249|0.1672|0.4781|0.3488|1.0949| 91 | 92 | Following are the results for 50k quora question pairs dataset for some models. 93 | 94 | Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr | 95 | ---|--|--|--|--|--|--|--| 96 | EDL|0.3877|0.2336|0.1532|0.1067|0.3913|0.3133|0.4550| 97 | EDLPS|0.4553|0.2981 |0.2105|0.1560|0.4583|0.3421|0.9690| 98 | 99 | ### Reference 100 | 101 | If you use this code as part of any published research, please acknowledge the following paper 102 | 103 | ``` 104 | @inproceedings{patro2018learning, 105 | title={Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator}, 106 | author={Patro, Badri Narayana and Kurmi, Vinod Kumar and Kumar, Sandeep and Namboodiri, Vinay}, 107 | booktitle={Proceedings of the 27th International Conference on Computational Linguistics}, 108 | pages={2715--2729}, 109 | year={2018} 110 | } 111 | 112 | 113 | @article{PATRO2021149, 114 | title = {Revisiting paraphrase question generator using pairwise discriminator}, 115 | author = {Badri N. Patro and Dev Chauhan and Vinod K. Kurmi and Vinay P. Namboodiri}, 116 | journal = {Neurocomputing}, 117 | volume = {420}, 118 | pages = {149-161}, 119 | year = {2021}, 120 | issn = {0925-2312}, 121 | doi = {https://doi.org/10.1016/j.neucom.2020.08.022}, 122 | url = {https://www.sciencedirect.com/science/article/pii/S0925231220312820} 123 | } 124 | ``` 125 | 126 | ## Contributors 127 | * [Dev Chauhan][1] (devgiri@iitk.ac.in) 128 | * [Badri N. Patro][2] (badri@iitk.ac.in) 129 | * [Vinod K. Kurmi][2] (vinodkk@iitk.ac.in) 130 | 131 | [1]: https://github.com/dev-chauhan 132 | [2]: https://github.com/badripatro 133 | [3]: https://github.com/vinodkkurmi 134 | 135 | 136 | 137 | -------------------------------------------------------------------------------- /data/quora_dataset/quora_data_prepro.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/data/quora_dataset/quora_data_prepro.h5 -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: PQG 2 | channels: 3 | - defaults 4 | dependencies: 5 | - _libgcc_mutex=0.1=main 6 | - attrs=19.1.0=py37_1 7 | - backcall=0.1.0=py37_0 8 | - blas=1.0=mkl 9 | - bleach=3.1.0=py37_0 10 | - ca-certificates=2019.11.27=0 11 | - certifi=2019.11.28=py37_0 12 | - cffi=1.12.3=py37h2e261b9_0 13 | - cycler=0.10.0=py37_0 14 | - dbus=1.13.6=h746ee38_0 15 | - decorator=4.4.0=py37_1 16 | - defusedxml=0.6.0=py_0 17 | - entrypoints=0.3=py37_0 18 | - expat=2.2.6=he6710b0_0 19 | - fontconfig=2.13.0=h9420a91_0 20 | - freetype=2.9.1=h8a8886c_1 21 | - glib=2.56.2=hd408876_0 22 | - gmp=6.1.2=h6c8ec71_1 23 | - gst-plugins-base=1.14.0=hbbd80ab_1 24 | - gstreamer=1.14.0=hb453b48_1 25 | - h5py=2.9.0=py37h7918eee_0 26 | - hdf5=1.10.4=hb1b8bf9_0 27 | - icu=58.2=h9c2bf20_1 28 | - intel-openmp=2019.4=243 29 | - ipykernel=5.1.1=py37h39e3cac_0 30 | - ipython=7.7.0=py37h39e3cac_0 31 | - ipython_genutils=0.2.0=py37_0 32 | - ipywidgets=7.5.0=py_0 33 | - jedi=0.13.3=py37_0 34 | - jinja2=2.10.1=py37_0 35 | - jpeg=9b=h024ee3a_2 36 | - jsonschema=3.0.1=py37_0 37 | - jupyter=1.0.0=py37_7 38 | - jupyter_client=5.3.1=py_0 39 | - jupyter_console=6.0.0=py37_0 40 | - jupyter_core=4.5.0=py_0 41 | - kiwisolver=1.1.0=py37he6710b0_0 42 | - libedit=3.1.20181209=hc058e9b_0 43 | - libffi=3.2.1=hd88cf55_4 44 | - libgcc-ng=9.1.0=hdf63c60_0 45 | - libgfortran-ng=7.3.0=hdf63c60_0 46 | - libpng=1.6.37=hbc83047_0 47 | - libsodium=1.0.16=h1bed415_0 48 | - libstdcxx-ng=9.1.0=hdf63c60_0 49 | - libtiff=4.0.10=h2733197_2 50 | - libuuid=1.0.3=h1bed415_2 51 | - libxcb=1.13=h1bed415_1 52 | - libxml2=2.9.9=hea5a465_1 53 | - markupsafe=1.1.1=py37h7b6447c_0 54 | - matplotlib=3.1.0=py37h5429711_0 55 | - mistune=0.8.4=py37h7b6447c_0 56 | - mkl=2019.4=243 57 | - mkl-service=2.0.2=py37h7b6447c_0 58 | - mkl_fft=1.0.12=py37ha843d7b_0 59 | - mkl_random=1.0.2=py37hd81dba3_0 60 | - nbconvert=5.5.0=py_0 61 | - nbformat=4.4.0=py37_0 62 | - ncurses=6.1=he6710b0_1 63 | - ninja=1.9.0=py37hfd86e86_0 64 | - nltk=3.4.4=py37_0 65 | - notebook=6.0.0=py37_0 66 | - numpy=1.16.4=py37h7e9f1db_0 67 | - numpy-base=1.16.4=py37hde5b4d6_0 68 | - olefile=0.46=py37_0 69 | - openjdk=8.0.152=h46b5887_1 70 | - openssl=1.1.1d=h7b6447c_3 71 | - pandoc=2.2.3.2=0 72 | - pandocfilters=1.4.2=py37_1 73 | - parso=0.5.0=py_0 74 | - pcre=8.43=he6710b0_0 75 | - pexpect=4.7.0=py37_0 76 | - pickleshare=0.7.5=py37_0 77 | - pillow=6.1.0=py37h34e0f95_0 78 | - pip=19.1.1=py37_0 79 | - prometheus_client=0.7.1=py_0 80 | - prompt_toolkit=2.0.9=py37_0 81 | - ptyprocess=0.6.0=py37_0 82 | - pycparser=2.19=py37_0 83 | - pygments=2.4.2=py_0 84 | - pyparsing=2.4.0=py_0 85 | - pyqt=5.9.2=py37h05f1152_2 86 | - pyrsistent=0.14.11=py37h7b6447c_0 87 | - python=3.7.3=h0371630_0 88 | - python-dateutil=2.8.0=py37_0 89 | - pytz=2019.1=py_0 90 | - pyzmq=18.0.0=py37he6710b0_0 91 | - qt=5.9.7=h5867ecd_1 92 | - qtconsole=4.5.2=py_0 93 | - readline=7.0=h7b6447c_5 94 | - scipy=1.3.0=py37h7c811a0_0 95 | - send2trash=1.5.0=py37_0 96 | - setuptools=41.0.1=py37_0 97 | - sip=4.19.8=py37hf484d3e_0 98 | - six=1.12.0=py37_0 99 | - sqlite=3.29.0=h7b6447c_0 100 | - terminado=0.8.2=py37_0 101 | - testpath=0.4.2=py37_0 102 | - tk=8.6.8=hbc83047_0 103 | - tornado=6.0.3=py37h7b6447c_0 104 | - tqdm=4.41.1=py_0 105 | - traitlets=4.3.2=py37_0 106 | - wcwidth=0.1.7=py37_0 107 | - webencodings=0.5.1=py37_1 108 | - wheel=0.33.4=py37_0 109 | - widgetsnbextension=3.5.0=py37_0 110 | - xz=5.2.4=h14c3975_4 111 | - zeromq=4.3.1=he6710b0_3 112 | - zlib=1.2.11=h7b6447c_3 113 | - zstd=1.3.7=h0b5b093_0 114 | - pip: 115 | - bandit==1.6.2 116 | - gitdb2==2.0.6 117 | - gitpython==3.0.5 118 | - imageio==2.6.1 119 | - pbr==5.4.4 120 | - progressbar==2.5 121 | - progressbar2==3.47.0 122 | - protobuf==3.11.2 123 | - python-utils==2.3.0 124 | - pyyaml==5.2 125 | - smmap2==2.0.5 126 | - stevedore==1.31.0 127 | - tensorboardx==2.0 128 | - torch==1.4.0 129 | - torchvision==0.3.0 130 | -------------------------------------------------------------------------------- /misc/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/misc/__init__.py -------------------------------------------------------------------------------- /misc/dataloader.py: -------------------------------------------------------------------------------- 1 | import h5py 2 | import json 3 | import torch 4 | import torch.utils.data as data 5 | 6 | 7 | class Dataloader(data.Dataset): 8 | 9 | def __init__(self, input_json_file_path, input_ques_h5_path): 10 | 11 | super(Dataloader, self).__init__() 12 | print('Reading', input_json_file_path) 13 | 14 | with open(input_json_file_path) as input_file: 15 | data_dict = json.load(input_file) 16 | 17 | self.ix_to_word = {} 18 | 19 | for k in data_dict['ix_to_word']: 20 | self.ix_to_word[int(k)] = data_dict['ix_to_word'][k] 21 | 22 | self.UNK_token = 0 23 | 24 | if 0 not in self.ix_to_word: 25 | self.ix_to_word[0] = '' 26 | 27 | else : 28 | raise Exception 29 | 30 | self.EOS_token = len(self.ix_to_word) 31 | self.ix_to_word[self.EOS_token] = '' 32 | self.PAD_token = len(self.ix_to_word) 33 | self.ix_to_word[self.PAD_token] = '' 34 | self.SOS_token = len(self.ix_to_word) 35 | self.ix_to_word[self.SOS_token] = '' 36 | self.vocab_size = len(self.ix_to_word) 37 | print('DataLoader loading h5 question file:', input_ques_h5_path) 38 | qa_data = h5py.File(input_ques_h5_path, 'r') 39 | 40 | ques_id_train = torch.from_numpy(qa_data['ques_cap_id_train'][...].astype(int)) 41 | 42 | ques_train, ques_len_train = self.process_data(torch.from_numpy(qa_data['ques_train'][...].astype(int)), torch.from_numpy(qa_data['ques_length_train'][...].astype(int))) 43 | 44 | label_train, label_len_train = self.process_data(torch.from_numpy(qa_data['ques1_train'][...].astype(int)), torch.from_numpy(qa_data['ques1_length_train'][...].astype(int))) 45 | 46 | self.train_id = 0 47 | self.seq_length = ques_train.size()[1] 48 | 49 | print('Training dataset length : ', ques_train.size()[0]) 50 | 51 | 52 | ques_test, ques_len_test = self.process_data(torch.from_numpy(qa_data['ques_test'][...].astype(int)), torch.from_numpy(qa_data['ques_length_test'][...].astype(int))) 53 | 54 | label_test, label_len_test = self.process_data(torch.from_numpy(qa_data['ques1_test'][...].astype(int)), torch.from_numpy(qa_data['ques1_length_test'][...].astype(int))) 55 | 56 | ques_id_test = torch.from_numpy(qa_data['ques_cap_id_test'][...].astype(int)) 57 | 58 | self.test_id = 0 59 | 60 | print('Test dataset length : ', ques_test.size()[0]) 61 | qa_data.close() 62 | 63 | self.ques = torch.cat([ques_train, ques_test]) 64 | self.len = torch.cat([ques_len_train, ques_len_test]) 65 | self.label = torch.cat([label_train, label_test]) 66 | self.label_len = torch.cat([label_len_train, label_len_test]) 67 | self.id = torch.cat([ques_id_train, ques_id_test]) 68 | 69 | def process_data(self, data, data_len): 70 | N = data.size()[0] 71 | new_data = torch.zeros(N, data.size()[1] + 2, dtype=torch.long) + self.PAD_token 72 | for i in range(N): 73 | new_data[i, 1:data_len[i]+1] = data[i, :data_len[i]] 74 | new_data[i, 0] = self.SOS_token 75 | new_data[i, data_len[i]+1] = self.EOS_token 76 | data_len[i] += 2 77 | return new_data, data_len 78 | 79 | def __len__(self): 80 | return self.len.size()[0] 81 | 82 | def __getitem__(self, idx): 83 | return (self.ques[idx], self.len[idx], self.label[idx], self.label_len[idx], self.id[idx]) 84 | 85 | def getVocabSize(self): 86 | return self.vocab_size 87 | 88 | def getDataNum(self, split): 89 | if split == 1: 90 | return 100000 91 | 92 | if split == 2: 93 | return 30000 94 | 95 | def getSeqLength(self): 96 | return self.seq_length 97 | -------------------------------------------------------------------------------- /misc/net_utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import misc.utils as utils 3 | 4 | 5 | def decode_sequence(ix_to_word, seq): 6 | N, D = seq.size()[0], seq.size()[1] 7 | out = [] 8 | EOS_flag = False 9 | for i in range(N): 10 | txt = '' 11 | for j in range(D): 12 | ix = seq[i, j] 13 | if int(ix.item()) not in ix_to_word: 14 | print("UNK token ", str(ix.item())) 15 | word = ix_to_word[len(ix_to_word) - 1] 16 | else: 17 | word = ix_to_word[int(ix.item())] 18 | if word == '': 19 | txt = txt + ' ' 20 | txt = txt + word 21 | break 22 | if word == '': 23 | txt = txt + '' 24 | continue 25 | if j > 0: 26 | txt = txt + ' ' 27 | txt = txt + word 28 | out.append(txt) 29 | return out 30 | 31 | def prob2pred(prob): 32 | 33 | return torch.multinomial(torch.exp(prob.view(-1, prob.size(-1))), 1).view(prob.size(0), prob.size(1)) 34 | 35 | def JointEmbeddingLoss(feature_emb1, feature_emb2): 36 | 37 | batch_size = feature_emb1.size()[0] 38 | 39 | return torch.sum( 40 | torch.clamp( 41 | torch.mm(feature_emb1, feature_emb2.t()) - torch.sum(feature_emb1 * feature_emb2, dim=-1) + 1, 42 | min=0.0 43 | ) 44 | ) / (batch_size * batch_size) 45 | -------------------------------------------------------------------------------- /misc/train_util.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import misc.utils as utils 3 | import misc.net_utils as net_utils 4 | import torch.nn as nn 5 | from pycocoevalcap.eval import COCOEvalCap 6 | import os 7 | 8 | 9 | def getObjsForScores(real_sents, pred_sents): 10 | class coco: 11 | def __init__(self, sents): 12 | self.sents = sents 13 | self.imgToAnns = [[{'caption': sents[i]}] for i in range(len(sents))] 14 | 15 | def getImgIds(self): 16 | return [i for i in range(len(self.sents))] 17 | 18 | return coco(real_sents), coco(pred_sents) 19 | 20 | 21 | def evaluate_scores(s1, s2): 22 | 23 | ''' 24 | calculates scores and return the dict with score_name and value 25 | ''' 26 | coco, cocoRes = getObjsForScores(s1, s2) 27 | 28 | evalObj = COCOEvalCap(coco, cocoRes) 29 | 30 | evalObj.evaluate() 31 | 32 | return evalObj.eval 33 | 34 | 35 | def dump_samples(ph, pph, gpph, file_name): 36 | 37 | file = open(file_name, "w") 38 | 39 | for r, s, t in zip(ph, pph, gpph): 40 | file.write("ph : " + r + "\npph : " + s + "\ngpph : " + t + '\n\n') 41 | file.close() 42 | 43 | 44 | # def save_model(encoder, generator, model_optim, epoch, it, local_loss, global_loss, save_folder, folder, discriminator=None, discriminatorg=None): 45 | 46 | # PATH = os.path.join(save_folder, folder, str(epoch) + '_' + str(it) + '.tar') 47 | 48 | # checkpoint = { 49 | # 'epoch': epoch, 50 | # 'iter': it, 51 | # 'encoder_state_dict': encoder.state_dict(), 52 | # 'generator_state_dict': generator.state_dict(), 53 | # 'optimizer_state_dict': model_optim.state_dict(), 54 | # 'local_loss': local_loss, 55 | # 'global_loss': global_loss 56 | # } 57 | # if discriminator is not None: 58 | # checkpoint['discriminator_state_dict'] = discriminator.state_dict() 59 | # if discriminatorg is not None: 60 | # checkpoint['discriminatorg_state_dict'] = discriminatorg.state_dict() 61 | 62 | # torch.save(checkpoint, PATH) 63 | 64 | def save_model(model, model_opt, epoch, save_file): 65 | 66 | checkpoint = { 67 | 'epoch': epoch, 68 | 'model': model.state_dict(), 69 | 'model_opt': model_opt.state_dict() 70 | } 71 | 72 | torch.save(checkpoint, save_file) 73 | -------------------------------------------------------------------------------- /misc/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import argparse 3 | 4 | 5 | def one_hot(t, c): 6 | return torch.zeros(*t.size(), c, device=t.device).scatter_(-1, t.unsqueeze(-1), 1) 7 | 8 | 9 | def make_parser(): 10 | parser = argparse.ArgumentParser() 11 | 12 | parser.add_argument('--input_ques_h5',default='data/quora_dataset/quora_data_prepro.h5',help='path to the h5file containing the preprocessed dataset') 13 | parser.add_argument('--input_json',default='data/quora_dataset/quora_data_prepro.json',help='path to the json file containing additional info and vocab') 14 | 15 | # starting point 16 | parser.add_argument('--start_from', default='None', help='path to a model checkpoint to initialize model weights from. Empty = don\'t') 17 | 18 | # # Model settings 19 | parser.add_argument('--model', default='EDLPS', help='which model to use? EDL|EDP|EDLP|EDLPS|EDLPG|EDLPGS|EDG|EDPG') 20 | parser.add_argument('--batch_size', type=int, default=150, help='what is theutils batch size in number of images per batch? (there will be x seq_per_img sentences)') 21 | parser.add_argument('--input_encoding_size', type=int, default=512,help='the encoding size of each token in the vocabulary, and the image.') 22 | parser.add_argument('--att_size', type=int, default=512, help='size of sttention vector which refer to k in paper') 23 | parser.add_argument('--emb_size',type=int, default=512, help='the size after embeeding from onehot') 24 | parser.add_argument('--rnn_layers',type=int, default=1, help='number of the rnn layer') 25 | parser.add_argument('--train_dataset_len', type=int, default=100000, help='length of train dataset') 26 | parser.add_argument('--val_dataset_len', type=int, default=30000, help='length of validation dataset') 27 | 28 | # Optimization 29 | parser.add_argument('--optim',default='rmsprop',help='what update to use? rmsprop|sgd|sgdmom|adagrad|adam') 30 | parser.add_argument('--learning_rate',default=0.0008,help='learning rate', type=float)#0.0001,#0.0002,#0.005 31 | parser.add_argument('--learning_rate_decay_start', default=5, type=int, help='at what epoch to start decaying learning rate? (-1 = dont)')#learning_rate_decay_start', 100, 32 | parser.add_argument('--learning_rate_decay_every', type=int, default=5, help='every how many epoch thereafter to drop LR by half?')#-learning_rate_decay_every', 1500, 33 | parser.add_argument('--momentum',type=float, default=0.9,help='momentum') 34 | parser.add_argument('--optim_alpha',type=float, default=0.8,help='alpha for adagrad/rmsprop/momentum/adam')#optim_alpha',0.99 35 | parser.add_argument('--optim_beta',type=float, default=0.999,help='beta used for adam')#optim_beta',0.995 36 | parser.add_argument('--optim_epsilon',type=float, default=1e-8,help='epsilon that goes into denominator in rmsprop') 37 | parser.add_argument('--max_iters', type=int, default=-1, help='max number of iterations to run for (-1 = run forever)') 38 | parser.add_argument('--iterPerEpoch', default=1250, type=int) 39 | parser.add_argument('--drop_prob_lm', type=float, default=0.5, help='strength of drop_prob_lm in the Language Model RNN') 40 | parser.add_argument('--n_epoch', type=int, default=1, help='number of epochs during training') 41 | 42 | # Evaluation/Checkpointing 43 | 44 | parser.add_argument('--save', default='Results', help='save directory') 45 | parser.add_argument('--checkpoint_dir', default='Results/checkpoints', help='folder to save checkpoints into (empty = this folder)') 46 | parser.add_argument('--language_eval', type=int, default=1, help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 47 | parser.add_argument('--val_images_use', type=int, default=24800, help='how many images to use when periodically evaluating the validation loss? (-1 = all)') 48 | parser.add_argument('--save_every', type=int, default=500, help='how often to save a model checkpoint?') 49 | parser.add_argument('--log_every', type=int , default=100, help='How often do we snapshot losses, for inclusion in the progress dump? (0 = disable)') 50 | 51 | # misc 52 | parser.add_argument('--backend', default='cudnn', help='nn|cudnn') 53 | parser.add_argument('--name', default='', help='an id identifying this run/job. used in cross-val and appended when writing progress files') 54 | parser.add_argument('--seed', type=int, default=1234, help='random number generator seed to use') 55 | parser.add_argument('--gpuid', type=int, default=-1, help='which gpu to use. -1 = use CPU') 56 | parser.add_argument('--nGPU', type=int, default=3, help='Number of GPUs to use by default') 57 | 58 | #text encoder 59 | parser.add_argument('--emb_dim',type=int, default=512,help='dim of word embedding') 60 | parser.add_argument('--emb_hid_dim', type=int, default=256,help='hidden dim of word embedding') 61 | parser.add_argument('--enc_dropout', type=float, default=0.5,help='dropout for encoder module') 62 | parser.add_argument('--enc_rnn_dim', default=512, type=int, help='size of the rnn in number of hidden nodes in each layer of gru in encoder') 63 | parser.add_argument('--enc_dim', type=int, default=512,help='size of the encoded sentence') 64 | parser.add_argument('--gen_rnn_dim', default=512, type=int, help='size of the rnn in number of hidden nodes in each layer of lstm in generator') 65 | parser.add_argument('--gen_dropout',type=float, default=0.5,help='dropout for generator module') 66 | 67 | return parser 68 | -------------------------------------------------------------------------------- /models/enc_dec_dis.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | import misc.utils as utils 5 | 6 | 7 | class ParaphraseGenerator(nn.Module): 8 | """ 9 | pytorch module which generates paraphrase of given phrase 10 | """ 11 | def __init__(self, op): 12 | 13 | super(ParaphraseGenerator, self).__init__() 14 | 15 | # encoder : 16 | self.emb_layer = nn.Sequential( 17 | nn.Linear(op["vocab_sz"], op["emb_hid_dim"]), 18 | nn.Threshold(0.000001, 0), 19 | nn.Linear(op["emb_hid_dim"], op["emb_dim"]), 20 | nn.Threshold(0.000001, 0)) 21 | self.enc_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"]) 22 | self.enc_lin = nn.Sequential( 23 | nn.Dropout(op["enc_dropout"]), 24 | nn.Linear(op["enc_rnn_dim"], op["enc_dim"])) 25 | 26 | # generator : 27 | self.gen_emb = nn.Embedding(op["vocab_sz"], op["emb_dim"]) 28 | self.gen_rnn = nn.LSTM(op["enc_dim"], op["gen_rnn_dim"]) 29 | self.gen_lin = nn.Sequential( 30 | nn.Dropout(op["gen_dropout"]), 31 | nn.Linear(op["gen_rnn_dim"], op["vocab_sz"]), 32 | nn.LogSoftmax(dim=-1)) 33 | 34 | # pair-wise discriminator : 35 | self.dis_emb_layer = nn.Sequential( 36 | nn.Linear(op["vocab_sz"], op["emb_hid_dim"]), 37 | nn.Threshold(0.000001, 0), 38 | nn.Linear(op["emb_hid_dim"], op["emb_dim"]), 39 | nn.Threshold(0.000001, 0), 40 | ) 41 | self.dis_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"]) 42 | self.dis_lin = nn.Sequential( 43 | nn.Dropout(op["enc_dropout"]), 44 | nn.Linear(op["enc_rnn_dim"], op["enc_dim"])) 45 | 46 | # some useful constants : 47 | self.max_seq_len = op["max_seq_len"] 48 | self.vocab_sz = op["vocab_sz"] 49 | 50 | def forward(self, phrase, sim_phrase=None, train=False): 51 | """ 52 | forward pass 53 | 54 | inputs :- 55 | 56 | phrase : given phrase , shape = (max sequence length, batch size) 57 | sim_phrase : (if train == True), shape = (max seq length, batch sz) 58 | train : if true teacher forcing is used to train the module 59 | 60 | outputs :- 61 | 62 | out : generated paraphrase, shape = (max sequence length, batch size, ) 63 | enc_out : encoded generated paraphrase, shape=(batch size, enc_dim) 64 | enc_sim_phrase : encoded sim_phrase, shape=(batch size, enc_dim) 65 | 66 | """ 67 | 68 | if sim_phrase is None: 69 | sim_phrase = phrase 70 | 71 | if train: 72 | 73 | # encode input phrase 74 | enc_phrase = self.enc_lin( 75 | self.enc_rnn( 76 | self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1]) 77 | 78 | # generate similar phrase using teacher forcing 79 | emb_sim_phrase_gen = self.gen_emb(sim_phrase) 80 | out_rnn, _ = self.gen_rnn( 81 | torch.cat([enc_phrase, emb_sim_phrase_gen[:-1, :]], dim=0)) 82 | out = self.gen_lin(out_rnn) 83 | 84 | # propagated from shared discriminator to calculate 85 | # pair-wise discriminator loss 86 | enc_sim_phrase = self.dis_lin( 87 | self.dis_rnn( 88 | self.dis_emb_layer(utils.one_hot(sim_phrase, 89 | self.vocab_sz)))[1]) 90 | enc_out = self.dis_lin( 91 | self.dis_rnn(self.dis_emb_layer(torch.exp(out)))[1]) 92 | 93 | else: 94 | 95 | # encode input phrase 96 | enc_phrase = self.enc_lin( 97 | self.enc_rnn( 98 | self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1]) 99 | 100 | # generate similar phrase using teacher forcing 101 | words = [] 102 | h = None 103 | for __ in range(self.max_seq_len): 104 | word, h = self.gen_rnn(enc_phrase, hx=h) 105 | word = self.gen_lin(word) 106 | words.append(word) 107 | word = torch.multinomial(torch.exp(word[0]), 1) 108 | word = word.t() 109 | enc_phrase = self.gen_emb(word) 110 | out = torch.cat(words, dim=0) 111 | 112 | # propagated from shared discriminator to calculate 113 | # pair-wise discriminator loss 114 | enc_sim_phrase = self.dis_lin( 115 | self.dis_rnn( 116 | self.dis_emb_layer(utils.one_hot(sim_phrase, 117 | self.vocab_sz)))[1]) 118 | enc_out = self.dis_lin( 119 | self.dis_rnn(self.dis_emb_layer(torch.exp(out)))[1]) 120 | 121 | enc_out.squeeze_(0) 122 | enc_sim_phrase.squeeze_(0) 123 | return out, enc_out, enc_sim_phrase 124 | -------------------------------------------------------------------------------- /models/enc_dec_sh_dis.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | import misc.utils as utils 5 | 6 | 7 | class ParaphraseGenerator(nn.Module): 8 | """ 9 | pytorch module which generates paraphrase of given phrase 10 | """ 11 | def __init__(self, op): 12 | 13 | super(ParaphraseGenerator, self).__init__() 14 | 15 | # encoder | shared pair-wise discriminator: 16 | self.emb_layer = nn.Sequential( 17 | nn.Linear(op["vocab_sz"], op["emb_hid_dim"]), 18 | nn.Threshold(0.000001, 0), 19 | nn.Linear(op["emb_hid_dim"], op["emb_dim"]), 20 | nn.Threshold(0.000001, 0)) 21 | self.enc_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"]) 22 | self.enc_lin = nn.Sequential( 23 | nn.Dropout(op["enc_dropout"]), 24 | nn.Linear(op["enc_rnn_dim"], op["enc_dim"])) 25 | 26 | # generator : 27 | self.gen_emb = nn.Embedding(op["vocab_sz"], op["emb_dim"]) 28 | self.gen_rnn = nn.LSTM(op["enc_dim"], op["gen_rnn_dim"]) 29 | self.gen_lin = nn.Sequential( 30 | nn.Dropout(op["gen_dropout"]), 31 | nn.Linear(op["gen_rnn_dim"], op["vocab_sz"]), 32 | nn.LogSoftmax(dim=-1)) 33 | 34 | # some useful constants : 35 | self.max_seq_len = op["max_seq_len"] 36 | self.vocab_sz = op["vocab_sz"] 37 | 38 | def forward(self, phrase, sim_phrase=None, train=False): 39 | """ 40 | forward pass 41 | 42 | inputs :- 43 | 44 | phrase : given phrase , shape = (max sequence length, batch size) 45 | sim_phrase : (if train == True), shape = (max seq length, batch sz) 46 | train : if true teacher forcing is used to train the module 47 | 48 | outputs :- 49 | 50 | out : generated paraphrase, shape = (max sequence length, batch size, ) 51 | enc_out : encoded generated paraphrase, shape=(batch size, enc_dim) 52 | enc_sim_phrase : encoded sim_phrase, shape=(batch size, enc_dim) 53 | 54 | """ 55 | 56 | if sim_phrase is None: 57 | sim_phrase = phrase 58 | 59 | if train: 60 | 61 | # encode input phrase 62 | enc_phrase = self.enc_lin( 63 | self.enc_rnn( 64 | self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1]) 65 | 66 | # generate similar phrase using teacher forcing 67 | emb_sim_phrase_gen = self.gen_emb(sim_phrase) 68 | out_rnn, _ = self.gen_rnn( 69 | torch.cat([enc_phrase, emb_sim_phrase_gen[:-1, :]], dim=0)) 70 | out = self.gen_lin(out_rnn) 71 | 72 | # encode similar phrase and generated output 73 | # (propagated from shared discriminator) to calculate 74 | # pair-wise discriminator loss 75 | enc_sim_phrase = self.enc_lin( 76 | self.enc_rnn( 77 | self.emb_layer(utils.one_hot(sim_phrase, 78 | self.vocab_sz)))[1]) 79 | enc_out = self.enc_lin( 80 | self.enc_rnn(self.emb_layer(torch.exp(out)))[1]) 81 | 82 | else: 83 | 84 | # encode input phrase 85 | enc_phrase = self.enc_lin( 86 | self.enc_rnn( 87 | self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1]) 88 | 89 | # generate similar phrase using teacher forcing 90 | words = [] 91 | h = None 92 | for __ in range(self.max_seq_len): 93 | word, h = self.gen_rnn(enc_phrase, hx=h) 94 | word = self.gen_lin(word) 95 | words.append(word) 96 | word = torch.multinomial(torch.exp(word[0]), 1) 97 | word = word.t() 98 | enc_phrase = self.gen_emb(word) 99 | out = torch.cat(words, dim=0) 100 | 101 | # encode similar phrase and generated output 102 | # (propagated from shared discriminator) to calculate 103 | # pair-wise discriminator loss 104 | enc_sim_phrase = self.enc_lin( 105 | self.enc_rnn( 106 | self.emb_layer(utils.one_hot(sim_phrase, 107 | self.vocab_sz)))[1]) 108 | enc_out = self.enc_lin( 109 | self.enc_rnn(self.emb_layer(torch.exp(out)))[1]) 110 | 111 | enc_out.squeeze_(0) 112 | enc_sim_phrase.squeeze_(0) 113 | return out, enc_out, enc_sim_phrase 114 | -------------------------------------------------------------------------------- /prepro/prepro_quora.py: -------------------------------------------------------------------------------- 1 | """ 2 | Preprocesses a raw json dataset into hdf5/json files. 3 | python prepro.py --input_train_json data/vqa_raw_train.json --input_test_json data/vqa_raw_test.json --num_ans 1000 4 | To get the question features. You will also see some question statistics in the terminal output. This will generate two files in the current directory, only_quora_tt.h5 and only_quora_tt.json. 5 | Also in this code a lot of places you will find things related to captions, but they actually correspond to paraphrases. Reuse of previous code :p 6 | """ 7 | import copy 8 | from random import shuffle, seed 9 | import sys 10 | import os.path 11 | import argparse 12 | import glob 13 | import numpy as np 14 | from scipy.misc import imread, imresize 15 | import scipy.io 16 | import pdb 17 | import string 18 | import h5py 19 | from nltk.tokenize import word_tokenize 20 | import json 21 | 22 | import re 23 | import math 24 | 25 | 26 | def tokenize(sentence): 27 | return [i for i in re.split(r"([-.\"',:? !\$#@~()*&\^%;\[\]/\\\+<>\n=])", sentence) if i!='' and i!=' ' and i!='\n']; 28 | 29 | def prepro_question(imgs, params): 30 | 31 | # preprocess all the question 32 | print('example processed tokens:') 33 | for i,img in enumerate(imgs): 34 | s = img['question'] 35 | if params['token_method'] == 'nltk': 36 | txt = word_tokenize(s) 37 | else: 38 | txt = tokenize(s) 39 | img['processed_tokens'] = txt 40 | if i < 10: print(txt) 41 | if i % 1000 == 0: 42 | sys.stdout.write("processing %d/%d (%.2f%% done) \r" % (i, len(imgs), i*100.0/len(imgs)) ) 43 | sys.stdout.flush() 44 | return imgs 45 | 46 | def prepro_question1(imgs, params): 47 | 48 | # preprocess all the question 49 | print ('example processed tokens:') 50 | for i,img in enumerate(imgs): 51 | s = img['question1'] 52 | if params['token_method'] == 'nltk': 53 | txt_c = word_tokenize(s) 54 | else: 55 | txt_c = tokenize(s) 56 | 57 | img['processed_tokens_caption'] = txt_c #this name is a bit misleading, it is for paraphrase questions actually. 58 | if i < 10: print (txt_c) 59 | if i % 1000 == 0: 60 | sys.stdout.write("processing %d/%d (%.2f%% done) \r" % (i, len(imgs), i*100.0/len(imgs)) ) 61 | sys.stdout.flush() 62 | return imgs 63 | 64 | 65 | def build_vocab_question(imgs5, params):#imgs1,imgs2,imgs3,imgs4,imgs5,imgs6,imgs7,imgs8, 66 | # build vocabulary for question and answers. 67 | 68 | count_thr = params['word_count_threshold'] 69 | 70 | # count up the number of words 71 | counts = {} 72 | 73 | for img in imgs5: 74 | for w in img['processed_tokens']: 75 | counts[w] = counts.get(w, 0) + 1 76 | 77 | cw = sorted([(count,w) for w,count in counts.iteritems()], reverse=True) 78 | print ('top words and their counts:') 79 | print ('\n'.join(map(str,cw[:20]))) 80 | 81 | # print some stats 82 | total_words = sum(counts.itervalues()) 83 | print ('total words:', total_words) 84 | bad_words = [w for w,n in counts.iteritems() if n <= count_thr] 85 | vocab = [w for w,n in counts.iteritems() if n > count_thr] # will incorpate vocab for both caption and question 86 | bad_count = sum(counts[w] for w in bad_words) 87 | print ('number of bad words: %d/%d = %.2f%%' % (len(bad_words), len(counts), len(bad_words)*100.0/len(counts))) 88 | print ('number of words in vocab would be %d' % (len(vocab), )) 89 | print ('number of UNKs: %d/%d = %.2f%%' % (bad_count, total_words, bad_count*100.0/total_words)) 90 | 91 | 92 | # lets now produce the final annotation 93 | # additional special UNK token we will use below to map infrequent words to 94 | print ('inserting the special UNK token') 95 | vocab.append('UNK') 96 | 97 | 98 | for img in imgs5: 99 | txt = img['processed_tokens'] 100 | question = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt] 101 | img['final_question'] = question 102 | txt_c = img['processed_tokens_caption'] 103 | caption = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt_c] 104 | img['final_caption'] = caption 105 | 106 | return imgs5,vocab#, imgs1,imgs2,imgs3,imgs4,imgs5,imgs6,imgs7,imgs8, vocab 107 | 108 | def apply_vocab_question(imgs, wtoi): ## this is for val or test question and caption 109 | # apply the vocab on test. 110 | for img in imgs: 111 | txt = img['processed_tokens'] 112 | question = [w if wtoi.get(w,len(wtoi)+1) != (len(wtoi)+1) else 'UNK' for w in txt] 113 | img['final_question'] = question 114 | txt_c = img['processed_tokens_caption'] 115 | caption = [w if w in wtoi else 'UNK' for w in txt_c] 116 | img['final_caption'] = caption 117 | 118 | return imgs 119 | 120 | def encode_question2(imgs, params, wtoi): 121 | 122 | max_length = params['max_length'] 123 | N = len(imgs) 124 | 125 | label_arrays = np.zeros((N, max_length), dtype='uint32') 126 | label_length = np.zeros(N, dtype='uint32') 127 | question_id = np.zeros(N, dtype='uint32') 128 | question_counter = 0 129 | 130 | caption_arrays = np.zeros((N, max_length), dtype='uint32') # will store encoding caption words 131 | caption_length = np.zeros(N, dtype='uint32')# will store encoding caption words 132 | 133 | 134 | for i,img in enumerate(imgs): 135 | question_id[question_counter] = img['id'] #unique_id 136 | label_length[question_counter] = min(max_length, len(img['final_question'])) # record the length of this question sequence 137 | caption_length[question_counter] = min(max_length, len(img['final_caption'])) # record the length of this caption sequence 138 | question_counter += 1 139 | for k,w in enumerate(img['final_question']): 140 | if k < max_length: 141 | label_arrays[i,k] = wtoi[w] 142 | for k,w in enumerate(img['final_caption']): ## this is for caption 143 | if k < max_length: 144 | caption_arrays[i,k] = wtoi[w] 145 | 146 | return label_arrays, label_length, question_id, caption_arrays, caption_length 147 | 148 | 149 | def main(params): 150 | 151 | imgs_train5 = json.load(open(params['input_train_json5'], 'r')) 152 | imgs_test5 = json.load(open(params['input_test_json5'], 'r')) 153 | 154 | 155 | 156 | ##seed(123) # make reproducible 157 | ##shuffle(imgs_train) # shuffle the order 158 | 159 | 160 | # tokenization and preprocessing training question 161 | imgs_train5 = prepro_question(imgs_train5, params) 162 | # tokenization and preprocessing test question 163 | imgs_test5 = prepro_question(imgs_test5, params) 164 | 165 | # tokenization and preprocessing training paraphrase question 166 | imgs_train5 = prepro_question1(imgs_train5, params) 167 | # tokenization and preprocessing test paraphrase question 168 | imgs_test5 = prepro_question1(imgs_test5, params) 169 | 170 | 171 | # create the vocab for question 172 | imgs_train5,vocab = build_vocab_question(imgs_train5, params) 173 | 174 | 175 | itow = {i+1:w for i,w in enumerate(vocab)} # a 1-indexed vocab translation table 176 | wtoi = {w:i+1 for i,w in enumerate(vocab)} # inverse table 177 | 178 | 179 | ques_train5, ques_length_train5, question_id_train5 , cap_train5, cap_length_train5 = encode_question2(imgs_train5, params, wtoi) 180 | 181 | 182 | imgs_test5 = apply_vocab_question(imgs_test5, wtoi) 183 | 184 | 185 | ques_test5, ques_length_test5, question_id_test5 , cap_test5, cap_length_test5 = encode_question2(imgs_test5, params, wtoi) 186 | 187 | 188 | 189 | 190 | N = len(imgs_train5) 191 | f = h5py.File(params['output_h55'], "w") 192 | ## for train information 193 | f.create_dataset("ques_train", dtype='uint32', data=ques_train5) 194 | f.create_dataset("ques_length_train", dtype='uint32', data=ques_length_train5) 195 | f.create_dataset("ques_cap_id_train", dtype='uint32', data=question_id_train5)#this is actually the ques_cap_id 196 | f.create_dataset("ques1_train", dtype='uint32', data=cap_train5) 197 | f.create_dataset("ques1_length_train", dtype='uint32', data=cap_length_train5) 198 | 199 | 200 | ## for test information 201 | f.create_dataset("ques_test", dtype='uint32', data=ques_test5) 202 | f.create_dataset("ques_length_test", dtype='uint32', data=ques_length_test5) 203 | f.create_dataset("ques_cap_id_test", dtype='uint32', data=question_id_test5) 204 | f.create_dataset("ques1_test", dtype='uint32', data=cap_test5) 205 | f.create_dataset("ques1_length_test", dtype='uint32', data=cap_length_test5) 206 | 207 | f.close() 208 | print ('wrote ', params['output_h55']) 209 | 210 | # create output json file 211 | 212 | out = {} 213 | out['ix_to_word'] = itow # encode the (1-indexed) vocab 214 | json.dump(out, open(params['output_json5'], 'w')) 215 | print ('wrote ', params['output_json5']) 216 | 217 | 218 | if __name__ == "__main__": 219 | 220 | parser = argparse.ArgumentParser() 221 | 222 | # input and output jsons and h5 223 | parser.add_argument('--input_train_json5', default='../data/quora_raw_train.json', help='input json file to process into hdf5') 224 | parser.add_argument('--input_test_json5', default='../data/quora_raw_test.json', help='input json file to process into hdf5') 225 | parser.add_argument('--num_ans', default=1000, type=int, help='number of top answers for the final classifications.') 226 | parser.add_argument('--output_json5', default='../data/quora_data_prepro.json', help='output json file') 227 | parser.add_argument('--output_h55', default='../data/quora_data_prepro.h5', help='output h5 file') 228 | 229 | 230 | # options 231 | parser.add_argument('--max_length', default=26, type=int, help='max length of a caption, in number of words. captions longer than this get clipped.') 232 | parser.add_argument('--word_count_threshold', default=0, type=int, help='only words that occur more than this number of times will be put in vocab') 233 | parser.add_argument('--token_method', default='nltk', help='tokenization method.') 234 | parser.add_argument('--num_test', default=0, type=int, help='number of test images (to withold until very very end)') 235 | parser.add_argument('--batch_size', default=10, type=int) 236 | 237 | args = parser.parse_args() 238 | params = vars(args) # convert to ordinary dict 239 | print ('parsed input parameters:') 240 | print (json.dumps(params, indent = 2)) 241 | main(params) 242 | -------------------------------------------------------------------------------- /prepro/quora_prepro.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import json 3 | import os 4 | 5 | def main(): 6 | out = [] 7 | outtest = [] 8 | outval = [] 9 | with open('../data/quora_duplicate_questions.tsv','rb') as tsvin: 10 | tsvin = csv.reader(tsvin, delimiter='\t')#read the tsv file of quora question pairs 11 | count0 = 1 12 | count1 = 1 13 | counter = 1 14 | for row in tsvin: 15 | counter = counter+1 16 | if row[5]=='0' and row[4][-1:]=='?':#the 6th entry in every row has value 0 or 1 and it represents paraphrases if that value is 1 17 | count0=count0+1 18 | elif row[5]=='1' and row[4][-1:]=='?': 19 | count1=count1+1 20 | if count1>1 and count1<100002:#taking the starting 1 lakh pairs as train set. Change this to 50002 for taking staring 50 k examples as train set 21 | # get the question and unique id from the tsv file 22 | quesid = row[1] #first question id 23 | ques = row[3] #first question 24 | img_id = row[0] #unique id for every pair 25 | ques1 = row[4]#paraphrase question 26 | quesid1 =row[2]#paraphrase question id 27 | 28 | # set the parameters of json file for writing 29 | jimg = {} 30 | 31 | jimg['question'] = ques 32 | jimg['question1'] = ques1 33 | jimg['ques_id'] = quesid 34 | jimg['ques_id1'] = quesid1 35 | jimg['id'] = img_id 36 | 37 | out.append(jimg) 38 | 39 | elif count1>100001 and count1<130002:#next 30k as the test set acc to https://arxiv.org/pdf/1711.00279.pdf 40 | quesid = row[1] 41 | ques = row[3] 42 | img_id = row[0] 43 | ques1 = row[4] 44 | quesid1 =row[2] 45 | 46 | jimg = {} 47 | 48 | jimg['question'] = ques 49 | jimg['question1'] = ques1 50 | jimg['ques_id'] = quesid 51 | jimg['ques_id1'] = quesid1 52 | jimg['id'] = img_id 53 | 54 | outtest.append(jimg) 55 | else :#rest as val 56 | quesid = row[1] 57 | ques = row[3] 58 | img_id = row[0] 59 | ques1 = row[4] 60 | quesid1 =row[2] 61 | 62 | jimg = {} 63 | jimg['question'] = ques 64 | jimg['question1'] = ques1 65 | jimg['ques_id'] = quesid 66 | jimg['ques_id1'] = quesid1 67 | jimg['id'] = img_id 68 | 69 | outval.append(jimg) 70 | #write the json files for train test and val 71 | print(len(out)) 72 | json.dump(out, open('../data/quora_raw_train.json', 'w')) 73 | print(len(outtest)) 74 | json.dump(outtest, open('../data/quora_raw_test.json', 'w')) 75 | print(len(outval)) 76 | json.dump(outval, open('../data/quora_raw_val.json', 'w')) 77 | 78 | 79 | if __name__ == "__main__": 80 | main() 81 | -------------------------------------------------------------------------------- /pycocoevalcap/.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | -------------------------------------------------------------------------------- /pycocoevalcap/README.md: -------------------------------------------------------------------------------- 1 | Microsoft COCO Caption Evaluation 2 | =================== 3 | 4 | Evaluation codes for MS COCO caption generation. 5 | 6 | ## Description ## 7 | This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset. 8 | 9 | The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption. 10 | Caption evaluation depends on the COCO API that natively supports Python 3 (see Requirements). 11 | 12 | ## Requirements ## 13 | - Java 1.8.0 14 | - Python 3 15 | - pycocotools (COCO Python API): https://github.com/cocodataset/cocoapi 16 | 17 | ## Installation ## 18 | To install pycocoevalcap and the pycocotools dependency, run: 19 | ``` 20 | pip install git+https://github.com/salaniz/pycocoevalcap 21 | ``` 22 | 23 | ## Files ## 24 | ./ 25 | - eval.py: The file includes COCOEavlCap class that can be used to evaluate results on COCO. 26 | - tokenizer: Python wrapper of Stanford CoreNLP PTBTokenizer 27 | - bleu: Bleu evalutation codes 28 | - meteor: Meteor evaluation codes 29 | - rouge: Rouge-L evaluation codes 30 | - cider: CIDEr evaluation codes 31 | 32 | ## References ## 33 | 34 | - [Microsoft COCO Captions: Data Collection and Evaluation Server](http://arxiv.org/abs/1504.00325) 35 | - PTBTokenizer: We use the [Stanford Tokenizer](http://nlp.stanford.edu/software/tokenizer.shtml) which is included in [Stanford CoreNLP 3.4.1](http://nlp.stanford.edu/software/corenlp.shtml). 36 | - BLEU: [BLEU: a Method for Automatic Evaluation of Machine Translation](http://www.aclweb.org/anthology/P02-1040.pdf) 37 | - Meteor: [Project page](http://www.cs.cmu.edu/~alavie/METEOR/) with related publications. We use the latest version (1.5) of the [Code](https://github.com/mjdenkowski/meteor). Changes have been made to the source code to properly aggreate the statistics for the entire corpus. 38 | - Rouge-L: [ROUGE: A Package for Automatic Evaluation of Summaries](http://anthology.aclweb.org/W/W04/W04-1013.pdf) 39 | - CIDEr: [CIDEr: Consensus-based Image Description Evaluation] (http://arxiv.org/pdf/1411.5726.pdf) 40 | 41 | ## Developers ## 42 | - Xinlei Chen (CMU) 43 | - Hao Fang (University of Washington) 44 | - Tsung-Yi Lin (Cornell) 45 | - Ramakrishna Vedantam (Virgina Tech) 46 | 47 | ## Acknowledgement ## 48 | - David Chiang (University of Norte Dame) 49 | - Michael Denkowski (CMU) 50 | - Alexander Rush (Harvard University) 51 | -------------------------------------------------------------------------------- /pycocoevalcap/bleu/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015 Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /pycocoevalcap/bleu/bleu.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : bleu.py 4 | # 5 | # Description : Wrapper for BLEU scorer. 6 | # 7 | # Creation Date : 06-01-2015 8 | # Last Modified : Thu 19 Mar 2015 09:13:28 PM PDT 9 | # Authors : Hao Fang and Tsung-Yi Lin 10 | 11 | from .bleu_scorer import BleuScorer 12 | 13 | 14 | class Bleu: 15 | def __init__(self, n=4): 16 | # default compute Blue score up to 4 17 | self._n = n 18 | self._hypo_for_image = {} 19 | self.ref_for_image = {} 20 | 21 | def compute_score(self, gts, res): 22 | 23 | assert(gts.keys() == res.keys()) 24 | imgIds = gts.keys() 25 | 26 | bleu_scorer = BleuScorer(n=self._n) 27 | for id in imgIds: 28 | hypo = res[id] 29 | ref = gts[id] 30 | 31 | # Sanity check. 32 | assert(type(hypo) is list) 33 | assert(len(hypo) == 1) 34 | assert(type(ref) is list) 35 | assert(len(ref) >= 1) 36 | 37 | bleu_scorer += (hypo[0], ref) 38 | 39 | #score, scores = bleu_scorer.compute_score(option='shortest') 40 | score, scores = bleu_scorer.compute_score(option='closest', verbose=1) 41 | #score, scores = bleu_scorer.compute_score(option='average', verbose=1) 42 | 43 | # return (bleu, bleu_info) 44 | return score, scores 45 | 46 | def method(self): 47 | return "Bleu" 48 | -------------------------------------------------------------------------------- /pycocoevalcap/bleu/bleu_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # bleu_scorer.py 4 | # David Chiang 5 | 6 | # Copyright (c) 2004-2006 University of Maryland. All rights 7 | # reserved. Do not redistribute without permission from the 8 | # author. Not for commercial use. 9 | 10 | # Modified by: 11 | # Hao Fang 12 | # Tsung-Yi Lin 13 | 14 | '''Provides: 15 | cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test(). 16 | cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked(). 17 | ''' 18 | 19 | import copy 20 | import sys, math, re 21 | from collections import defaultdict 22 | 23 | def precook(s, n=4, out=False): 24 | """Takes a string as input and returns an object that can be given to 25 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 26 | can take string arguments as well.""" 27 | words = s.split() 28 | counts = defaultdict(int) 29 | for k in range(1,n+1): 30 | for i in range(len(words)-k+1): 31 | ngram = tuple(words[i:i+k]) 32 | counts[ngram] += 1 33 | return (len(words), counts) 34 | 35 | def cook_refs(refs, eff=None, n=4): ## lhuang: oracle will call with "average" 36 | '''Takes a list of reference sentences for a single segment 37 | and returns an object that encapsulates everything that BLEU 38 | needs to know about them.''' 39 | 40 | reflen = [] 41 | maxcounts = {} 42 | for ref in refs: 43 | rl, counts = precook(ref, n) 44 | reflen.append(rl) 45 | for (ngram,count) in counts.items(): 46 | maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 47 | 48 | # Calculate effective reference sentence length. 49 | if eff == "shortest": 50 | reflen = min(reflen) 51 | elif eff == "average": 52 | reflen = float(sum(reflen))/len(reflen) 53 | 54 | ## lhuang: N.B.: leave reflen computaiton to the very end!! 55 | 56 | ## lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design) 57 | 58 | return (reflen, maxcounts) 59 | 60 | def cook_test(test, refs, eff=None, n=4): 61 | '''Takes a test sentence and returns an object that 62 | encapsulates everything that BLEU needs to know about it.''' 63 | 64 | reflen, refmaxcounts = refs 65 | testlen, counts = precook(test, n, True) 66 | 67 | result = {} 68 | 69 | # Calculate effective reference sentence length. 70 | 71 | if eff == "closest": 72 | result["reflen"] = min((abs(l-testlen), l) for l in reflen)[1] 73 | else: ## i.e., "average" or "shortest" or None 74 | result["reflen"] = reflen 75 | 76 | result["testlen"] = testlen 77 | 78 | result["guess"] = [max(0,testlen-k+1) for k in range(1,n+1)] 79 | 80 | result['correct'] = [0]*n 81 | for (ngram, count) in counts.items(): 82 | result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count) 83 | 84 | return result 85 | 86 | class BleuScorer(object): 87 | """Bleu scorer. 88 | """ 89 | 90 | __slots__ = "n", "crefs", "ctest", "_score", "_ratio", "_testlen", "_reflen", "special_reflen" 91 | # special_reflen is used in oracle (proportional effective ref len for a node). 92 | 93 | def copy(self): 94 | ''' copy the refs.''' 95 | new = BleuScorer(n=self.n) 96 | new.ctest = copy.copy(self.ctest) 97 | new.crefs = copy.copy(self.crefs) 98 | new._score = None 99 | return new 100 | 101 | def __init__(self, test=None, refs=None, n=4, special_reflen=None): 102 | ''' singular instance ''' 103 | 104 | self.n = n 105 | self.crefs = [] 106 | self.ctest = [] 107 | self.cook_append(test, refs) 108 | self.special_reflen = special_reflen 109 | 110 | def cook_append(self, test, refs): 111 | '''called by constructor and __iadd__ to avoid creating new instances.''' 112 | 113 | if refs is not None: 114 | self.crefs.append(cook_refs(refs)) 115 | if test is not None: 116 | cooked_test = cook_test(test, self.crefs[-1]) 117 | self.ctest.append(cooked_test) ## N.B.: -1 118 | else: 119 | self.ctest.append(None) # lens of crefs and ctest have to match 120 | 121 | self._score = None ## need to recompute 122 | 123 | def ratio(self, option=None): 124 | self.compute_score(option=option) 125 | return self._ratio 126 | 127 | def score_ratio(self, option=None): 128 | '''return (bleu, len_ratio) pair''' 129 | return (self.fscore(option=option), self.ratio(option=option)) 130 | 131 | def score_ratio_str(self, option=None): 132 | return "%.4f (%.2f)" % self.score_ratio(option) 133 | 134 | def reflen(self, option=None): 135 | self.compute_score(option=option) 136 | return self._reflen 137 | 138 | def testlen(self, option=None): 139 | self.compute_score(option=option) 140 | return self._testlen 141 | 142 | def retest(self, new_test): 143 | if type(new_test) is str: 144 | new_test = [new_test] 145 | assert len(new_test) == len(self.crefs), new_test 146 | self.ctest = [] 147 | for t, rs in zip(new_test, self.crefs): 148 | self.ctest.append(cook_test(t, rs)) 149 | self._score = None 150 | 151 | return self 152 | 153 | def rescore(self, new_test): 154 | ''' replace test(s) with new test(s), and returns the new score.''' 155 | 156 | return self.retest(new_test).compute_score() 157 | 158 | def size(self): 159 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 160 | return len(self.crefs) 161 | 162 | def __iadd__(self, other): 163 | '''add an instance (e.g., from another sentence).''' 164 | 165 | if type(other) is tuple: 166 | ## avoid creating new BleuScorer instances 167 | self.cook_append(other[0], other[1]) 168 | else: 169 | assert self.compatible(other), "incompatible BLEUs." 170 | self.ctest.extend(other.ctest) 171 | self.crefs.extend(other.crefs) 172 | self._score = None ## need to recompute 173 | 174 | return self 175 | 176 | def compatible(self, other): 177 | return isinstance(other, BleuScorer) and self.n == other.n 178 | 179 | def single_reflen(self, option="average"): 180 | return self._single_reflen(self.crefs[0][0], option) 181 | 182 | def _single_reflen(self, reflens, option=None, testlen=None): 183 | 184 | if option == "shortest": 185 | reflen = min(reflens) 186 | elif option == "average": 187 | reflen = float(sum(reflens))/len(reflens) 188 | elif option == "closest": 189 | reflen = min((abs(l-testlen), l) for l in reflens)[1] 190 | else: 191 | assert False, "unsupported reflen option %s" % option 192 | 193 | return reflen 194 | 195 | def recompute_score(self, option=None, verbose=0): 196 | self._score = None 197 | return self.compute_score(option, verbose) 198 | 199 | def compute_score(self, option=None, verbose=0): 200 | n = self.n 201 | small = 1e-9 202 | tiny = 1e-15 ## so that if guess is 0 still return 0 203 | bleu_list = [[] for _ in range(n)] 204 | 205 | if self._score is not None: 206 | return self._score 207 | 208 | if option is None: 209 | option = "average" if len(self.crefs) == 1 else "closest" 210 | 211 | self._testlen = 0 212 | self._reflen = 0 213 | totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n} 214 | 215 | # for each sentence 216 | for comps in self.ctest: 217 | testlen = comps['testlen'] 218 | self._testlen += testlen 219 | 220 | if self.special_reflen is None: ## need computation 221 | reflen = self._single_reflen(comps['reflen'], option, testlen) 222 | else: 223 | reflen = self.special_reflen 224 | 225 | self._reflen += reflen 226 | 227 | for key in ['guess','correct']: 228 | for k in range(n): 229 | totalcomps[key][k] += comps[key][k] 230 | 231 | # append per image bleu score 232 | bleu = 1. 233 | for k in range(n): 234 | bleu *= (float(comps['correct'][k]) + tiny) \ 235 | /(float(comps['guess'][k]) + small) 236 | bleu_list[k].append(bleu ** (1./(k+1))) 237 | ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division 238 | if ratio < 1: 239 | for k in range(n): 240 | bleu_list[k][-1] *= math.exp(1 - 1/ratio) 241 | 242 | if verbose > 1: 243 | print(comps, reflen) 244 | 245 | totalcomps['reflen'] = self._reflen 246 | totalcomps['testlen'] = self._testlen 247 | 248 | bleus = [] 249 | bleu = 1. 250 | for k in range(n): 251 | bleu *= float(totalcomps['correct'][k] + tiny) \ 252 | / (totalcomps['guess'][k] + small) 253 | bleus.append(bleu ** (1./(k+1))) 254 | ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division 255 | if ratio < 1: 256 | for k in range(n): 257 | bleus[k] *= math.exp(1 - 1/ratio) 258 | 259 | if verbose > 0: 260 | print(totalcomps) 261 | print("ratio:", ratio) 262 | 263 | self._score = bleus 264 | return self._score, bleu_list 265 | -------------------------------------------------------------------------------- /pycocoevalcap/cider/cider.py: -------------------------------------------------------------------------------- 1 | # Filename: cider.py 2 | # 3 | # Description: Describes the class to compute the CIDEr (Consensus-Based Image Description Evaluation) Metric 4 | # by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726) 5 | # 6 | # Creation Date: Sun Feb 8 14:16:54 2015 7 | # 8 | # Authors: Ramakrishna Vedantam and Tsung-Yi Lin 9 | 10 | from .cider_scorer import CiderScorer 11 | import pdb 12 | 13 | class Cider: 14 | """ 15 | Main Class to compute the CIDEr metric 16 | 17 | """ 18 | def __init__(self, test=None, refs=None, n=4, sigma=6.0): 19 | # set cider to sum over 1 to 4-grams 20 | self._n = n 21 | # set the standard deviation parameter for gaussian penalty 22 | self._sigma = sigma 23 | 24 | def compute_score(self, gts, res): 25 | """ 26 | Main function to compute CIDEr score 27 | :param hypo_for_image (dict) : dictionary with key and value 28 | ref_for_image (dict) : dictionary with key and value 29 | :return: cider (float) : computed CIDEr score for the corpus 30 | """ 31 | 32 | assert(gts.keys() == res.keys()) 33 | imgIds = gts.keys() 34 | 35 | cider_scorer = CiderScorer(n=self._n, sigma=self._sigma) 36 | 37 | for id in imgIds: 38 | hypo = res[id] 39 | ref = gts[id] 40 | 41 | # Sanity check. 42 | assert(type(hypo) is list) 43 | assert(len(hypo) == 1) 44 | assert(type(ref) is list) 45 | assert(len(ref) > 0) 46 | 47 | cider_scorer += (hypo[0], ref) 48 | 49 | (score, scores) = cider_scorer.compute_score() 50 | 51 | return score, scores 52 | 53 | def method(self): 54 | return "CIDEr" 55 | -------------------------------------------------------------------------------- /pycocoevalcap/cider/cider_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Tsung-Yi Lin 3 | # Ramakrishna Vedantam 4 | 5 | import copy 6 | from collections import defaultdict 7 | import numpy as np 8 | import pdb 9 | import math 10 | 11 | def precook(s, n=4, out=False): 12 | """ 13 | Takes a string as input and returns an object that can be given to 14 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 15 | can take string arguments as well. 16 | :param s: string : sentence to be converted into ngrams 17 | :param n: int : number of ngrams for which representation is calculated 18 | :return: term frequency vector for occuring ngrams 19 | """ 20 | words = s.split() 21 | counts = defaultdict(int) 22 | for k in range(1,n+1): 23 | for i in range(len(words)-k+1): 24 | ngram = tuple(words[i:i+k]) 25 | counts[ngram] += 1 26 | return counts 27 | 28 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average" 29 | '''Takes a list of reference sentences for a single segment 30 | and returns an object that encapsulates everything that BLEU 31 | needs to know about them. 32 | :param refs: list of string : reference sentences for some image 33 | :param n: int : number of ngrams for which (ngram) representation is calculated 34 | :return: result (list of dict) 35 | ''' 36 | return [precook(ref, n) for ref in refs] 37 | 38 | def cook_test(test, n=4): 39 | '''Takes a test sentence and returns an object that 40 | encapsulates everything that BLEU needs to know about it. 41 | :param test: list of string : hypothesis sentence for some image 42 | :param n: int : number of ngrams for which (ngram) representation is calculated 43 | :return: result (dict) 44 | ''' 45 | return precook(test, n, True) 46 | 47 | class CiderScorer(object): 48 | """CIDEr scorer. 49 | """ 50 | 51 | def copy(self): 52 | ''' copy the refs.''' 53 | new = CiderScorer(n=self.n) 54 | new.ctest = copy.copy(self.ctest) 55 | new.crefs = copy.copy(self.crefs) 56 | return new 57 | 58 | def __init__(self, test=None, refs=None, n=4, sigma=6.0): 59 | ''' singular instance ''' 60 | self.n = n 61 | self.sigma = sigma 62 | self.crefs = [] 63 | self.ctest = [] 64 | self.document_frequency = defaultdict(float) 65 | self.cook_append(test, refs) 66 | self.ref_len = None 67 | 68 | def cook_append(self, test, refs): 69 | '''called by constructor and __iadd__ to avoid creating new instances.''' 70 | 71 | if refs is not None: 72 | self.crefs.append(cook_refs(refs)) 73 | if test is not None: 74 | self.ctest.append(cook_test(test)) ## N.B.: -1 75 | else: 76 | self.ctest.append(None) # lens of crefs and ctest have to match 77 | 78 | def size(self): 79 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 80 | return len(self.crefs) 81 | 82 | def __iadd__(self, other): 83 | '''add an instance (e.g., from another sentence).''' 84 | 85 | if type(other) is tuple: 86 | ## avoid creating new CiderScorer instances 87 | self.cook_append(other[0], other[1]) 88 | else: 89 | self.ctest.extend(other.ctest) 90 | self.crefs.extend(other.crefs) 91 | 92 | return self 93 | def compute_doc_freq(self): 94 | ''' 95 | Compute term frequency for reference data. 96 | This will be used to compute idf (inverse document frequency later) 97 | The term frequency is stored in the object 98 | :return: None 99 | ''' 100 | for refs in self.crefs: 101 | # refs, k ref captions of one image 102 | for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]): 103 | self.document_frequency[ngram] += 1 104 | # maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 105 | 106 | def compute_cider(self): 107 | def counts2vec(cnts): 108 | """ 109 | Function maps counts of ngram to vector of tfidf weights. 110 | The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights. 111 | The n-th entry of array denotes length of n-grams. 112 | :param cnts: 113 | :return: vec (array of dict), norm (array of float), length (int) 114 | """ 115 | vec = [defaultdict(float) for _ in range(self.n)] 116 | length = 0 117 | norm = [0.0 for _ in range(self.n)] 118 | for (ngram,term_freq) in cnts.items(): 119 | # give word count 1 if it doesn't appear in reference corpus 120 | df = np.log(max(1.0, self.document_frequency[ngram])) 121 | # ngram index 122 | n = len(ngram)-1 123 | # tf (term_freq) * idf (precomputed idf) for n-grams 124 | vec[n][ngram] = float(term_freq)*(self.ref_len - df) 125 | # compute norm for the vector. the norm will be used for computing similarity 126 | norm[n] += pow(vec[n][ngram], 2) 127 | 128 | if n == 1: 129 | length += term_freq 130 | norm = [np.sqrt(n) for n in norm] 131 | return vec, norm, length 132 | 133 | def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref): 134 | ''' 135 | Compute the cosine similarity of two vectors. 136 | :param vec_hyp: array of dictionary for vector corresponding to hypothesis 137 | :param vec_ref: array of dictionary for vector corresponding to reference 138 | :param norm_hyp: array of float for vector corresponding to hypothesis 139 | :param norm_ref: array of float for vector corresponding to reference 140 | :param length_hyp: int containing length of hypothesis 141 | :param length_ref: int containing length of reference 142 | :return: array of score for each n-grams cosine similarity 143 | ''' 144 | delta = float(length_hyp - length_ref) 145 | # measure consine similarity 146 | val = np.array([0.0 for _ in range(self.n)]) 147 | for n in range(self.n): 148 | # ngram 149 | for (ngram,count) in vec_hyp[n].items(): 150 | # vrama91 : added clipping 151 | val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram] 152 | 153 | if (norm_hyp[n] != 0) and (norm_ref[n] != 0): 154 | val[n] /= (norm_hyp[n]*norm_ref[n]) 155 | 156 | assert(not math.isnan(val[n])) 157 | # vrama91: added a length based gaussian penalty 158 | val[n] *= np.e**(-(delta**2)/(2*self.sigma**2)) 159 | return val 160 | 161 | # compute log reference length 162 | self.ref_len = np.log(float(len(self.crefs))) 163 | 164 | scores = [] 165 | for test, refs in zip(self.ctest, self.crefs): 166 | # compute vector for test captions 167 | vec, norm, length = counts2vec(test) 168 | # compute vector for ref captions 169 | score = np.array([0.0 for _ in range(self.n)]) 170 | for ref in refs: 171 | vec_ref, norm_ref, length_ref = counts2vec(ref) 172 | score += sim(vec, vec_ref, norm, norm_ref, length, length_ref) 173 | # change by vrama91 - mean of ngram scores, instead of sum 174 | score_avg = np.mean(score) 175 | # divide by number of references 176 | score_avg /= len(refs) 177 | # multiply score by 10 178 | score_avg *= 10.0 179 | # append score of an image to the score list 180 | scores.append(score_avg) 181 | return scores 182 | 183 | def compute_score(self, option=None, verbose=0): 184 | # compute idf 185 | self.compute_doc_freq() 186 | # assert to check document frequency 187 | assert(len(self.ctest) >= max(self.document_frequency.values())) 188 | # compute cider score 189 | score = self.compute_cider() 190 | # debug 191 | # print score 192 | return np.mean(np.array(score)), np.array(score) 193 | -------------------------------------------------------------------------------- /pycocoevalcap/eval.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | from .tokenizer.ptbtokenizer import PTBTokenizer 3 | from .bleu.bleu import Bleu 4 | from .meteor.meteor import Meteor 5 | from .rouge.rouge import Rouge 6 | from .cider.cider import Cider 7 | 8 | class COCOEvalCap: 9 | def __init__(self, coco, cocoRes): 10 | self.evalImgs = [] 11 | self.eval = {} 12 | self.imgToEval = {} 13 | self.coco = coco 14 | self.cocoRes = cocoRes 15 | self.params = {'image_id': coco.getImgIds()} 16 | 17 | def evaluate(self): 18 | imgIds = self.params['image_id'] 19 | # imgIds = self.coco.getImgIds() 20 | gts = {} 21 | res = {} 22 | for imgId in imgIds: 23 | gts[imgId] = self.coco.imgToAnns[imgId] 24 | res[imgId] = self.cocoRes.imgToAnns[imgId] 25 | 26 | # ================================================= 27 | # Set up scorers 28 | # ================================================= 29 | print('tokenization...') 30 | tokenizer = PTBTokenizer() 31 | gts = tokenizer.tokenize(gts) 32 | res = tokenizer.tokenize(res) 33 | 34 | # ================================================= 35 | # Set up scorers 36 | # ================================================= 37 | print('setting up scorers...') 38 | scorers = [ 39 | (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]), 40 | (Meteor(),"METEOR"), 41 | (Rouge(), "ROUGE_L"), 42 | (Cider(), "CIDEr") 43 | ] 44 | 45 | # ================================================= 46 | # Compute scores 47 | # ================================================= 48 | for scorer, method in scorers: 49 | print('computing %s score...'%(scorer.method())) 50 | score, scores = scorer.compute_score(gts, res) 51 | if type(method) == list: 52 | for sc, scs, m in zip(score, scores, method): 53 | self.setEval(sc, m) 54 | self.setImgToEvalImgs(scs, gts.keys(), m) 55 | print("%s: %0.3f"%(m, sc)) 56 | else: 57 | self.setEval(score, method) 58 | self.setImgToEvalImgs(scores, gts.keys(), method) 59 | print("%s: %0.3f"%(method, score)) 60 | self.setEvalImgs() 61 | 62 | def setEval(self, score, method): 63 | self.eval[method] = score 64 | 65 | def setImgToEvalImgs(self, scores, imgIds, method): 66 | for imgId, score in zip(imgIds, scores): 67 | if not imgId in self.imgToEval: 68 | self.imgToEval[imgId] = {} 69 | self.imgToEval[imgId]["image_id"] = imgId 70 | self.imgToEval[imgId][method] = score 71 | 72 | def setEvalImgs(self): 73 | self.evalImgs = [eval for imgId, eval in self.imgToEval.items()] 74 | -------------------------------------------------------------------------------- /pycocoevalcap/license.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the FreeBSD Project. 27 | -------------------------------------------------------------------------------- /pycocoevalcap/meteor/data/paraphrase-en.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/meteor/data/paraphrase-en.gz -------------------------------------------------------------------------------- /pycocoevalcap/meteor/meteor-1.5.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/meteor/meteor-1.5.jar -------------------------------------------------------------------------------- /pycocoevalcap/meteor/meteor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Python wrapper for METEOR implementation, by Xinlei Chen 4 | # Acknowledge Michael Denkowski for the generous discussion and help 5 | 6 | import os 7 | import sys 8 | import subprocess 9 | import threading 10 | 11 | # Assumes meteor-1.5.jar is in the same directory as meteor.py. Change as needed. 12 | METEOR_JAR = 'meteor-1.5.jar' 13 | # print METEOR_JAR 14 | 15 | class Meteor: 16 | 17 | def __init__(self): 18 | self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, \ 19 | '-', '-', '-stdio', '-l', 'en', '-norm'] 20 | self.meteor_p = subprocess.Popen(self.meteor_cmd, \ 21 | cwd=os.path.dirname(os.path.abspath(__file__)), \ 22 | stdin=subprocess.PIPE, \ 23 | stdout=subprocess.PIPE, \ 24 | stderr=subprocess.PIPE) 25 | # Used to guarantee thread safety 26 | self.lock = threading.Lock() 27 | 28 | def compute_score(self, gts, res): 29 | assert(gts.keys() == res.keys()) 30 | imgIds = gts.keys() 31 | scores = [] 32 | 33 | eval_line = 'EVAL' 34 | self.lock.acquire() 35 | for i in imgIds: 36 | assert(len(res[i]) == 1) 37 | stat = self._stat(res[i][0], gts[i]) 38 | eval_line += ' ||| {}'.format(stat) 39 | 40 | self.meteor_p.stdin.write('{}\n'.format(eval_line).encode()) 41 | self.meteor_p.stdin.flush() 42 | for i in range(0,len(imgIds)): 43 | scores.append(float(self.meteor_p.stdout.readline().strip())) 44 | score = float(self.meteor_p.stdout.readline().strip()) 45 | self.lock.release() 46 | 47 | return score, scores 48 | 49 | def method(self): 50 | return "METEOR" 51 | 52 | def _stat(self, hypothesis_str, reference_list): 53 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 54 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 55 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 56 | self.meteor_p.stdin.write('{}\n'.format(score_line).encode()) 57 | self.meteor_p.stdin.flush() 58 | return self.meteor_p.stdout.readline().decode().strip() 59 | 60 | def _score(self, hypothesis_str, reference_list): 61 | self.lock.acquire() 62 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 63 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 64 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 65 | self.meteor_p.stdin.write('{}\n'.format(score_line)) 66 | stats = self.meteor_p.stdout.readline().strip() 67 | eval_line = 'EVAL ||| {}'.format(stats) 68 | # EVAL ||| stats 69 | self.meteor_p.stdin.write('{}\n'.format(eval_line)) 70 | score = float(self.meteor_p.stdout.readline().strip()) 71 | # bug fix: there are two values returned by the jar file, one average, and one all, so do it twice 72 | # thanks for Andrej for pointing this out 73 | score = float(self.meteor_p.stdout.readline().strip()) 74 | self.lock.release() 75 | return score 76 | 77 | def __del__(self): 78 | self.lock.acquire() 79 | self.meteor_p.stdin.close() 80 | self.meteor_p.kill() 81 | self.meteor_p.wait() 82 | self.lock.release() 83 | -------------------------------------------------------------------------------- /pycocoevalcap/rouge/rouge.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : rouge.py 4 | # 5 | # Description : Computes ROUGE-L metric as described by Lin and Hovey (2004) 6 | # 7 | # Creation Date : 2015-01-07 06:03 8 | # Author : Ramakrishna Vedantam 9 | 10 | import numpy as np 11 | import pdb 12 | 13 | def my_lcs(string, sub): 14 | """ 15 | Calculates longest common subsequence for a pair of tokenized strings 16 | :param string : list of str : tokens from a string split using whitespace 17 | :param sub : list of str : shorter string, also split using whitespace 18 | :returns: length (list of int): length of the longest common subsequence between the two strings 19 | 20 | Note: my_lcs only gives length of the longest common subsequence, not the actual LCS 21 | """ 22 | if(len(string)< len(sub)): 23 | sub, string = string, sub 24 | 25 | lengths = [[0 for i in range(0,len(sub)+1)] for j in range(0,len(string)+1)] 26 | 27 | for j in range(1,len(sub)+1): 28 | for i in range(1,len(string)+1): 29 | if(string[i-1] == sub[j-1]): 30 | lengths[i][j] = lengths[i-1][j-1] + 1 31 | else: 32 | lengths[i][j] = max(lengths[i-1][j] , lengths[i][j-1]) 33 | 34 | return lengths[len(string)][len(sub)] 35 | 36 | class Rouge(): 37 | ''' 38 | Class for computing ROUGE-L score for a set of candidate sentences for the MS COCO test set 39 | 40 | ''' 41 | def __init__(self): 42 | # vrama91: updated the value below based on discussion with Hovey 43 | self.beta = 1.2 44 | 45 | def calc_score(self, candidate, refs): 46 | """ 47 | Compute ROUGE-L score given one candidate and references for an image 48 | :param candidate: str : candidate sentence to be evaluated 49 | :param refs: list of str : COCO reference sentences for the particular image to be evaluated 50 | :returns score: int (ROUGE-L score for the candidate evaluated against references) 51 | """ 52 | assert(len(candidate)==1) 53 | assert(len(refs)>0) 54 | prec = [] 55 | rec = [] 56 | 57 | # split into tokens 58 | token_c = candidate[0].split(" ") 59 | 60 | for reference in refs: 61 | # split into tokens 62 | token_r = reference.split(" ") 63 | # compute the longest common subsequence 64 | lcs = my_lcs(token_r, token_c) 65 | prec.append(lcs/float(len(token_c))) 66 | rec.append(lcs/float(len(token_r))) 67 | 68 | prec_max = max(prec) 69 | rec_max = max(rec) 70 | 71 | if(prec_max!=0 and rec_max !=0): 72 | score = ((1 + self.beta**2)*prec_max*rec_max)/float(rec_max + self.beta**2*prec_max) 73 | else: 74 | score = 0.0 75 | return score 76 | 77 | def compute_score(self, gts, res): 78 | """ 79 | Computes Rouge-L score given a set of reference and candidate sentences for the dataset 80 | Invoked by evaluate_captions.py 81 | :param hypo_for_image: dict : candidate / test sentences with "image name" key and "tokenized sentences" as values 82 | :param ref_for_image: dict : reference MS-COCO sentences with "image name" key and "tokenized sentences" as values 83 | :returns: average_score: float (mean ROUGE-L score computed by averaging scores for all the images) 84 | """ 85 | assert(gts.keys() == res.keys()) 86 | imgIds = gts.keys() 87 | 88 | score = [] 89 | for id in imgIds: 90 | hypo = res[id] 91 | ref = gts[id] 92 | 93 | score.append(self.calc_score(hypo, ref)) 94 | 95 | # Sanity check. 96 | assert(type(hypo) is list) 97 | assert(len(hypo) == 1) 98 | assert(type(ref) is list) 99 | assert(len(ref) > 0) 100 | 101 | average_score = np.mean(np.array(score)) 102 | return average_score, np.array(score) 103 | 104 | def method(self): 105 | return "Rouge" 106 | -------------------------------------------------------------------------------- /pycocoevalcap/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_namespace_packages 2 | 3 | # Prepend pycocoevalcap to package names 4 | package_names = ['pycocoevalcap.'+p for p in find_namespace_packages()] 5 | 6 | setup( 7 | name='pycocoevalcap', 8 | version=1.0, 9 | packages=['pycocoevalcap']+package_names, 10 | package_dir={'pycocoevalcap': '.'}, 11 | package_data={'': ['*.jar', '*.gz']}, 12 | install_requires=['pycocotools>=2.0.0'] 13 | ) 14 | -------------------------------------------------------------------------------- /pycocoevalcap/tokenizer/ptbtokenizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : ptbtokenizer.py 4 | # 5 | # Description : Do the PTB Tokenization and remove punctuations. 6 | # 7 | # Creation Date : 29-12-2014 8 | # Last Modified : Thu Mar 19 09:53:35 2015 9 | # Authors : Hao Fang and Tsung-Yi Lin 10 | 11 | import os 12 | import sys 13 | import subprocess 14 | import tempfile 15 | import itertools 16 | import contextlib 17 | 18 | # path to the stanford corenlp jar 19 | STANFORD_CORENLP_3_4_1_JAR = 'stanford-corenlp-3.4.1.jar' 20 | 21 | # punctuations to be removed from the sentences 22 | PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \ 23 | ".", "?", "!", ",", ":", "-", "--", "...", ";"] 24 | 25 | class PTBTokenizer: 26 | """Python wrapper of Stanford PTBTokenizer""" 27 | 28 | def tokenize(self, captions_for_image): 29 | cmd = ['java', '-cp', STANFORD_CORENLP_3_4_1_JAR, \ 30 | 'edu.stanford.nlp.process.PTBTokenizer', \ 31 | '-preserveLines', '-lowerCase'] 32 | 33 | # ====================================================== 34 | # prepare data for PTB Tokenizer 35 | # ====================================================== 36 | final_tokenized_captions_for_image = {} 37 | image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))] 38 | sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v]) 39 | 40 | # ====================================================== 41 | # save sentences to temporary file 42 | # ====================================================== 43 | path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__)) 44 | tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname) 45 | tmp_file.write(sentences.encode()) 46 | tmp_file.close() 47 | 48 | # ====================================================== 49 | # tokenize sentence 50 | # ====================================================== 51 | cmd.append(os.path.basename(tmp_file.name)) 52 | p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \ 53 | stdout=subprocess.PIPE) 54 | 55 | token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0] 56 | 57 | token_lines = token_lines.decode() 58 | lines = token_lines.split('\n') 59 | # remove temp file 60 | os.remove(tmp_file.name) 61 | 62 | # ====================================================== 63 | # create dictionary for tokenized captions 64 | # ====================================================== 65 | for k, line in zip(image_id, lines): 66 | if not k in final_tokenized_captions_for_image: 67 | final_tokenized_captions_for_image[k] = [] 68 | tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \ 69 | if w not in PUNCTUATIONS]) 70 | final_tokenized_captions_for_image[k].append(tokenized_caption) 71 | 72 | return final_tokenized_captions_for_image 73 | -------------------------------------------------------------------------------- /pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar -------------------------------------------------------------------------------- /train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | function help(){ 4 | printf "Usage :\n"; 5 | printf " $0 \n" 6 | exit; 7 | } 8 | 9 | if [[ $# < 1 ]] 10 | then 11 | help; 12 | fi 13 | 14 | file="train_$1"; 15 | 16 | if [ ! -f "training/$file.py" ] 17 | then 18 | printf "File ${file} not found\n"; 19 | printf "enter correct model name \n"; 20 | exit; 21 | fi 22 | 23 | python -m training.$file "${@:2}"; 24 | -------------------------------------------------------------------------------- /training/train_edlp.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import time 4 | 5 | import torch 6 | import torch.nn as nn 7 | import torch.optim as optim 8 | import torch.utils.data as Data 9 | from tensorboardX import SummaryWriter 10 | from tqdm import tqdm 11 | 12 | from misc import net_utils, utils 13 | from misc.dataloader import Dataloader 14 | from misc.train_util import dump_samples, evaluate_scores, save_model 15 | from models.enc_dec_dis import ParaphraseGenerator 16 | 17 | 18 | def main(): 19 | 20 | # get arguments --- 21 | 22 | parser = utils.make_parser() 23 | args = parser.parse_args() 24 | 25 | # build model 26 | 27 | # # get data 28 | data = Dataloader(args.input_json, args.input_ques_h5) 29 | 30 | # # make op 31 | op = { 32 | "vocab_sz": data.getVocabSize(), 33 | "max_seq_len": data.getSeqLength(), 34 | "emb_hid_dim": args.emb_hid_dim, 35 | "emb_dim": args.emb_dim, 36 | "enc_dim": args.enc_dim, 37 | "enc_dropout": args.enc_dropout, 38 | "enc_rnn_dim": args.enc_rnn_dim, 39 | "gen_rnn_dim": args.gen_rnn_dim, 40 | "gen_dropout": args.gen_dropout, 41 | "lr": args.learning_rate, 42 | "epochs": args.n_epoch 43 | } 44 | 45 | # # instantiate paraphrase generator 46 | pgen = ParaphraseGenerator(op) 47 | 48 | # setup logging 49 | logger = SummaryWriter(os.path.join(LOG_DIR, TIME + args.name)) 50 | # subprocess.run(['mkdir', os.path.join(GEN_DIR, TIME)], check=False) 51 | os.makedirs(os.path.join(GEN_DIR, TIME), exist_ok=True) 52 | # ready model for training 53 | 54 | train_loader = Data.DataLoader( 55 | Data.Subset(data, range(args.train_dataset_len)), 56 | batch_size=args.batch_size, 57 | shuffle=True, 58 | ) 59 | test_loader = Data.DataLoader(Data.Subset( 60 | data, 61 | range(args.train_dataset_len, 62 | args.train_dataset_len + args.val_dataset_len)), 63 | batch_size=args.batch_size, 64 | shuffle=True) 65 | 66 | pgen_optim = optim.RMSprop(pgen.parameters(), lr=op["lr"]) 67 | pgen.train() 68 | 69 | # train model 70 | pgen = pgen.to(DEVICE) 71 | cross_entropy_loss = nn.CrossEntropyLoss(ignore_index=data.PAD_token) 72 | 73 | for epoch in range(op["epochs"]): 74 | 75 | epoch_l1 = 0 76 | epoch_l2 = 0 77 | itr = 0 78 | ph = [] 79 | pph = [] 80 | gpph = [] 81 | pgen.train() 82 | 83 | for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm( 84 | train_loader, ascii=True, desc="epoch" + str(epoch)): 85 | 86 | phrase = phrase.to(DEVICE) 87 | para_phrase = para_phrase.to(DEVICE) 88 | 89 | out, enc_out, enc_sim_phrase = pgen( 90 | phrase.t(), 91 | sim_phrase=para_phrase.t(), 92 | train=True, 93 | ) 94 | 95 | loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase) 96 | loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase) 97 | 98 | pgen_optim.zero_grad() 99 | (loss_1 + loss_2).backward() 100 | 101 | pgen_optim.step() 102 | 103 | # accumulate results 104 | 105 | epoch_l1 += loss_1.item() 106 | epoch_l2 += loss_2.item() 107 | ph += net_utils.decode_sequence(data.ix_to_word, phrase) 108 | pph += net_utils.decode_sequence(data.ix_to_word, para_phrase) 109 | gpph += net_utils.decode_sequence(data.ix_to_word, 110 | torch.argmax(out, dim=-1).t()) 111 | 112 | itr += 1 113 | torch.cuda.empty_cache() 114 | 115 | # log results 116 | 117 | logger.add_scalar("l2_train", epoch_l2 / itr, epoch) 118 | logger.add_scalar("l1_train", epoch_l1 / itr, epoch) 119 | 120 | scores = evaluate_scores(gpph, pph) 121 | 122 | for key in scores: 123 | logger.add_scalar(key + "_train", scores[key], epoch) 124 | 125 | dump_samples(ph, pph, gpph, 126 | os.path.join(GEN_DIR, TIME, 127 | str(epoch) + "_train.txt")) 128 | 129 | # start validation 130 | 131 | epoch_l1 = 0 132 | epoch_l2 = 0 133 | itr = 0 134 | ph = [] 135 | pph = [] 136 | gpph = [] 137 | pgen.eval() 138 | 139 | with torch.no_grad(): 140 | for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm( 141 | test_loader, ascii=True, desc="val" + str(epoch)): 142 | 143 | phrase = phrase.to(DEVICE) 144 | para_phrase = para_phrase.to(DEVICE) 145 | 146 | out, enc_out, enc_sim_phrase = pgen(phrase.t(), 147 | sim_phrase=para_phrase.t()) 148 | 149 | loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase) 150 | loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase) 151 | 152 | epoch_l1 += loss_1.item() 153 | epoch_l2 += loss_2.item() 154 | ph += net_utils.decode_sequence(data.ix_to_word, phrase) 155 | pph += net_utils.decode_sequence(data.ix_to_word, para_phrase) 156 | gpph += net_utils.decode_sequence( 157 | data.ix_to_word, 158 | torch.argmax(out, dim=-1).t()) 159 | 160 | itr += 1 161 | torch.cuda.empty_cache() 162 | 163 | logger.add_scalar("l2_val", epoch_l2 / itr, epoch) 164 | logger.add_scalar("l1_val", epoch_l1 / itr, epoch) 165 | 166 | scores = evaluate_scores(gpph, pph) 167 | 168 | for key in scores: 169 | logger.add_scalar(key + "_val", scores[key], epoch) 170 | 171 | dump_samples(ph, pph, gpph, 172 | os.path.join(GEN_DIR, TIME, 173 | str(epoch) + "_val.txt")) 174 | save_model(pgen, pgen_optim, epoch, os.path.join(SAVE_DIR, TIME, str(epoch))) 175 | 176 | # wrap ups 177 | logger.close() 178 | print("Done !!") 179 | 180 | 181 | if __name__ == "__main__": 182 | 183 | LOG_DIR = 'logs' 184 | SAVE_DIR = 'save' 185 | GEN_DIR = 'samples' 186 | HOME = './' 187 | TIME = time.strftime("%Y%m%d_%H%M%S") 188 | DEVICE = torch.device( 189 | 'cuda') if torch.cuda.is_available() else torch.device('cpu') 190 | main() 191 | -------------------------------------------------------------------------------- /training/train_edlps.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import time 4 | 5 | import torch 6 | import torch.nn as nn 7 | import torch.optim as optim 8 | import torch.utils.data as Data 9 | from tensorboardX import SummaryWriter 10 | from tqdm import tqdm 11 | 12 | from misc import net_utils, utils 13 | from misc.dataloader import Dataloader 14 | from misc.train_util import dump_samples, evaluate_scores, save_model 15 | from models.enc_dec_sh_dis import ParaphraseGenerator 16 | 17 | 18 | def main(): 19 | 20 | # get arguments --- 21 | 22 | parser = utils.make_parser() 23 | args = parser.parse_args() 24 | 25 | # build model 26 | 27 | # # get data 28 | data = Dataloader(args.input_json, args.input_ques_h5) 29 | 30 | # # make op 31 | op = { 32 | "vocab_sz": data.getVocabSize(), 33 | "max_seq_len": data.getSeqLength(), 34 | "emb_hid_dim": args.emb_hid_dim, 35 | "emb_dim": args.emb_dim, 36 | "enc_dim": args.enc_dim, 37 | "enc_dropout": args.enc_dropout, 38 | "enc_rnn_dim": args.enc_rnn_dim, 39 | "gen_rnn_dim": args.gen_rnn_dim, 40 | "gen_dropout": args.gen_dropout, 41 | "lr": args.learning_rate, 42 | "epochs": args.n_epoch 43 | } 44 | 45 | # # instantiate paraphrase generator 46 | pgen = ParaphraseGenerator(op) 47 | 48 | # setup logging 49 | logger = SummaryWriter(os.path.join(LOG_DIR, TIME + args.name)) 50 | # subprocess.run(['mkdir', os.path.join(GEN_DIR, TIME), os.path.join(SAVE_DIR, TIME)], check=False) 51 | os.makedirs(os.path.join(GEN_DIR, TIME), exist_ok=True) 52 | # ready model for training 53 | 54 | train_loader = Data.DataLoader( 55 | Data.Subset(data, range(args.train_dataset_len)), 56 | batch_size=args.batch_size, 57 | shuffle=True, 58 | ) 59 | test_loader = Data.DataLoader(Data.Subset( 60 | data, 61 | range(args.train_dataset_len, 62 | args.train_dataset_len + args.val_dataset_len)), 63 | batch_size=args.batch_size, 64 | shuffle=True) 65 | 66 | pgen_optim = optim.RMSprop(pgen.parameters(), lr=op["lr"]) 67 | pgen.train() 68 | 69 | # train model 70 | pgen = pgen.to(DEVICE) 71 | cross_entropy_loss = nn.CrossEntropyLoss(ignore_index=data.PAD_token) 72 | 73 | for epoch in range(op["epochs"]): 74 | 75 | epoch_l1 = 0 76 | epoch_l2 = 0 77 | itr = 0 78 | ph = [] 79 | pph = [] 80 | gpph = [] 81 | pgen.train() 82 | 83 | for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm( 84 | train_loader, ascii=True, desc="epoch" + str(epoch)): 85 | 86 | phrase = phrase.to(DEVICE) 87 | para_phrase = para_phrase.to(DEVICE) 88 | 89 | out, enc_out, enc_sim_phrase = pgen( 90 | phrase.t(), 91 | sim_phrase=para_phrase.t(), 92 | train=True, 93 | ) 94 | 95 | loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase) 96 | loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase) 97 | 98 | pgen_optim.zero_grad() 99 | (loss_1 + loss_2).backward() 100 | 101 | pgen_optim.step() 102 | 103 | # accumulate results 104 | 105 | epoch_l1 += loss_1.item() 106 | epoch_l2 += loss_2.item() 107 | ph += net_utils.decode_sequence(data.ix_to_word, phrase) 108 | pph += net_utils.decode_sequence(data.ix_to_word, para_phrase) 109 | gpph += net_utils.decode_sequence(data.ix_to_word, 110 | torch.argmax(out, dim=-1).t()) 111 | 112 | itr += 1 113 | torch.cuda.empty_cache() 114 | 115 | # log results 116 | 117 | logger.add_scalar("l2_train", epoch_l2 / itr, epoch) 118 | logger.add_scalar("l1_train", epoch_l1 / itr, epoch) 119 | 120 | scores = evaluate_scores(gpph, pph) 121 | 122 | for key in scores: 123 | logger.add_scalar(key + "_train", scores[key], epoch) 124 | 125 | dump_samples(ph, pph, gpph, 126 | os.path.join(GEN_DIR, TIME, 127 | str(epoch) + "_train.txt")) 128 | # start validation 129 | 130 | epoch_l1 = 0 131 | epoch_l2 = 0 132 | itr = 0 133 | ph = [] 134 | pph = [] 135 | gpph = [] 136 | pgen.eval() 137 | 138 | with torch.no_grad(): 139 | for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm( 140 | test_loader, ascii=True, desc="val" + str(epoch)): 141 | 142 | phrase = phrase.to(DEVICE) 143 | para_phrase = para_phrase.to(DEVICE) 144 | 145 | out, enc_out, enc_sim_phrase = pgen(phrase.t(), 146 | sim_phrase=para_phrase.t()) 147 | 148 | loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase) 149 | loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase) 150 | 151 | epoch_l1 += loss_1.item() 152 | epoch_l2 += loss_2.item() 153 | ph += net_utils.decode_sequence(data.ix_to_word, phrase) 154 | pph += net_utils.decode_sequence(data.ix_to_word, para_phrase) 155 | gpph += net_utils.decode_sequence( 156 | data.ix_to_word, 157 | torch.argmax(out, dim=-1).t()) 158 | 159 | itr += 1 160 | torch.cuda.empty_cache() 161 | 162 | logger.add_scalar("l2_val", epoch_l2 / itr, epoch) 163 | logger.add_scalar("l1_val", epoch_l1 / itr, epoch) 164 | 165 | scores = evaluate_scores(gpph, pph) 166 | 167 | for key in scores: 168 | logger.add_scalar(key + "_val", scores[key], epoch) 169 | 170 | dump_samples(ph, pph, gpph, 171 | os.path.join(GEN_DIR, TIME, 172 | str(epoch) + "_val.txt")) 173 | 174 | save_model(pgen, pgen_optim, epoch, os.path.join(SAVE_DIR, TIME, str(epoch))) 175 | 176 | # wrap ups 177 | logger.close() 178 | print("Done !!") 179 | 180 | 181 | if __name__ == "__main__": 182 | 183 | LOG_DIR = 'logs' 184 | SAVE_DIR = 'save' 185 | GEN_DIR = 'samples' 186 | HOME = './' 187 | TIME = time.strftime("%Y%m%d_%H%M%S") 188 | DEVICE = torch.device( 189 | 'cuda') if torch.cuda.is_available() else torch.device('cpu') 190 | main() 191 | --------------------------------------------------------------------------------