├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── data
    └── quora_dataset
    │   ├── quora_data_prepro.h5
    │   └── quora_data_prepro.json
├── environment.yml
├── misc
    ├── __init__.py
    ├── dataloader.py
    ├── net_utils.py
    ├── train_util.py
    └── utils.py
├── models
    ├── enc_dec_dis.py
    └── enc_dec_sh_dis.py
├── prepro
    ├── prepro_quora.py
    └── quora_prepro.py
├── pycocoevalcap
    ├── .gitignore
    ├── README.md
    ├── bleu
    │   ├── LICENSE
    │   ├── bleu.py
    │   └── bleu_scorer.py
    ├── cider
    │   ├── cider.py
    │   └── cider_scorer.py
    ├── eval.py
    ├── license.txt
    ├── meteor
    │   ├── data
    │   │   └── paraphrase-en.gz
    │   ├── meteor-1.5.jar
    │   └── meteor.py
    ├── rouge
    │   └── rouge.py
    ├── setup.py
    └── tokenizer
    │   ├── ptbtokenizer.py
    │   └── stanford-corenlp-3.4.1.jar
├── train.sh
└── training
    ├── train_edlp.py
    └── train_edlps.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | tst.py
 2 | *.pyc
 3 | */__pycache__/*
 4 | runs/*
 5 | runs
 6 | ignore.txt
 7 | .vscode
 8 | .vscode/*
 9 | logs
10 | logs/*
11 | save
12 | save/*
13 | samples
14 | samples/*
15 | result
16 | result/*
17 | */__pycache__/*
18 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "pycocoevalcap"]
2 | 	path = pycocoevalcap
3 | 	url = https://github.com/salaniz/pycocoevalcap.git
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Dev Chauhan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Paraphrase Question Generator using Shared Discriminator
  2 | 
  3 | PyTorch code for Paraphrase Question Generator.  This code-based is built upon this paper [Learning Semantic Sentence Embeddings using Pair-wise Discriminator](https://www.aclweb.org/anthology/C18-1230.pdf).
  4 | 
  5 | If you want the code used for experiments please head over to the `orig-code` branch. This code consists of some good models mentioned in above paper only with more readable and usable form.
  6 | 
  7 | ### Requirements and Setup
  8 | 
  9 | ##### Use Anaconda or Miniconda
 10 | 
 11 | 1. Install Anaconda or Miniconda distribution based on Python3+ from their [downloads' site](https://conda.io/docs/user-guide/install/download.html).
 12 | 2. Clone this repository and create an environment:
 13 | 
 14 | ```
 15 | git clone https://www.github.com/dev-chauhan/PQG-pytorch
 16 | cd PQG-pytorch
 17 | conda env create -f environment.yml
 18 | 
 19 | # activate the environment
 20 | conda activate PQG
 21 | ```
 22 | 3. After that for logging you need to install [tensorboardX](https://github.com/lanpa/tensorboardX).
 23 | ```
 24 | pip install tensorboardX
 25 | ```
 26 | ### Dataset
 27 | 
 28 | You can directly use following files downloading them into `data` folder or by following the process shown below it.
 29 | ##### Data Files
 30 | Download all the data files from here.
 31 | - [quora_data_prepro.h5](https://figshare.com/s/5463afb24cba05629cdf)
 32 | - [quora_data_prepro.json](https://figshare.com/s/5463afb24cba05629cdf)
 33 | - [quora_raw_train.json](https://figshare.com/s/5463afb24cba05629cdf)
 34 | - [quora_raw_val.json](https://figshare.com/s/5463afb24cba05629cdf)
 35 | - [quora_raw_test.json](https://figshare.com/s/5463afb24cba05629cdf)
 36 | 
 37 | 
 38 | ##### Download Dataset
 39 | We have referred  [neuraltalk2](https://github.com/karpathy/neuraltalk2) and [Text-to-Image Synthesis ](https://github.com/reedscot/icml2016) to prepare our code base. The first thing you need to do is to download the Quora Question Pairs dataset from the [Quora Question Pair website](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) and put the same in the `data` folder.
 40 | 
 41 | If you want to train from scratch continue reading or if you just want to evaluate using a pretrained model then head over to `Datafiles` section and download the data files (put all the data files in the `data` folder) and run `score.py` to evaluate pretrained model.
 42 | 
 43 | Now we need to do some preprocessing, head over to the `prepro` folder and run
 44 | 
 45 | ```
 46 | $ cd prepro
 47 | $ python quora_prepro.py
 48 | ```
 49 | 
 50 | **Note** The above command generates json files for 100K question pairs for train, 5k question pairs for validation and 30K question pairs for Test set.
 51 | If you want to change this and instead use only 50K question pairs for training and rest remaining the same, then you need to make some minor changes in the above file. After this step, it will generate the files under the `data` folder. `quora_raw_train.json`, `quora_raw_val.json` and `quora_raw_test.json`
 52 | 
 53 | ##### Preprocess Paraphrase Question
 54 | 
 55 | ```
 56 | $ python prepro_quora.py --input_train_json ../data/quora_raw_train.json --input_test_json ../data/quora_raw_test.json
 57 | ```
 58 | This will generate two files in `data/` folder, `quora_data_prepro.h5` and `quora_data_prepro.json`.
 59 | 
 60 | ### Training
 61 | 
 62 | ```
 63 | $ ./train.sh <name of model> --n_epoch <number of epochs>
 64 | ```
 65 | 
 66 | You can change training data set and validation data set lengths by adding arguments `--train_dataset_len` and `--val_dataset_len` which are default to `100000` and `30000` which is maximum.
 67 | 
 68 | There are other arguments also for you to experiment like `--batch_size`, `--learning_rate`, `--drop_prob_lm`, etc.
 69 | 
 70 | You can resume training using `--start_from` argument in which you have to give path of saved model.
 71 | ### Save and log
 72 | 
 73 | First you have to make empty directories `save`, `samples`, and `logs`.  
 74 | For each training there will be a directory having unique name in `save`. Saved model will be a `.tar` file. Each model will be saved as `<epoch number>` in that directory.
 75 | 
 76 | In `samples` directory with same unique name as above the directory contains a `.txt` file for each epoch as `<epoch number>_train.txt` or `<epoch number>_val.txt` having generated paraphrases by model at the end of that epoch on validation data set.
 77 | 
 78 | Logs for training and evaluation is stored in `logs` directory which you can see using `tensorboard` by running following command.
 79 | ```
 80 | tensorboard --logdir <path of logs directory>
 81 | ```
 82 | This command will tell you where you can see your logs on browser, commonly it is `localhost:6006` but you can change it using `--port` argument in above command.
 83 | 
 84 | ### Results
 85 | Following are the results for 100k quora question pairs dataset for some models.
 86 | 
 87 | Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr |
 88 | ---|--|--|--|--|--|--|--|
 89 | EDL|0.4162|0.2578|0.1724|0.1219|0.4191|0.3244|0.6189|
 90 | EDLPS|0.4754|0.3160|0.2249|0.1672|0.4781|0.3488|1.0949|
 91 | 
 92 | Following are the results for 50k quora question pairs dataset for some models.
 93 | 
 94 | Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr |
 95 | ---|--|--|--|--|--|--|--|
 96 | EDL|0.3877|0.2336|0.1532|0.1067|0.3913|0.3133|0.4550|
 97 | EDLPS|0.4553|0.2981 |0.2105|0.1560|0.4583|0.3421|0.9690|
 98 | 
 99 | ### Reference
100 | 
101 | If you use this code as part of any published research, please acknowledge the following paper
102 | 
103 | ```
104 | @inproceedings{patro2018learning,
105 |   title={Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator},
106 |   author={Patro, Badri Narayana and Kurmi, Vinod Kumar and Kumar, Sandeep and Namboodiri, Vinay},
107 |   booktitle={Proceedings of the 27th International Conference on Computational Linguistics},
108 |   pages={2715--2729},
109 |   year={2018}
110 | }
111 | 
112 | 
113 | @article{PATRO2021149,
114 | title = {Revisiting paraphrase question generator using pairwise discriminator},
115 | author = {Badri N. Patro and Dev Chauhan and Vinod K. Kurmi and Vinay P. Namboodiri},
116 | journal = {Neurocomputing},
117 | volume = {420},
118 | pages = {149-161},
119 | year = {2021},
120 | issn = {0925-2312},
121 | doi = {https://doi.org/10.1016/j.neucom.2020.08.022},
122 | url = {https://www.sciencedirect.com/science/article/pii/S0925231220312820}
123 | }
124 | ```
125 | 
126 | ## Contributors
127 | * [Dev  Chauhan][1] (devgiri@iitk.ac.in)
128 | * [Badri N. Patro][2] (badri@iitk.ac.in)
129 | * [Vinod K. Kurmi][2] (vinodkk@iitk.ac.in)
130 | 
131 | [1]: https://github.com/dev-chauhan
132 | [2]: https://github.com/badripatro
133 | [3]: https://github.com/vinodkkurmi
134 | 
135 | 
136 | 
137 | 


--------------------------------------------------------------------------------
/data/quora_dataset/quora_data_prepro.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/data/quora_dataset/quora_data_prepro.h5


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
  1 | name: PQG
  2 | channels:
  3 |   - defaults
  4 | dependencies:
  5 |   - _libgcc_mutex=0.1=main
  6 |   - attrs=19.1.0=py37_1
  7 |   - backcall=0.1.0=py37_0
  8 |   - blas=1.0=mkl
  9 |   - bleach=3.1.0=py37_0
 10 |   - ca-certificates=2019.11.27=0
 11 |   - certifi=2019.11.28=py37_0
 12 |   - cffi=1.12.3=py37h2e261b9_0
 13 |   - cycler=0.10.0=py37_0
 14 |   - dbus=1.13.6=h746ee38_0
 15 |   - decorator=4.4.0=py37_1
 16 |   - defusedxml=0.6.0=py_0
 17 |   - entrypoints=0.3=py37_0
 18 |   - expat=2.2.6=he6710b0_0
 19 |   - fontconfig=2.13.0=h9420a91_0
 20 |   - freetype=2.9.1=h8a8886c_1
 21 |   - glib=2.56.2=hd408876_0
 22 |   - gmp=6.1.2=h6c8ec71_1
 23 |   - gst-plugins-base=1.14.0=hbbd80ab_1
 24 |   - gstreamer=1.14.0=hb453b48_1
 25 |   - h5py=2.9.0=py37h7918eee_0
 26 |   - hdf5=1.10.4=hb1b8bf9_0
 27 |   - icu=58.2=h9c2bf20_1
 28 |   - intel-openmp=2019.4=243
 29 |   - ipykernel=5.1.1=py37h39e3cac_0
 30 |   - ipython=7.7.0=py37h39e3cac_0
 31 |   - ipython_genutils=0.2.0=py37_0
 32 |   - ipywidgets=7.5.0=py_0
 33 |   - jedi=0.13.3=py37_0
 34 |   - jinja2=2.10.1=py37_0
 35 |   - jpeg=9b=h024ee3a_2
 36 |   - jsonschema=3.0.1=py37_0
 37 |   - jupyter=1.0.0=py37_7
 38 |   - jupyter_client=5.3.1=py_0
 39 |   - jupyter_console=6.0.0=py37_0
 40 |   - jupyter_core=4.5.0=py_0
 41 |   - kiwisolver=1.1.0=py37he6710b0_0
 42 |   - libedit=3.1.20181209=hc058e9b_0
 43 |   - libffi=3.2.1=hd88cf55_4
 44 |   - libgcc-ng=9.1.0=hdf63c60_0
 45 |   - libgfortran-ng=7.3.0=hdf63c60_0
 46 |   - libpng=1.6.37=hbc83047_0
 47 |   - libsodium=1.0.16=h1bed415_0
 48 |   - libstdcxx-ng=9.1.0=hdf63c60_0
 49 |   - libtiff=4.0.10=h2733197_2
 50 |   - libuuid=1.0.3=h1bed415_2
 51 |   - libxcb=1.13=h1bed415_1
 52 |   - libxml2=2.9.9=hea5a465_1
 53 |   - markupsafe=1.1.1=py37h7b6447c_0
 54 |   - matplotlib=3.1.0=py37h5429711_0
 55 |   - mistune=0.8.4=py37h7b6447c_0
 56 |   - mkl=2019.4=243
 57 |   - mkl-service=2.0.2=py37h7b6447c_0
 58 |   - mkl_fft=1.0.12=py37ha843d7b_0
 59 |   - mkl_random=1.0.2=py37hd81dba3_0
 60 |   - nbconvert=5.5.0=py_0
 61 |   - nbformat=4.4.0=py37_0
 62 |   - ncurses=6.1=he6710b0_1
 63 |   - ninja=1.9.0=py37hfd86e86_0
 64 |   - nltk=3.4.4=py37_0
 65 |   - notebook=6.0.0=py37_0
 66 |   - numpy=1.16.4=py37h7e9f1db_0
 67 |   - numpy-base=1.16.4=py37hde5b4d6_0
 68 |   - olefile=0.46=py37_0
 69 |   - openjdk=8.0.152=h46b5887_1
 70 |   - openssl=1.1.1d=h7b6447c_3
 71 |   - pandoc=2.2.3.2=0
 72 |   - pandocfilters=1.4.2=py37_1
 73 |   - parso=0.5.0=py_0
 74 |   - pcre=8.43=he6710b0_0
 75 |   - pexpect=4.7.0=py37_0
 76 |   - pickleshare=0.7.5=py37_0
 77 |   - pillow=6.1.0=py37h34e0f95_0
 78 |   - pip=19.1.1=py37_0
 79 |   - prometheus_client=0.7.1=py_0
 80 |   - prompt_toolkit=2.0.9=py37_0
 81 |   - ptyprocess=0.6.0=py37_0
 82 |   - pycparser=2.19=py37_0
 83 |   - pygments=2.4.2=py_0
 84 |   - pyparsing=2.4.0=py_0
 85 |   - pyqt=5.9.2=py37h05f1152_2
 86 |   - pyrsistent=0.14.11=py37h7b6447c_0
 87 |   - python=3.7.3=h0371630_0
 88 |   - python-dateutil=2.8.0=py37_0
 89 |   - pytz=2019.1=py_0
 90 |   - pyzmq=18.0.0=py37he6710b0_0
 91 |   - qt=5.9.7=h5867ecd_1
 92 |   - qtconsole=4.5.2=py_0
 93 |   - readline=7.0=h7b6447c_5
 94 |   - scipy=1.3.0=py37h7c811a0_0
 95 |   - send2trash=1.5.0=py37_0
 96 |   - setuptools=41.0.1=py37_0
 97 |   - sip=4.19.8=py37hf484d3e_0
 98 |   - six=1.12.0=py37_0
 99 |   - sqlite=3.29.0=h7b6447c_0
100 |   - terminado=0.8.2=py37_0
101 |   - testpath=0.4.2=py37_0
102 |   - tk=8.6.8=hbc83047_0
103 |   - tornado=6.0.3=py37h7b6447c_0
104 |   - tqdm=4.41.1=py_0
105 |   - traitlets=4.3.2=py37_0
106 |   - wcwidth=0.1.7=py37_0
107 |   - webencodings=0.5.1=py37_1
108 |   - wheel=0.33.4=py37_0
109 |   - widgetsnbextension=3.5.0=py37_0
110 |   - xz=5.2.4=h14c3975_4
111 |   - zeromq=4.3.1=he6710b0_3
112 |   - zlib=1.2.11=h7b6447c_3
113 |   - zstd=1.3.7=h0b5b093_0
114 |   - pip:
115 |     - bandit==1.6.2
116 |     - gitdb2==2.0.6
117 |     - gitpython==3.0.5
118 |     - imageio==2.6.1
119 |     - pbr==5.4.4
120 |     - progressbar==2.5
121 |     - progressbar2==3.47.0
122 |     - protobuf==3.11.2
123 |     - python-utils==2.3.0
124 |     - pyyaml==5.2
125 |     - smmap2==2.0.5
126 |     - stevedore==1.31.0
127 |     - tensorboardx==2.0
128 |     - torch==1.4.0
129 |     - torchvision==0.3.0
130 | 


--------------------------------------------------------------------------------
/misc/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/misc/__init__.py


--------------------------------------------------------------------------------
/misc/dataloader.py:
--------------------------------------------------------------------------------
 1 | import h5py
 2 | import json
 3 | import torch
 4 | import torch.utils.data as data
 5 | 
 6 | 
 7 | class Dataloader(data.Dataset):
 8 | 
 9 |     def __init__(self, input_json_file_path, input_ques_h5_path):
10 | 
11 |         super(Dataloader, self).__init__()
12 |         print('Reading', input_json_file_path)
13 | 
14 |         with open(input_json_file_path) as input_file:
15 |             data_dict = json.load(input_file)
16 | 
17 |         self.ix_to_word = {}
18 | 
19 |         for k in data_dict['ix_to_word']:
20 |             self.ix_to_word[int(k)] = data_dict['ix_to_word'][k]
21 | 
22 |         self.UNK_token = 0
23 | 
24 |         if 0 not in self.ix_to_word:
25 |             self.ix_to_word[0] = '<UNK>'
26 | 
27 |         else :
28 |             raise Exception
29 | 
30 |         self.EOS_token = len(self.ix_to_word)
31 |         self.ix_to_word[self.EOS_token] = '<EOS>'
32 |         self.PAD_token = len(self.ix_to_word)
33 |         self.ix_to_word[self.PAD_token] = '<PAD>'
34 |         self.SOS_token = len(self.ix_to_word)
35 |         self.ix_to_word[self.SOS_token] = '<SOS>'
36 |         self.vocab_size = len(self.ix_to_word)
37 |         print('DataLoader loading h5 question file:', input_ques_h5_path)
38 |         qa_data = h5py.File(input_ques_h5_path, 'r')
39 | 
40 |         ques_id_train = torch.from_numpy(qa_data['ques_cap_id_train'][...].astype(int))
41 | 
42 |         ques_train, ques_len_train = self.process_data(torch.from_numpy(qa_data['ques_train'][...].astype(int)), torch.from_numpy(qa_data['ques_length_train'][...].astype(int)))
43 | 
44 |         label_train, label_len_train = self.process_data(torch.from_numpy(qa_data['ques1_train'][...].astype(int)), torch.from_numpy(qa_data['ques1_length_train'][...].astype(int)))
45 | 
46 |         self.train_id = 0
47 |         self.seq_length = ques_train.size()[1]
48 | 
49 |         print('Training dataset length : ', ques_train.size()[0])
50 | 
51 | 
52 |         ques_test, ques_len_test = self.process_data(torch.from_numpy(qa_data['ques_test'][...].astype(int)), torch.from_numpy(qa_data['ques_length_test'][...].astype(int)))
53 | 
54 |         label_test, label_len_test = self.process_data(torch.from_numpy(qa_data['ques1_test'][...].astype(int)), torch.from_numpy(qa_data['ques1_length_test'][...].astype(int)))
55 | 
56 |         ques_id_test = torch.from_numpy(qa_data['ques_cap_id_test'][...].astype(int))
57 | 
58 |         self.test_id = 0
59 | 
60 |         print('Test dataset length : ', ques_test.size()[0])
61 |         qa_data.close()
62 | 
63 |         self.ques = torch.cat([ques_train, ques_test])
64 |         self.len = torch.cat([ques_len_train, ques_len_test])
65 |         self.label = torch.cat([label_train, label_test])
66 |         self.label_len = torch.cat([label_len_train, label_len_test])
67 |         self.id = torch.cat([ques_id_train, ques_id_test])
68 | 
69 |     def process_data(self, data, data_len):
70 |         N = data.size()[0]
71 |         new_data = torch.zeros(N, data.size()[1] + 2, dtype=torch.long) + self.PAD_token
72 |         for i in range(N):
73 |             new_data[i, 1:data_len[i]+1] = data[i, :data_len[i]]
74 |             new_data[i, 0] = self.SOS_token
75 |             new_data[i, data_len[i]+1] = self.EOS_token
76 |             data_len[i] += 2
77 |         return new_data, data_len
78 | 
79 |     def __len__(self):
80 |         return self.len.size()[0]
81 | 
82 |     def __getitem__(self, idx):
83 |         return (self.ques[idx], self.len[idx], self.label[idx], self.label_len[idx], self.id[idx])
84 | 
85 |     def getVocabSize(self):
86 |         return self.vocab_size
87 | 
88 |     def getDataNum(self, split):
89 |         if split == 1:
90 |             return 100000
91 | 
92 |         if split == 2:
93 |             return 30000
94 | 
95 |     def getSeqLength(self):
96 |         return self.seq_length
97 | 


--------------------------------------------------------------------------------
/misc/net_utils.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import misc.utils as utils
 3 | 
 4 | 
 5 | def decode_sequence(ix_to_word, seq):
 6 |     N, D = seq.size()[0], seq.size()[1]
 7 |     out = []
 8 |     EOS_flag = False
 9 |     for i in range(N):
10 |         txt = ''
11 |         for j in range(D):
12 |             ix = seq[i, j]
13 |             if int(ix.item()) not in ix_to_word:
14 |                 print("UNK token ", str(ix.item()))
15 |                 word = ix_to_word[len(ix_to_word) - 1]
16 |             else:
17 |                 word = ix_to_word[int(ix.item())]
18 |             if word == '<EOS>':
19 |                 txt = txt + ' '
20 |                 txt = txt + word
21 |                 break
22 |             if word == '<SOS>':
23 |                 txt = txt + '<SOS>'
24 |                 continue
25 |             if j > 0:
26 |                 txt = txt + ' '
27 |             txt = txt + word
28 |         out.append(txt)
29 |     return out
30 | 
31 | def prob2pred(prob):
32 | 
33 |     return torch.multinomial(torch.exp(prob.view(-1, prob.size(-1))), 1).view(prob.size(0), prob.size(1))
34 | 
35 | def JointEmbeddingLoss(feature_emb1, feature_emb2):
36 |        
37 |     batch_size = feature_emb1.size()[0]
38 | 
39 |     return torch.sum(
40 |         torch.clamp(
41 |             torch.mm(feature_emb1, feature_emb2.t()) - torch.sum(feature_emb1 * feature_emb2, dim=-1) + 1,
42 |             min=0.0
43 |         )
44 |     ) / (batch_size * batch_size)
45 | 


--------------------------------------------------------------------------------
/misc/train_util.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import misc.utils as utils
 3 | import misc.net_utils as net_utils
 4 | import torch.nn as nn
 5 | from pycocoevalcap.eval import COCOEvalCap
 6 | import os
 7 | 
 8 | 
 9 | def getObjsForScores(real_sents, pred_sents):
10 |     class coco:
11 |         def __init__(self, sents):
12 |             self.sents = sents
13 |             self.imgToAnns = [[{'caption': sents[i]}] for i in range(len(sents))]
14 | 
15 |         def getImgIds(self):
16 |             return [i for i in range(len(self.sents))]
17 | 
18 |     return coco(real_sents), coco(pred_sents)
19 | 
20 | 
21 | def evaluate_scores(s1, s2):
22 | 
23 |     '''
24 |     calculates scores and return the dict with score_name and value
25 |     '''
26 |     coco, cocoRes = getObjsForScores(s1, s2)
27 | 
28 |     evalObj = COCOEvalCap(coco, cocoRes)
29 | 
30 |     evalObj.evaluate()
31 | 
32 |     return evalObj.eval
33 | 
34 | 
35 | def dump_samples(ph, pph, gpph, file_name):
36 | 
37 |     file = open(file_name, "w")
38 | 
39 |     for r, s, t in zip(ph, pph, gpph):
40 |         file.write("ph : " + r + "\npph : " + s + "\ngpph : " + t + '\n\n')
41 |     file.close()
42 | 
43 | 
44 | # def save_model(encoder, generator, model_optim, epoch, it, local_loss, global_loss, save_folder, folder, discriminator=None, discriminatorg=None):
45 | 
46 | #     PATH = os.path.join(save_folder, folder, str(epoch) + '_' + str(it) + '.tar')
47 | 
48 | #     checkpoint = {
49 | #         'epoch': epoch,
50 | #         'iter': it,
51 | #         'encoder_state_dict': encoder.state_dict(),
52 | #         'generator_state_dict': generator.state_dict(),
53 | #         'optimizer_state_dict': model_optim.state_dict(),
54 | #         'local_loss': local_loss,
55 | #         'global_loss': global_loss
56 | #     }
57 | #     if discriminator is not None:
58 | #         checkpoint['discriminator_state_dict'] = discriminator.state_dict()
59 | #     if discriminatorg is not None:
60 | #         checkpoint['discriminatorg_state_dict'] = discriminatorg.state_dict()
61 | 
62 | #     torch.save(checkpoint, PATH)
63 | 
64 | def save_model(model, model_opt, epoch, save_file):
65 | 
66 |     checkpoint = {
67 |         'epoch': epoch,
68 |         'model': model.state_dict(),
69 |         'model_opt': model_opt.state_dict()
70 |     }
71 | 
72 |     torch.save(checkpoint, save_file)
73 | 


--------------------------------------------------------------------------------
/misc/utils.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import argparse
 3 | 
 4 | 
 5 | def one_hot(t, c):
 6 |     return torch.zeros(*t.size(), c, device=t.device).scatter_(-1, t.unsqueeze(-1), 1)
 7 | 
 8 | 
 9 | def make_parser():
10 |     parser = argparse.ArgumentParser()
11 | 
12 |     parser.add_argument('--input_ques_h5',default='data/quora_dataset/quora_data_prepro.h5',help='path to the h5file containing the preprocessed dataset')
13 |     parser.add_argument('--input_json',default='data/quora_dataset/quora_data_prepro.json',help='path to the json file containing additional info and vocab')
14 | 
15 |     # starting point
16 |     parser.add_argument('--start_from', default='None', help='path to a model checkpoint to initialize model weights from. Empty = don\'t')
17 | 
18 |     # # Model settings
19 |     parser.add_argument('--model', default='EDLPS', help='which model to use? EDL|EDP|EDLP|EDLPS|EDLPG|EDLPGS|EDG|EDPG')
20 |     parser.add_argument('--batch_size', type=int, default=150, help='what is theutils batch size in number of images per batch? (there will be x seq_per_img sentences)')
21 |     parser.add_argument('--input_encoding_size', type=int, default=512,help='the encoding size of each token in the vocabulary, and the image.')
22 |     parser.add_argument('--att_size', type=int, default=512, help='size of sttention vector which refer to k in paper')
23 |     parser.add_argument('--emb_size',type=int, default=512, help='the size after embeeding from onehot')
24 |     parser.add_argument('--rnn_layers',type=int, default=1, help='number of the rnn layer')
25 |     parser.add_argument('--train_dataset_len', type=int, default=100000, help='length of train dataset')
26 |     parser.add_argument('--val_dataset_len', type=int, default=30000, help='length of validation dataset')
27 | 
28 |     # Optimization
29 |     parser.add_argument('--optim',default='rmsprop',help='what update to use? rmsprop|sgd|sgdmom|adagrad|adam')
30 |     parser.add_argument('--learning_rate',default=0.0008,help='learning rate', type=float)#0.0001,#0.0002,#0.005
31 |     parser.add_argument('--learning_rate_decay_start', default=5, type=int, help='at what epoch to start decaying learning rate? (-1 = dont)')#learning_rate_decay_start', 100,
32 |     parser.add_argument('--learning_rate_decay_every', type=int, default=5, help='every how many epoch thereafter to drop LR by half?')#-learning_rate_decay_every', 1500,
33 |     parser.add_argument('--momentum',type=float, default=0.9,help='momentum')
34 |     parser.add_argument('--optim_alpha',type=float, default=0.8,help='alpha for adagrad/rmsprop/momentum/adam')#optim_alpha',0.99
35 |     parser.add_argument('--optim_beta',type=float, default=0.999,help='beta used for adam')#optim_beta',0.995
36 |     parser.add_argument('--optim_epsilon',type=float, default=1e-8,help='epsilon that goes into denominator in rmsprop')
37 |     parser.add_argument('--max_iters', type=int, default=-1, help='max number of iterations to run for (-1 = run forever)')
38 |     parser.add_argument('--iterPerEpoch', default=1250, type=int)
39 |     parser.add_argument('--drop_prob_lm', type=float, default=0.5, help='strength of drop_prob_lm in the Language Model RNN')
40 |     parser.add_argument('--n_epoch', type=int, default=1, help='number of epochs during training')
41 | 
42 |     # Evaluation/Checkpointing
43 | 
44 |     parser.add_argument('--save', default='Results', help='save directory')
45 |     parser.add_argument('--checkpoint_dir', default='Results/checkpoints', help='folder to save checkpoints into (empty = this folder)')
46 |     parser.add_argument('--language_eval', type=int, default=1, help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.')
47 |     parser.add_argument('--val_images_use', type=int, default=24800, help='how many images to use when periodically evaluating the validation loss? (-1 = all)')
48 |     parser.add_argument('--save_every', type=int, default=500, help='how often to save a model checkpoint?')
49 |     parser.add_argument('--log_every', type=int , default=100, help='How often do we snapshot losses, for inclusion in the progress dump? (0 = disable)')
50 | 
51 |     # misc
52 |     parser.add_argument('--backend', default='cudnn', help='nn|cudnn')
53 |     parser.add_argument('--name', default='', help='an id identifying this run/job. used in cross-val and appended when writing progress files')
54 |     parser.add_argument('--seed', type=int, default=1234, help='random number generator seed to use')
55 |     parser.add_argument('--gpuid', type=int, default=-1, help='which gpu to use. -1 = use CPU')
56 |     parser.add_argument('--nGPU', type=int, default=3, help='Number of GPUs to use by default')
57 | 
58 |     #text encoder
59 |     parser.add_argument('--emb_dim',type=int, default=512,help='dim of word embedding')
60 |     parser.add_argument('--emb_hid_dim', type=int, default=256,help='hidden dim of word embedding')
61 |     parser.add_argument('--enc_dropout', type=float, default=0.5,help='dropout for encoder module')
62 |     parser.add_argument('--enc_rnn_dim', default=512, type=int, help='size of the rnn in number of hidden nodes in each layer of gru in encoder')
63 |     parser.add_argument('--enc_dim', type=int, default=512,help='size of the encoded sentence')
64 |     parser.add_argument('--gen_rnn_dim', default=512, type=int, help='size of the rnn in number of hidden nodes in each layer of lstm in generator')
65 |     parser.add_argument('--gen_dropout',type=float, default=0.5,help='dropout for generator module')
66 | 
67 |     return parser
68 | 


--------------------------------------------------------------------------------
/models/enc_dec_dis.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | 
  4 | import misc.utils as utils
  5 | 
  6 | 
  7 | class ParaphraseGenerator(nn.Module):
  8 |     """
  9 |     pytorch module which generates paraphrase of given phrase
 10 |     """
 11 |     def __init__(self, op):
 12 | 
 13 |         super(ParaphraseGenerator, self).__init__()
 14 | 
 15 |         # encoder :
 16 |         self.emb_layer = nn.Sequential(
 17 |             nn.Linear(op["vocab_sz"], op["emb_hid_dim"]),
 18 |             nn.Threshold(0.000001, 0),
 19 |             nn.Linear(op["emb_hid_dim"], op["emb_dim"]),
 20 |             nn.Threshold(0.000001, 0))
 21 |         self.enc_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"])
 22 |         self.enc_lin = nn.Sequential(
 23 |             nn.Dropout(op["enc_dropout"]),
 24 |             nn.Linear(op["enc_rnn_dim"], op["enc_dim"]))
 25 |         
 26 |         # generator :
 27 |         self.gen_emb = nn.Embedding(op["vocab_sz"], op["emb_dim"])
 28 |         self.gen_rnn = nn.LSTM(op["enc_dim"], op["gen_rnn_dim"])
 29 |         self.gen_lin = nn.Sequential(
 30 |             nn.Dropout(op["gen_dropout"]),
 31 |             nn.Linear(op["gen_rnn_dim"], op["vocab_sz"]),
 32 |             nn.LogSoftmax(dim=-1))
 33 |         
 34 |         # pair-wise discriminator :
 35 |         self.dis_emb_layer = nn.Sequential(
 36 |             nn.Linear(op["vocab_sz"], op["emb_hid_dim"]),
 37 |             nn.Threshold(0.000001, 0),
 38 |             nn.Linear(op["emb_hid_dim"], op["emb_dim"]),
 39 |             nn.Threshold(0.000001, 0),
 40 |         )
 41 |         self.dis_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"])
 42 |         self.dis_lin = nn.Sequential(
 43 |             nn.Dropout(op["enc_dropout"]),
 44 |             nn.Linear(op["enc_rnn_dim"], op["enc_dim"]))
 45 |         
 46 |         # some useful constants :
 47 |         self.max_seq_len = op["max_seq_len"]
 48 |         self.vocab_sz = op["vocab_sz"]
 49 | 
 50 |     def forward(self, phrase, sim_phrase=None, train=False):
 51 |         """
 52 |         forward pass
 53 | 
 54 |         inputs :-
 55 | 
 56 |         phrase : given phrase , shape = (max sequence length, batch size)
 57 |         sim_phrase : (if train == True), shape = (max seq length, batch sz)
 58 |         train : if true teacher forcing is used to train the module
 59 | 
 60 |         outputs :-
 61 | 
 62 |         out : generated paraphrase, shape = (max sequence length, batch size, )
 63 |         enc_out : encoded generated paraphrase, shape=(batch size, enc_dim)
 64 |         enc_sim_phrase : encoded sim_phrase, shape=(batch size, enc_dim)
 65 | 
 66 |         """
 67 | 
 68 |         if sim_phrase is None:
 69 |             sim_phrase = phrase
 70 | 
 71 |         if train:
 72 | 
 73 |             # encode input phrase
 74 |             enc_phrase = self.enc_lin(
 75 |                 self.enc_rnn(
 76 |                     self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1])
 77 |             
 78 |             # generate similar phrase using teacher forcing
 79 |             emb_sim_phrase_gen = self.gen_emb(sim_phrase)
 80 |             out_rnn, _ = self.gen_rnn(
 81 |                 torch.cat([enc_phrase, emb_sim_phrase_gen[:-1, :]], dim=0))
 82 |             out = self.gen_lin(out_rnn)
 83 | 
 84 |             # propagated from shared discriminator to calculate
 85 |             # pair-wise discriminator loss
 86 |             enc_sim_phrase = self.dis_lin(
 87 |                 self.dis_rnn(
 88 |                     self.dis_emb_layer(utils.one_hot(sim_phrase,
 89 |                                                      self.vocab_sz)))[1])
 90 |             enc_out = self.dis_lin(
 91 |                 self.dis_rnn(self.dis_emb_layer(torch.exp(out)))[1])
 92 | 
 93 |         else:
 94 | 
 95 |             # encode input phrase
 96 |             enc_phrase = self.enc_lin(
 97 |                 self.enc_rnn(
 98 |                     self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1])
 99 |             
100 |             # generate similar phrase using teacher forcing
101 |             words = []
102 |             h = None
103 |             for __ in range(self.max_seq_len):
104 |                 word, h = self.gen_rnn(enc_phrase, hx=h)
105 |                 word = self.gen_lin(word)
106 |                 words.append(word)
107 |                 word = torch.multinomial(torch.exp(word[0]), 1)
108 |                 word = word.t()
109 |                 enc_phrase = self.gen_emb(word)
110 |             out = torch.cat(words, dim=0)
111 | 
112 |             # propagated from shared discriminator to calculate
113 |             # pair-wise discriminator loss
114 |             enc_sim_phrase = self.dis_lin(
115 |                 self.dis_rnn(
116 |                     self.dis_emb_layer(utils.one_hot(sim_phrase,
117 |                                                      self.vocab_sz)))[1])
118 |             enc_out = self.dis_lin(
119 |                 self.dis_rnn(self.dis_emb_layer(torch.exp(out)))[1])
120 | 
121 |         enc_out.squeeze_(0)
122 |         enc_sim_phrase.squeeze_(0)
123 |         return out, enc_out, enc_sim_phrase
124 | 


--------------------------------------------------------------------------------
/models/enc_dec_sh_dis.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | 
  4 | import misc.utils as utils
  5 | 
  6 | 
  7 | class ParaphraseGenerator(nn.Module):
  8 |     """
  9 |     pytorch module which generates paraphrase of given phrase
 10 |     """
 11 |     def __init__(self, op):
 12 | 
 13 |         super(ParaphraseGenerator, self).__init__()
 14 | 
 15 |         # encoder | shared pair-wise discriminator:
 16 |         self.emb_layer = nn.Sequential(
 17 |             nn.Linear(op["vocab_sz"], op["emb_hid_dim"]),
 18 |             nn.Threshold(0.000001, 0),
 19 |             nn.Linear(op["emb_hid_dim"], op["emb_dim"]),
 20 |             nn.Threshold(0.000001, 0))
 21 |         self.enc_rnn = nn.GRU(op["emb_dim"], op["enc_rnn_dim"])
 22 |         self.enc_lin = nn.Sequential(
 23 |             nn.Dropout(op["enc_dropout"]),
 24 |             nn.Linear(op["enc_rnn_dim"], op["enc_dim"]))
 25 |         
 26 |         # generator :
 27 |         self.gen_emb = nn.Embedding(op["vocab_sz"], op["emb_dim"])
 28 |         self.gen_rnn = nn.LSTM(op["enc_dim"], op["gen_rnn_dim"])
 29 |         self.gen_lin = nn.Sequential(
 30 |             nn.Dropout(op["gen_dropout"]),
 31 |             nn.Linear(op["gen_rnn_dim"], op["vocab_sz"]),
 32 |             nn.LogSoftmax(dim=-1))
 33 |         
 34 |         # some useful constants :
 35 |         self.max_seq_len = op["max_seq_len"]
 36 |         self.vocab_sz = op["vocab_sz"]
 37 | 
 38 |     def forward(self, phrase, sim_phrase=None, train=False):
 39 |         """
 40 |         forward pass
 41 | 
 42 |         inputs :-
 43 | 
 44 |         phrase : given phrase , shape = (max sequence length, batch size)
 45 |         sim_phrase : (if train == True), shape = (max seq length, batch sz)
 46 |         train : if true teacher forcing is used to train the module
 47 | 
 48 |         outputs :-
 49 | 
 50 |         out : generated paraphrase, shape = (max sequence length, batch size, )
 51 |         enc_out : encoded generated paraphrase, shape=(batch size, enc_dim)
 52 |         enc_sim_phrase : encoded sim_phrase, shape=(batch size, enc_dim)
 53 | 
 54 |         """
 55 | 
 56 |         if sim_phrase is None:
 57 |             sim_phrase = phrase
 58 | 
 59 |         if train:
 60 | 
 61 |             # encode input phrase
 62 |             enc_phrase = self.enc_lin(
 63 |                 self.enc_rnn(
 64 |                     self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1])
 65 | 
 66 |             # generate similar phrase using teacher forcing
 67 |             emb_sim_phrase_gen = self.gen_emb(sim_phrase)
 68 |             out_rnn, _ = self.gen_rnn(
 69 |                 torch.cat([enc_phrase, emb_sim_phrase_gen[:-1, :]], dim=0))
 70 |             out = self.gen_lin(out_rnn)
 71 | 
 72 |             # encode similar phrase and generated output
 73 |             # (propagated from shared discriminator) to calculate
 74 |             # pair-wise discriminator loss
 75 |             enc_sim_phrase = self.enc_lin(
 76 |                 self.enc_rnn(
 77 |                     self.emb_layer(utils.one_hot(sim_phrase,
 78 |                                                  self.vocab_sz)))[1])
 79 |             enc_out = self.enc_lin(
 80 |                 self.enc_rnn(self.emb_layer(torch.exp(out)))[1])
 81 | 
 82 |         else:
 83 | 
 84 |             # encode input phrase
 85 |             enc_phrase = self.enc_lin(
 86 |                 self.enc_rnn(
 87 |                     self.emb_layer(utils.one_hot(phrase, self.vocab_sz)))[1])
 88 |             
 89 |             # generate similar phrase using teacher forcing
 90 |             words = []
 91 |             h = None
 92 |             for __ in range(self.max_seq_len):
 93 |                 word, h = self.gen_rnn(enc_phrase, hx=h)
 94 |                 word = self.gen_lin(word)
 95 |                 words.append(word)
 96 |                 word = torch.multinomial(torch.exp(word[0]), 1)
 97 |                 word = word.t()
 98 |                 enc_phrase = self.gen_emb(word)
 99 |             out = torch.cat(words, dim=0)
100 | 
101 |             # encode similar phrase and generated output
102 |             # (propagated from shared discriminator) to calculate
103 |             # pair-wise discriminator loss
104 |             enc_sim_phrase = self.enc_lin(
105 |                 self.enc_rnn(
106 |                     self.emb_layer(utils.one_hot(sim_phrase,
107 |                                                  self.vocab_sz)))[1])
108 |             enc_out = self.enc_lin(
109 |                 self.enc_rnn(self.emb_layer(torch.exp(out)))[1])
110 | 
111 |         enc_out.squeeze_(0)
112 |         enc_sim_phrase.squeeze_(0)
113 |         return out, enc_out, enc_sim_phrase
114 | 


--------------------------------------------------------------------------------
/prepro/prepro_quora.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Preprocesses a raw json dataset into hdf5/json files.
  3 | python prepro.py --input_train_json data/vqa_raw_train.json --input_test_json data/vqa_raw_test.json --num_ans 1000
  4 | To get the question features. You will also see some question statistics in the terminal output. This will generate two files in the current directory, only_quora_tt.h5 and only_quora_tt.json.
  5 | Also in this code a lot of places you will find things related to captions, but they actually correspond to paraphrases. Reuse of previous code :p
  6 | """
  7 | import copy
  8 | from random import shuffle, seed
  9 | import sys
 10 | import os.path
 11 | import argparse
 12 | import glob
 13 | import numpy as np
 14 | from scipy.misc import imread, imresize
 15 | import scipy.io
 16 | import pdb
 17 | import string
 18 | import h5py
 19 | from nltk.tokenize import word_tokenize
 20 | import json
 21 | 
 22 | import re
 23 | import math
 24 | 
 25 | 
 26 | def tokenize(sentence):
 27 |     return [i for i in re.split(r"([-.\"',:? !\$#@~()*&\^%;\[\]/\\\+<>\n=])", sentence) if i!='' and i!=' ' and i!='\n'];
 28 | 
 29 | def prepro_question(imgs, params):
 30 |   
 31 |     # preprocess all the question
 32 |     print('example processed tokens:')
 33 |     for i,img in enumerate(imgs):
 34 |         s = img['question']
 35 |         if params['token_method'] == 'nltk':
 36 |             txt = word_tokenize(s)
 37 |         else:
 38 |             txt = tokenize(s)
 39 |         img['processed_tokens'] = txt
 40 |         if i < 10: print(txt)
 41 |         if i % 1000 == 0:
 42 |             sys.stdout.write("processing %d/%d (%.2f%% done)   \r" %  (i, len(imgs), i*100.0/len(imgs)) )
 43 |             sys.stdout.flush()   
 44 |     return imgs
 45 | 
 46 | def prepro_question1(imgs, params):
 47 |   
 48 |     # preprocess all the question
 49 |     print ('example processed tokens:')
 50 |     for i,img in enumerate(imgs):
 51 |         s = img['question1']
 52 |         if params['token_method'] == 'nltk':
 53 |             txt_c = word_tokenize(s)
 54 |         else:
 55 |             txt_c = tokenize(s)
 56 | 
 57 |         img['processed_tokens_caption'] = txt_c #this name is a bit misleading, it is for paraphrase questions actually.
 58 |         if i < 10: print (txt_c)
 59 |         if i % 1000 == 0:
 60 |             sys.stdout.write("processing %d/%d (%.2f%% done)   \r" %  (i, len(imgs), i*100.0/len(imgs)) )
 61 |             sys.stdout.flush()   
 62 |     return imgs
 63 | 
 64 | 
 65 | def build_vocab_question(imgs5, params):#imgs1,imgs2,imgs3,imgs4,imgs5,imgs6,imgs7,imgs8,
 66 |     # build vocabulary for question and answers.
 67 | 
 68 |     count_thr = params['word_count_threshold']
 69 | 
 70 |     # count up the number of words
 71 |     counts = {}
 72 |         
 73 |     for img in imgs5:
 74 |         for w in img['processed_tokens']:
 75 |             counts[w] = counts.get(w, 0) + 1
 76 | 
 77 |     cw = sorted([(count,w) for w,count in counts.iteritems()], reverse=True)
 78 |     print ('top words and their counts:')
 79 |     print ('\n'.join(map(str,cw[:20])))
 80 | 
 81 |     # print some stats
 82 |     total_words = sum(counts.itervalues())
 83 |     print ('total words:', total_words)
 84 |     bad_words = [w for w,n in counts.iteritems() if n <= count_thr]
 85 |     vocab = [w for w,n in counts.iteritems() if n > count_thr]   # will incorpate vocab for  both caption and question
 86 |     bad_count = sum(counts[w] for w in bad_words)
 87 |     print ('number of bad words: %d/%d = %.2f%%' % (len(bad_words), len(counts), len(bad_words)*100.0/len(counts)))
 88 |     print ('number of words in vocab would be %d' % (len(vocab), ))
 89 |     print ('number of UNKs: %d/%d = %.2f%%' % (bad_count, total_words, bad_count*100.0/total_words))
 90 | 
 91 | 
 92 |     # lets now produce the final annotation
 93 |     # additional special UNK token we will use below to map infrequent words to
 94 |     print ('inserting the special UNK token')
 95 |     vocab.append('UNK')
 96 |   
 97 |     
 98 |     for img in imgs5:
 99 |         txt = img['processed_tokens']
100 |         question = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt]
101 |         img['final_question'] = question
102 |         txt_c = img['processed_tokens_caption']
103 |         caption = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt_c]
104 |         img['final_caption'] = caption
105 |     
106 |     return imgs5,vocab#, imgs1,imgs2,imgs3,imgs4,imgs5,imgs6,imgs7,imgs8, vocab
107 | 
108 | def apply_vocab_question(imgs, wtoi):  ## this is for val or test question and caption 
109 |     # apply the vocab on test.
110 |     for img in imgs:
111 |         txt = img['processed_tokens']
112 |         question = [w if wtoi.get(w,len(wtoi)+1) != (len(wtoi)+1) else 'UNK' for w in txt]
113 |         img['final_question'] = question
114 |         txt_c = img['processed_tokens_caption']
115 |         caption = [w if w in wtoi else 'UNK' for w in txt_c]
116 |         img['final_caption'] = caption  
117 | 
118 |     return imgs
119 | 
120 | def encode_question2(imgs, params, wtoi):
121 | 
122 |     max_length = params['max_length']
123 |     N = len(imgs)
124 | 
125 |     label_arrays = np.zeros((N, max_length), dtype='uint32')
126 |     label_length = np.zeros(N, dtype='uint32')
127 |     question_id = np.zeros(N, dtype='uint32')
128 |     question_counter = 0
129 |     
130 |     caption_arrays = np.zeros((N, max_length), dtype='uint32') # will store encoding caption words
131 |     caption_length = np.zeros(N, dtype='uint32')# will store encoding caption words
132 |        
133 |     
134 |     for i,img in enumerate(imgs):
135 |         question_id[question_counter] = img['id'] #unique_id
136 |         label_length[question_counter] = min(max_length, len(img['final_question'])) # record the length of this question sequence
137 |         caption_length[question_counter] = min(max_length, len(img['final_caption'])) # record the length of this caption sequence        
138 |         question_counter += 1
139 |         for k,w in enumerate(img['final_question']):
140 |             if k < max_length:
141 |                 label_arrays[i,k] = wtoi[w]
142 |         for k,w in enumerate(img['final_caption']):         ## this is for caption
143 |             if k < max_length:
144 |                 caption_arrays[i,k] = wtoi[w]            
145 |   
146 |     return label_arrays, label_length, question_id, caption_arrays, caption_length
147 | 
148 | 
149 | def main(params):
150 |     
151 |     imgs_train5 = json.load(open(params['input_train_json5'], 'r'))
152 |     imgs_test5 = json.load(open(params['input_test_json5'], 'r'))
153 | 
154 |     
155 |     
156 |     ##seed(123) # make reproducible
157 |     ##shuffle(imgs_train) # shuffle the order
158 | 
159 |     
160 |     # tokenization and preprocessing training question
161 |     imgs_train5 = prepro_question(imgs_train5, params)
162 |     # tokenization and preprocessing test question
163 |     imgs_test5 = prepro_question(imgs_test5, params)
164 | 
165 |     # tokenization and preprocessing training paraphrase question
166 |     imgs_train5 = prepro_question1(imgs_train5, params)
167 |     # tokenization and preprocessing test paraphrase question
168 |     imgs_test5 = prepro_question1(imgs_test5, params)
169 | 
170 |     
171 |     # create the vocab for question
172 |     imgs_train5,vocab = build_vocab_question(imgs_train5, params)
173 | 
174 | 
175 |     itow = {i+1:w for i,w in enumerate(vocab)} # a 1-indexed vocab translation table
176 |     wtoi = {w:i+1 for i,w in enumerate(vocab)} # inverse table
177 | 
178 |     
179 |     ques_train5, ques_length_train5, question_id_train5 , cap_train5, cap_length_train5 = encode_question2(imgs_train5, params, wtoi)
180 |     
181 |     
182 |     imgs_test5 = apply_vocab_question(imgs_test5, wtoi)
183 |     
184 |     
185 |     ques_test5, ques_length_test5, question_id_test5 , cap_test5, cap_length_test5 = encode_question2(imgs_test5, params, wtoi)
186 |     
187 |     
188 |    
189 | 
190 |     N = len(imgs_train5)
191 |     f = h5py.File(params['output_h55'], "w")
192 |     ## for train information
193 |     f.create_dataset("ques_train", dtype='uint32', data=ques_train5)
194 |     f.create_dataset("ques_length_train", dtype='uint32', data=ques_length_train5)
195 |     f.create_dataset("ques_cap_id_train", dtype='uint32', data=question_id_train5)#this is actually the ques_cap_id
196 |     f.create_dataset("ques1_train", dtype='uint32', data=cap_train5)
197 |     f.create_dataset("ques1_length_train", dtype='uint32', data=cap_length_train5)
198 | 
199 |     
200 |     ## for test information
201 |     f.create_dataset("ques_test", dtype='uint32', data=ques_test5)
202 |     f.create_dataset("ques_length_test", dtype='uint32', data=ques_length_test5)
203 |     f.create_dataset("ques_cap_id_test", dtype='uint32', data=question_id_test5)
204 |     f.create_dataset("ques1_test", dtype='uint32', data=cap_test5)
205 |     f.create_dataset("ques1_length_test", dtype='uint32', data=cap_length_test5)
206 | 
207 |     f.close()
208 |     print ('wrote ', params['output_h55'])
209 |     
210 |     # create output json file
211 |     
212 |     out = {}
213 |     out['ix_to_word'] = itow # encode the (1-indexed) vocab
214 |     json.dump(out, open(params['output_json5'], 'w'))
215 |     print ('wrote ', params['output_json5'])
216 |     
217 |     
218 | if __name__ == "__main__":
219 | 
220 |     parser = argparse.ArgumentParser()
221 | 
222 |     # input and output jsons and h5
223 |     parser.add_argument('--input_train_json5', default='../data/quora_raw_train.json', help='input json file to process into hdf5')
224 |     parser.add_argument('--input_test_json5', default='../data/quora_raw_test.json', help='input json file to process into hdf5')
225 |     parser.add_argument('--num_ans', default=1000, type=int, help='number of top answers for the final classifications.')
226 |     parser.add_argument('--output_json5', default='../data/quora_data_prepro.json', help='output json file')
227 |     parser.add_argument('--output_h55', default='../data/quora_data_prepro.h5', help='output h5 file')
228 | 
229 |    
230 |     # options
231 |     parser.add_argument('--max_length', default=26, type=int, help='max length of a caption, in number of words. captions longer than this get clipped.')
232 |     parser.add_argument('--word_count_threshold', default=0, type=int, help='only words that occur more than this number of times will be put in vocab')
233 |     parser.add_argument('--token_method', default='nltk', help='tokenization method.')    
234 |     parser.add_argument('--num_test', default=0, type=int, help='number of test images (to withold until very very end)')
235 |     parser.add_argument('--batch_size', default=10, type=int)
236 | 
237 |     args = parser.parse_args()
238 |     params = vars(args) # convert to ordinary dict
239 |     print ('parsed input parameters:')
240 |     print (json.dumps(params, indent = 2))
241 |     main(params)
242 | 


--------------------------------------------------------------------------------
/prepro/quora_prepro.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import json
 3 | import os
 4 | 
 5 | def main():
 6 | 	out = []
 7 | 	outtest = []
 8 | 	outval = []
 9 | 	with open('../data/quora_duplicate_questions.tsv','rb') as tsvin:
10 | 		tsvin = csv.reader(tsvin, delimiter='\t')#read the tsv file of quora question pairs
11 | 		count0 = 1
12 | 		count1 = 1
13 | 		counter = 1
14 | 		for row in tsvin:
15 | 			counter = counter+1
16 | 			if row[5]=='0' and row[4][-1:]=='?':#the 6th entry in every row has value 0 or 1 and it represents paraphrases if that value is 1
17 | 				count0=count0+1
18 | 			elif row[5]=='1' and row[4][-1:]=='?':
19 | 				count1=count1+1
20 | 				if count1>1 and count1<100002:#taking the starting 1 lakh pairs as train set. Change this to 50002 for taking staring 50 k examples as train set
21 | 					# get the question and unique id from the tsv file
22 | 					quesid = row[1] #first question id
23 | 					ques = row[3] #first question
24 | 					img_id = row[0] #unique id for every pair
25 | 					ques1 = row[4]#paraphrase question
26 | 					quesid1 =row[2]#paraphrase question id				
27 | 					
28 | 					# set the parameters of json file for writing 
29 | 					jimg = {}
30 | 
31 | 					jimg['question'] = ques
32 | 					jimg['question1'] = ques1
33 | 					jimg['ques_id'] =  quesid
34 | 					jimg['ques_id1'] =  quesid1
35 | 					jimg['id'] =  img_id
36 | 
37 | 					out.append(jimg)
38 | 
39 | 				elif count1>100001 and count1<130002:#next 30k as the test set acc to https://arxiv.org/pdf/1711.00279.pdf
40 | 					quesid = row[1] 
41 | 					ques = row[3] 
42 | 					img_id = row[0] 
43 | 					ques1 = row[4]
44 | 					quesid1 =row[2]
45 | 
46 | 					jimg = {}
47 | 
48 | 					jimg['question'] = ques
49 | 					jimg['question1'] = ques1
50 | 					jimg['ques_id'] =  quesid
51 | 					jimg['ques_id1'] =  quesid1
52 | 					jimg['id'] =  img_id
53 | 
54 | 					outtest.append(jimg)	
55 | 				else :#rest as val
56 | 					quesid = row[1] 
57 | 					ques = row[3] 
58 | 					img_id = row[0] 
59 | 					ques1 = row[4]
60 | 					quesid1 =row[2]
61 | 				
62 | 					jimg = {}
63 | 					jimg['question'] = ques
64 | 					jimg['question1'] = ques1
65 | 					jimg['ques_id'] =  quesid
66 | 					jimg['ques_id1'] =  quesid1
67 | 					jimg['id'] =  img_id
68 | 					
69 | 					outval.append(jimg)
70 | 	#write the json files for train test and val
71 | 	print(len(out))
72 | 	json.dump(out, open('../data/quora_raw_train.json', 'w'))
73 | 	print(len(outtest))
74 | 	json.dump(outtest, open('../data/quora_raw_test.json', 'w'))
75 | 	print(len(outval))
76 | 	json.dump(outval, open('../data/quora_raw_val.json', 'w'))
77 | 
78 | 
79 | if __name__ == "__main__":
80 | 	main()
81 | 


--------------------------------------------------------------------------------
/pycocoevalcap/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | 


--------------------------------------------------------------------------------
/pycocoevalcap/README.md:
--------------------------------------------------------------------------------
 1 | Microsoft COCO Caption Evaluation
 2 | ===================
 3 | 
 4 | Evaluation codes for MS COCO caption generation.
 5 | 
 6 | ## Description ##
 7 | This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset.
 8 | 
 9 | The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption.  
10 | Caption evaluation depends on the COCO API that natively supports Python 3 (see Requirements).
11 | 
12 | ## Requirements ##
13 | - Java 1.8.0
14 | - Python 3
15 | - pycocotools (COCO Python API): https://github.com/cocodataset/cocoapi
16 | 
17 | ## Installation ##
18 | To install pycocoevalcap and the pycocotools dependency, run:
19 | ```
20 | pip install git+https://github.com/salaniz/pycocoevalcap
21 | ```
22 | 
23 | ## Files ##
24 | ./
25 | - eval.py: The file includes COCOEavlCap class that can be used to evaluate results on COCO.
26 | - tokenizer: Python wrapper of Stanford CoreNLP PTBTokenizer
27 | - bleu: Bleu evalutation codes
28 | - meteor: Meteor evaluation codes
29 | - rouge: Rouge-L evaluation codes
30 | - cider: CIDEr evaluation codes
31 | 
32 | ## References ##
33 | 
34 | - [Microsoft COCO Captions: Data Collection and Evaluation Server](http://arxiv.org/abs/1504.00325)
35 | - PTBTokenizer: We use the [Stanford Tokenizer](http://nlp.stanford.edu/software/tokenizer.shtml) which is included in [Stanford CoreNLP 3.4.1](http://nlp.stanford.edu/software/corenlp.shtml).
36 | - BLEU: [BLEU: a Method for Automatic Evaluation of Machine Translation](http://www.aclweb.org/anthology/P02-1040.pdf)
37 | - Meteor: [Project page](http://www.cs.cmu.edu/~alavie/METEOR/) with related publications. We use the latest version (1.5) of the [Code](https://github.com/mjdenkowski/meteor). Changes have been made to the source code to properly aggreate the statistics for the entire corpus.
38 | - Rouge-L: [ROUGE: A Package for Automatic Evaluation of Summaries](http://anthology.aclweb.org/W/W04/W04-1013.pdf)
39 | - CIDEr: [CIDEr: Consensus-based Image Description Evaluation] (http://arxiv.org/pdf/1411.5726.pdf)
40 | 
41 | ## Developers ##
42 | - Xinlei Chen (CMU)
43 | - Hao Fang (University of Washington)
44 | - Tsung-Yi Lin (Cornell)
45 | - Ramakrishna Vedantam (Virgina Tech)
46 | 
47 | ## Acknowledgement ##
48 | - David Chiang (University of Norte Dame)
49 | - Michael Denkowski (CMU)
50 | - Alexander Rush (Harvard University)
51 | 


--------------------------------------------------------------------------------
/pycocoevalcap/bleu/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015 Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/pycocoevalcap/bleu/bleu.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #
 3 | # File Name : bleu.py
 4 | #
 5 | # Description : Wrapper for BLEU scorer.
 6 | #
 7 | # Creation Date : 06-01-2015
 8 | # Last Modified : Thu 19 Mar 2015 09:13:28 PM PDT
 9 | # Authors : Hao Fang <hfang@uw.edu> and Tsung-Yi Lin <tl483@cornell.edu>
10 | 
11 | from .bleu_scorer import BleuScorer
12 | 
13 | 
14 | class Bleu:
15 |     def __init__(self, n=4):
16 |         # default compute Blue score up to 4
17 |         self._n = n
18 |         self._hypo_for_image = {}
19 |         self.ref_for_image = {}
20 | 
21 |     def compute_score(self, gts, res):
22 | 
23 |         assert(gts.keys() == res.keys())
24 |         imgIds = gts.keys()
25 | 
26 |         bleu_scorer = BleuScorer(n=self._n)
27 |         for id in imgIds:
28 |             hypo = res[id]
29 |             ref = gts[id]
30 | 
31 |             # Sanity check.
32 |             assert(type(hypo) is list)
33 |             assert(len(hypo) == 1)
34 |             assert(type(ref) is list)
35 |             assert(len(ref) >= 1)
36 | 
37 |             bleu_scorer += (hypo[0], ref)
38 | 
39 |         #score, scores = bleu_scorer.compute_score(option='shortest')
40 |         score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
41 |         #score, scores = bleu_scorer.compute_score(option='average', verbose=1)
42 | 
43 |         # return (bleu, bleu_info)
44 |         return score, scores
45 | 
46 |     def method(self):
47 |         return "Bleu"
48 | 


--------------------------------------------------------------------------------
/pycocoevalcap/bleu/bleu_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | # bleu_scorer.py
  4 | # David Chiang <chiang@isi.edu>
  5 | 
  6 | # Copyright (c) 2004-2006 University of Maryland. All rights
  7 | # reserved. Do not redistribute without permission from the
  8 | # author. Not for commercial use.
  9 | 
 10 | # Modified by:
 11 | # Hao Fang <hfang@uw.edu>
 12 | # Tsung-Yi Lin <tl483@cornell.edu>
 13 | 
 14 | '''Provides:
 15 | cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
 16 | cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
 17 | '''
 18 | 
 19 | import copy
 20 | import sys, math, re
 21 | from collections import defaultdict
 22 | 
 23 | def precook(s, n=4, out=False):
 24 |     """Takes a string as input and returns an object that can be given to
 25 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 26 |     can take string arguments as well."""
 27 |     words = s.split()
 28 |     counts = defaultdict(int)
 29 |     for k in range(1,n+1):
 30 |         for i in range(len(words)-k+1):
 31 |             ngram = tuple(words[i:i+k])
 32 |             counts[ngram] += 1
 33 |     return (len(words), counts)
 34 | 
 35 | def cook_refs(refs, eff=None, n=4): ## lhuang: oracle will call with "average"
 36 |     '''Takes a list of reference sentences for a single segment
 37 |     and returns an object that encapsulates everything that BLEU
 38 |     needs to know about them.'''
 39 | 
 40 |     reflen = []
 41 |     maxcounts = {}
 42 |     for ref in refs:
 43 |         rl, counts = precook(ref, n)
 44 |         reflen.append(rl)
 45 |         for (ngram,count) in counts.items():
 46 |             maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
 47 | 
 48 |     # Calculate effective reference sentence length.
 49 |     if eff == "shortest":
 50 |         reflen = min(reflen)
 51 |     elif eff == "average":
 52 |         reflen = float(sum(reflen))/len(reflen)
 53 | 
 54 |     ## lhuang: N.B.: leave reflen computaiton to the very end!!
 55 | 
 56 |     ## lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design)
 57 | 
 58 |     return (reflen, maxcounts)
 59 | 
 60 | def cook_test(test, refs, eff=None, n=4):
 61 |     '''Takes a test sentence and returns an object that
 62 |     encapsulates everything that BLEU needs to know about it.'''
 63 | 
 64 |     reflen, refmaxcounts = refs
 65 |     testlen, counts = precook(test, n, True)
 66 | 
 67 |     result = {}
 68 | 
 69 |     # Calculate effective reference sentence length.
 70 | 
 71 |     if eff == "closest":
 72 |         result["reflen"] = min((abs(l-testlen), l) for l in reflen)[1]
 73 |     else: ## i.e., "average" or "shortest" or None
 74 |         result["reflen"] = reflen
 75 | 
 76 |     result["testlen"] = testlen
 77 | 
 78 |     result["guess"] = [max(0,testlen-k+1) for k in range(1,n+1)]
 79 | 
 80 |     result['correct'] = [0]*n
 81 |     for (ngram, count) in counts.items():
 82 |         result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
 83 | 
 84 |     return result
 85 | 
 86 | class BleuScorer(object):
 87 |     """Bleu scorer.
 88 |     """
 89 | 
 90 |     __slots__ = "n", "crefs", "ctest", "_score", "_ratio", "_testlen", "_reflen", "special_reflen"
 91 |     # special_reflen is used in oracle (proportional effective ref len for a node).
 92 | 
 93 |     def copy(self):
 94 |         ''' copy the refs.'''
 95 |         new = BleuScorer(n=self.n)
 96 |         new.ctest = copy.copy(self.ctest)
 97 |         new.crefs = copy.copy(self.crefs)
 98 |         new._score = None
 99 |         return new
100 | 
101 |     def __init__(self, test=None, refs=None, n=4, special_reflen=None):
102 |         ''' singular instance '''
103 | 
104 |         self.n = n
105 |         self.crefs = []
106 |         self.ctest = []
107 |         self.cook_append(test, refs)
108 |         self.special_reflen = special_reflen
109 | 
110 |     def cook_append(self, test, refs):
111 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
112 | 
113 |         if refs is not None:
114 |             self.crefs.append(cook_refs(refs))
115 |             if test is not None:
116 |                 cooked_test = cook_test(test, self.crefs[-1])
117 |                 self.ctest.append(cooked_test) ## N.B.: -1
118 |             else:
119 |                 self.ctest.append(None) # lens of crefs and ctest have to match
120 | 
121 |         self._score = None ## need to recompute
122 | 
123 |     def ratio(self, option=None):
124 |         self.compute_score(option=option)
125 |         return self._ratio
126 | 
127 |     def score_ratio(self, option=None):
128 |         '''return (bleu, len_ratio) pair'''
129 |         return (self.fscore(option=option), self.ratio(option=option))
130 | 
131 |     def score_ratio_str(self, option=None):
132 |         return "%.4f (%.2f)" % self.score_ratio(option)
133 | 
134 |     def reflen(self, option=None):
135 |         self.compute_score(option=option)
136 |         return self._reflen
137 | 
138 |     def testlen(self, option=None):
139 |         self.compute_score(option=option)
140 |         return self._testlen
141 | 
142 |     def retest(self, new_test):
143 |         if type(new_test) is str:
144 |             new_test = [new_test]
145 |         assert len(new_test) == len(self.crefs), new_test
146 |         self.ctest = []
147 |         for t, rs in zip(new_test, self.crefs):
148 |             self.ctest.append(cook_test(t, rs))
149 |         self._score = None
150 | 
151 |         return self
152 | 
153 |     def rescore(self, new_test):
154 |         ''' replace test(s) with new test(s), and returns the new score.'''
155 | 
156 |         return self.retest(new_test).compute_score()
157 | 
158 |     def size(self):
159 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
160 |         return len(self.crefs)
161 | 
162 |     def __iadd__(self, other):
163 |         '''add an instance (e.g., from another sentence).'''
164 | 
165 |         if type(other) is tuple:
166 |             ## avoid creating new BleuScorer instances
167 |             self.cook_append(other[0], other[1])
168 |         else:
169 |             assert self.compatible(other), "incompatible BLEUs."
170 |             self.ctest.extend(other.ctest)
171 |             self.crefs.extend(other.crefs)
172 |             self._score = None ## need to recompute
173 | 
174 |         return self
175 | 
176 |     def compatible(self, other):
177 |         return isinstance(other, BleuScorer) and self.n == other.n
178 | 
179 |     def single_reflen(self, option="average"):
180 |         return self._single_reflen(self.crefs[0][0], option)
181 | 
182 |     def _single_reflen(self, reflens, option=None, testlen=None):
183 | 
184 |         if option == "shortest":
185 |             reflen = min(reflens)
186 |         elif option == "average":
187 |             reflen = float(sum(reflens))/len(reflens)
188 |         elif option == "closest":
189 |             reflen = min((abs(l-testlen), l) for l in reflens)[1]
190 |         else:
191 |             assert False, "unsupported reflen option %s" % option
192 | 
193 |         return reflen
194 | 
195 |     def recompute_score(self, option=None, verbose=0):
196 |         self._score = None
197 |         return self.compute_score(option, verbose)
198 | 
199 |     def compute_score(self, option=None, verbose=0):
200 |         n = self.n
201 |         small = 1e-9
202 |         tiny = 1e-15 ## so that if guess is 0 still return 0
203 |         bleu_list = [[] for _ in range(n)]
204 | 
205 |         if self._score is not None:
206 |             return self._score
207 | 
208 |         if option is None:
209 |             option = "average" if len(self.crefs) == 1 else "closest"
210 | 
211 |         self._testlen = 0
212 |         self._reflen = 0
213 |         totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
214 | 
215 |         # for each sentence
216 |         for comps in self.ctest:
217 |             testlen = comps['testlen']
218 |             self._testlen += testlen
219 | 
220 |             if self.special_reflen is None: ## need computation
221 |                 reflen = self._single_reflen(comps['reflen'], option, testlen)
222 |             else:
223 |                 reflen = self.special_reflen
224 | 
225 |             self._reflen += reflen
226 | 
227 |             for key in ['guess','correct']:
228 |                 for k in range(n):
229 |                     totalcomps[key][k] += comps[key][k]
230 | 
231 |             # append per image bleu score
232 |             bleu = 1.
233 |             for k in range(n):
234 |                 bleu *= (float(comps['correct'][k]) + tiny) \
235 |                         /(float(comps['guess'][k]) + small)
236 |                 bleu_list[k].append(bleu ** (1./(k+1)))
237 |             ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division
238 |             if ratio < 1:
239 |                 for k in range(n):
240 |                     bleu_list[k][-1] *= math.exp(1 - 1/ratio)
241 | 
242 |             if verbose > 1:
243 |                 print(comps, reflen)
244 | 
245 |         totalcomps['reflen'] = self._reflen
246 |         totalcomps['testlen'] = self._testlen
247 | 
248 |         bleus = []
249 |         bleu = 1.
250 |         for k in range(n):
251 |             bleu *= float(totalcomps['correct'][k] + tiny) \
252 |                     / (totalcomps['guess'][k] + small)
253 |             bleus.append(bleu ** (1./(k+1)))
254 |         ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division
255 |         if ratio < 1:
256 |             for k in range(n):
257 |                 bleus[k] *= math.exp(1 - 1/ratio)
258 | 
259 |         if verbose > 0:
260 |             print(totalcomps)
261 |             print("ratio:", ratio)
262 | 
263 |         self._score = bleus
264 |         return self._score, bleu_list
265 | 


--------------------------------------------------------------------------------
/pycocoevalcap/cider/cider.py:
--------------------------------------------------------------------------------
 1 | # Filename: cider.py
 2 | #
 3 | # Description: Describes the class to compute the CIDEr (Consensus-Based Image Description Evaluation) Metric
 4 | #               by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726)
 5 | #
 6 | # Creation Date: Sun Feb  8 14:16:54 2015
 7 | #
 8 | # Authors: Ramakrishna Vedantam <vrama91@vt.edu> and Tsung-Yi Lin <tl483@cornell.edu>
 9 | 
10 | from .cider_scorer import CiderScorer
11 | import pdb
12 | 
13 | class Cider:
14 |     """
15 |     Main Class to compute the CIDEr metric
16 | 
17 |     """
18 |     def __init__(self, test=None, refs=None, n=4, sigma=6.0):
19 |         # set cider to sum over 1 to 4-grams
20 |         self._n = n
21 |         # set the standard deviation parameter for gaussian penalty
22 |         self._sigma = sigma
23 | 
24 |     def compute_score(self, gts, res):
25 |         """
26 |         Main function to compute CIDEr score
27 |         :param  hypo_for_image (dict) : dictionary with key <image> and value <tokenized hypothesis / candidate sentence>
28 |                 ref_for_image (dict)  : dictionary with key <image> and value <tokenized reference sentence>
29 |         :return: cider (float) : computed CIDEr score for the corpus
30 |         """
31 | 
32 |         assert(gts.keys() == res.keys())
33 |         imgIds = gts.keys()
34 | 
35 |         cider_scorer = CiderScorer(n=self._n, sigma=self._sigma)
36 | 
37 |         for id in imgIds:
38 |             hypo = res[id]
39 |             ref = gts[id]
40 | 
41 |             # Sanity check.
42 |             assert(type(hypo) is list)
43 |             assert(len(hypo) == 1)
44 |             assert(type(ref) is list)
45 |             assert(len(ref) > 0)
46 | 
47 |             cider_scorer += (hypo[0], ref)
48 | 
49 |         (score, scores) = cider_scorer.compute_score()
50 | 
51 |         return score, scores
52 | 
53 |     def method(self):
54 |         return "CIDEr"
55 | 


--------------------------------------------------------------------------------
/pycocoevalcap/cider/cider_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # Tsung-Yi Lin <tl483@cornell.edu>
  3 | # Ramakrishna Vedantam <vrama91@vt.edu>
  4 | 
  5 | import copy
  6 | from collections import defaultdict
  7 | import numpy as np
  8 | import pdb
  9 | import math
 10 | 
 11 | def precook(s, n=4, out=False):
 12 |     """
 13 |     Takes a string as input and returns an object that can be given to
 14 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 15 |     can take string arguments as well.
 16 |     :param s: string : sentence to be converted into ngrams
 17 |     :param n: int    : number of ngrams for which representation is calculated
 18 |     :return: term frequency vector for occuring ngrams
 19 |     """
 20 |     words = s.split()
 21 |     counts = defaultdict(int)
 22 |     for k in range(1,n+1):
 23 |         for i in range(len(words)-k+1):
 24 |             ngram = tuple(words[i:i+k])
 25 |             counts[ngram] += 1
 26 |     return counts
 27 | 
 28 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average"
 29 |     '''Takes a list of reference sentences for a single segment
 30 |     and returns an object that encapsulates everything that BLEU
 31 |     needs to know about them.
 32 |     :param refs: list of string : reference sentences for some image
 33 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 34 |     :return: result (list of dict)
 35 |     '''
 36 |     return [precook(ref, n) for ref in refs]
 37 | 
 38 | def cook_test(test, n=4):
 39 |     '''Takes a test sentence and returns an object that
 40 |     encapsulates everything that BLEU needs to know about it.
 41 |     :param test: list of string : hypothesis sentence for some image
 42 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 43 |     :return: result (dict)
 44 |     '''
 45 |     return precook(test, n, True)
 46 | 
 47 | class CiderScorer(object):
 48 |     """CIDEr scorer.
 49 |     """
 50 | 
 51 |     def copy(self):
 52 |         ''' copy the refs.'''
 53 |         new = CiderScorer(n=self.n)
 54 |         new.ctest = copy.copy(self.ctest)
 55 |         new.crefs = copy.copy(self.crefs)
 56 |         return new
 57 | 
 58 |     def __init__(self, test=None, refs=None, n=4, sigma=6.0):
 59 |         ''' singular instance '''
 60 |         self.n = n
 61 |         self.sigma = sigma
 62 |         self.crefs = []
 63 |         self.ctest = []
 64 |         self.document_frequency = defaultdict(float)
 65 |         self.cook_append(test, refs)
 66 |         self.ref_len = None
 67 | 
 68 |     def cook_append(self, test, refs):
 69 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
 70 | 
 71 |         if refs is not None:
 72 |             self.crefs.append(cook_refs(refs))
 73 |             if test is not None:
 74 |                 self.ctest.append(cook_test(test)) ## N.B.: -1
 75 |             else:
 76 |                 self.ctest.append(None) # lens of crefs and ctest have to match
 77 | 
 78 |     def size(self):
 79 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
 80 |         return len(self.crefs)
 81 | 
 82 |     def __iadd__(self, other):
 83 |         '''add an instance (e.g., from another sentence).'''
 84 | 
 85 |         if type(other) is tuple:
 86 |             ## avoid creating new CiderScorer instances
 87 |             self.cook_append(other[0], other[1])
 88 |         else:
 89 |             self.ctest.extend(other.ctest)
 90 |             self.crefs.extend(other.crefs)
 91 | 
 92 |         return self
 93 |     def compute_doc_freq(self):
 94 |         '''
 95 |         Compute term frequency for reference data.
 96 |         This will be used to compute idf (inverse document frequency later)
 97 |         The term frequency is stored in the object
 98 |         :return: None
 99 |         '''
100 |         for refs in self.crefs:
101 |             # refs, k ref captions of one image
102 |             for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]):
103 |                 self.document_frequency[ngram] += 1
104 |             # maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
105 | 
106 |     def compute_cider(self):
107 |         def counts2vec(cnts):
108 |             """
109 |             Function maps counts of ngram to vector of tfidf weights.
110 |             The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights.
111 |             The n-th entry of array denotes length of n-grams.
112 |             :param cnts:
113 |             :return: vec (array of dict), norm (array of float), length (int)
114 |             """
115 |             vec = [defaultdict(float) for _ in range(self.n)]
116 |             length = 0
117 |             norm = [0.0 for _ in range(self.n)]
118 |             for (ngram,term_freq) in cnts.items():
119 |                 # give word count 1 if it doesn't appear in reference corpus
120 |                 df = np.log(max(1.0, self.document_frequency[ngram]))
121 |                 # ngram index
122 |                 n = len(ngram)-1
123 |                 # tf (term_freq) * idf (precomputed idf) for n-grams
124 |                 vec[n][ngram] = float(term_freq)*(self.ref_len - df)
125 |                 # compute norm for the vector.  the norm will be used for computing similarity
126 |                 norm[n] += pow(vec[n][ngram], 2)
127 | 
128 |                 if n == 1:
129 |                     length += term_freq
130 |             norm = [np.sqrt(n) for n in norm]
131 |             return vec, norm, length
132 | 
133 |         def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref):
134 |             '''
135 |             Compute the cosine similarity of two vectors.
136 |             :param vec_hyp: array of dictionary for vector corresponding to hypothesis
137 |             :param vec_ref: array of dictionary for vector corresponding to reference
138 |             :param norm_hyp: array of float for vector corresponding to hypothesis
139 |             :param norm_ref: array of float for vector corresponding to reference
140 |             :param length_hyp: int containing length of hypothesis
141 |             :param length_ref: int containing length of reference
142 |             :return: array of score for each n-grams cosine similarity
143 |             '''
144 |             delta = float(length_hyp - length_ref)
145 |             # measure consine similarity
146 |             val = np.array([0.0 for _ in range(self.n)])
147 |             for n in range(self.n):
148 |                 # ngram
149 |                 for (ngram,count) in vec_hyp[n].items():
150 |                     # vrama91 : added clipping
151 |                     val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram]
152 | 
153 |                 if (norm_hyp[n] != 0) and (norm_ref[n] != 0):
154 |                     val[n] /= (norm_hyp[n]*norm_ref[n])
155 | 
156 |                 assert(not math.isnan(val[n]))
157 |                 # vrama91: added a length based gaussian penalty
158 |                 val[n] *= np.e**(-(delta**2)/(2*self.sigma**2))
159 |             return val
160 | 
161 |         # compute log reference length
162 |         self.ref_len = np.log(float(len(self.crefs)))
163 | 
164 |         scores = []
165 |         for test, refs in zip(self.ctest, self.crefs):
166 |             # compute vector for test captions
167 |             vec, norm, length = counts2vec(test)
168 |             # compute vector for ref captions
169 |             score = np.array([0.0 for _ in range(self.n)])
170 |             for ref in refs:
171 |                 vec_ref, norm_ref, length_ref = counts2vec(ref)
172 |                 score += sim(vec, vec_ref, norm, norm_ref, length, length_ref)
173 |             # change by vrama91 - mean of ngram scores, instead of sum
174 |             score_avg = np.mean(score)
175 |             # divide by number of references
176 |             score_avg /= len(refs)
177 |             # multiply score by 10
178 |             score_avg *= 10.0
179 |             # append score of an image to the score list
180 |             scores.append(score_avg)
181 |         return scores
182 | 
183 |     def compute_score(self, option=None, verbose=0):
184 |         # compute idf
185 |         self.compute_doc_freq()
186 |         # assert to check document frequency
187 |         assert(len(self.ctest) >= max(self.document_frequency.values()))
188 |         # compute cider score
189 |         score = self.compute_cider()
190 |         # debug
191 |         # print score
192 |         return np.mean(np.array(score)), np.array(score)
193 | 


--------------------------------------------------------------------------------
/pycocoevalcap/eval.py:
--------------------------------------------------------------------------------
 1 | __author__ = 'tylin'
 2 | from .tokenizer.ptbtokenizer import PTBTokenizer
 3 | from .bleu.bleu import Bleu
 4 | from .meteor.meteor import Meteor
 5 | from .rouge.rouge import Rouge
 6 | from .cider.cider import Cider
 7 | 
 8 | class COCOEvalCap:
 9 |     def __init__(self, coco, cocoRes):
10 |         self.evalImgs = []
11 |         self.eval = {}
12 |         self.imgToEval = {}
13 |         self.coco = coco
14 |         self.cocoRes = cocoRes
15 |         self.params = {'image_id': coco.getImgIds()}
16 | 
17 |     def evaluate(self):
18 |         imgIds = self.params['image_id']
19 |         # imgIds = self.coco.getImgIds()
20 |         gts = {}
21 |         res = {}
22 |         for imgId in imgIds:
23 |             gts[imgId] = self.coco.imgToAnns[imgId]
24 |             res[imgId] = self.cocoRes.imgToAnns[imgId]
25 | 
26 |         # =================================================
27 |         # Set up scorers
28 |         # =================================================
29 |         print('tokenization...')
30 |         tokenizer = PTBTokenizer()
31 |         gts  = tokenizer.tokenize(gts)
32 |         res = tokenizer.tokenize(res)
33 | 
34 |         # =================================================
35 |         # Set up scorers
36 |         # =================================================
37 |         print('setting up scorers...')
38 |         scorers = [
39 |             (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
40 |             (Meteor(),"METEOR"),
41 |             (Rouge(), "ROUGE_L"),
42 |             (Cider(), "CIDEr")
43 |         ]
44 | 
45 |         # =================================================
46 |         # Compute scores
47 |         # =================================================
48 |         for scorer, method in scorers:
49 |             print('computing %s score...'%(scorer.method()))
50 |             score, scores = scorer.compute_score(gts, res)
51 |             if type(method) == list:
52 |                 for sc, scs, m in zip(score, scores, method):
53 |                     self.setEval(sc, m)
54 |                     self.setImgToEvalImgs(scs, gts.keys(), m)
55 |                     print("%s: %0.3f"%(m, sc))
56 |             else:
57 |                 self.setEval(score, method)
58 |                 self.setImgToEvalImgs(scores, gts.keys(), method)
59 |                 print("%s: %0.3f"%(method, score))
60 |         self.setEvalImgs()
61 | 
62 |     def setEval(self, score, method):
63 |         self.eval[method] = score
64 | 
65 |     def setImgToEvalImgs(self, scores, imgIds, method):
66 |         for imgId, score in zip(imgIds, scores):
67 |             if not imgId in self.imgToEval:
68 |                 self.imgToEval[imgId] = {}
69 |                 self.imgToEval[imgId]["image_id"] = imgId
70 |             self.imgToEval[imgId][method] = score
71 | 
72 |     def setEvalImgs(self):
73 |         self.evalImgs = [eval for imgId, eval in self.imgToEval.items()]
74 | 


--------------------------------------------------------------------------------
/pycocoevalcap/license.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015, Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 | 
 7 | 1. Redistributions of source code must retain the above copyright notice, this
 8 |    list of conditions and the following disclaimer.
 9 | 2. Redistributions in binary form must reproduce the above copyright notice,
10 |    this list of conditions and the following disclaimer in the documentation
11 |    and/or other materials provided with the distribution.
12 | 
13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
23 | 
24 | The views and conclusions contained in the software and documentation are those
25 | of the authors and should not be interpreted as representing official policies,
26 | either expressed or implied, of the FreeBSD Project.
27 | 


--------------------------------------------------------------------------------
/pycocoevalcap/meteor/data/paraphrase-en.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/meteor/data/paraphrase-en.gz


--------------------------------------------------------------------------------
/pycocoevalcap/meteor/meteor-1.5.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/meteor/meteor-1.5.jar


--------------------------------------------------------------------------------
/pycocoevalcap/meteor/meteor.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Python wrapper for METEOR implementation, by Xinlei Chen
 4 | # Acknowledge Michael Denkowski for the generous discussion and help
 5 | 
 6 | import os
 7 | import sys
 8 | import subprocess
 9 | import threading
10 | 
11 | # Assumes meteor-1.5.jar is in the same directory as meteor.py.  Change as needed.
12 | METEOR_JAR = 'meteor-1.5.jar'
13 | # print METEOR_JAR
14 | 
15 | class Meteor:
16 | 
17 |     def __init__(self):
18 |         self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, \
19 |                 '-', '-', '-stdio', '-l', 'en', '-norm']
20 |         self.meteor_p = subprocess.Popen(self.meteor_cmd, \
21 |                 cwd=os.path.dirname(os.path.abspath(__file__)), \
22 |                 stdin=subprocess.PIPE, \
23 |                 stdout=subprocess.PIPE, \
24 |                 stderr=subprocess.PIPE)
25 |         # Used to guarantee thread safety
26 |         self.lock = threading.Lock()
27 | 
28 |     def compute_score(self, gts, res):
29 |         assert(gts.keys() == res.keys())
30 |         imgIds = gts.keys()
31 |         scores = []
32 | 
33 |         eval_line = 'EVAL'
34 |         self.lock.acquire()
35 |         for i in imgIds:
36 |             assert(len(res[i]) == 1)
37 |             stat = self._stat(res[i][0], gts[i])
38 |             eval_line += ' ||| {}'.format(stat)
39 | 
40 |         self.meteor_p.stdin.write('{}\n'.format(eval_line).encode())
41 |         self.meteor_p.stdin.flush()
42 |         for i in range(0,len(imgIds)):
43 |             scores.append(float(self.meteor_p.stdout.readline().strip()))
44 |         score = float(self.meteor_p.stdout.readline().strip())
45 |         self.lock.release()
46 | 
47 |         return score, scores
48 | 
49 |     def method(self):
50 |         return "METEOR"
51 | 
52 |     def _stat(self, hypothesis_str, reference_list):
53 |         # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
54 |         hypothesis_str = hypothesis_str.replace('|||','').replace('  ',' ')
55 |         score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str))
56 |         self.meteor_p.stdin.write('{}\n'.format(score_line).encode())
57 |         self.meteor_p.stdin.flush()
58 |         return self.meteor_p.stdout.readline().decode().strip()
59 | 
60 |     def _score(self, hypothesis_str, reference_list):
61 |         self.lock.acquire()
62 |         # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
63 |         hypothesis_str = hypothesis_str.replace('|||','').replace('  ',' ')
64 |         score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str))
65 |         self.meteor_p.stdin.write('{}\n'.format(score_line))
66 |         stats = self.meteor_p.stdout.readline().strip()
67 |         eval_line = 'EVAL ||| {}'.format(stats)
68 |         # EVAL ||| stats
69 |         self.meteor_p.stdin.write('{}\n'.format(eval_line))
70 |         score = float(self.meteor_p.stdout.readline().strip())
71 |         # bug fix: there are two values returned by the jar file, one average, and one all, so do it twice
72 |         # thanks for Andrej for pointing this out
73 |         score = float(self.meteor_p.stdout.readline().strip())
74 |         self.lock.release()
75 |         return score
76 | 
77 |     def __del__(self):
78 |         self.lock.acquire()
79 |         self.meteor_p.stdin.close()
80 |         self.meteor_p.kill()
81 |         self.meteor_p.wait()
82 |         self.lock.release()
83 | 


--------------------------------------------------------------------------------
/pycocoevalcap/rouge/rouge.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #
  3 | # File Name : rouge.py
  4 | #
  5 | # Description : Computes ROUGE-L metric as described by Lin and Hovey (2004)
  6 | #
  7 | # Creation Date : 2015-01-07 06:03
  8 | # Author : Ramakrishna Vedantam <vrama91@vt.edu>
  9 | 
 10 | import numpy as np
 11 | import pdb
 12 | 
 13 | def my_lcs(string, sub):
 14 |     """
 15 |     Calculates longest common subsequence for a pair of tokenized strings
 16 |     :param string : list of str : tokens from a string split using whitespace
 17 |     :param sub : list of str : shorter string, also split using whitespace
 18 |     :returns: length (list of int): length of the longest common subsequence between the two strings
 19 | 
 20 |     Note: my_lcs only gives length of the longest common subsequence, not the actual LCS
 21 |     """
 22 |     if(len(string)< len(sub)):
 23 |         sub, string = string, sub
 24 | 
 25 |     lengths = [[0 for i in range(0,len(sub)+1)] for j in range(0,len(string)+1)]
 26 | 
 27 |     for j in range(1,len(sub)+1):
 28 |         for i in range(1,len(string)+1):
 29 |             if(string[i-1] == sub[j-1]):
 30 |                 lengths[i][j] = lengths[i-1][j-1] + 1
 31 |             else:
 32 |                 lengths[i][j] = max(lengths[i-1][j] , lengths[i][j-1])
 33 | 
 34 |     return lengths[len(string)][len(sub)]
 35 | 
 36 | class Rouge():
 37 |     '''
 38 |     Class for computing ROUGE-L score for a set of candidate sentences for the MS COCO test set
 39 | 
 40 |     '''
 41 |     def __init__(self):
 42 |         # vrama91: updated the value below based on discussion with Hovey
 43 |         self.beta = 1.2
 44 | 
 45 |     def calc_score(self, candidate, refs):
 46 |         """
 47 |         Compute ROUGE-L score given one candidate and references for an image
 48 |         :param candidate: str : candidate sentence to be evaluated
 49 |         :param refs: list of str : COCO reference sentences for the particular image to be evaluated
 50 |         :returns score: int (ROUGE-L score for the candidate evaluated against references)
 51 |         """
 52 |         assert(len(candidate)==1)
 53 |         assert(len(refs)>0)
 54 |         prec = []
 55 |         rec = []
 56 | 
 57 |         # split into tokens
 58 |         token_c = candidate[0].split(" ")
 59 | 
 60 |         for reference in refs:
 61 |             # split into tokens
 62 |             token_r = reference.split(" ")
 63 |             # compute the longest common subsequence
 64 |             lcs = my_lcs(token_r, token_c)
 65 |             prec.append(lcs/float(len(token_c)))
 66 |             rec.append(lcs/float(len(token_r)))
 67 | 
 68 |         prec_max = max(prec)
 69 |         rec_max = max(rec)
 70 | 
 71 |         if(prec_max!=0 and rec_max !=0):
 72 |             score = ((1 + self.beta**2)*prec_max*rec_max)/float(rec_max + self.beta**2*prec_max)
 73 |         else:
 74 |             score = 0.0
 75 |         return score
 76 | 
 77 |     def compute_score(self, gts, res):
 78 |         """
 79 |         Computes Rouge-L score given a set of reference and candidate sentences for the dataset
 80 |         Invoked by evaluate_captions.py
 81 |         :param hypo_for_image: dict : candidate / test sentences with "image name" key and "tokenized sentences" as values
 82 |         :param ref_for_image: dict : reference MS-COCO sentences with "image name" key and "tokenized sentences" as values
 83 |         :returns: average_score: float (mean ROUGE-L score computed by averaging scores for all the images)
 84 |         """
 85 |         assert(gts.keys() == res.keys())
 86 |         imgIds = gts.keys()
 87 | 
 88 |         score = []
 89 |         for id in imgIds:
 90 |             hypo = res[id]
 91 |             ref  = gts[id]
 92 | 
 93 |             score.append(self.calc_score(hypo, ref))
 94 | 
 95 |             # Sanity check.
 96 |             assert(type(hypo) is list)
 97 |             assert(len(hypo) == 1)
 98 |             assert(type(ref) is list)
 99 |             assert(len(ref) > 0)
100 | 
101 |         average_score = np.mean(np.array(score))
102 |         return average_score, np.array(score)
103 | 
104 |     def method(self):
105 |         return "Rouge"
106 | 


--------------------------------------------------------------------------------
/pycocoevalcap/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_namespace_packages
 2 | 
 3 | # Prepend pycocoevalcap to package names
 4 | package_names = ['pycocoevalcap.'+p for p in find_namespace_packages()]
 5 | 
 6 | setup(
 7 |     name='pycocoevalcap',
 8 |     version=1.0,
 9 |     packages=['pycocoevalcap']+package_names,
10 |     package_dir={'pycocoevalcap': '.'},
11 |     package_data={'': ['*.jar', '*.gz']},
12 |     install_requires=['pycocotools>=2.0.0']
13 | )
14 | 


--------------------------------------------------------------------------------
/pycocoevalcap/tokenizer/ptbtokenizer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #
 3 | # File Name : ptbtokenizer.py
 4 | #
 5 | # Description : Do the PTB Tokenization and remove punctuations.
 6 | #
 7 | # Creation Date : 29-12-2014
 8 | # Last Modified : Thu Mar 19 09:53:35 2015
 9 | # Authors : Hao Fang <hfang@uw.edu> and Tsung-Yi Lin <tl483@cornell.edu>
10 | 
11 | import os
12 | import sys
13 | import subprocess
14 | import tempfile
15 | import itertools
16 | import contextlib
17 | 
18 | # path to the stanford corenlp jar
19 | STANFORD_CORENLP_3_4_1_JAR = 'stanford-corenlp-3.4.1.jar'
20 | 
21 | # punctuations to be removed from the sentences
22 | PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \
23 |         ".", "?", "!", ",", ":", "-", "--", "...", ";"]
24 | 
25 | class PTBTokenizer:
26 |     """Python wrapper of Stanford PTBTokenizer"""
27 | 
28 |     def tokenize(self, captions_for_image):
29 |         cmd = ['java', '-cp', STANFORD_CORENLP_3_4_1_JAR, \
30 |                 'edu.stanford.nlp.process.PTBTokenizer', \
31 |                 '-preserveLines', '-lowerCase']
32 | 
33 |         # ======================================================
34 |         # prepare data for PTB Tokenizer
35 |         # ======================================================
36 |         final_tokenized_captions_for_image = {}
37 |         image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))]
38 |         sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v])
39 | 
40 |         # ======================================================
41 |         # save sentences to temporary file
42 |         # ======================================================
43 |         path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__))
44 |         tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname)
45 |         tmp_file.write(sentences.encode())
46 |         tmp_file.close()
47 | 
48 |         # ======================================================
49 |         # tokenize sentence
50 |         # ======================================================
51 |         cmd.append(os.path.basename(tmp_file.name))
52 |         p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \
53 |                 stdout=subprocess.PIPE)
54 |         
55 |         token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
56 | 
57 |         token_lines = token_lines.decode()
58 |         lines = token_lines.split('\n')
59 |         # remove temp file
60 |         os.remove(tmp_file.name)
61 | 
62 |         # ======================================================
63 |         # create dictionary for tokenized captions
64 |         # ======================================================
65 |         for k, line in zip(image_id, lines):
66 |             if not k in final_tokenized_captions_for_image:
67 |                 final_tokenized_captions_for_image[k] = []
68 |             tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
69 |                     if w not in PUNCTUATIONS])
70 |             final_tokenized_captions_for_image[k].append(tokenized_caption)
71 | 
72 |         return final_tokenized_captions_for_image
73 | 


--------------------------------------------------------------------------------
/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dev-chauhan/PQG-pytorch/0090547517daf520b6c85b06dac1992d928979c0/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar


--------------------------------------------------------------------------------
/train.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | function help(){
 4 |     printf "Usage :\n";
 5 |     printf "    $0 <model_name> <arguments>\n"
 6 |     exit;
 7 | }
 8 | 
 9 | if [[ $# < 1 ]]
10 | then
11 |     help;
12 | fi
13 | 
14 | file="train_$1";
15 | 
16 | if [ ! -f "training/$file.py" ]
17 | then
18 |     printf "File ${file} not found\n";
19 |     printf "enter correct model name \n";
20 |     exit;
21 | fi
22 | 
23 | python -m training.$file "${@:2}";
24 | 


--------------------------------------------------------------------------------
/training/train_edlp.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import time
  4 | 
  5 | import torch
  6 | import torch.nn as nn
  7 | import torch.optim as optim
  8 | import torch.utils.data as Data
  9 | from tensorboardX import SummaryWriter
 10 | from tqdm import tqdm
 11 | 
 12 | from misc import net_utils, utils
 13 | from misc.dataloader import Dataloader
 14 | from misc.train_util import dump_samples, evaluate_scores, save_model
 15 | from models.enc_dec_dis import ParaphraseGenerator
 16 | 
 17 | 
 18 | def main():
 19 | 
 20 |     # get arguments ---
 21 | 
 22 |     parser = utils.make_parser()
 23 |     args = parser.parse_args()
 24 | 
 25 |     # build model
 26 | 
 27 |     # # get data
 28 |     data = Dataloader(args.input_json, args.input_ques_h5)
 29 | 
 30 |     # # make op
 31 |     op = {
 32 |         "vocab_sz": data.getVocabSize(),
 33 |         "max_seq_len": data.getSeqLength(),
 34 |         "emb_hid_dim": args.emb_hid_dim,
 35 |         "emb_dim": args.emb_dim,
 36 |         "enc_dim": args.enc_dim,
 37 |         "enc_dropout": args.enc_dropout,
 38 |         "enc_rnn_dim": args.enc_rnn_dim,
 39 |         "gen_rnn_dim": args.gen_rnn_dim,
 40 |         "gen_dropout": args.gen_dropout,
 41 |         "lr": args.learning_rate,
 42 |         "epochs": args.n_epoch
 43 |     }
 44 | 
 45 |     # # instantiate paraphrase generator
 46 |     pgen = ParaphraseGenerator(op)
 47 | 
 48 |     # setup logging
 49 |     logger = SummaryWriter(os.path.join(LOG_DIR, TIME + args.name))
 50 |     # subprocess.run(['mkdir', os.path.join(GEN_DIR, TIME)], check=False)
 51 |     os.makedirs(os.path.join(GEN_DIR, TIME), exist_ok=True)
 52 |     # ready model for training
 53 | 
 54 |     train_loader = Data.DataLoader(
 55 |         Data.Subset(data, range(args.train_dataset_len)),
 56 |         batch_size=args.batch_size,
 57 |         shuffle=True,
 58 |     )
 59 |     test_loader = Data.DataLoader(Data.Subset(
 60 |         data,
 61 |         range(args.train_dataset_len,
 62 |               args.train_dataset_len + args.val_dataset_len)),
 63 |                                   batch_size=args.batch_size,
 64 |                                   shuffle=True)
 65 | 
 66 |     pgen_optim = optim.RMSprop(pgen.parameters(), lr=op["lr"])
 67 |     pgen.train()
 68 | 
 69 |     # train model
 70 |     pgen = pgen.to(DEVICE)
 71 |     cross_entropy_loss = nn.CrossEntropyLoss(ignore_index=data.PAD_token)
 72 | 
 73 |     for epoch in range(op["epochs"]):
 74 | 
 75 |         epoch_l1 = 0
 76 |         epoch_l2 = 0
 77 |         itr = 0
 78 |         ph = []
 79 |         pph = []
 80 |         gpph = []
 81 |         pgen.train()
 82 | 
 83 |         for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm(
 84 |                 train_loader, ascii=True, desc="epoch" + str(epoch)):
 85 | 
 86 |             phrase = phrase.to(DEVICE)
 87 |             para_phrase = para_phrase.to(DEVICE)
 88 | 
 89 |             out, enc_out, enc_sim_phrase = pgen(
 90 |                 phrase.t(),
 91 |                 sim_phrase=para_phrase.t(),
 92 |                 train=True,
 93 |             )
 94 | 
 95 |             loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase)
 96 |             loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase)
 97 | 
 98 |             pgen_optim.zero_grad()
 99 |             (loss_1 + loss_2).backward()
100 | 
101 |             pgen_optim.step()
102 | 
103 |             # accumulate results
104 | 
105 |             epoch_l1 += loss_1.item()
106 |             epoch_l2 += loss_2.item()
107 |             ph += net_utils.decode_sequence(data.ix_to_word, phrase)
108 |             pph += net_utils.decode_sequence(data.ix_to_word, para_phrase)
109 |             gpph += net_utils.decode_sequence(data.ix_to_word,
110 |                                               torch.argmax(out, dim=-1).t())
111 | 
112 |             itr += 1
113 |             torch.cuda.empty_cache()
114 | 
115 |         # log results
116 | 
117 |         logger.add_scalar("l2_train", epoch_l2 / itr, epoch)
118 |         logger.add_scalar("l1_train", epoch_l1 / itr, epoch)
119 | 
120 |         scores = evaluate_scores(gpph, pph)
121 | 
122 |         for key in scores:
123 |             logger.add_scalar(key + "_train", scores[key], epoch)
124 | 
125 |         dump_samples(ph, pph, gpph,
126 |                      os.path.join(GEN_DIR, TIME,
127 |                                   str(epoch) + "_train.txt"))
128 | 
129 |         # start validation
130 | 
131 |         epoch_l1 = 0
132 |         epoch_l2 = 0
133 |         itr = 0
134 |         ph = []
135 |         pph = []
136 |         gpph = []
137 |         pgen.eval()
138 | 
139 |         with torch.no_grad():
140 |             for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm(
141 |                     test_loader, ascii=True, desc="val" + str(epoch)):
142 | 
143 |                 phrase = phrase.to(DEVICE)
144 |                 para_phrase = para_phrase.to(DEVICE)
145 | 
146 |                 out, enc_out, enc_sim_phrase = pgen(phrase.t(),
147 |                                                     sim_phrase=para_phrase.t())
148 | 
149 |                 loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase)
150 |                 loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase)
151 | 
152 |                 epoch_l1 += loss_1.item()
153 |                 epoch_l2 += loss_2.item()
154 |                 ph += net_utils.decode_sequence(data.ix_to_word, phrase)
155 |                 pph += net_utils.decode_sequence(data.ix_to_word, para_phrase)
156 |                 gpph += net_utils.decode_sequence(
157 |                     data.ix_to_word,
158 |                     torch.argmax(out, dim=-1).t())
159 | 
160 |                 itr += 1
161 |                 torch.cuda.empty_cache()
162 | 
163 |             logger.add_scalar("l2_val", epoch_l2 / itr, epoch)
164 |             logger.add_scalar("l1_val", epoch_l1 / itr, epoch)
165 | 
166 |             scores = evaluate_scores(gpph, pph)
167 | 
168 |             for key in scores:
169 |                 logger.add_scalar(key + "_val", scores[key], epoch)
170 | 
171 |             dump_samples(ph, pph, gpph,
172 |                          os.path.join(GEN_DIR, TIME,
173 |                                       str(epoch) + "_val.txt"))
174 |         save_model(pgen, pgen_optim, epoch, os.path.join(SAVE_DIR, TIME, str(epoch)))
175 | 
176 |     # wrap ups
177 |     logger.close()
178 |     print("Done !!")
179 | 
180 | 
181 | if __name__ == "__main__":
182 | 
183 |     LOG_DIR = 'logs'
184 |     SAVE_DIR = 'save'
185 |     GEN_DIR = 'samples'
186 |     HOME = './'
187 |     TIME = time.strftime("%Y%m%d_%H%M%S")
188 |     DEVICE = torch.device(
189 |         'cuda') if torch.cuda.is_available() else torch.device('cpu')
190 |     main()
191 | 


--------------------------------------------------------------------------------
/training/train_edlps.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import time
  4 | 
  5 | import torch
  6 | import torch.nn as nn
  7 | import torch.optim as optim
  8 | import torch.utils.data as Data
  9 | from tensorboardX import SummaryWriter
 10 | from tqdm import tqdm
 11 | 
 12 | from misc import net_utils, utils
 13 | from misc.dataloader import Dataloader
 14 | from misc.train_util import dump_samples, evaluate_scores, save_model
 15 | from models.enc_dec_sh_dis import ParaphraseGenerator
 16 | 
 17 | 
 18 | def main():
 19 | 
 20 |     # get arguments ---
 21 | 
 22 |     parser = utils.make_parser()
 23 |     args = parser.parse_args()
 24 | 
 25 |     # build model
 26 | 
 27 |     # # get data
 28 |     data = Dataloader(args.input_json, args.input_ques_h5)
 29 | 
 30 |     # # make op
 31 |     op = {
 32 |         "vocab_sz": data.getVocabSize(),
 33 |         "max_seq_len": data.getSeqLength(),
 34 |         "emb_hid_dim": args.emb_hid_dim,
 35 |         "emb_dim": args.emb_dim,
 36 |         "enc_dim": args.enc_dim,
 37 |         "enc_dropout": args.enc_dropout,
 38 |         "enc_rnn_dim": args.enc_rnn_dim,
 39 |         "gen_rnn_dim": args.gen_rnn_dim,
 40 |         "gen_dropout": args.gen_dropout,
 41 |         "lr": args.learning_rate,
 42 |         "epochs": args.n_epoch
 43 |     }
 44 | 
 45 |     # # instantiate paraphrase generator
 46 |     pgen = ParaphraseGenerator(op)
 47 | 
 48 |     # setup logging
 49 |     logger = SummaryWriter(os.path.join(LOG_DIR, TIME + args.name))
 50 |     # subprocess.run(['mkdir', os.path.join(GEN_DIR, TIME), os.path.join(SAVE_DIR, TIME)], check=False)
 51 |     os.makedirs(os.path.join(GEN_DIR, TIME), exist_ok=True)
 52 |     # ready model for training
 53 | 
 54 |     train_loader = Data.DataLoader(
 55 |         Data.Subset(data, range(args.train_dataset_len)),
 56 |         batch_size=args.batch_size,
 57 |         shuffle=True,
 58 |     )
 59 |     test_loader = Data.DataLoader(Data.Subset(
 60 |         data,
 61 |         range(args.train_dataset_len,
 62 |               args.train_dataset_len + args.val_dataset_len)),
 63 |                                   batch_size=args.batch_size,
 64 |                                   shuffle=True)
 65 | 
 66 |     pgen_optim = optim.RMSprop(pgen.parameters(), lr=op["lr"])
 67 |     pgen.train()
 68 | 
 69 |     # train model
 70 |     pgen = pgen.to(DEVICE)
 71 |     cross_entropy_loss = nn.CrossEntropyLoss(ignore_index=data.PAD_token)
 72 | 
 73 |     for epoch in range(op["epochs"]):
 74 | 
 75 |         epoch_l1 = 0
 76 |         epoch_l2 = 0
 77 |         itr = 0
 78 |         ph = []
 79 |         pph = []
 80 |         gpph = []
 81 |         pgen.train()
 82 | 
 83 |         for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm(
 84 |                 train_loader, ascii=True, desc="epoch" + str(epoch)):
 85 | 
 86 |             phrase = phrase.to(DEVICE)
 87 |             para_phrase = para_phrase.to(DEVICE)
 88 | 
 89 |             out, enc_out, enc_sim_phrase = pgen(
 90 |                 phrase.t(),
 91 |                 sim_phrase=para_phrase.t(),
 92 |                 train=True,
 93 |             )
 94 | 
 95 |             loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase)
 96 |             loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase)
 97 | 
 98 |             pgen_optim.zero_grad()
 99 |             (loss_1 + loss_2).backward()
100 | 
101 |             pgen_optim.step()
102 | 
103 |             # accumulate results
104 | 
105 |             epoch_l1 += loss_1.item()
106 |             epoch_l2 += loss_2.item()
107 |             ph += net_utils.decode_sequence(data.ix_to_word, phrase)
108 |             pph += net_utils.decode_sequence(data.ix_to_word, para_phrase)
109 |             gpph += net_utils.decode_sequence(data.ix_to_word,
110 |                                               torch.argmax(out, dim=-1).t())
111 | 
112 |             itr += 1
113 |             torch.cuda.empty_cache()
114 | 
115 |         # log results
116 | 
117 |         logger.add_scalar("l2_train", epoch_l2 / itr, epoch)
118 |         logger.add_scalar("l1_train", epoch_l1 / itr, epoch)
119 | 
120 |         scores = evaluate_scores(gpph, pph)
121 | 
122 |         for key in scores:
123 |             logger.add_scalar(key + "_train", scores[key], epoch)
124 | 
125 |         dump_samples(ph, pph, gpph,
126 |                      os.path.join(GEN_DIR, TIME,
127 |                                   str(epoch) + "_train.txt"))
128 |         # start validation
129 | 
130 |         epoch_l1 = 0
131 |         epoch_l2 = 0
132 |         itr = 0
133 |         ph = []
134 |         pph = []
135 |         gpph = []
136 |         pgen.eval()
137 | 
138 |         with torch.no_grad():
139 |             for phrase, phrase_len, para_phrase, para_phrase_len, _ in tqdm(
140 |                     test_loader, ascii=True, desc="val" + str(epoch)):
141 | 
142 |                 phrase = phrase.to(DEVICE)
143 |                 para_phrase = para_phrase.to(DEVICE)
144 | 
145 |                 out, enc_out, enc_sim_phrase = pgen(phrase.t(),
146 |                                                     sim_phrase=para_phrase.t())
147 | 
148 |                 loss_1 = cross_entropy_loss(out.permute(1, 2, 0), para_phrase)
149 |                 loss_2 = net_utils.JointEmbeddingLoss(enc_out, enc_sim_phrase)
150 | 
151 |                 epoch_l1 += loss_1.item()
152 |                 epoch_l2 += loss_2.item()
153 |                 ph += net_utils.decode_sequence(data.ix_to_word, phrase)
154 |                 pph += net_utils.decode_sequence(data.ix_to_word, para_phrase)
155 |                 gpph += net_utils.decode_sequence(
156 |                     data.ix_to_word,
157 |                     torch.argmax(out, dim=-1).t())
158 | 
159 |                 itr += 1
160 |                 torch.cuda.empty_cache()
161 | 
162 |             logger.add_scalar("l2_val", epoch_l2 / itr, epoch)
163 |             logger.add_scalar("l1_val", epoch_l1 / itr, epoch)
164 | 
165 |             scores = evaluate_scores(gpph, pph)
166 | 
167 |             for key in scores:
168 |                 logger.add_scalar(key + "_val", scores[key], epoch)
169 | 
170 |             dump_samples(ph, pph, gpph,
171 |                          os.path.join(GEN_DIR, TIME,
172 |                                       str(epoch) + "_val.txt"))
173 | 
174 |         save_model(pgen, pgen_optim, epoch, os.path.join(SAVE_DIR, TIME, str(epoch)))
175 | 
176 |     # wrap ups
177 |     logger.close()
178 |     print("Done !!")
179 | 
180 | 
181 | if __name__ == "__main__":
182 | 
183 |     LOG_DIR = 'logs'
184 |     SAVE_DIR = 'save'
185 |     GEN_DIR = 'samples'
186 |     HOME = './'
187 |     TIME = time.strftime("%Y%m%d_%H%M%S")
188 |     DEVICE = torch.device(
189 |         'cuda') if torch.cuda.is_available() else torch.device('cpu')
190 |     main()
191 | 


--------------------------------------------------------------------------------