├── .gitignore
├── LICENSE
├── README.md
├── cache_elmo.py
├── coref_kernels.cc
├── coref_ops.py
├── data
├── test.vispro.1.1.jsonlines
├── train.vispro.1.1.jsonlines
└── val.vispro.1.1.jsonlines
├── evaluate.py
├── experiments.conf
├── fig
├── case_study1.png
├── data_example.PNG
└── dialog_example.PNG
├── filter_embeddings.py
├── get_char_vocab.py
├── get_im_fc.py
├── metrics.py
├── model.py
├── predict.py
├── requirements.txt
├── setup_all.sh
├── setup_training.sh
├── train.py
└── util.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.so
3 | char_vocab*.txt
4 | glove*.txt
5 | glove*.txt.filtered
6 | *.hdf5
7 | logs
8 | output
9 | venv
10 | *.tgz
11 | .idea
12 | sftp-config.json
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 HKUST-KnowComp
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Visual Pronoun Coreference Resolution in Dialogues
2 |
3 | ## Introduction
4 | This is the data and the source code for EMNLP 2019 paper "What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues". [[PAPER](https://www.aclweb.org/anthology/D19-1516.pdf)][[PPT](https://drive.google.com/open?id=1T5911qE1XrToNcMTOhKoAFEiqEgco2Sv)]
5 |
6 | ### Abstract
7 | Grounding pronouns to a visual object it refers to requires complex reasoning from various information sources, especially in conversational scenarios.
8 | For example, when people in a conversation talk about something all speakers can see (e.g., the statue), they often directly use pronouns (e.g., it) to refer it without previous introduction.
9 | This fact brings a huge challenge for modern natural language understanding systems, particularly conventional context-based pronoun coreference models.
10 | To tackle this challenge, in this paper, we formally define the task of visual-aware pronoun coreference resolution (PCR), and introduce VisPro, a large-scale dialogue PCR dataset, to investigate whether and how the visual information can help resolve pronouns in dialogues.
11 | We then propose a novel visual-aware PCR model, VisCoref, for this task and conduct comprehensive experiments and case studies on our dataset.
12 | Results demonstrate the importance of the visual information in this PCR case and show the effectiveness of the proposed model.
13 |
14 |
15 |

16 |
17 |
18 | The readers are welcome to star/fork this repository and use it to train your own model, reproduce our experiment, and follow our future work. Please kindly cite our paper:
19 | ```
20 | @inproceedings{DBLP:conf/emnlp/YuZSSZ19,
21 | author = {Xintong Yu and
22 | Hongming Zhang and
23 | Yangqiu Song and
24 | Yan Song and
25 | Changshui Zhang},
26 | title = {What You See is What You Get: Visual Pronoun Coreference Resolution
27 | in Dialogues},
28 | booktitle = {Proceedings of {EMNLP-IJCNLP} 2019},
29 | pages = {5122--5131},
30 | publisher = {Association for Computational Linguistics},
31 | year = {2019},
32 | url = {https://doi.org/10.18653/v1/D19-1516},
33 | doi = {10.18653/v1/D19-1516},
34 | }
35 | ```
36 |
37 |
38 |
39 | ## VisPro Dataset
40 | VisPro dataset contains coreference annotation of 29,722 pronouns from 5,000 dialogues.
41 |
42 | The train, validation, and test split of VisPro dataset are in `data` directory.
43 |
44 | ### An example of VisPro
45 |
46 |

47 |
48 | Mentions in the same coreference cluster are in the same color.
49 |
50 | ### Annotation Format
51 | Each line contains the annotation of one dialog.
52 | ```
53 | {
54 | "doc_key": str, # e.g. in "dl:train:0", "dl" indicates "dialog" genre to be compatible with the CoNLL format, and it is the same for all dialogs in VisPro; "train" means that it is from the train split of VisDial (note that the split of VisPro is not the same as VisDial); "0" is the original index in the randomly selected 5,000 VisDial dialogs; basically this key serves as an index of the dialog
55 | "image_file": str, # the image filename of the dialog
56 | "object_detection": list, # the ids of object labels from 80 categories of MSCOCO object detection challenge
57 | "sentences": list,
58 | "speakers": list,
59 | "cluster": list, # each element is a cluster, and each element within a cluster is a mention
60 | "correct_caption_NPs": list, # the noun phrases in the caption
61 | "pronoun_info": list
62 | }
63 | ```
64 |
65 | Each element of `"pronoun_info"` contains the annotation of one pronoun.
66 | ```
67 | {
68 | "current_pronoun": list,
69 | "reference_type": int,
70 | "not_discussed": bool,
71 | "candidate_NPs": list,
72 | "correct_NPs": list
73 | }
74 | ```
75 | Text spans are denoted as [index_start, index_end] of their positions in the whole dialogue, and the indices is counted by concatenating all sentences of the dialogue together.
76 |
77 | `"current_pronoun"`, `"candidate_NPs"`, and `"correct_NPs"` are positions of the pronouns, the candidate noun phrases and the correct noun phrases of antecedents respectively.
78 |
79 | `"reference_type"` has 3 values. 0 for pronouns which refers to noun phrases in the text, 1 for pronouns whose antecedents are not in the candidate list, 2 for non-referential pronouns.
80 |
81 | `"not_discussed"` indicates whether the antecedents of the pronoun is discussed in the dialogue text.
82 |
83 | Take the first dialog in the test split of VisPro as example:
84 | ```
85 | {
86 | "pronoun_info": [{"current_pronoun": [15, 15], "candidate_NPs": [[0, 1], [3, 4], [6, 8], [10, 10], [12, 12]], "reference_type": 0, "correct_NPs": [[0, 1], [10, 10]], "not_discussed": false}],
87 | "sentences": [["A", "firefighter", "rests", "his", "arm", "on", "a", "parking", "meter", "as", "another", "walks", "past", "."], ["Is", "he", "in", "his", "gear", "?"]],
88 | "doc_key": "dl:train:152"
89 | }
90 | ```
91 | Here [0, 1] indicates the phrase of "a firefighter", [3, 4] indicates "his arms", [6, 8] indicates "a parking meter", [10, 10] indicates "another", [12, 12] indicates "past", and [15, 15] indicates "he."
92 | For the current pronoun "he", "candidate_NPs" means that "a firefighter", "his arms", "a parking meter", "another", "past" all serve as candidates for antecedents, while "correct_caption_NPs" means that only "a firefighter" and "another" are correct antecedents.
93 |
94 | The "doc_key" means that it is the 152th selected dialog from the train split of VisDial.
95 |
96 |
97 | ## Usage of VisCoref
98 |
99 | ### An Example of VisCoref Prediction
100 |
101 |

102 |
103 |
104 | The figure shows an example of a VisCoref prediction with the image, the relevant part of the dialogue, the prediction result, and the heatmap of the text-object similarity. We indicate the target pronoun with the *underlined italics* font and the candidate mentions with bold font. The row of the heatmap represents the mention in the context and the column means the detected object labels from the image.
105 |
106 | ### Getting Started
107 | * Install python 3.7 and the following requirements: `pip install -r requirements.txt`. Set default python under your system to python 3.7.
108 | * Download supplementary data for training VisCoref and the pretrained model from [Data](https://drive.google.com/open?id=1dSeGz5k57bU2GXCt7sY9krykLvmnbiVx) and extract: `tar -xzvf VisCoref.tar.gz`.
109 | * Move VisPro data and supplementary data end with `.jsonlines` to `data` directory and move the pretrained model to `logs` directory.
110 | * Download GloVe embeddings and build custom kernels by running `setup_all.sh`.
111 | * There are 3 platform-dependent ways to build custom TensorFlow kernels. Please comment/uncomment the appropriate lines in the script.
112 | * Setup training files by running `setup_training.sh`.
113 |
114 | ### Traning Instructions
115 |
116 | * Experiment configurations are found in `experiments.conf`
117 | * Choose an experiment that you would like to run, e.g. `best`
118 | * For training and prediction, set the `GPU` environment variable, which the code treats as shorthand for `CUDA_VISIBLE_DEVICES`.
119 | * (optional) For the "End-to-end + Visual" baseline, first download images from [VisDial](https://visualdialog.org/data) to the `data/images` folder, then run `python get_im_fc.py` to get image features.
120 | * Training: `python train.py `
121 | * Results are stored in the `logs` directory and can be viewed via TensorBoard.
122 | * Prediction: `python predict.py `
123 | * Evaluation: `python evaluate.py `
124 |
125 | ## Acknowledgment
126 | VisPro dataset is based on [VisDial v1.0](https://visualdialog.org/).
127 |
128 | We built the training framework based on the original [End-to-end Coreference Resolution](https://github.com/kentonl/e2e-coref).
129 |
130 | ## Others
131 | If you have questions about the data or the code, you are welcome to open an issue or send me an email, I will respond to that as soon as possible.
--------------------------------------------------------------------------------
/cache_elmo.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 |
4 | import numpy as np
5 | import tensorflow as tf
6 | import tensorflow_hub as hub
7 | import h5py
8 | import json
9 | import argparse
10 |
11 | def parse_args():
12 | parser = argparse.ArgumentParser(description='cache elmo embedding')
13 |
14 | parser.add_argument('--dataset', type=str, default='vispro',
15 | help='dataset: vispro, vispro_cdd, vispro_mscoco')
16 |
17 | args = parser.parse_args()
18 | return args
19 |
20 | def build_elmo():
21 | token_ph = tf.placeholder(tf.string, [None, None])
22 | len_ph = tf.placeholder(tf.int32, [None])
23 | elmo_module = hub.Module("https://tfhub.dev/google/elmo/2")
24 | lm_embeddings = elmo_module(
25 | inputs={"tokens": token_ph, "sequence_len": len_ph},
26 | signature="tokens", as_dict=True)
27 | word_emb = lm_embeddings["word_emb"]
28 | lm_emb = tf.stack([tf.concat([word_emb, word_emb], -1),
29 | lm_embeddings["lstm_outputs1"],
30 | lm_embeddings["lstm_outputs2"]], -1)
31 | return token_ph, len_ph, lm_emb
32 |
33 | def cache_dataset(data_path, session, dataset, token_ph, len_ph, lm_emb, out_file):
34 | with open(data_path) as in_file:
35 | for doc_num, line in enumerate(in_file.readlines()):
36 | example = json.loads(line)
37 | sentences = example["sentences"]
38 |
39 | if dataset == 'vispro':
40 | caption = sentences.pop(0)
41 |
42 | max_sentence_length = max(len(s) for s in sentences)
43 | tokens = [[""] * max_sentence_length for _ in sentences]
44 | text_len = np.array([len(s) for s in sentences])
45 |
46 | for i, sentence in enumerate(sentences):
47 | for j, word in enumerate(sentence):
48 | tokens[i][j] = word
49 | tokens = np.array(tokens)
50 |
51 | if dataset == 'vispro':
52 | # extract dialog
53 | tf_lm_emb_dial = session.run(lm_emb, feed_dict={
54 | token_ph: tokens,
55 | len_ph: text_len
56 | })
57 | file_key = example["doc_key"].replace("/", ":")
58 | group = out_file.create_group(file_key)
59 | for i, (e, l) in enumerate(zip(tf_lm_emb_dial, text_len)):
60 | e = e[:l, :, :]
61 | group[str(i + 1)] = e
62 |
63 | # extract caption alone
64 | # extract spans from caption
65 | caption_NPs = example['correct_caption_NPs']
66 | file_key = file_key + ':cap'
67 | group = out_file.create_group(file_key)
68 | # caption_NPs might be empty
69 | if len(caption_NPs) == 0:
70 | continue
71 | # extract elmo feature for all spans
72 | span_len = [c[1] - c[0] + 1 for c in caption_NPs]
73 | span_list = [[""] * max(span_len) for _ in caption_NPs]
74 | for i, (span_start, span_end) in enumerate(caption_NPs):
75 | for j, index in enumerate(range(span_start, span_end + 1)):
76 | span_list[i][j] = caption[index].lower()
77 | span_list = np.array(span_list)
78 | tf_lm_emb_cap = session.run(lm_emb, feed_dict={
79 | token_ph: span_list,
80 | len_ph: span_len
81 | })
82 | for i, (e, l) in enumerate(zip(tf_lm_emb_cap, span_len)):
83 | e = e[:l, :, :]
84 | group[str(i)] = e
85 |
86 | else:
87 | tf_lm_emb = session.run(lm_emb, feed_dict={
88 | token_ph: tokens,
89 | len_ph: text_len
90 | })
91 | file_key = example["doc_key"].replace("/", ":")
92 | group = out_file.create_group(file_key)
93 | for i, (e, l) in enumerate(zip(tf_lm_emb, text_len)):
94 | e = e[:l, :, :]
95 | group[str(i)] = e
96 |
97 | if doc_num % 10 == 0:
98 | print(f"Cached {doc_num + 1} documents in {data_path}")
99 |
100 | if __name__ == "__main__":
101 | token_ph, len_ph, lm_emb = build_elmo()
102 |
103 | args = parse_args()
104 | if args.dataset == 'vispro':
105 | json_filenames = ['data/' + s + '.vispro.1.1.jsonlines'
106 | for s in ['train', 'val', 'test']]
107 | elif args.dataset == 'vispro_cdd':
108 | json_filenames = ['data/cdd_np.vispro.1.1.jsonlines']
109 | elif args.dataset == 'vispro_mscoco':
110 | json_filenames = ['data/mscoco_label.jsonlines']
111 | config = tf.ConfigProto()
112 | config.gpu_options.allow_growth = True
113 | with tf.Session(config=config) as session:
114 | session.run(tf.global_variables_initializer())
115 | h5_filename = "data/elmo_cache.%s.hdf5" % args.dataset
116 | out_file = h5py.File(h5_filename, "w")
117 | for json_filename in json_filenames:
118 | cache_dataset(json_filename, session, args.dataset, token_ph, len_ph, lm_emb, out_file)
119 | out_file.close()
120 |
--------------------------------------------------------------------------------
/coref_kernels.cc:
--------------------------------------------------------------------------------
1 | #include