├── .DS_Store
├── .gitignore
├── LICENSE
├── README.md
├── VideoGPT2.py
├── dataset.py
├── generate.py
├── images
    ├── Figure1.png
    └── Figure2.png
├── requirements.txt
└── train.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ictnlp/DSTC8-AVSD/e9578fe1dc0d982928b4be8b5e133036664ad05c/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 ICTNLP
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DSTC8-AVSD
 2 | We rank the 1st in DSTC8 Audio-Visual Scene-Aware Dialog competition. This is the source code for our AAAI2020-DSTC8-AVSD paper [Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog.](<https://arxiv.org/abs/2002.00163>) Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, Cheng Niu, Jie Zhou. AAAI2020.
 3 | 
 4 | ## News
 5 | Our paper is accpeted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP). [url](<https://ieeexplore.ieee.org/abstract/document/9376902>)
 6 | 
 7 | ## Abstract
 8 | 
 9 | Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation
10 | task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.
11 | 
12 | ![A dialogue sampled from the DSTC8-AVSD dataset. For each dialogue, there are video, audio, video caption, dialogue summary and 10 turns of conversations about the video.](./images/Figure1.png)
13 | 
14 | ## Model Architecture
15 | 
16 | ![](./images/Figure2.png)
17 | 
18 | 
19 | 
20 | ## How to Run
21 | 
22 | ### Requirements
23 | 
24 | Python. 3.6
25 | 
26 | torch==1.0.1
27 | pytorch-ignite==0.2.1
28 | transformers==2.1.1
29 | tqdm==4.36.1
30 | 
31 | ```shell
32 | pip install -r requirements.txt
33 | ```
34 | 
35 | ### Data
36 | 
37 | Download [dataset](https://drive.google.com/drive/folders/1SlZTySJAk_2tiMG5F8ivxCfOl_OWwd_Q) of the DSTC8, including the training, validation, and test dialogues and the features of Charades videos extracted using VGGish and I3D models.
38 | 
39 | All the data should be saved into folder `data/` in the repo root folder.
40 | 
41 | ### Train
42 | 
43 | ```shell
44 | python train.py --log_path log/
45 | ```
46 | 
47 | ### Generate
48 | 
49 | ```shell
50 | python generate.py --model_checkpoint log/ --output result.json --beam_search
51 | ```
52 | 
53 | 
54 | 
55 | ## Citation
56 | 
57 | If you use this code in your research, you can cite our AAAI2020 DSTC8 workshop paper:
58 | 
59 | ```
60 | @article{li2020bridging,
61 |     title={Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog},
62 |     author={Zekang Li and Zongjia Li and Jinchao Zhang and Yang Feng and Cheng Niu and Jie Zhou},
63 |     year={2020},
64 |     eprint={2002.00163},
65 |     archivePrefix={arXiv},
66 |     journal={CoRR},
67 |     primaryClass={cs.CL}
68 | }
69 | ```
70 | 
71 | 
72 | 
73 | 


--------------------------------------------------------------------------------
/VideoGPT2.py:
--------------------------------------------------------------------------------
  1 | from transformers import *
  2 | import math
  3 | import torch
  4 | import torch.nn as nn
  5 | from torch.nn import CrossEntropyLoss, MSELoss
  6 | 
  7 | 
  8 | def gelu(x):
  9 |     return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
 10 | 
 11 | class Attention(nn.Module):
 12 |     def __init__(self, nx, n_ctx, config, scale=False):
 13 |         super(Attention, self).__init__()
 14 |         self.output_attentions = config.output_attentions
 15 | 
 16 |         n_state = nx  # in Attention: n_state=768 (nx=n_embd)
 17 |         # [switch nx => n_state from Block to Attention to keep identical to TF implem]
 18 |         assert n_state % config.n_head == 0
 19 |         self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))
 20 |         self.n_head = config.n_head
 21 |         self.split_size = n_state
 22 |         self.scale = scale
 23 | 
 24 |         self.c_attn = Conv1D(n_state * 3, nx)
 25 |         self.c_proj = Conv1D(n_state, nx)
 26 |         self.attn_dropout = nn.Dropout(config.attn_pdrop)
 27 |         self.resid_dropout = nn.Dropout(config.resid_pdrop)
 28 |         self.pruned_heads = set()
 29 | 
 30 |     def prune_heads(self, heads):
 31 |         if len(heads) == 0:
 32 |             return
 33 |         mask = torch.ones(self.n_head, self.split_size // self.n_head)
 34 |         heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads
 35 |         for head in heads:
 36 |             # Compute how many pruned heads are before the head and move the index accordingly
 37 |             head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
 38 |             mask[head] = 0
 39 |         mask = mask.view(-1).contiguous().eq(1)
 40 |         index = torch.arange(len(mask))[mask].long()
 41 |         index_attn = torch.cat([index, index + self.split_size, index + (2*self.split_size)])
 42 | 
 43 |         # Prune conv1d layers
 44 |         self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)
 45 |         self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)
 46 | 
 47 |         # Update hyper params
 48 |         self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))
 49 |         self.n_head = self.n_head - len(heads)
 50 |         self.pruned_heads = self.pruned_heads.union(heads)
 51 | 
 52 |     def _attn(self, q, k, v, attention_mask=None, head_mask=None):
 53 |         w = torch.matmul(q, k)
 54 |         if self.scale:
 55 |             w = w / math.sqrt(v.size(-1))
 56 |         nd, ns = w.size(-2), w.size(-1)
 57 |         b = self.bias[:, :, ns-nd:ns, :ns]
 58 |         #w = w * b - 1e18 * (1 - b)
 59 | 
 60 |         if attention_mask is not None:
 61 |             # Apply the attention mask
 62 |             b = torch.gt(b + attention_mask[0], 0).float()
 63 |             w = w * b - 1e18 * (1 - b)
 64 |             w = w - 1e18 * (1 - attention_mask[1])
 65 |         else:
 66 |             w = w * b - 1e18 * (1 - b)
 67 | 
 68 |         w = nn.Softmax(dim=-1)(w)
 69 |         w = self.attn_dropout(w)
 70 | 
 71 |         # Mask heads if we want to
 72 |         if head_mask is not None:
 73 |             w = w * head_mask
 74 | 
 75 |         outputs = [torch.matmul(w, v)]
 76 |         if self.output_attentions:
 77 |             outputs.append(w)
 78 |         return outputs
 79 | 
 80 |     def merge_heads(self, x):
 81 |         x = x.permute(0, 2, 1, 3).contiguous()
 82 |         new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
 83 |         return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states
 84 | 
 85 |     def split_heads(self, x, k=False):
 86 |         new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
 87 |         x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states
 88 |         if k:
 89 |             return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)
 90 |         else:
 91 |             return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)
 92 | 
 93 |     def forward(self, x, layer_past=None, attention_mask=None, head_mask=None):
 94 |         x = self.c_attn(x)
 95 |         query, key, value = x.split(self.split_size, dim=2)
 96 |         query = self.split_heads(query)
 97 |         key = self.split_heads(key, k=True)
 98 |         value = self.split_heads(value)
 99 |         if layer_past is not None:
100 |             past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below
101 |             key = torch.cat((past_key, key), dim=-1)
102 |             value = torch.cat((past_value, value), dim=-2)
103 |         present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking
104 | 
105 |         attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
106 |         a = attn_outputs[0]
107 | 
108 |         a = self.merge_heads(a)
109 |         a = self.c_proj(a)
110 |         a = self.resid_dropout(a)
111 | 
112 |         outputs = [a, present] + attn_outputs[1:]
113 |         return outputs  # a, present, (attentions)
114 | 
115 | 
116 | class MLP(nn.Module):
117 |     def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
118 |         super(MLP, self).__init__()
119 |         nx = config.n_embd
120 |         self.c_fc = Conv1D(n_state, nx)
121 |         self.c_proj = Conv1D(nx, n_state)
122 |         self.act = gelu
123 |         self.dropout = nn.Dropout(config.resid_pdrop)
124 | 
125 |     def forward(self, x):
126 |         h = self.act(self.c_fc(x))
127 |         h2 = self.c_proj(h)
128 |         return self.dropout(h2)
129 | 
130 | 
131 | class Block(nn.Module):
132 |     def __init__(self, n_ctx, config, scale=False):
133 |         super(Block, self).__init__()
134 |         nx = config.n_embd
135 |         self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
136 |         self.attn = Attention(nx, n_ctx, config, scale)
137 |         self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
138 |         self.mlp = MLP(4 * nx, config)
139 | 
140 |     def forward(self, x, layer_past=None, attention_mask=None, head_mask=None):
141 |         output_attn = self.attn(self.ln_1(x),
142 |                                 layer_past=layer_past,
143 |                                 attention_mask=attention_mask,
144 |                                 head_mask=head_mask)
145 |         a = output_attn[0]  # output_attn: a, present, (attentions)
146 | 
147 |         x = x + a
148 |         m = self.mlp(self.ln_2(x))
149 |         x = x + m
150 | 
151 |         outputs = [x] + output_attn[1:]
152 |         return outputs  # x, present, (attentions)
153 | 
154 | 
155 | class VideoGPT2Model(GPT2Model):
156 | 
157 |     def __init__(self, config):
158 |         super(VideoGPT2Model, self).__init__(config)
159 |         self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
160 |         
161 |     def forward(self, input_embs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
162 |         if past is None:
163 |             past_length = 0
164 |             past = [None] * len(self.h)
165 |         else:
166 |             past_length = past[0][0].size(-2)
167 |         if position_ids is None:
168 |             position_ids = torch.arange(past_length, input_embs.size(-2) + past_length, dtype=torch.long, device=input_embs.device)
169 |             position_ids = position_ids.unsqueeze(0).expand_as(input_embs[:, :, 0])
170 | 
171 |         # Attention mask.
172 |         if attention_mask is not None:
173 |             # We create a 3D attention mask from a 2D tensor mask.
174 |             # Sizes are [batch_size, 1, 1, to_seq_length]
175 |             # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
176 |             # this attention mask is more simple than the triangular masking of causal attention
177 |             # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
178 |             attention_mask[0] = attention_mask[0].unsqueeze(1).unsqueeze(2)
179 |             attention_mask[1] = attention_mask[1].unsqueeze(1).unsqueeze(2)
180 | 
181 |             # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
182 |             # masked positions, this operation will create a tensor which is 0.0 for
183 |             # positions we want to attend and -10000.0 for masked positions.
184 |             # Since we are adding it to the raw scores before the softmax, this is
185 |             # effectively the same as removing these entirely.
186 |             attention_mask[0] = attention_mask[0].to(dtype=next(self.parameters()).dtype) # fp16 compatibility
187 |             attention_mask[1] = attention_mask[1].to(dtype=next(self.parameters()).dtype) # fp16 compatibility
188 |             #attention_mask = (1.0 - attention_mask) * -1e18
189 | 
190 |         # Prepare head mask if needed
191 |         # 1.0 in head_mask indicate we keep the head
192 |         # attention_probs has shape bsz x n_heads x N x N
193 |         # head_mask has shape n_layer x batch x n_heads x N x N
194 |         if head_mask is not None:
195 |             if head_mask.dim() == 1:
196 |                 head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
197 |                 head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
198 |             elif head_mask.dim() == 2:
199 |                 head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
200 |             head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
201 |         else:
202 |             head_mask = [None] * self.config.n_layer
203 | 
204 |         input_shape = input_embs.size()[:2]
205 |         # input_ids = input_ids.view(-1, input_ids.size(-1))
206 |         position_ids = position_ids.view(-1, position_ids.size(-1))
207 | 
208 |         # inputs_embeds = self.wte(input_ids)
209 |         inputs_embeds = input_embs
210 |         position_embeds = self.wpe(position_ids)
211 |         if token_type_ids is not None:
212 |             token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
213 |             token_type_embeds = self.wte(token_type_ids)
214 |         else:
215 |             token_type_embeds = 0
216 |         hidden_states = inputs_embeds + position_embeds + token_type_embeds
217 |         hidden_states = self.drop(hidden_states)
218 | 
219 |         output_shape = input_shape + (hidden_states.size(-1),)
220 | 
221 |         presents = ()
222 |         all_attentions = []
223 |         all_hidden_states = ()
224 |         for i, (block, layer_past) in enumerate(zip(self.h, past)):
225 |             if self.output_hidden_states:
226 |                 all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
227 | 
228 |             outputs = block(hidden_states,
229 |                             layer_past=layer_past,
230 |                             attention_mask=attention_mask,
231 |                             head_mask=head_mask[i])
232 | 
233 |             hidden_states, present = outputs[:2]
234 |             presents = presents + (present,)
235 | 
236 |             if self.output_attentions:
237 |                 all_attentions.append(outputs[2])
238 | 
239 |         hidden_states = self.ln_f(hidden_states)
240 | 
241 |         hidden_states = hidden_states.view(*output_shape)
242 |         # Add last hidden state
243 |         if self.output_hidden_states:
244 |             all_hidden_states = all_hidden_states + (hidden_states,)
245 | 
246 |         outputs = (hidden_states, presents)
247 |         if self.output_hidden_states:
248 |             outputs = outputs + (all_hidden_states,)
249 |         if self.output_attentions:
250 |             # let the number of heads free (-1) so we can extract attention even after head pruning
251 |             attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
252 |             all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
253 |             outputs = outputs + (all_attentions,)
254 |         return outputs  # last hidden state, presents, (all hidden_states), (attentions)
255 | 
256 | 
257 | class VideoGPT2LMHeadModel(GPT2PreTrainedModel):
258 |     def __init__(self, config):
259 |         super(VideoGPT2LMHeadModel, self).__init__(config)
260 |         self.transformer = VideoGPT2Model(config)
261 |         self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
262 |         self.video_ff = nn.Linear(4224, config.n_embd)
263 |         self.video_inverse_ff = nn.Linear(config.n_embd, 4224)
264 | 
265 |         self.init_weights()
266 |         self.tie_weights()
267 | 
268 |     def tie_weights(self):
269 |         """ Make sure we are sharing the input and output embeddings.
270 |             Export to TorchScript can't handle parameter sharing so we are cloning them instead.
271 |         """
272 |         self._tie_or_clone_weights(self.lm_head,
273 |                                    self.transformer.wte)
274 | 
275 | 
276 |     def forward(self, input_embs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
277 |                 labels=None, mode="reply"):
278 |         transformer_outputs = self.transformer(input_embs,
279 |                                                past=past,
280 |                                                attention_mask=attention_mask,
281 |                                                token_type_ids=token_type_ids,
282 |                                                position_ids=position_ids,
283 |                                                head_mask=head_mask)
284 |         hidden_states = transformer_outputs[0]
285 | 
286 |         lm_logits = self.lm_head(hidden_states)
287 | 
288 |         outputs = (lm_logits,) + transformer_outputs[1:]
289 |         if labels is not None:
290 |             # Shift so that tokens < n predict n
291 |             if mode == "reply":
292 |                 shift_logits = lm_logits[..., :-1, :].contiguous()
293 |                 shift_labels = labels[0][..., 1:].contiguous()
294 |                 # Flatten the tokens
295 |                 loss_text_fct = CrossEntropyLoss(ignore_index=-1)
296 |                 loss_text = loss_text_fct(shift_logits.view(-1, shift_logits.size(-1)),
297 |                             shift_labels.view(-1))
298 |                 loss = loss_text
299 |             else: 
300 |                 lm_video_regs = self.video_inverse_ff(hidden_states[:, :labels[1].size(1), :])
301 |                 shift_video_regs = lm_video_regs[..., :-1, :].contiguous()
302 |                 shift_video_labels = labels[1][..., :-1, :].contiguous()
303 |                 loss_video_fct = MSELoss(reduce=True, size_average=True)
304 |                 loss_video = loss_video_fct(shift_video_regs, shift_video_labels)
305 |                 loss = loss_video
306 |             outputs = (loss,) + outputs
307 | 
308 |         return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
309 | 


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | # author: noctli
  3 | import json
  4 | import pickle
  5 | import logging
  6 | import numpy as np
  7 | import torch
  8 | import torch.utils.data
  9 | from torch.utils.data import Dataset
 10 | from itertools import chain
 11 | # from train import SPECIAL_TOKENS, MODEL_INPUTS, PADDED_INPUTS
 12 | SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>","<cap>", "<video>", "<pad>"]
 13 | SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
 14 | MODEL_INPUTS = ["input_ids", "token_type_ids","lm_labels"]
 15 | PADDED_INPUTS = ["input_ids", "token_type_ids","lm_labels"]
 16 | 
 17 | 
 18 | def tokenize(obj,tokenizer):
 19 |     if isinstance(obj, str):
 20 |         return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
 21 |     if isinstance(obj, dict):
 22 |         return dict((n, tokenize(o)) for n, o in obj.items())
 23 |     return list(tokenize(o) for o in obj)    
 24 | 
 25 | def get_dataset(tokenizer, data_file, feature_path=None, undisclosed_only=False, n_history=3):
 26 |     
 27 |     dialog_data = json.load(open(data_file, 'r'))
 28 |     dialog_list = []
 29 |     vid_set = set()
 30 |     for dialog in dialog_data['dialogs']:
 31 |         caption = [tokenize(dialog['caption'],tokenizer)] + [tokenize(dialog['summary'],tokenizer)]
 32 |         questions = [tokenize(d['question'],tokenizer) for d in dialog['dialog']]
 33 |         answers = [tokenize(d['answer'],tokenizer) for d in dialog['dialog']]
 34 |         vid = dialog["image_id"]
 35 |         vid_set.add(vid)
 36 |         if undisclosed_only:
 37 |             it = range(len(questions) - 1, len(questions))
 38 |         else:
 39 |             it = range(len(questions))
 40 |         qalist=[]
 41 |         history = []
 42 |         if undisclosed_only:
 43 |             for n in range(len(questions)-1):
 44 |                 qalist.append(questions[n])
 45 |                 qalist.append(answers[n])
 46 |             history=qalist[max(-len(qalist),-n_history*2):]
 47 |         for n in it:
 48 |             if undisclosed_only:
 49 |                 assert dialog['dialog'][n]['answer'] == '__UNDISCLOSED__'
 50 |             question = questions[n]
 51 |             answer = answers[n]
 52 |             history.append(question)
 53 |             if n_history == 0:
 54 |                 item = {'vid': vid, 'history': [question], 'answer': answer, 'caption': caption}
 55 |             else:
 56 |                 item = {'vid': vid, 'history': history, 'answer': answer, 'caption': caption}
 57 |             dialog_list.append(item)
 58 |             qalist.append(question)
 59 |             qalist.append(answer)
 60 |             history=qalist[max(-len(qalist),-n_history*2):]
 61 | 
 62 |     all_features = {}
 63 |     if feature_path is not None:
 64 |         fea_types = ['vggish', 'i3d_flow', 'i3d_rgb']
 65 |         dataname = '<FeaType>/<ImageID>.npy'
 66 |         for ftype in fea_types:
 67 |             if undisclosed_only:
 68 |                 basename = dataname.replace('<FeaType>', ftype+'_testset')
 69 |             else:
 70 |                 basename = dataname.replace('<FeaType>', ftype)
 71 |             features = {}
 72 |             for vid in vid_set:
 73 |                 filename = basename.replace('<ImageID>', vid)
 74 |                 filepath = feature_path + filename
 75 |                 features[vid] = (filepath, filepath)
 76 |             all_features[ftype] = features
 77 |         return dialog_list, all_features
 78 | 
 79 |     return dialog_list
 80 | 
 81 | 
 82 | class AVSDDataSet(Dataset):
 83 |     def __init__(self, dialogs, tokenizer, features=None, drop_rate=0.5, train=True):
 84 |         self.dialogs = dialogs
 85 |         self.features = features
 86 |         self.tokenizer = tokenizer
 87 |         self.drop_rate = drop_rate
 88 |         self.train = train
 89 | 
 90 |     def __len__(self):
 91 |         return len(self.dialogs)
 92 | 
 93 |     def __getitem__(self, index):
 94 |         dialog = self.dialogs[index]
 95 |         vid = dialog['vid']
 96 |         his = self.dialogs[index]['history']
 97 |         cap = self.dialogs[index]['caption']
 98 |         ans = self.dialogs[index]['answer']
 99 |         
100 |         if np.random.rand() < self.drop_rate:
101 |             instance, _ = build_input_from_segments(cap, his, ans, self.tokenizer, video=False, drop_caption=True, train=self.train)
102 |         else:
103 |             instance, _ = build_input_from_segments(cap, his, ans, self.tokenizer, video=False, drop_caption=False, train=self.train)
104 |         input_ids = torch.Tensor(instance["input_ids"]).long()
105 |         token_type_ids = torch.Tensor(instance["token_type_ids"]).long()
106 |         lm_labels = torch.Tensor(instance["lm_labels"]).long()
107 | 
108 |         if self.features is not None:
109 |             try:
110 |                 vgg = np.load(self.features[0]["vggish"][vid][0])
111 |                 i3d_flow = np.load(self.features[0]["i3d_flow"][vid][0])
112 |                 i3d_rgb = np.load(self.features[0]["i3d_rgb"][vid][0])
113 |             except KeyError:
114 |                 vgg = np.load(self.features[1]["vggish"][vid][0])
115 |                 i3d_flow = np.load(self.features[1]["i3d_flow"][vid][0])
116 |                 i3d_rgb = np.load(self.features[1]["i3d_rgb"][vid][0])
117 | 
118 |             sample_i3d_flow = i3d_flow[range(1, i3d_flow.shape[0], 1)]
119 |             sample_i3d_rgb = i3d_rgb[range(1, i3d_rgb.shape[0], 1)]
120 | 
121 |             vgg = torch.from_numpy(vgg).float()
122 |             i3d_flow = torch.from_numpy(sample_i3d_flow).float()
123 |             i3d_rgb = torch.from_numpy(sample_i3d_rgb).float()
124 |             min_length = min([i3d_flow.size(0), i3d_rgb.size(0), vgg.size(0)])
125 |             i3d = torch.cat([i3d_flow[:min_length], i3d_rgb[:min_length], vgg[:min_length]], dim=1)
126 | 
127 |             return input_ids, token_type_ids, lm_labels, i3d
128 |         else:
129 |             return input_ids, token_type_ids, lm_labels
130 | 
131 | 
132 | def collate_fn(batch, pad_token, features=None):
133 |     def padding(seq, pad_token):
134 |         max_len = max([i.size(0) for i in seq])
135 |         if len(seq[0].size()) == 1:
136 |             result = torch.ones((len(seq), max_len)).long() * pad_token
137 |         else:
138 |             result = torch.ones((len(seq), max_len, seq[0].size(-1))).float()
139 |         for i in range(len(seq)):
140 |             result[i, :seq[i].size(0)] = seq[i]
141 |         return result
142 | 
143 |     input_ids_list, token_type_ids_list, lm_labels_list, i3d_list = [], [], [], []
144 |     for i in batch:
145 |         input_ids_list.append(i[0])
146 |         token_type_ids_list.append(i[1])
147 |         lm_labels_list.append(i[2])
148 |         if features is not None:
149 |             i3d_list.append(i[3])
150 | 
151 |     input_ids = padding(input_ids_list, pad_token)
152 |     token_type_ids = padding(token_type_ids_list, pad_token)
153 |     lm_labels = padding(lm_labels_list, -1)
154 |     input_mask = input_ids != pad_token
155 |     if features is not None:
156 |         i3d = padding(i3d_list, pad_token)
157 |         i3d_mask = torch.sum(i3d != 1, dim=2) != 0
158 |         input_mask = torch.cat([i3d_mask, input_mask], dim=1)
159 |         i3d_labels = torch.ones((i3d.size(0), i3d.size(1))).long() * -1
160 |         video_mask = torch.cat([torch.zeros((i3d.size(0), i3d.size(1))), torch.ones(lm_labels.size())], 1)
161 |         reply_mask = torch.zeros(video_mask.size())
162 |         lm_labels = torch.cat([i3d_labels, lm_labels], dim=1)
163 |         return input_ids, token_type_ids, lm_labels, input_mask, i3d, video_mask, reply_mask
164 |     else:
165 |         return input_ids, token_type_ids, lm_labels, input_mask
166 | 
167 | 
168 | def pad_dataset(dataset, padding=0):
169 |     """ Pad the dataset. This could be optimized by defining a Dataset class and padd only batches but this is simpler. """
170 |     max_l = max(len(x) for x in dataset["input_ids"])
171 |     for name in PADDED_INPUTS:
172 |         dataset[name] = [x + [padding if name != "labels" else -1] * (max_l - len(x)) for x in dataset[name]]
173 |     return dataset
174 | 
175 | def build_input_from_segments(caption, history, reply, tokenizer, with_eos=True, video=False, drop_caption=False, train=True):
176 |     """ Build a sequence of input from 3 segments: caption(caption+summary) history and last reply """
177 |     bos, eos, speaker1, speaker2, cap = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-2])
178 |     if not drop_caption:
179 |         instance = {}
180 |         sequence = [[bos] + list(chain(*caption))] + history + [reply + ([eos] if with_eos else [])]
181 |         sequence = [[cap] + sequence[0] + [eos]] + [[speaker2 if (len(sequence)-i) % 2 else speaker1] + s for i, s in enumerate(sequence[1:])]
182 | 
183 |         instance["input_ids"] = list(chain(*sequence))
184 |         instance["token_type_ids"] = [cap] * len(sequence[0]) + [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence[1:]) for _ in s]
185 |         if video and train:
186 |             #instance["lm_labels"] = sequence[0] + ([-1]*sum(len(s) for s in sequence[1:-1])) + sequence[-1]
187 |             instance["lm_labels"] = sequence[0] + ([-1]*sum(len(s) for s in sequence[1:-1])) + sequence[-1]
188 |         else:
189 |             instance["lm_labels"] = ([-1]*sum(len(s) for s in sequence[:-1])) + sequence[-1]
190 |     else:
191 |         instance = {}
192 |         sequence = history + [reply + ([eos] if with_eos else [])]
193 |         sequence = [[speaker2 if (len(sequence)-i) % 2 else speaker1] + s for i, s in enumerate(sequence)]
194 | 
195 |         instance["input_ids"] = list(chain(*sequence))
196 |         instance["token_type_ids"] = [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence) for _ in s]
197 |         if video:
198 |             instance["lm_labels"] = ([-1]*sum(len(s) for s in sequence[:-1])) + sequence[-1]
199 |         else:
200 |             instance["lm_labels"] = ([-1]*sum(len(s) for s in sequence[:-1])) + sequence[-1]
201 | 
202 |     return instance, sequence
203 | 
204 | 
205 | 


--------------------------------------------------------------------------------
/generate.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import logging
  3 | import random
  4 | import time
  5 | import copy
  6 | from argparse import ArgumentParser
  7 | from itertools import chain
  8 | from pprint import pformat
  9 | import copy
 10 | import numpy as np
 11 | 
 12 | import torch
 13 | import torch.nn.functional as F
 14 | 
 15 | from transformers import *
 16 | from VideoGPT2 import *
 17 | from train import SPECIAL_TOKENS, SPECIAL_TOKENS_DICT
 18 | from dataset import get_dataset, build_input_from_segments
 19 | 
 20 | def top_filtering(logits, top_k=0, top_p=0.0, threshold=-float('Inf'), filter_value=-float('Inf')):
 21 |     """ Filter a distribution of logits using top-k, top-p (nucleus) and/or threshold filtering
 22 |         Args:
 23 |             logits: logits distribution shape (vocabulary size)
 24 |             top_k: <=0: no filtering, >0: keep only top k tokens with highest probability.
 25 |             top_p: <=0.0: no filtering, >0.0: keep only a subset S of candidates, where S is the smallest subset
 26 |                 whose total probability mass is greater than or equal to the threshold top_p.
 27 |                 In practice, we select the highest probability tokens whose cumulative probability mass exceeds
 28 |                 the threshold top_p.
 29 |             threshold: a minimal threshold to keep logits
 30 |     """
 31 |     assert logits.dim() == 1  # Only work for batch size 1 for now - could update but it would obfuscate a bit the code
 32 |     top_k = min(top_k, logits.size(-1))
 33 |     if top_k > 0:
 34 |         # Remove all tokens with a probability less than the last token in the top-k tokens
 35 |         indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
 36 |         logits[indices_to_remove] = filter_value
 37 | 
 38 |     if top_p > 0.0:
 39 |         # Compute cumulative probabilities of sorted tokens
 40 |         sorted_logits, sorted_indices = torch.sort(logits, descending=True)
 41 |         cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
 42 | 
 43 |         # Remove tokens with cumulative probability above the threshold
 44 |         sorted_indices_to_remove = cumulative_probabilities > top_p
 45 |         # Shift the indices to the right to keep also the first token above the threshold
 46 |         sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
 47 |         sorted_indices_to_remove[..., 0] = 0
 48 | 
 49 |         # Back to unsorted indices and set them to -infinity
 50 |         indices_to_remove = sorted_indices[sorted_indices_to_remove]
 51 |         logits[indices_to_remove] = filter_value
 52 | 
 53 |     indices_to_remove = logits < threshold
 54 |     logits[indices_to_remove] = filter_value
 55 | 
 56 |     return logits
 57 | 
 58 | def sample_sequence(caption, history, tokenizer, model, args, current_output=None, video=None):
 59 |     special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
 60 |     if current_output is None:
 61 |         current_output = []
 62 |     for i in range(args.max_length):
 63 |         instance, sequence = build_input_from_segments(caption, history, current_output, tokenizer, with_eos=False, drop_caption=False)
 64 | 
 65 |         input_ids = torch.tensor(instance["input_ids"], device=args.device).unsqueeze(0)
 66 |         token_type_ids = torch.tensor(instance["token_type_ids"], device=args.device).unsqueeze(0)
 67 |         input_embs = model.transformer.wte(input_ids)
 68 |         if video is not None:
 69 |             input_embs = torch.cat([model.video_ff(video), input_embs], dim=1)
 70 |             token_type_ids = torch.cat([torch.ones((1, video.size(1))).long().cuda() * tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-2]), token_type_ids], dim=1)
 71 | 
 72 |         logits = model(input_embs, token_type_ids=token_type_ids)
 73 |         if "gpt2" == args.model:
 74 |             logits = logits[0]
 75 |         logits = logits[0, -1, :] / args.temperature
 76 |         logits = top_filtering(logits, top_k=args.top_k, top_p=args.top_p)
 77 |         probs = F.softmax(logits, dim=-1)
 78 | 
 79 |         prev = torch.topk(probs, 1)[1] if args.no_sample else torch.multinomial(probs, 1)
 80 |         if i < args.min_length and prev.item() in special_tokens_ids:
 81 |             while prev.item() in special_tokens_ids:
 82 |                 prev = torch.multinomial(probs, num_samples=1)
 83 | 
 84 |         if prev.item() in special_tokens_ids:
 85 |             break
 86 |         current_output.append(prev.item())
 87 | 
 88 |     return current_output
 89 | 
 90 | def beam_search(caption, history, tokenizer, model, args, current_output=None, video=None):
 91 |     special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
 92 |     if current_output is None:
 93 |         current_output = []
 94 |     hyplist = [([], 0., current_output)]
 95 |     best_state = None
 96 |     comp_hyplist = []
 97 |     
 98 |     for i in range(args.max_length):
 99 |         new_hyplist = []
100 |         argmin = 0
101 |         for out, lp, st in hyplist:
102 |             instance, sequence = build_input_from_segments(caption, history, st, tokenizer, with_eos=False, drop_caption=False)
103 | 
104 |             input_ids = torch.tensor(instance["input_ids"], device=args.device).unsqueeze(0)
105 |             token_type_ids = torch.tensor(instance["token_type_ids"], device=args.device).unsqueeze(0)
106 |             input_embs = model.transformer.wte(input_ids)
107 |             if video is not None:
108 |                 input_embs = torch.cat([model.video_ff(video), input_embs], dim=1)
109 |                 token_type_ids = torch.cat([torch.ones((1, video.size(1))).long().cuda() * tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-2]), token_type_ids], dim=1)
110 | 
111 |             logits = model(input_embs, token_type_ids=token_type_ids)
112 |             if "gpt2" == args.model:
113 |                 logits = logits[0]
114 |             logp = F.log_softmax(logits, dim=-1)[:, -1, :]
115 |             lp_vec = logp.cpu().data.numpy() + lp
116 |             lp_vec = np.squeeze(lp_vec)
117 |             if i >= args.min_length:
118 |                 new_lp = lp_vec[tokenizer.eos_token_id] + args.penalty * (len(out) + 1)
119 |                 comp_hyplist.append((out, new_lp))
120 |                 if best_state is None or best_state < new_lp:
121 |                     best_state = new_lp
122 |             count = 1
123 |             for o in np.argsort(lp_vec)[::-1]:
124 |                 if o == tokenizer.unk_token_id or o == tokenizer.eos_token_id:
125 |                     continue
126 |                 new_lp = lp_vec[o]
127 |                 if len(new_hyplist) == args.beam_size:
128 |                     if new_hyplist[argmin][1] < new_lp:
129 |                         new_st = copy.deepcopy(st)
130 |                         new_st.append(int(o))
131 |                         new_hyplist[argmin] = (out + [o], new_lp, new_st)
132 |                         argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
133 |                     else:
134 |                         break
135 |                 else:
136 |                     new_st = copy.deepcopy(st)
137 |                     new_st.append(int(o))
138 |                     new_hyplist.append((out + [o], new_lp, new_st))
139 |                     if len(new_hyplist) == args.beam_size:
140 |                         argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
141 |                 count += 1
142 |         hyplist = new_hyplist 
143 |     if len(comp_hyplist) > 0:
144 |         maxhyps = sorted(comp_hyplist, key=lambda h: -h[1])[:1]
145 |         return maxhyps
146 |     else:
147 |         return [([], 0)]
148 | 
149 | def greedy_decode(caption, history, tokenizer, model, args, current_output=None, video=None):
150 |     special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
151 |     ys = []
152 |     
153 |     for i in range(args.max_length):
154 |         instance, sequence = build_input_from_segments(caption, history, ys, tokenizer, with_eos=False, drop_caption=False)
155 | 
156 |         input_ids = torch.tensor(instance["input_ids"], device=args.device).unsqueeze(0)
157 |         token_type_ids = torch.tensor(instance["token_type_ids"], device=args.device).unsqueeze(0)
158 |         input_embs = model.transformer.wte(input_ids)
159 |         if video is not None:
160 |             input_embs = torch.cat([model.video_ff(video), input_embs], dim=1)
161 |             token_type_ids = torch.cat([torch.ones((1, video.size(1))).long().cuda() * tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-2]), token_type_ids], dim=1)
162 | 
163 |         logits = model(input_embs, token_type_ids=token_type_ids)
164 |         if "gpt2" == args.model:
165 |             logits = logits[0][0]
166 |         logits = logits.cpu().data.numpy()
167 |         next_word = np.argsort(logits[-1])[-1]
168 |         if next_word == special_tokens_ids[1]:
169 |             break
170 |         ys.append(next_word)
171 |     return ys    
172 | 
173 | 
174 | # Evaluation routine
175 | def generate_response(model, data, dataset, args, ref_data=None):
176 |     result_dialogs = []
177 |     model.eval()
178 |     with torch.no_grad():
179 |         qa_id = 0
180 |         for idx, dialog in enumerate(data['dialogs']):
181 |             vid = dialog['image_id']
182 |             out_dialog = dialog['dialog'][-1:]
183 |             pred_dialog = {'image_id': vid,
184 |                            'dialog': copy.deepcopy(out_dialog)}
185 |             result_dialogs.append(pred_dialog)
186 | 
187 |             vgg = np.load("data/vggish_testset/"+vid+".npy")
188 |             i3d_flow = np.load("data/i3d_flow_testset/"+vid+".npy")
189 |             i3d_rgb = np.load("data/i3d_rgb_testset/"+vid+".npy")
190 |         
191 |             sample_i3d_flow = i3d_flow[range(1, i3d_flow.shape[0], 1)]
192 |             sample_i3d_rgb = i3d_rgb[range(1, i3d_rgb.shape[0], 1)]
193 | 
194 |             vgg = torch.from_numpy(vgg).float().cuda()
195 |             i3d_flow = torch.from_numpy(sample_i3d_flow).float().cuda()
196 |             i3d_rgb = torch.from_numpy(sample_i3d_rgb).float().cuda()
197 |             min_length = min([i3d_flow.size(0), i3d_rgb.size(0), vgg.size(0)])
198 |             i3d = torch.cat([i3d_flow[:min_length], i3d_rgb[:min_length], vgg[:min_length]], dim=1).unsqueeze(0)
199 | 
200 |             for t, qa in enumerate(out_dialog):
201 |                 logging.info('%d %s_%d' % (qa_id, vid, t))
202 |                 logging.info('QS: ' + qa['question'])
203 |                 # prepare input data
204 |                 start_time = time.time()
205 |                 qa_id += 1
206 |                 
207 |                 if args.beam_search:
208 |                    #hypstr = greedy_decode(dataset[idx]["caption"], dataset[idx]["history"], tokenizer, model, args, video=i3d)
209 |                    hypstr = beam_search(dataset[idx]["caption"], dataset[idx]["history"], tokenizer, model, args, video=i3d)
210 |                    hypstr = hypstr[0][0]
211 |                 else:
212 |                    hypstr=sample_sequence(dataset[idx]["caption"], dataset[idx]["history"], tokenizer, model, args, video=i3d)
213 |                 hypstr=tokenizer.decode(hypstr, skip_special_tokens=True)
214 |                 logging.info('HYP: ' + hypstr)
215 |                 pred_dialog['dialog'][t]['answer'] = hypstr
216 |                 logging.info('ElapsedTime: %f' % (time.time() - start_time))
217 |                 logging.info('-----------------------')
218 | 
219 | 
220 |     return {'dialogs': result_dialogs}
221 | 
222 | 
223 | ##################################
224 | # main
225 | if __name__ =="__main__":
226 |     parser = ArgumentParser()
227 |     parser.add_argument("--model", type=str, default="gpt2", help="Model type (gpt or gpt2)")
228 |     parser.add_argument("--model_checkpoint", type=str, default="log_without_caption_with_valid/", help="Path, url or short name of the model")
229 |     parser.add_argument("--max_history", type=int, default=3, help="Number of previous utterances to keep in history")
230 |     parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu", help="Device (cuda or cpu)")
231 | 
232 |     parser.add_argument("--no_sample", action='store_true', help="Set to use greedy decoding instead of sampling")
233 |     parser.add_argument("--beam_search", action='store_true', help="Set to use beam search instead of sampling")
234 |     parser.add_argument("--beam_size", type=int, default=5, help="Beam size")
235 |     parser.add_argument("--max_length", type=int, default=20, help="Maximum length of the output utterances")
236 |     parser.add_argument("--min_length", type=int, default=1, help="Minimum length of the output utterances")
237 |     parser.add_argument("--penalty", type=float, default=0.3, help="elngth penalty")
238 |     parser.add_argument("--seed", type=int, default=42, help="Seed")
239 |     parser.add_argument("--temperature", type=int, default=0.7, help="Sampling softmax temperature")
240 |     parser.add_argument("--top_k", type=int, default=0, help="Filter top-k tokens before sampling (<=0: no filtering)")
241 |     parser.add_argument("--top_p", type=float, default=0.9, help="Nucleus filtering (top-p) before sampling (<=0.0: no filtering)")
242 |     parser.add_argument("--test_set", type=str, default="data/test_set4DSTC8-AVSD.json")
243 |     parser.add_argument("--lbl_test_set", type=str, default="data/lbl_undisclosedonly_test_set4DSTC7-AVSD.json")
244 |     parser.add_argument("--output", type=str, default="result.json")
245 |     
246 |     args = parser.parse_args()
247 |     for arg in vars(args):
248 |         print("{}={}".format(arg, getattr(args, arg)))
249 | 
250 |     logging.basicConfig(level=logging.INFO,
251 |             format='%(asctime)s %(levelname)s: %(message)s')
252 |  
253 |     logging.info('Loading model params from ' + args.model_checkpoint)
254 |     
255 |     tokenizer_class = GPT2Tokenizer if "gpt2" == args.model else OpenAIGPTTokenizer
256 |     tokenizer = tokenizer_class.from_pretrained(args.model_checkpoint)
257 |     tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
258 |     model_class = VideoGPT2LMHeadModel if "gpt2" == args.model else OpenAIGPTLMHeadModel
259 |     model_config = GPT2Config.from_pretrained(args.model_checkpoint)
260 |     model = model_class.from_pretrained(args.model_checkpoint+"checkpoint_mymodel_4.pth", config=model_config)
261 |     model.to(args.device)
262 |     model.eval()
263 | 
264 | 
265 | 
266 |     logging.info('Loading test data from ' + args.test_set)
267 |     test_data = json.load(open(args.test_set,'r'))
268 |     test_dataset = get_dataset(tokenizer, args.test_set, undisclosed_only=True, n_history=args.max_history)
269 |     # generate sentences
270 |     logging.info('-----------------------generate--------------------------')
271 |     start_time = time.time()
272 |     result = generate_response(model, test_data, test_dataset, args)
273 |     logging.info('----------------')
274 |     logging.info('wall time = %f' % (time.time() - start_time))
275 |     if args.output:
276 |         logging.info('writing results to ' + args.output)
277 |         json.dump(result, open(args.output, 'w'), indent=4)
278 |     logging.info('done')
279 | 


--------------------------------------------------------------------------------
/images/Figure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ictnlp/DSTC8-AVSD/e9578fe1dc0d982928b4be8b5e133036664ad05c/images/Figure1.png


--------------------------------------------------------------------------------
/images/Figure2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ictnlp/DSTC8-AVSD/e9578fe1dc0d982928b4be8b5e133036664ad05c/images/Figure2.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.0.1
2 | pytorch-ignite==0.2.1
3 | transformers==2.1.1
4 | tqdm==4.36.1
5 | 
6 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2019-present, HuggingFace Inc.
  2 | # All rights reserved. This source code is licensed under the BSD-style license found in the LICENSE file in the root directory of this source tree.
  3 | import os
  4 | import math
  5 | import logging
  6 | from pprint import pformat
  7 | from argparse import ArgumentParser
  8 | from collections import defaultdict
  9 | from itertools import chain
 10 | 
 11 | import torch
 12 | from torch.nn.parallel import DistributedDataParallel
 13 | from torch.utils.data import DataLoader, TensorDataset
 14 | from ignite.engine import Engine, Events
 15 | from ignite.handlers import ModelCheckpoint
 16 | from ignite.metrics import Accuracy, Loss, MetricsLambda, RunningAverage
 17 | from ignite.contrib.handlers import ProgressBar, PiecewiseLinear
 18 | from ignite.contrib.handlers.tensorboard_logger import TensorboardLogger, OutputHandler, OptimizerParamsHandler
 19 | from transformers import *
 20 | from VideoGPT2 import *
 21 | import pickle as pkl
 22 | 
 23 | from dataset import get_dataset, AVSDDataSet, collate_fn
 24 | 
 25 | SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>","<cap>", "<video>", "<pad>"]
 26 | SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
 27 | MODEL_INPUTS = ["input_ids", "token_type_ids","lm_labels"]
 28 | PADDED_INPUTS = ["input_ids", "token_type_ids","lm_labels"]
 29 | 
 30 | logger = logging.getLogger(__file__)
 31 | 
 32 | def average_distributed_scalar(scalar, args):
 33 |     """ Average a scalar over the nodes if we are in distributed training. We use this for distributed evaluation. """
 34 |     if args.local_rank == -1:
 35 |         return scalar
 36 |     scalar_t = torch.tensor(scalar, dtype=torch.float, device=args.device) / torch.distributed.get_world_size()
 37 |     torch.distributed.all_reduce(scalar_t, op=torch.distributed.ReduceOp.SUM)
 38 |     return scalar_t.item()
 39 | 
 40 | 
 41 | def get_data_loaders_new(args, tokenizer):
 42 |     train_data = get_dataset(tokenizer, args.train_path, args.fea_path, n_history=args.max_history)
 43 |     #with open("train_data_gpt2.pkl", "rb") as f:
 44 |     #    train_data = pkl.load(f)
 45 |         # pkl.dump(train_data, f)
 46 |     valid_data = get_dataset(tokenizer, args.valid_path, args.fea_path, n_history=args.max_history)
 47 |     #with open("valid_data_gpt2.pkl", "rb") as f:
 48 |     #    valid_data = pkl.load(f)
 49 |         # pkl.dump(valid_data, f)
 50 |     train_dataset = AVSDDataSet(train_data[0], tokenizer, (train_data[1], valid_data[1]), drop_rate=0, train=True)
 51 |     valid_dataset = AVSDDataSet(valid_data[0], tokenizer, (valid_data[1], train_data[1]), drop_rate=0, train=False)
 52 |     train_loader = DataLoader(train_dataset, batch_size=args.train_batch_size, num_workers=4, shuffle=(not args.distributed), collate_fn=lambda x: collate_fn(x, tokenizer.pad_token_id, features=True))
 53 |     valid_loader = DataLoader(valid_dataset, batch_size=args.valid_batch_size, num_workers=4, shuffle=False, collate_fn=lambda x: collate_fn(x, tokenizer.pad_token_id, features=True))
 54 |     return train_loader, valid_loader
 55 | 
 56 | 
 57 | def train():
 58 |     parser = ArgumentParser()
 59 |     parser.add_argument("--train_path", type=str, default="data/train_set4DSTC7-AVSD.json", help="Path of the trainset")
 60 |     parser.add_argument("--fea_path", type=str, default="data/", help="Path of the trainset")
 61 |     parser.add_argument("--valid_path", type=str, default="data/valid_set4DSTC7-AVSD.json", help="Path of the validset")
 62 |     parser.add_argument("--model_checkpoint", type=str, default="gpt2", help="Path, url or short name of the model")
 63 |     parser.add_argument("--max_history", type=int, default=3, help="Number of previous exchanges to keep in history")
 64 |     parser.add_argument("--train_batch_size", type=int, default=4, help="Batch size for training")
 65 |     parser.add_argument("--valid_batch_size", type=int, default=4, help="Batch size for validation")
 66 |     parser.add_argument("--drop_rate", type=float, default=0.5, help="drop rate for caption")
 67 |     parser.add_argument("--gradient_accumulation_steps", type=int, default=8, help="Accumulate gradients on several steps")
 68 |     parser.add_argument("--lr", type=float, default=6.25e-5, help="Learning rate")
 69 |     parser.add_argument("--max_norm", type=float, default=1.0, help="Clipping gradient norm")
 70 |     parser.add_argument("--n_epochs", type=int, default=8, help="Number of training epochs")
 71 |     parser.add_argument("--eval_before_start", action='store_true', help="If true start with a first evaluation before training")
 72 |     parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu", help="Device (cuda or cpu)")
 73 |     parser.add_argument("--fp16", type=str, default="", help="Set to O0, O1, O2 or O3 for fp16 training (see apex documentation)")
 74 |     parser.add_argument("--local_rank", type=int, default=-1, help="Local rank for distributed training (-1: not distributed)")
 75 |     parser.add_argument("--log_path", type=str, default="log/", help="Log path")
 76 |     args = parser.parse_args()
 77 | 
 78 |     if not os.path.exists(args.log_path):
 79 |         os.makedirs(args.log_path)
 80 |     # logging is set to INFO (resp. WARN) for main (resp. auxiliary) process. logger.info => log main process only, logger.warning => log all processes
 81 |     logging.basicConfig(level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
 82 |     logger.warning("Running process %d", args.local_rank)  # This is a logger.warning: it will be printed by all distributed processes
 83 |     logger.info("Arguments: %s", pformat(args))
 84 | 
 85 |     # Initialize distributed training if needed
 86 |     args.distributed = (args.local_rank != -1)
 87 |     if args.distributed:
 88 |         torch.cuda.set_device(args.local_rank)
 89 |         args.device = torch.device("cuda", args.local_rank)
 90 |         torch.distributed.init_process_group(backend='nccl', init_method='env://')
 91 | 
 92 |     logger.info("Prepare tokenizer, pretrained model and optimizer - add special tokens for fine-tuning")
 93 |     tokenizer_class = GPT2Tokenizer
 94 |     tokenizer = tokenizer_class.from_pretrained(args.model_checkpoint)
 95 |     model_class = VideoGPT2LMHeadModel
 96 |     model = model_class.from_pretrained(args.model_checkpoint)
 97 |     tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
 98 |     model.resize_token_embeddings(len(tokenizer))
 99 |     model.to(args.device)
100 |     optimizer = AdamW(model.parameters(), lr=args.lr)
101 | 
102 |     # Prepare model for FP16 and distributed training if needed (order is important, distributed should be the last)
103 |     if args.fp16:
104 |         from apex import amp  # Apex is only required if we use fp16 training
105 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16)
106 |     if args.distributed:
107 |         model = DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
108 | 
109 |     logger.info("Prepare datasets")
110 |     train_loader, val_loader = get_data_loaders_new(args, tokenizer)
111 |     # Training function and trainer
112 |     def update(engine, batch):
113 |         model.train()
114 |         batch = tuple(input_tensor.to(args.device) for input_tensor in batch)
115 |         input_ids,token_type_ids,labels, input_mask, i3d, video_mask, reply_mask = batch
116 |         input_embs = model.transformer.wte(input_ids)
117 |         video_embs = model.video_ff(i3d)
118 |         input_embs = torch.cat([video_embs, input_embs], dim=1)
119 |         token_type_ids = torch.cat([torch.ones((i3d.size(0), i3d.size(1))).long().cuda() * tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-2]), token_type_ids], dim=1)
120 |         video_loss = model(input_embs,token_type_ids=token_type_ids, labels=(labels, i3d), attention_mask=[video_mask, input_mask], mode="video")[0]
121 |         reply_loss = model(input_embs,token_type_ids=token_type_ids, labels=(labels, i3d), attention_mask=[reply_mask, input_mask], mode="reply")[0]
122 |         loss = (video_loss + reply_loss) / args.gradient_accumulation_steps
123 |         if args.fp16:
124 |             with amp.scale_loss(loss, optimizer) as scaled_loss:
125 |                 scaled_loss.backward()
126 |             torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_norm)
127 |         else:
128 |             loss.backward()
129 |             torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_norm)
130 |         if engine.state.iteration % args.gradient_accumulation_steps == 0:
131 |             optimizer.step()
132 |             optimizer.zero_grad()
133 |         return loss.item()
134 |     trainer = Engine(update)
135 | 
136 |     # Evaluation function and evaluator (evaluator output is the input of the metrics)
137 |     def inference(engine, batch):
138 |         model.eval()
139 |         with torch.no_grad():
140 |             batch = tuple(input_tensor.to(args.device) for input_tensor in batch)
141 |             input_ids, token_type_ids, lm_labels, input_mask, i3d, video_mask, reply_mask = batch
142 |             input_embs = model.transformer.wte(input_ids)
143 |             video_embs = model.video_ff(i3d)
144 |             input_embs = torch.cat([video_embs, input_embs], dim=1)
145 |             token_type_ids = torch.cat([torch.ones((i3d.size(0), i3d.size(1))).long().cuda() * tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-2]), token_type_ids], dim=1)
146 |             model_outputs = model(input_embs, token_type_ids=token_type_ids, attention_mask=[reply_mask, input_mask])[0]
147 | 
148 |             lm_logits = model_outputs  # So we can also use GPT2 outputs
149 |             lm_logits_flat_shifted = lm_logits[..., :-1, :].contiguous().view(-1, lm_logits.size(-1))
150 |             lm_labels_flat_shifted = lm_labels[..., 1:].contiguous().view(-1)
151 |             return lm_logits_flat_shifted, lm_labels_flat_shifted
152 |     evaluator = Engine(inference)
153 | 
154 |     # Attach evaluation to trainer: we evaluate when we start the training and at the end of each epoch
155 |     trainer.add_event_handler(Events.EPOCH_COMPLETED, lambda _: evaluator.run(val_loader))
156 |     if args.n_epochs < 1:
157 |         trainer.add_event_handler(Events.COMPLETED, lambda _: evaluator.run(val_loader))
158 |     if args.eval_before_start:
159 |         trainer.add_event_handler(Events.STARTED, lambda _: evaluator.run(val_loader))
160 | 
161 |     # Linearly decrease the learning rate from lr to zero
162 |     scheduler = PiecewiseLinear(optimizer, "lr", [(0, args.lr), (args.n_epochs * len(train_loader), 0.0)])
163 |     trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)
164 | 
165 |     # Prepare metrics - note how we compute distributed metrics 
166 |     RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
167 |     metrics = {"nll": Loss(torch.nn.CrossEntropyLoss(ignore_index=-1), output_transform=lambda x: (x[0], x[1]))}
168 |     metrics.update({"average_nll": MetricsLambda(average_distributed_scalar, metrics["nll"], args)})
169 |     metrics["average_ppl"] = MetricsLambda(math.exp, metrics["average_nll"])
170 |     for name, metric in metrics.items():
171 |         metric.attach(evaluator, name)
172 | 
173 |     # On the main process: add progress bar, tensorboard, checkpoints and save model, configuration and tokenizer before we start to train
174 |     if args.local_rank in [-1, 0]:
175 |         pbar = ProgressBar(persist=True)
176 |         pbar.attach(trainer, metric_names=["loss"])
177 |         evaluator.add_event_handler(Events.COMPLETED, lambda _: pbar.log_message("Validation: %s" % pformat(evaluator.state.metrics)))
178 | 
179 |         tb_logger = TensorboardLogger(log_dir="./tb_logs")
180 |         tb_logger.attach(trainer, log_handler=OutputHandler(tag="training", metric_names=["loss"]), event_name=Events.ITERATION_COMPLETED)
181 |         tb_logger.attach(trainer, log_handler=OptimizerParamsHandler(optimizer), event_name=Events.ITERATION_STARTED)
182 |         tb_logger.attach(evaluator, log_handler=OutputHandler(tag="validation", metric_names=list(metrics.keys()), another_engine=trainer), event_name=Events.EPOCH_COMPLETED)
183 | 
184 |         checkpoint_handler = ModelCheckpoint(args.log_path, 'checkpoint', save_interval=1, n_saved=8,require_empty=False)
185 |         trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': getattr(model, 'module', model)})  # "getattr" take care of distributed encapsulation
186 | 
187 |         torch.save(args, args.log_path + 'model_training_args.bin')
188 |         getattr(model, 'module', model).config.to_json_file(os.path.join(args.log_path, CONFIG_NAME))
189 |         tokenizer.save_vocabulary(args.log_path)
190 | 
191 |     # Run the training
192 |     trainer.run(train_loader, max_epochs=args.n_epochs)
193 | 
194 |     # On the main process: close tensorboard logger and rename the last checkpoint (for easy re-loading with OpenAIGPTModel.from_pretrained method)
195 |     if args.local_rank in [-1, 0] and args.n_epochs > 0:
196 |         os.rename(checkpoint_handler._saved[-1][1][-1], os.path.join(args.log_path, WEIGHTS_NAME))  # TODO: PR in ignite to have better access to saved file paths (cleaner)
197 |         tb_logger.close()
198 | 
199 | if __name__ == "__main__":
200 |     train()
201 | 


--------------------------------------------------------------------------------