├── referring_segmentation
    ├── model
    │   ├── backbone
    │   │   ├── deeplab_resnet.py
    │   │   ├── __pycache__
    │   │   │   ├── resnet.cpython-38.pyc
    │   │   │   ├── mobilenet.cpython-38.pyc
    │   │   │   └── frozen_batchnorm.cpython-38.pyc
    │   │   ├── frozen_batchnorm.py
    │   │   └── mobilenet.py
    │   ├── __pycache__
    │   │   └── model.cpython-38.pyc
    │   └── module
    │   │   ├── __pycache__
    │   │       ├── TCN.cpython-38.pyc
    │   │       ├── aspp.cpython-38.pyc
    │   │       └── attention.cpython-38.pyc
    │   │   ├── aspp.py
    │   │   ├── attention.py
    │   │   └── TCN.py
    ├── utils
    │   ├── __pycache__
    │   │   ├── loss.cpython-38.pyc
    │   │   ├── utils.cpython-38.pyc
    │   │   ├── tester.cpython-38.pyc
    │   │   ├── trainer.cpython-38.pyc
    │   │   ├── check_datas.cpython-38.pyc
    │   │   └── video_reader.cpython-38.pyc
    │   ├── utils.py
    │   ├── video_reader.py
    │   ├── tester.py
    │   └── loss.py
    ├── dataset
    │   ├── __pycache__
    │   │   ├── dataset.cpython-38.pyc
    │   │   └── augmentation.cpython-38.pyc
    │   └── dataset.py
    ├── pre_proc
    │   ├── video2imgs.py
    │   └── generate_data_list.py
    ├── json
    │   ├── config_a2d_sentences.json
    │   └── config_jhmdb_sentences.json
    └── main.py
├── docs
    └── net.png
├── temporal_grounding
    ├── model
    │   ├── __pycache__
    │   │   ├── model.cpython-38.pyc
    │   │   └── decoder.cpython-38.pyc
    │   ├── module
    │   │   ├── __pycache__
    │   │   │   ├── TCN.cpython-38.pyc
    │   │   │   ├── aspp.cpython-38.pyc
    │   │   │   ├── encoder.cpython-38.pyc
    │   │   │   ├── attention.cpython-38.pyc
    │   │   │   ├── tanmodule.cpython-38.pyc
    │   │   │   └── RefTransformer.cpython-38.pyc
    │   │   ├── attention.py
    │   │   └── RefTransformer.py
    │   ├── backbone
    │   │   ├── __pycache__
    │   │   │   ├── C3D.cpython-38.pyc
    │   │   │   ├── resnet.cpython-38.pyc
    │   │   │   ├── mobilenet.cpython-38.pyc
    │   │   │   ├── pytorch_i3d.cpython-38.pyc
    │   │   │   └── frozen_batchnorm.cpython-38.pyc
    │   │   └── C3D.py
    │   └── model.py
    ├── utils
    │   ├── __pycache__
    │   │   ├── losses.cpython-38.pyc
    │   │   ├── tester.cpython-38.pyc
    │   │   ├── utils.cpython-38.pyc
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── optimizer.cpython-38.pyc
    │   │   ├── scheduler.cpython-38.pyc
    │   │   ├── trainer.cpython-38.pyc
    │   │   ├── adam_optimizer.cpython-38.pyc
    │   │   └── generate_batch.cpython-38.pyc
    │   ├── __init__.py
    │   ├── generate_batch.py
    │   ├── losses.py
    │   ├── scheduler.py
    │   ├── tester.py
    │   ├── optimizer.py
    │   ├── adam_optimizer.py
    │   └── utils.py
    ├── json
    │   ├── config_TACoS_C3D_anchor.json
    │   ├── config_Charades-STA_I3D_anchor.json
    │   ├── config_Charades-STA_I3D_regression.json
    │   ├── config_TACoS_C3D_regression.json
    │   ├── config_Charades-STA_VGG_anchor.json
    │   ├── config_Charades-STA_VGG_regression.json
    │   ├── config_ActivityNet_C3D_anchor.json
    │   └── config_ActivityNet_C3D_regression.json
    ├── main.py
    └── dataset.py
├── spatiotemporal_grounding
    ├── util
    │   ├── __pycache__
    │   │   ├── dist.cpython-38.pyc
    │   │   ├── misc.cpython-38.pyc
    │   │   ├── optim.cpython-38.pyc
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── box_ops.cpython-38.pyc
    │   │   └── metrics.cpython-38.pyc
    │   ├── __init__.py
    │   ├── optim.py
    │   ├── box_ops.py
    │   └── metrics.py
    ├── models
    │   ├── __pycache__
    │   │   ├── model.cpython-38.pyc
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── criterion.cpython-38.pyc
    │   │   ├── anchor_utils.cpython-38.pyc
    │   │   └── postprocessors.cpython-38.pyc
    │   ├── module
    │   │   ├── __pycache__
    │   │   │   ├── TCN.cpython-310.pyc
    │   │   │   ├── TCN.cpython-38.pyc
    │   │   │   ├── aspp.cpython-310.pyc
    │   │   │   ├── aspp.cpython-38.pyc
    │   │   │   ├── attention.cpython-310.pyc
    │   │   │   └── attention.cpython-38.pyc
    │   │   ├── RefTransformer.py
    │   │   ├── attention.py
    │   │   └── decoder.py
    │   ├── __init__.py
    │   ├── anchor_utils.py
    │   ├── utils.py
    │   ├── postprocessors.py
    │   └── criterion.py
    ├── datasets
    │   ├── __pycache__
    │   │   ├── hcstvg.cpython-38.pyc
    │   │   ├── vidstg.cpython-38.pyc
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── hcstvg_eval.cpython-38.pyc
    │   │   ├── vidstg_eval.cpython-38.pyc
    │   │   ├── torch_videovision.cpython-38.pyc
    │   │   └── video_transforms.cpython-38.pyc
    │   ├── __init__.py
    │   └── torch_videovision.py
    └── config
    │   ├── hcstvg.json
    │   └── vidstg.json
└── README.md


/referring_segmentation/model/backbone/deeplab_resnet.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/net.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/docs/net.png


--------------------------------------------------------------------------------
/temporal_grounding/model/__pycache__/model.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/__pycache__/model.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/losses.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/losses.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/tester.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/tester.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/utils.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/__pycache__/model.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/__pycache__/model.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/loss.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/loss.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/utils.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/dist.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/dist.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/misc.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/misc.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/__pycache__/decoder.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/__pycache__/decoder.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/optimizer.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/optimizer.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/scheduler.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/scheduler.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/trainer.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/trainer.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/tester.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/tester.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/trainer.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/trainer.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/optim.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/optim.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/TCN.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/TCN.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/aspp.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/aspp.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/dataset/__pycache__/dataset.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/dataset/__pycache__/dataset.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/module/__pycache__/TCN.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/TCN.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__pycache__/model.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/model.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/box_ops.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/box_ops.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__pycache__/metrics.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/metrics.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/__pycache__/C3D.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/C3D.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/encoder.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/encoder.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/adam_optimizer.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/adam_optimizer.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/utils/__pycache__/generate_batch.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/generate_batch.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/module/__pycache__/aspp.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/aspp.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/check_datas.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/check_datas.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/utils/__pycache__/video_reader.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/video_reader.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/hcstvg.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/hcstvg.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/vidstg.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/vidstg.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__pycache__/criterion.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/criterion.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/__pycache__/resnet.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/resnet.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/attention.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/attention.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/tanmodule.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/tanmodule.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/dataset/__pycache__/augmentation.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/dataset/__pycache__/augmentation.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/backbone/__pycache__/resnet.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/resnet.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__pycache__/anchor_utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/anchor_utils.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-310.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-310.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-310.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-310.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/__pycache__/mobilenet.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/mobilenet.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/module/__pycache__/attention.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/attention.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/hcstvg_eval.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/hcstvg_eval.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/vidstg_eval.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/vidstg_eval.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__pycache__/postprocessors.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/postprocessors.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/__pycache__/pytorch_i3d.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/pytorch_i3d.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/module/__pycache__/RefTransformer.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/RefTransformer.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/backbone/__pycache__/mobilenet.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/mobilenet.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-310.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-310.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/torch_videovision.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/torch_videovision.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__pycache__/video_transforms.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/video_transforms.cpython-38.pyc


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc


--------------------------------------------------------------------------------
/referring_segmentation/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc


--------------------------------------------------------------------------------
/spatiotemporal_grounding/config/hcstvg.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "combine_datasets": [
 3 |     "hcstvg"
 4 |   ],
 5 |   "combine_datasets_val": [
 6 |     "hcstvg"
 7 |   ],
 8 |   "hcstvg_vid_path": "",
 9 |   "hcstvg_ann_path": "./annotations/HC-STVG/v1"
10 | }


--------------------------------------------------------------------------------
/spatiotemporal_grounding/config/vidstg.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "combine_datasets": [
 3 |     "vidstg"
 4 |   ],
 5 |   "combine_datasets_val": [
 6 |     "vidstg"
 7 |   ],
 8 |   "vidstg_vid_path": "",
 9 |   "vidstg_ann_path": "./annotations/VidSTG"
10 | }


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/__init__.py:
--------------------------------------------------------------------------------
1 | # Adapted from
2 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
3 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
4 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/__init__.py:
--------------------------------------------------------------------------------
 1 | from .vidstg import build as build_vidstg
 2 | from .hcstvg import build as build_hcstvg
 3 | 
 4 | 
 5 | def build_dataset(dataset_file: str, image_set: str, args):
 6 |     if dataset_file == "vidstg":
 7 |         return build_vidstg(image_set, args)
 8 |     if dataset_file == "hcstvg":
 9 |         return build_hcstvg(image_set, args)
10 |     raise ValueError(f"dataset {dataset_file} not supported")
11 | 


--------------------------------------------------------------------------------
/referring_segmentation/pre_proc/video2imgs.py:
--------------------------------------------------------------------------------
 1 | import cv2
 2 | from tqdm import tqdm
 3 | import os
 4 | 
 5 | def video2imgs(videos_path, imgs_save_path):
 6 |     videos = os.listdir(videos_path)
 7 |     for video_name in tqdm(videos):
 8 |         file_name = video_name.split('.')[0]
 9 |         img_save_path = os.path.join(imgs_save_path, file_name)
10 |         if not os.path.exists(img_save_path):
11 |             os.makedirs(img_save_path)
12 |         
13 |         vc = cv2.VideoCapture(os.path.join(videos_path, video_name)) 
14 |         i_frame = 0
15 |         rval=vc.isOpened()
16 | 
17 |         while rval:  
18 |             i_frame = i_frame + 1
19 |             rval, frame = vc.read()
20 |             if rval:
21 |                 cv2.imwrite(os.path.join(img_save_path, '{:05d}.jpg'.format(i_frame)), frame)
22 |             else:
23 |                 break
24 |         vc.release()
25 | 
26 | if __name__ =='__main__':
27 |     videos_path = '/media/wwk/HDD1/dataset/referring_video_segmentation/a2d_sentences/Release/clips320H'
28 |     imgs_save_path = '/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences/Rename_Images'
29 |     video2imgs(videos_path, imgs_save_path)
30 | 


--------------------------------------------------------------------------------
/temporal_grounding/json/config_TACoS_C3D_anchor.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "cuda": true,
 3 |   "gpu_id": "0",
 4 |   "seed": 1,
 5 |   "training_datasets": "TACoS",
 6 |   "testing_datasets": "TACoS",
 7 |   "datasets_root": "",
 8 |   "video_fea_type": "C3D",
 9 |   "window_width": [
10 |     4,
11 |     9,
12 |     16,
13 |     29,
14 |     64
15 |   ],
16 |   "segment_num": 200,
17 |   "embedding_type": "glove_840B_300",
18 |   "embedding_length": 20,
19 |   "decoder_type": "anchor_based",
20 |   "dropout": 0.1,
21 |   "embedding_dim": 300,
22 |   "video_fea_dim": 4096,
23 |   "attention_dim": 256,
24 |   "MLP_dim": 256,
25 |   "prenorm": false,
26 |   "padding_type": "circle",
27 |   "layer_num": 10,
28 |   "with_text": true,
29 |   "with_attention": true,
30 |   "with_mlp": true,
31 |   "groups": 4,
32 |   "gru_bidirection": true,
33 |   "thres_score": 0.3,
34 |   "thres_adjmat": 0.8,
35 |   "resume": "",
36 |   "batch_size": 128,
37 |   "epochs": 20,
38 |   "lr": 0.001,
39 |   "optimizer": "AdamW",
40 |   "lr_schedule": "StepLR",
41 |   "decay_epochs": 15,
42 |   "decay_ratio": 0.1,
43 |   "with_weight_loss": true,
44 |   "loss_weight": [
45 |     1,
46 |     0.001,
47 |     0.001
48 |   ],
49 |   "log_root": "./logs/TACoS_anchor",
50 |   "save_temp_iters": 10,
51 |   "save_iters": 5,
52 |   "num_worker": 8,
53 |   "test_savefold": "./result",
54 |   "checkpoint": "./logs/TACoS_anchor/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_Charades-STA_I3D_anchor.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "cuda": true,
 3 |   "gpu_id": "0",
 4 |   "seed": 1,
 5 |   "training_datasets": "Charades-STA",
 6 |   "testing_datasets": "Charades-STA",
 7 |   "datasets_root": "",
 8 |   "video_fea_type": "I3D",
 9 |   "window_width": [
10 |     6,
11 |     12,
12 |     24,
13 |     48,
14 |     72
15 |   ],
16 |   "segment_num": 75,
17 |   "embedding_type": "glove_6B_300",
18 |   "embedding_length": 20,
19 |   "decoder_type": "anchor_based",
20 |   "dropout": 0.1,
21 |   "embedding_dim": 300,
22 |   "video_fea_dim": 1024,
23 |   "attention_dim": 256,
24 |   "MLP_dim": 256,
25 |   "prenorm": false,
26 |   "padding_type": "circle",
27 |   "layer_num": 10,
28 |   "with_text": true,
29 |   "with_attention": true,
30 |   "with_mlp": true,
31 |   "groups": 8,
32 |   "gru_bidirection": true,
33 |   "thres_score": 0.3,
34 |   "thres_adjmat": 0.8,
35 |   "resume": "",
36 |   "batch_size": 128,
37 |   "epochs": 20,
38 |   "lr": 0.001,
39 |   "optimizer": "AdamW",
40 |   "lr_schedule": "StepLR",
41 |   "decay_epochs": 15,
42 |   "decay_ratio": 0.1,
43 |   "with_weight_loss": true,
44 |   "loss_weight": [
45 |     1,
46 |     0.001,
47 |     0.001
48 |   ],
49 |   "log_root": "./logs/Charades-STA_I3D_anchor",
50 |   "save_temp_iters": 10,
51 |   "save_iters": 5,
52 |   "num_worker": 4,
53 |   "test_savefold": "./result",
54 |   "checkpoint": "./logs/Charades-STA_I3D_anchor/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_Charades-STA_I3D_regression.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "cuda": true,
 3 |   "gpu_id": "0",
 4 |   "seed": 1,
 5 |   "training_datasets": "Charades-STA",
 6 |   "testing_datasets": "Charades-STA",
 7 |   "datasets_root": "",
 8 |   "video_fea_type": "I3D",
 9 |   "window_width": [
10 |     6,
11 |     12,
12 |     24,
13 |     48,
14 |     72
15 |   ],
16 |   "segment_num": 75,
17 |   "embedding_type": "glove_6B_300",
18 |   "embedding_length": 20,
19 |   "decoder_type": "regression",
20 |   "dropout": 0.1,
21 |   "embedding_dim": 300,
22 |   "video_fea_dim": 1024,
23 |   "attention_dim": 256,
24 |   "MLP_dim": 256,
25 |   "prenorm": false,
26 |   "padding_type": "circle",
27 |   "layer_num": 10,
28 |   "with_text": true,
29 |   "with_attention": true,
30 |   "with_mlp": true,
31 |   "groups": 8,
32 |   "gru_bidirection": true,
33 |   "thres_score": 0.3,
34 |   "thres_adjmat": 0.8,
35 |   "resume": "",
36 |   "batch_size": 128,
37 |   "epochs": 20,
38 |   "lr": 0.001,
39 |   "optimizer": "AdamW",
40 |   "lr_schedule": "StepLR",
41 |   "decay_epochs": 15,
42 |   "decay_ratio": 0.1,
43 |   "with_weight_loss": true,
44 |   "loss_weight": [
45 |     1,
46 |     0.001,
47 |     0.001
48 |   ],
49 |   "log_root": "./logs/Charades-STA_I3D_regression",
50 |   "save_temp_iters": 10,
51 |   "save_iters": 5,
52 |   "num_worker": 4,
53 |   "test_savefold": "./result",
54 |   "checkpoint": "./logs/Charades-STA_I3D_regression/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/utils/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the LICENSE file in
 5 | # the root directory of this source tree. An additional grant of patent rights
 6 | # can be found in the PATENTS file in the same directory.
 7 | 
 8 | import importlib
 9 | import os
10 | 
11 | from .optimizer import FairseqLRScheduler
12 | 
13 | 
14 | LR_SCHEDULER_REGISTRY = {}
15 | 
16 | 
17 | def build_lr_scheduler(args, optimizer):
18 |     return LR_SCHEDULER_REGISTRY[args.lr_scheduler](args, optimizer)
19 | 
20 | 
21 | def register_lr_scheduler(name):
22 |     """Decorator to register a new LR scheduler."""
23 | 
24 |     def register_lr_scheduler_cls(cls):
25 |         if name in LR_SCHEDULER_REGISTRY:
26 |             raise ValueError('Cannot register duplicate LR scheduler ({})'.format(name))
27 |         if not issubclass(cls, FairseqLRScheduler):
28 |             raise ValueError('LR Scheduler ({}: {}) must extend FairseqLRScheduler'.format(name, cls.__name__))
29 |         LR_SCHEDULER_REGISTRY[name] = cls
30 |         return cls
31 | 
32 |     return register_lr_scheduler_cls
33 | 
34 | 
35 | # automatically import any Python files in the optimizer/lr_scheduler/ directory
36 | # for file in os.listdir(os.path.dirname(__file__)):
37 | #     if file.endswith('.py') and not file.startswith('_'):
38 | #         module = file[:file.find('.py')]
39 | #         importlib.import_module('optimizer.lr_scheduler.' + module)
40 | 


--------------------------------------------------------------------------------
/temporal_grounding/json/config_TACoS_C3D_regression.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "cuda": true,
 3 |     "gpu_id": "0",
 4 |     "seed": 1,
 5 |     "training_datasets": "TACoS",
 6 |     "testing_datasets": "TACoS",
 7 |     "datasets_root": "",
 8 |     "video_fea_type": "C3D",
 9 |     "window_width": [
10 |         4,
11 |         9,
12 |         16,
13 |         29,
14 |         64
15 |     ],
16 |     "segment_num": 200,
17 |     "embedding_type": "glove_840B_300",
18 |     "embedding_length": 20,
19 |     "decoder_type": "regression",
20 |     "dropout": 0.1,
21 |     "prenorm": false,
22 |     "embedding_dim": 300,
23 |     "video_fea_dim": 4096,
24 |     "attention_dim": 256,
25 |     "MLP_dim": 256,
26 |     "padding_type": "circle",
27 |     "layer_num": 10,
28 |     "with_text": true,
29 |     "with_attention": true,
30 |     "with_mlp": true,
31 |     "groups": 4,
32 |     "gru_bidirection": true,
33 |     "thres_score": 0.3,
34 |     "thres_adjmat": 0.8,
35 |     "resume": "",
36 |     "batch_size": 128,
37 |     "epochs": 20,
38 |     "lr": 0.0005,
39 |     "optimizer": "AdamW",
40 |     "lr_schedule": "StepLR",
41 |     "decay_epochs": 15,
42 |     "decay_ratio": 0.1,
43 |     "with_weight_loss": true,
44 |     "loss_weight": [
45 |         1,
46 |         0.001,
47 |         0.001
48 |     ],
49 |     "log_root": "./logs/TACoS_regression",
50 |     "save_temp_iters": 10,
51 |     "save_iters": 5,
52 |     "num_worker": 8,
53 |     "test_savefold": "./result",
54 |     "checkpoint": "./logs/TACoS_regression/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_Charades-STA_VGG_anchor.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "cuda": true,
 3 |     "gpu_id": "0",
 4 |     "seed": 1,
 5 |     "training_datasets": "Charades-STA",
 6 |     "testing_datasets": "Charades-STA",
 7 |     "datasets_root": "",
 8 |     "video_fea_type": "VGG",
 9 |     "window_width": [
10 |         6,
11 |         12,
12 |         24,
13 |         48,
14 |         72
15 |     ],
16 |     "segment_num": 75,
17 |     "embedding_type": "glove_6B_300",
18 |     "embedding_length": 20,
19 |     "decoder_type": "anchor_based",
20 |     "dropout": 0.1,
21 |     "embedding_dim": 300,
22 |     "video_fea_dim": 4096,
23 |     "attention_dim": 256,
24 |     "MLP_dim": 256,
25 |     "prenorm": false,
26 |     "padding_type": "circle",
27 |     "layer_num": 10,
28 |     "with_text": true,
29 |     "with_attention": true,
30 |     "with_mlp": true,
31 |     "groups": 8,
32 |     "gru_bidirection": true,
33 |     "thres_score": 0.3,
34 |     "thres_adjmat": 0.8,
35 |     "resume": "",
36 |     "batch_size": 128,
37 |     "epochs": 20,
38 |     "lr": 0.001,
39 |     "optimizer": "AdamW",
40 |     "lr_schedule": "StepLR",
41 |     "decay_epochs": 15,
42 |     "decay_ratio": 0.1,
43 |     "with_weight_loss": true,
44 |     "loss_weight": [
45 |         1,
46 |         0.001,
47 |         0.001
48 |     ],
49 |     "log_root": "./logs/Charades-STA_VGG_anchor",
50 |     "save_temp_iters": 10,
51 |     "save_iters": 5,
52 |     "num_worker": 4,
53 |     "test_savefold": "./result",
54 |     "checkpoint": "./logs/Charades-STA_VGG_anchor/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_Charades-STA_VGG_regression.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "cuda": true,
 3 |     "gpu_id": "0",
 4 |     "seed": 1,
 5 |     "training_datasets": "Charades-STA",
 6 |     "testing_datasets": "Charades-STA",
 7 |     "datasets_root": "",
 8 |     "video_fea_type": "VGG",
 9 |     "window_width": [
10 |         6,
11 |         12,
12 |         24,
13 |         48,
14 |         72
15 |     ],
16 |     "segment_num": 75,
17 |     "embedding_type": "glove_6B_300",
18 |     "embedding_length": 20,
19 |     "decoder_type": "regression",
20 |     "dropout": 0.1,
21 |     "embedding_dim": 300,
22 |     "video_fea_dim": 4096,
23 |     "attention_dim": 256,
24 |     "MLP_dim": 256,
25 |     "prenorm": false,
26 |     "padding_type": "circle",
27 |     "layer_num": 10,
28 |     "with_text": true,
29 |     "with_attention": true,
30 |     "with_mlp": true,
31 |     "groups": 8,
32 |     "gru_bidirection": true,
33 |     "thres_score": 0.3,
34 |     "thres_adjmat": 0.8,
35 |     "resume": "",
36 |     "batch_size": 128,
37 |     "epochs": 20,
38 |     "lr": 0.001,
39 |     "optimizer": "AdamW",
40 |     "lr_schedule": "StepLR",
41 |     "decay_epochs": 15,
42 |     "decay_ratio": 0.1,
43 |     "with_weight_loss": true,
44 |     "loss_weight": [
45 |         1,
46 |         0.001,
47 |         0.001
48 |     ],
49 |     "log_root": "./logs/Charades-STA_VGG_regression",
50 |     "save_temp_iters": 10,
51 |     "save_iters": 5,
52 |     "num_worker": 4,
53 |     "test_savefold": "./result",
54 |     "checkpoint": "./logs/Charades-STA_VGG_regression/checkpoints/best_model.pth"
55 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_ActivityNet_C3D_anchor.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "cuda": true,
 3 |     "gpu_id": "0",
 4 |     "seed": 1,
 5 |     "training_datasets": "ActivityNet",
 6 |     "testing_datasets": "ActivityNet",
 7 |     "datasets_root": "",
 8 |     "video_fea_type": "C3D",
 9 |     "window_width": [
10 |         16,
11 |         32,
12 |         64,
13 |         96,
14 |         128,
15 |         160,
16 |         192
17 |     ],
18 |     "segment_num": 200,
19 |     "embedding_type": "glove_840B_300",
20 |     "embedding_length": 40,
21 |     "decoder_type": "anchor_based",
22 |     "dropout": 0.1,
23 |     "embedding_dim": 300,
24 |     "video_fea_dim": 500,
25 |     "attention_dim": 256,
26 |     "MLP_dim": 256,
27 |     "padding_type": "circle",
28 |     "layer_num": 10,
29 |     "prenorm": false,
30 |     "with_text": true,
31 |     "with_attention": true,
32 |     "with_mlp": true,
33 |     "groups": 4,
34 |     "gru_bidirection": true,
35 |     "thres_score": 0.3,
36 |     "thres_adjmat": 0.8,
37 |     "resume": "",
38 |     "batch_size": 128,
39 |     "epochs": 20,
40 |     "lr": 1e-3,
41 |     "optimizer": "AdamW",
42 |     "lr_schedule": "StepLR",
43 |     "decay_epochs": 15,
44 |     "decay_ratio": 0.1,
45 |     "with_weight_loss": true,
46 |     "loss_weight": [
47 |         1,
48 |         0.001,
49 |         0.001
50 |     ],
51 |     "log_root": "./logs/ActivityNet_anchor",
52 |     "save_temp_iters": 10,
53 |     "save_iters": 5,
54 |     "num_worker": 8,
55 |     "test_savefold": "./result",
56 |     "checkpoint": "./logs/ActivityNet_C3D_anchor/checkpoints/best_model.pth"
57 | }


--------------------------------------------------------------------------------
/temporal_grounding/json/config_ActivityNet_C3D_regression.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "cuda": true,
 3 |     "gpu_id": "0",
 4 |     "seed": 1,
 5 |     "training_datasets": "ActivityNet",
 6 |     "testing_datasets": "ActivityNet",
 7 |     "datasets_root": "",
 8 |     "video_fea_type": "C3D",
 9 |     "window_width": [
10 |         16,
11 |         32,
12 |         64,
13 |         96,
14 |         128,
15 |         160,
16 |         192
17 |     ],
18 |     "segment_num": 64,
19 |     "embedding_type": "glove_840B_300",
20 |     "embedding_length": 40,
21 |     "decoder_type": "regression",
22 |     "dropout": 0.1,
23 |     "embedding_dim": 300,
24 |     "video_fea_dim": 500,
25 |     "attention_dim": 256,
26 |     "MLP_dim": 256,
27 |     "padding_type": "circle",
28 |     "layer_num": 10,
29 |     "prenorm": true,
30 |     "with_text": true,
31 |     "with_attention": true,
32 |     "with_mlp": true,
33 |     "groups": 4,
34 |     "gru_bidirection": true,
35 |     "thres_score": 0.3,
36 |     "thres_adjmat": 0.8,
37 |     "resume": "",
38 |     "batch_size": 128,
39 |     "epochs": 20,
40 |     "lr": 0.0005,
41 |     "optimizer": "AdamW",
42 |     "lr_schedule": "StepLR",
43 |     "decay_epochs": 15,
44 |     "decay_ratio": 0.1,
45 |     "with_weight_loss": true,
46 |     "loss_weight": [
47 |         1,
48 |         0.001,
49 |         0.001
50 |     ],
51 |     "log_root": "./logs/ActivityNet_regression",
52 |     "save_temp_iters": 10,
53 |     "save_iters": 5,
54 |     "num_worker": 8,
55 |     "test_savefold": "./result",
56 |     "checkpoint": "./logs/ActivityNet_C3D_regression/checkpoints/best_model.pth"
57 | }


--------------------------------------------------------------------------------
/referring_segmentation/json/config_a2d_sentences.json:
--------------------------------------------------------------------------------
 1 | {
 2 |  "setting_config": {
 3 |   "cuda": true,
 4 |   "gpu_id": "0",
 5 |   "seed": 20
 6 |  },
 7 |  "data_config": {
 8 |   "training_datasets": [
 9 |    "a2d_sentences"
10 |   ],
11 |   "testing_datasets": [
12 |    "a2d_sentences"
13 |   ],
14 |   "datasets_root": "/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences",
15 |   "input_size": [
16 |    320,
17 |    320
18 |   ],
19 |   "clip_size": 8,
20 |   "augmentations": {
21 |    "random_crop": true,
22 |    "random_flip": false
23 |   },
24 |   "embedding_type": "glove_840B_300",
25 |   "max_embedding_length": 20
26 | },
27 |  "model_config": {
28 |   "backbone": "deeplab_resnet101",
29 |   "input_dim": 3,
30 |   "os": 16,
31 |   "train_backbone": false,
32 |   "backbone_path": "",
33 |   "backbone_multi_scale": true,
34 |   "video_feature_dim": 256,
35 |   "text_feature_dim": 256,
36 |   "TCN_feature_dim": 256,
37 |   "gru_bidirection": true,
38 |   "attention_dim": 256,
39 |   "TCN_hidden_dim": 64,
40 |   "embedding_dim": 300,
41 |   "layer_num": 5,
42 |   "is_local_attention": true,
43 |   "groups": 4,
44 |   "is_global_attention": true,
45 |   "conv_type": "2D",
46 |   "filter_type": "global",
47 |   "global_fuse_type": "mutan",
48 |   "local_fuse_type": "relevance_filter",
49 |   "padding_type": "circle",
50 |   "norm_type": "GroupNorm",
51 |   "frozen_batchnorm": true,
52 |   "frozen_backbone": true
53 |  },
54 |  "training_config": {
55 |   "resume": "",
56 |   "batch_size": 8,
57 |   "epochs": 20,
58 | 
59 |   "lr_backbone": 5e-05,
60 |   "lr_branch": 0.0005,
61 |   "optimizer": "AdamW",
62 |   "lr_schedule": "Step",
63 |   "loss_function": "SSIM",
64 |   "log_root": "./logs",
65 |   "save_iters": 100,
66 |   "num_worker": 8,
67 |   "lr_weight": 0.1
68 |  },
69 |  "testing_config": {
70 |   "test_savefold": "./result",
71 |   "checkpoint": "./logs/checkpoints/checkpoint_best.pth"
72 |  }
73 | }
74 | 


--------------------------------------------------------------------------------
/referring_segmentation/json/config_jhmdb_sentences.json:
--------------------------------------------------------------------------------
 1 | {
 2 |  "setting_config": {
 3 |   "cuda": true,
 4 |   "gpu_id": "0",
 5 |   "seed": 20
 6 |  },
 7 |  "data_config": {
 8 |   "training_datasets": [
 9 |    "jhmdb_sentences"
10 |   ],
11 |   "testing_datasets": [
12 |    "jhmdb_sentences"
13 |   ],
14 |   "datasets_root": "/media/wwk/HDD1/dataset/referring_video_segmentation/jhmdb_sentences",
15 |   "input_size": [
16 |    320,
17 |    320
18 |   ],
19 |   "clip_size": 8,
20 |   "augmentations": {
21 |    "random_crop": true,
22 |    "random_flip": false
23 |   },
24 |   "embedding_type": "glove_840B_300",
25 |   "max_embedding_length": 20
26 |  },
27 |  "model_config": {
28 |   "backbone": "deeplab_resnet101",
29 |   "input_dim": 3,
30 |   "os": 16,
31 |   "train_backbone": false,
32 |   "backbone_path": "",
33 |   "backbone_multi_scale": true,
34 |   "video_feature_dim": 256,
35 |   "text_feature_dim": 256,
36 |   "TCN_feature_dim": 256,
37 |   "gru_bidirection": true,
38 |   "attention_dim": 256,
39 |   "TCN_hidden_dim": 64,
40 |   "embedding_dim": 300,
41 |   "layer_num": 5,
42 |   "is_local_attention": true,
43 |   "groups": 4,
44 |   "is_global_attention": true,
45 |   "conv_type": "2D",
46 |   "filter_type": "global",
47 |   "global_fuse_type": "mutan",
48 |   "local_fuse_type": "relevance_filter",
49 |   "padding_type": "circle",
50 |   "norm_type": "GroupNorm",
51 |   "frozen_batchnorm": true,
52 |   "frozen_backbone": true
53 |  },
54 |  "training_config": {
55 |   "resume": "",
56 |   "batch_size": 8,
57 |   "epochs": 20,
58 | 
59 |   "lr_backbone": 5e-05,
60 |   "lr_branch": 0.0005,
61 |   "optimizer": "AdamW",
62 |   "lr_schedule": "Step",
63 |   "loss_function": "SSIM",
64 |   "log_root": "./logs",
65 |   "save_iters": 100,
66 |   "num_worker": 8,
67 |   "lr_weight": 0.1
68 |  },
69 |  "testing_config": {
70 |   "test_savefold": "./result",
71 |   "checkpoint": "./logs/checkpoints/checkpoint_best.pth"
72 |  }
73 | }
74 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/__init__.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from .model import build_model
 3 | 
 4 | def generate_anchors(windows):
 5 |     widths = torch.tensor(windows)
 6 |     center = 7.5
 7 |     start = center - 0.5 * (widths - 1)
 8 |     end = center + 0.5 * (widths - 1)
 9 |     return torch.stack([start, end], -1)
10 | 
11 | def generate_proposals(max_num_frames, windows):
12 |     anchors = generate_anchors(windows)
13 |     widths = (anchors[:, 1] - anchors[:, 0] + 1)  # [num_anchors]
14 |     centers = torch.arange(0, max_num_frames)  # [video_len]
15 |     start = centers[:, None] - 0.5 * (widths[None, :] - 1)
16 |     end = centers[:, None] + 0.5 * (widths[None, :] - 1)
17 |     proposals = torch.stack([start, end], -1)  # [video_len, num_anchors, 2]
18 |     return proposals.view(-1, 2)
19 | 
20 | def calculate_IoU_batch(i0, i1):
21 |     union = (torch.min(torch.stack([i0[0], i1[0]], 0), 0)[0], torch.max(torch.stack([i0[1], i1[1]], 0), 0)[0])
22 |     inter = (torch.max(torch.stack([i0[0], i1[0]], 0), 0)[0], torch.min(torch.stack([i0[1], i1[1]], 0), 0)[0])
23 |     iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10)
24 |     iou[union[1] - union[0] < -1e-5] = 0
25 |     iou[iou < 0] = 0.0
26 |     return iou
27 | 
28 | 
29 | def generate_scores(proposals, label, max_num_frames, thres_score):
30 |     illegal = torch.logical_or(proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames)
31 |     label1 = label[None, :].repeat(proposals.shape[0], 1)
32 |     # label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0)
33 |     IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]),
34 |                                         (label1[:, 0], label1[:, 1]))
35 |     IoUs[illegal] = 0.0  # [video_len * num_anchors]
36 |     max_IoU = torch.max(IoUs)
37 |     IoUs[IoUs < thres_score * max_IoU] = 0.0
38 |     IoUs = IoUs / (max_IoU + 1e-4)
39 | 
40 |     scores = IoUs.float()
41 |     scores_mask = (1 - illegal.float())
42 |     return scores, scores_mask
43 | 


--------------------------------------------------------------------------------
/referring_segmentation/model/backbone/frozen_batchnorm.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class FrozenBatchNorm2d(torch.nn.Module):
 6 |     """
 7 |     BatchNorm2d where the batch statistics and the affine parameters are fixed.
 8 | 
 9 |     Copy-paste from torchvision.misc.ops with added eps before rqsrt,
10 |     without which any other models than torchvision.models.resnet[18,34,50,101]
11 |     produce nans.
12 |     """
13 | 
14 |     def __init__(self, n):
15 |         super(FrozenBatchNorm2d, self).__init__()
16 |         self.register_buffer("weight", torch.ones(n))
17 |         self.register_buffer("bias", torch.zeros(n))
18 |         self.register_buffer("running_mean", torch.zeros(n))
19 |         self.register_buffer("running_var", torch.ones(n))
20 | 
21 |     def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
22 |                               missing_keys, unexpected_keys, error_msgs):
23 |         num_batches_tracked_key = prefix + 'num_batches_tracked'
24 |         if num_batches_tracked_key in state_dict:
25 |             del state_dict[num_batches_tracked_key]
26 | 
27 |         super(FrozenBatchNorm2d, self)._load_from_state_dict(
28 |             state_dict, prefix, local_metadata, strict,
29 |             missing_keys, unexpected_keys, error_msgs)
30 | 
31 |     def forward(self, x):
32 |         # move reshapes to the beginning
33 |         # to make it fuser-friendly
34 |         w = self.weight.reshape(1, -1, 1, 1)
35 |         b = self.bias.reshape(1, -1, 1, 1)
36 |         rv = self.running_var.reshape(1, -1, 1, 1)
37 |         rm = self.running_mean.reshape(1, -1, 1, 1)
38 |         eps = 1e-5
39 |         scale = w * (rv + eps).rsqrt()
40 |         bias = b - rm * scale
41 |         return x * scale + bias
42 | 
43 | 
44 | def convert_to_frozen_batchnorm(module):
45 |     new_module = module
46 |     if isinstance(module, nn.BatchNorm2d):
47 |         new_module = FrozenBatchNorm2d(module.num_features)
48 |     for name, child in module.named_children():
49 |         new_module.add_module(name, convert_to_frozen_batchnorm(child))
50 |     return new_module


--------------------------------------------------------------------------------
/temporal_grounding/utils/generate_batch.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | def collect_fn(batch):
 4 |     """
 5 |     heat, offset, d, mask, embedding, fea, ratio, embedding_length, label
 6 |     """
 7 |     heats = []
 8 |     offsets = []
 9 |     ds = []
10 |     ratios = []
11 |     feas = []
12 |     embeddings = []
13 |     masks = []
14 |     lengths = []
15 |     labels = []
16 |     max_fea_len = 0
17 |     max_embedding_len = 0
18 |     for b in batch:
19 |         fea = b[5]
20 |         embedding = b[4]
21 |         lengths.append(b[7])
22 |         if max_fea_len < fea.shape[-1]:
23 |             max_fea_len = fea.shape[-1]
24 |         if max_embedding_len < embedding.shape[0]:
25 |             max_embedding_len = embedding.shape[0]
26 |     lengths = torch.tensor(lengths)
27 |     sorted, indices = lengths.sort(descending=True)
28 |     for index in indices:
29 |         b = batch[index]
30 |         seq_length = b[0].shape[0]
31 |         fea_padded = torch.zeros((b[5].shape[0], max_fea_len))
32 |         embedding_padded = torch.zeros((max_embedding_len, b[4].shape[1]))
33 |         heat_padded = torch.zeros(max_fea_len)
34 |         mask_padded = torch.zeros(max_fea_len)
35 |         d_padded = torch.zeros(max_fea_len)
36 |         fea_padded[:, : seq_length] = b[5]
37 |         embedding_padded[:b[4].shape[0], :] = b[4]
38 |         mask_padded[: seq_length] = b[3]
39 |         heat_padded[: seq_length] = b[0]
40 |         d_padded[: seq_length] = b[2]
41 | 
42 |         offsets.append(b[1])
43 |         ds.append(d_padded)
44 |         feas.append(fea_padded)
45 |         heats.append(heat_padded)
46 |         embeddings.append(embedding_padded)
47 |         masks.append(mask_padded)
48 |         ratios.append(b[6])
49 |         labels.append(b[-1])
50 |     feas = torch.stack(feas, dim=0)
51 |     heats = torch.stack(heats, dim=0)
52 |     embeddings = torch.stack(embeddings, dim=0)
53 |     masks = torch.stack(masks, dim=0)
54 |     ratios = torch.stack(ratios, dim=0)
55 |     offsets = torch.stack(offsets, dim=0)
56 |     ds = torch.stack(ds, dim=0)
57 |     labels = torch.stack(labels, dim=0)
58 |     return heats, offsets, ds, masks, embeddings, feas, ratios, sorted, labels
59 | 


--------------------------------------------------------------------------------
/referring_segmentation/main.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import json
 4 | import numpy as np
 5 | import random
 6 | import torch
 7 | from dataset.dataset import MyDataset
 8 | from model.model import Model
 9 | import torch.nn as nn
10 | from utils.trainer import Trainer
11 | from utils.tester import Tester
12 | 
13 | 
14 | def main(args):
15 |     with open(args.json_file, 'r') as f:
16 |         config = json.load(f)
17 | 
18 |     setting = config['setting_config']
19 |     data_config = config['data_config']
20 |     model_config = config['model_config']
21 |     training_config = config['training_config']
22 |     testing_config = config['testing_config']
23 |     is_cuda = setting['cuda']
24 |     if is_cuda:
25 |         os.environ["CUDA_VISIBLE_DEVICES"] = setting['gpu_id']
26 |     if args.checkpoint is not None:
27 |         testing_config['checkpoint'] = args.checkpoint
28 | 
29 |     torch.manual_seed(setting['seed'])
30 |     np.random.seed(setting['seed'])
31 |     random.seed(setting['seed'])
32 | 
33 |     training_config['cuda'] = is_cuda
34 |     testing_config['cuda'] = is_cuda
35 |     training_config['train_backbone'] = model_config['train_backbone']
36 | 
37 |     # init_distributed_mode()
38 |     if args.mode == 'train':
39 |         dataset = MyDataset(data_config, 'train')
40 |         val_dataset = MyDataset(data_config, 'test')
41 |     elif args.mode == 'test':
42 |         dataset = MyDataset(data_config, 'test')
43 |     else:
44 |         raise NotImplementedError
45 | 
46 |     model = Model(model_config)
47 |     if is_cuda:
48 |         model = nn.DataParallel(model)
49 |         model = model.cuda()
50 | 
51 |     if not os.path.exists(training_config['log_root']):
52 |         os.mkdir(training_config['log_root'])
53 |     if args.mode == 'train':
54 |         trainer = Trainer(training_config)
55 |         trainer.train(model, dataset, val_dataset)
56 |     elif args.mode == 'test':
57 |         tester = Tester(testing_config)
58 |         tester.test(model, dataset)
59 |     else:
60 |         raise NotImplementedError
61 | 
62 | 
63 | if __name__ == '__main__':
64 | 
65 |     parser = argparse.ArgumentParser()
66 |     parser.add_argument('--json_file', type=str, default='json/config_a2d_sentences.json')
67 |     parser.add_argument('--mode', type=str, default='train')
68 |     parser.add_argument('--checkpoint', type=str, default=None)
69 |     args = parser.parse_args()
70 |     main(args)
71 | 


--------------------------------------------------------------------------------
/temporal_grounding/main.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import json
 4 | import numpy as np
 5 | import random
 6 | import torch
 7 | from dataset import MyDataset
 8 | from model.model import Model
 9 | import torch.nn as nn
10 | from utils.trainer import Trainer
11 | from utils.tester import Tester
12 | 
13 | 
14 | def main(args):
15 |     with open(args.json_file, 'r') as f:
16 |         config = json.load(f)
17 |     config['mode'] = args.mode
18 |     if args.checkpoint is not None:
19 |         config['checkpoint'] = args.checkpoint
20 | 
21 |     is_cuda = config['cuda']
22 |     if is_cuda:
23 |         os.environ["CUDA_VISIBLE_DEVICES"] = config['gpu_id']
24 | 
25 |     torch.manual_seed(config['seed'])
26 |     np.random.seed(config['seed'])
27 |     random.seed(config['seed'])
28 | 
29 |     # init_distributed_mode()
30 |     if config['mode'] == 'train':
31 |         dataset = MyDataset(config, 'train')
32 |         eval_dataset = MyDataset(config, 'test')
33 |     elif config['mode'] == 'test':
34 |         dataset = MyDataset(config, 'test')
35 |     else:
36 |         raise NotImplementedError
37 | 
38 |     model = Model(config)
39 |     if is_cuda:
40 |         model = nn.DataParallel(model)
41 |         model = model.cuda()
42 | 
43 |     if not os.path.exists(config['log_root']):
44 |         os.makedirs(config['log_root'])
45 |     if config['mode'] == 'train':
46 |         trainer = Trainer(config)
47 |         trainer.train(model, dataset, eval_dataset)
48 |     elif config['mode'] == 'test':
49 |         tester = Tester(config)
50 |         tester.test(model, dataset)
51 |     else:
52 |         raise NotImplementedError
53 | 
54 | 
55 | if __name__ == '__main__':
56 |     # python main.py --json_file=json/config_Charades-STA_I3D_regression.json --mode=train
57 |     # python main.py --json_file=json/config_Charades-STA_I3D_anchor.json --mode=train
58 |     # python main.py --json_file=json/config_ActivityNet_C3D_regression.json --mode=train
59 |     # python main.py --json_file=json/config_ActivityNet_C3D_anchor.json --mode=train
60 |     parser = argparse.ArgumentParser()
61 |     parser.add_argument('--json_file', type=str,
62 |                         default='json/config_Charades-STA_I3D_regression.json', required=True)
63 |     parser.add_argument('--mode', type=str,
64 |                         default='train', required=True, choices=['train', 'test'])
65 |     parser.add_argument('--checkpoint', type=str, default=None)
66 |     args = parser.parse_args()
67 |     main(args)
68 | 


--------------------------------------------------------------------------------
/temporal_grounding/model/backbone/C3D.py:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | 
 3 | import torch.nn as nn
 4 | 
 5 | 
 6 | class C3D(nn.Module):
 7 |     """
 8 |     nb_classes: nb_classes in classification task, 101 for UCF101 dataset
 9 |     """
10 | 
11 |     def __init__(self, nb_classes):
12 |         super(C3D, self).__init__()
13 | 
14 |         self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
15 |         self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
16 | 
17 |         self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
18 |         self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
19 | 
20 |         self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
21 |         self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
22 |         self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
23 | 
24 |         self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
25 |         self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
26 |         self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
27 | 
28 |         self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
29 |         self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
30 |         self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))
31 | 
32 |         self.fc6 = nn.Linear(8192, 4096)
33 |         self.fc7 = nn.Linear(4096, 4096)
34 |         self.fc8 = nn.Linear(4096, nb_classes)
35 | 
36 |         self.dropout = nn.Dropout(p=0.5)
37 | 
38 |         self.relu = nn.ReLU()
39 | 
40 |     def forward(self, x):
41 |         h = self.relu(self.conv1(x))
42 |         h = self.pool1(h)
43 |         h = self.relu(self.conv2(h))
44 |         h = self.pool2(h)
45 | 
46 |         h = self.relu(self.conv3a(h))
47 |         h = self.relu(self.conv3b(h))
48 |         h = self.pool3(h)
49 | 
50 |         h = self.relu(self.conv4a(h))
51 |         h = self.relu(self.conv4b(h))
52 |         h = self.pool4(h)
53 | 
54 |         h = self.relu(self.conv5a(h))
55 |         h = self.relu(self.conv5b(h))
56 |         h = self.pool5(h)
57 | 
58 |         h = h.view(-1, 8192)
59 |         h = self.relu(self.fc6(h))
60 |         out = h
61 |         # h = self.dropout(h)
62 |         # h = self.relu(self.fc7(h))
63 |         # out = h if feature_layer == 7 and out == None else out
64 |         # h = self.dropout(h)
65 |         # logits = self.fc8(h)
66 |         return out
67 | 
68 | 
69 | 
70 | 
71 | 
72 | 
73 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/anchor_utils.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | def generate_anchors(windows):
 5 |     widths = torch.tensor(windows)
 6 |     center = 7.5
 7 |     start = center - 0.5 * (widths - 1)
 8 |     end = center + 0.5 * (widths - 1)
 9 |     return torch.stack([start, end], -1)
10 | 
11 | 
12 | def generate_proposals(max_num_frames, windows):
13 |     anchors = generate_anchors(windows)
14 |     widths = (anchors[:, 1] - anchors[:, 0] + 1)  # [num_anchors]
15 |     centers = torch.arange(0, max_num_frames)  # [video_len]
16 |     start = centers[:, None] - 0.5 * (widths[None, :] - 1)
17 |     end = centers[:, None] + 0.5 * (widths[None, :] - 1)
18 |     proposals = torch.stack([start, end], -1)  # [video_len, num_anchors, 2]
19 |     return proposals.view(-1, 2)
20 | 
21 | 
22 | def calculate_IoU_batch(i0, i1):
23 |     union = (torch.min(torch.stack([i0[0], i1[0]], 0), 0)[
24 |              0], torch.max(torch.stack([i0[1], i1[1]], 0), 0)[0])
25 |     inter = (torch.max(torch.stack([i0[0], i1[0]], 0), 0)[
26 |              0], torch.min(torch.stack([i0[1], i1[1]], 0), 0)[0])
27 |     iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10)
28 |     iou[union[1] - union[0] < -1e-5] = 0
29 |     iou[iou < 0] = 0.0
30 |     return iou
31 | 
32 | 
33 | def generate_scores(proposals, label, max_num_frames, thres_score):
34 |     illegal = torch.logical_or(
35 |         proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames)
36 |     label1 = label[None, :].repeat(proposals.shape[0], 1)
37 |     # label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0)
38 |     IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]),
39 |                                (label1[:, 0], label1[:, 1]))
40 |     IoUs[illegal] = 0.0  # [video_len * num_anchors]
41 |     max_IoU = torch.max(IoUs)
42 |     IoUs[IoUs < thres_score * max_IoU] = 0.0
43 |     IoUs = IoUs / (max_IoU + 1e-4)
44 | 
45 |     scores = IoUs.float()
46 |     scores_mask = (1 - illegal.float())
47 |     return scores, scores_mask
48 | 
49 | 
50 | def generate_2d_gaussian(boxes, w, h, delta=0.05):
51 |     # boxes: k*4    cxcywh
52 |     n_boxes = len(boxes)
53 |     ww = torch.linspace(0, 1, w)
54 |     hh = torch.linspace(0, 1, h)
55 |     gridh, gridw = torch.meshgrid(hh, ww)
56 |     grid = torch.stack([gridw, gridh], dim=0)[None, ...].repeat(
57 |         n_boxes, 1, 1, 1).to(boxes.device)  # k*2*h*w
58 |     boxes = boxes[..., None, None].repeat(1, 1, h, w)
59 |     gaussian = torch.exp(-(boxes[:, 0]-grid[:, 0])**2/(delta*boxes[:, 2]**2)) *\
60 |         torch.exp(-(boxes[:, 1]-grid[:, 1])**2/(delta*boxes[:, 3]**2))  # k*h*w
61 |     gaussian[gaussian < 0.05] = 0
62 |     return gaussian
63 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/losses.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | 
 5 | def _neg_loss(pred, gt):
 6 |   pos_inds = gt.eq(1).float()
 7 |   neg_inds = gt.lt(1).float()
 8 |   neg_weights = torch.pow(1 - gt, 4)
 9 | 
10 |   loss = 0
11 |   pos_loss = torch.log(pred) * torch.pow(1 - pred, 2) * pos_inds
12 |   neg_loss = torch.log(1 - pred) * torch.pow(pred, 2) * neg_weights * neg_inds
13 | 
14 |   num_pos  = pos_inds.float().sum()
15 |   pos_loss = (pos_loss).sum()
16 |   neg_loss = (neg_loss).sum()
17 | 
18 |   if num_pos == 0:
19 |     loss = loss - neg_loss
20 |   else:
21 |     loss = loss - (pos_loss + neg_loss) / num_pos
22 |   return loss
23 | 
24 | 
25 | class FocalLoss(nn.Module):
26 |   '''nn.Module warpper for focal loss'''
27 |   def __init__(self):
28 |     super(FocalLoss, self).__init__()
29 |     self.neg_loss = _neg_loss
30 | 
31 |   def forward(self, out, target):
32 |     return self.neg_loss(out, target)
33 | 
34 | 
35 | class RegL1Loss(nn.Module):
36 |     def __init__(self):
37 |         super(RegL1Loss, self).__init__()
38 | 
39 |     def forward(self, output, ind, target):
40 |         pred = output.gather(1, ind)
41 |         loss = F.l1_loss(pred, target, size_average=False)
42 |         return loss
43 | 
44 | def generate_weight(target):
45 | 
46 |     pos = torch.where(target>0, torch.ones_like(target), torch.zeros_like(target))
47 |     neg = torch.where(target==0, torch.ones_like(target), torch.zeros_like(target))
48 |     # ing = ((torch.gt(target, 0) & torch.lt(target, 1))).float()
49 | 
50 |     num_pos = torch.sum(pos)
51 |     num_neg = torch.sum(neg)
52 |     num_total = num_pos + num_neg
53 | 
54 |     alpha = num_neg  / num_total
55 |     beta = 1.1 * num_pos  / num_total
56 |     # target pixel = 1 -> weight beta
57 |     # target pixel = 0 -> weight 1-beta
58 |     # if num_pos == 0:
59 |     #     weights = alpha * neg
60 |     # else:
61 |     weights = alpha * pos + beta * neg
62 | 
63 |     return weights
64 | 
65 | 
66 | class IoULoss(nn.Module):
67 |     def __init__(self):
68 |         super(IoULoss, self).__init__()
69 | 
70 |     def forward(self, box_a, box_b):
71 |         inter_max_xy = torch.min(box_a[:, -1], box_b[:, -1])
72 |         inter_min_xy = torch.max(box_a[:, 0], box_b[:, 0])
73 |         inter = torch.clamp((inter_max_xy - inter_min_xy), min=0)
74 | 
75 |         # calculate union
76 |         union_max_xy = torch.max(box_a[:, -1], box_b[:, -1])
77 |         union_min_xy = torch.min(box_a[:, 0], box_b[:, 0])
78 |         union = torch.clamp((union_max_xy - union_min_xy), min=0)
79 | 
80 |         iou = inter / (union + 1e-6)
81 | 
82 |         return 1 - iou.mean()
83 | 
84 | class TAGLoss(nn.Module):
85 |     def __init__(self):
86 |         super(TAGLoss, self).__init__()
87 | 
88 |     def forward(self, net_outs, gts):
89 | 
90 |         ac_loss = (-gts*torch.log(net_outs+1e-8)).sum(1) / gts.sum(1)
91 |         ac_loss = ac_loss.mean()
92 | 
93 |         return ac_loss


--------------------------------------------------------------------------------
/referring_segmentation/model/module/aspp.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | 
 5 | 
 6 | class ASPP_module(nn.Module):
 7 |     def __init__(self, inplanes, planes, rate, norm_type='GroupNorm'):
 8 |         super(ASPP_module, self).__init__()
 9 |         if rate == 1:
10 |             kernel_size = 1
11 |             padding = 0
12 |         else:
13 |             kernel_size = 3
14 |             padding = rate
15 |         if norm_type=='GroupNorm':
16 |             self.bn = nn.GroupNorm(8, planes)
17 |         else:
18 |             self.bn = nn.BatchNorm2d(planes)
19 |         self.atrous_convolution = nn.Conv2d(inplanes, planes, kernel_size=kernel_size, stride=1, padding=padding, dilation=rate, bias=False)
20 |         self.relu = nn.ReLU()
21 | 
22 |     def forward(self, x):
23 |         x = self.atrous_convolution(x)
24 |         x = self.bn(x)
25 | 
26 |         return self.relu(x)
27 | 
28 | 
29 | class ASPP(nn.Module):
30 |     def __init__(self, inplanes, planes, rates, norm_type='GroupNorm'):
31 |         super(ASPP, self).__init__()
32 | 
33 |         self.aspp1 = ASPP_module(inplanes, planes, rate=rates[0], norm_type=norm_type)
34 |         self.aspp2 = ASPP_module(inplanes, planes, rate=rates[1], norm_type=norm_type)
35 |         self.aspp3 = ASPP_module(inplanes, planes, rate=rates[2], norm_type=norm_type)
36 |         self.aspp4 = ASPP_module(inplanes, planes, rate=rates[3], norm_type=norm_type)
37 | 
38 |         self.relu = nn.ReLU()
39 | 
40 |         if norm_type=='GroupNorm':
41 |             self.global_avg_pool = nn.Sequential(
42 |                 nn.AdaptiveAvgPool2d((1, 1)),
43 |                 nn.Conv2d(inplanes, planes, 1, stride=1, bias=False),
44 |                 nn.GroupNorm(8, planes),
45 |                 nn.ReLU()
46 |             )
47 |             self.bn1 = nn.GroupNorm(8, planes)
48 |         else:
49 |             self.global_avg_pool = nn.Sequential(
50 |                 nn.AdaptiveAvgPool2d((1, 1)),
51 |                 nn.Conv2d(inplanes, planes, 1, stride=1, bias=False),
52 |                 nn.BatchNorm2d(planes),
53 |                 nn.ReLU()
54 |             )
55 |             self.bn1 = nn.BatchNorm2d(planes)
56 | 
57 |         self.conv1 = nn.Conv2d(planes*5, planes, 1, bias=False)
58 |         self.__init_weight()
59 | 
60 |     def __init_weight(self):
61 |         for m in self.modules():
62 |             if isinstance(m, nn.Conv2d):
63 |                 torch.nn.init.kaiming_normal_(m.weight)
64 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
65 |                 m.weight.data.fill_(1)
66 |                 m.bias.data.zero_()
67 | 
68 |     def forward(self, x):
69 |         x1 = self.aspp1(x)
70 |         x2 = self.aspp2(x)
71 |         x3 = self.aspp3(x)
72 |         x4 = self.aspp4(x)
73 |         x5 = self.global_avg_pool(x)
74 |         x5 = F.interpolate(x5, size=x4.size()[2:], mode='bilinear', align_corners=True)
75 | 
76 |         x = torch.cat((x1, x2, x3, x4, x5), dim=1)
77 | 
78 |         x = self.conv1(x)
79 |         x = self.bn1(x)
80 |         x = self.relu(x)
81 |         return x


--------------------------------------------------------------------------------
/referring_segmentation/utils/utils.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.distributed as dist
 3 | import os
 4 | import numpy as np
 5 | 
 6 | 
 7 | def setup_for_distributed(is_master):
 8 |     """
 9 |     This function disables printing when not in master process
10 |     """
11 |     import builtins as __builtin__
12 |     builtin_print = __builtin__.print
13 | 
14 |     def print(*args, **kwargs):
15 |         force = kwargs.pop('force', False)
16 |         if is_master or force:
17 |             builtin_print(*args, **kwargs)
18 | 
19 |     __builtin__.print = print
20 | 
21 | 
22 | def init_distributed_mode():
23 |     if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
24 |         rank = int(os.environ["RANK"])
25 |         world_size = int(os.environ['WORLD_SIZE'])
26 |         gpu = int(os.environ['LOCAL_RANK'])
27 |     elif 'SLURM_PROCID' in os.environ:
28 |         rank = int(os.environ['SLURM_PROCID'])
29 |         gpu = rank % torch.cuda.device_count()
30 |     else:
31 |         print('Not using distributed mode')
32 |         return
33 | 
34 |     torch.cuda.set_device(gpu)
35 |     dist_backend = 'nccl'
36 |     dist_url = 'env://'
37 |     print('| distributed init (rank {}): {}'.format(
38 |         rank, dist_url), flush=True)
39 |     torch.distributed.init_process_group(backend=dist_backend, init_method=dist_url,
40 |                                          world_size=world_size, rank=rank)
41 |     torch.distributed.barrier()
42 |     setup_for_distributed(rank == 0)
43 | 
44 | 
45 | 
46 | def calculate_IoU(pred, gt):
47 |     SMOOTH = 1e-6
48 |     IArea = (pred & (gt == 1.0)).astype(float).sum()
49 |     OArea = (pred | (gt == 1.0)).astype(float).sum()
50 |     IoU = (IArea + SMOOTH) / (OArea + SMOOTH)
51 |     return IoU, IArea, OArea
52 | 
53 | 
54 | def report_result(preds, labels):
55 |     print(len(preds))
56 |     MeanIoU, IArea, OArea, Overlap = [], [], [], []
57 |     for i in range(len(preds)):
58 |         iou, iarea, oarea = calculate_IoU(preds[i], labels[i])
59 |         MeanIoU.append(iou)
60 |         IArea.append(iarea)
61 |         OArea.append(oarea)
62 |         Overlap.append(iou)
63 | 
64 |     prec5, prec6, prec7, prec8, prec9 = np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1)), \
65 |                                         np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1))
66 |     for i in range(len(Overlap)):
67 |         if Overlap[i] >= 0.5:
68 |             prec5[i] = 1
69 |         if Overlap[i] >= 0.6:
70 |             prec6[i] = 1
71 |         if Overlap[i] >= 0.7:
72 |             prec7[i] = 1
73 |         if Overlap[i] >= 0.8:
74 |             prec8[i] = 1
75 |         if Overlap[i] >= 0.9:
76 |             prec9[i] = 1
77 | 
78 |     mAP_thres_list = list(range(50, 95+1, 5))
79 |     mAP = []
80 |     for i in range(len(mAP_thres_list)):
81 |         tmp = np.zeros((len(Overlap), 1))
82 |         for j in range(len(Overlap)):
83 |             if Overlap[j] >= mAP_thres_list[i] / 100.0:
84 |                 tmp[j] = 1
85 |         mAP.append(tmp.sum() / tmp.shape[0])
86 | 
87 |     return np.mean(np.array(MeanIoU)), np.array(IArea).sum() / np.array(OArea).sum(), \
88 |            prec5.sum() / prec5.shape[0], prec6.sum() / prec6.shape[0], prec7.sum() / prec7.shape[0], \
89 |            prec8.sum() / prec8.shape[0], prec9.sum() / prec9.shape[0], np.mean(np.array(mAP))
90 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/scheduler.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the LICENSE file in
 5 | # the root directory of this source tree. An additional grant of patent rights
 6 | # can be found in the PATENTS file in the same directory.
 7 | 
 8 | from .optimizer import FairseqLRScheduler
 9 | from . import register_lr_scheduler
10 | import argparse
11 | 
12 | @register_lr_scheduler('inverse_sqrt')
13 | class InverseSquareRootSchedule(FairseqLRScheduler):
14 |     """Decay the LR based on the inverse square root of the update number.
15 | 
16 |     We also support a warmup phase where we linearly increase the learning rate
17 |     from some initial learning rate (``--warmup-init-lr``) until the configured
18 |     learning rate (``--lr``). Thereafter we decay proportional to the number of
19 |     updates, with a decay factor set to align with the configured learning rate.
20 | 
21 |     During warmup::
22 | 
23 |       lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates)
24 |       lr = lrs[update_num]
25 | 
26 |     After warmup::
27 | 
28 |       decay_factor = args.lr * sqrt(args.warmup_updates)
29 |       lr = decay_factor / sqrt(update_num)
30 |     """
31 | 
32 |     def __init__(self, args, optimizer):
33 |         super().__init__(args, optimizer)
34 |         # if len(args.lr) > 1:
35 |         #     raise ValueError(
36 |         #         'Cannot use a fixed learning rate schedule with inverse_sqrt.'
37 |         #         ' Consider --lr-scheduler=fixed instead.'
38 |         #     )
39 |         warmup_end_lr = args.lr
40 |         if args.warmup_init_lr < 0:
41 |             args.warmup_init_lr = warmup_end_lr
42 | 
43 |         # linearly warmup for the first args.warmup_updates
44 |         self.lr_step = (warmup_end_lr - args.warmup_init_lr) / \
45 |             args.warmup_updates
46 | 
47 |         # then, decay prop. to the inverse square root of the update number
48 |         self.decay_factor = warmup_end_lr * args.warmup_updates**0.5
49 | 
50 |         # initial learning rate
51 |         self.lr = args.warmup_init_lr
52 |         self.optimizer.set_lr(self.lr)
53 | 
54 |     @staticmethod
55 |     def add_args(parser):
56 |         """Add arguments to the parser for this LR scheduler."""
57 |         # fmt: off
58 |         parser.add_argument('--warmup-updates', default=300, type=int, metavar='N',
59 |                             help='warmup the learning rate linearly for the first N updates')
60 |         parser.add_argument('--warmup-init-lr', default=1e-06, type=float, metavar='LR',
61 |                             help='initial learning rate during warmup phase; default is args.lr')
62 |         # fmt: on
63 | 
64 |     def step(self, epoch, val_loss=None):
65 |         """Update the learning rate at the end of the given epoch."""
66 |         super().step(epoch, val_loss)
67 |         # we don't change the learning rate at epoch boundaries
68 |         return self.optimizer.get_lr()
69 | 
70 |     def step_update(self, num_updates):
71 |         """Update the learning rate after each update."""
72 |         if num_updates < self.args.warmup_updates:
73 |             self.lr = self.args.warmup_init_lr + num_updates*self.lr_step
74 |         else:
75 |             self.lr = self.decay_factor * num_updates**-0.5
76 |         self.optimizer.set_lr(self.lr)
77 |         return self.lr
78 | 


--------------------------------------------------------------------------------
/temporal_grounding/model/module/attention.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | from einops import rearrange, repeat
 5 | from torch import einsum
 6 | import warnings
 7 | warnings.filterwarnings("ignore")
 8 | 
 9 | 
10 | class GlobalTextPresentation(nn.Module):
11 |     def __init__(self, text_dim):
12 |         super(GlobalTextPresentation, self).__init__()
13 |         self.W_txt = nn.Linear(text_dim, text_dim)
14 | 
15 |     def forward(self, fea_text, mask=None):
16 |         fea_text = fea_text
17 |         weight_text = self.W_txt(fea_text)  # B*L*C
18 |         if mask is not None:
19 |             mask = mask.permute(0, 2, 1)
20 |             weight_text = weight_text.masked_fill(mask == 0, -1e9)
21 |         weight_text = weight_text.softmax(dim=1)
22 |         weight_text_global_out = weight_text.mean(dim=2)  # B*L
23 |         fea_text_global = fea_text * weight_text
24 |         fea_text_global = fea_text_global.sum(dim=1, keepdim=True)  # B*C*1*1
25 |         return fea_text_global, weight_text_global_out
26 | 
27 | 
28 | class Attention(nn.Module):
29 |     def __init__(self, videodim, textdim, attentiondim, groups):
30 |         super(Attention, self).__init__()
31 | 
32 |         self.groups = groups
33 |         self.q = nn.Linear(textdim, attentiondim)
34 |         self.kv = nn.Linear(videodim, 2*attentiondim)
35 | 
36 |     def forward(self, videofea, textfea):
37 |         videofea = videofea.permute(0, 2, 1)  # b*t*c
38 |         q = self.q(textfea)  # b*l*c
39 |         kv = self.kv(videofea)
40 |         k, v = kv.chunk(2, dim=-1)
41 |         q = rearrange(q, 'b l (g d) -> b g l d', g=self.groups)
42 |         k = rearrange(k, 'b t (g d) -> b g t d', g=self.groups)
43 |         v = rearrange(v, 'b t (g d) -> b g t d', g=self.groups)
44 |         A = einsum('bgld,bgtd->bglt', [q, k]
45 |                    ).mean(dim=2, keepdim=True)  # b*g*l*t
46 | 
47 |         att = torch.sigmoid(A)
48 |         out = v.permute(0, 1, 3, 2) * att  # b*g*d*t
49 |         out = rearrange(out, 'b g d t -> b (g d) t')
50 |         return A.mean(dim=[1, 2]), out
51 | 
52 | 
53 | class MutanFusion(nn.Module):
54 |     def __init__(self, input_dim, out_dim, num_layers):
55 |         super(MutanFusion, self).__init__()
56 |         self.input_dim = input_dim
57 |         self.out_dim = out_dim
58 |         self.num_layers = num_layers
59 | 
60 |         hv = []
61 |         for i in range(self.num_layers):
62 |             do = nn.Dropout(p=0.5)
63 |             lin = nn.Linear(input_dim, out_dim)
64 | 
65 |             hv.append(nn.Sequential(do, lin, nn.Tanh()))
66 |         #
67 |         self.image_transformation_layers = nn.ModuleList(hv)
68 |         #
69 |         hq = []
70 |         for i in range(self.num_layers):
71 |             do = nn.Dropout(p=0.5)
72 |             lin = nn.Linear(input_dim, out_dim)
73 |             hq.append(nn.Sequential(do, lin, nn.Tanh()))
74 |         #
75 |         self.ques_transformation_layers = nn.ModuleList(hq)
76 | 
77 |     def forward(self, ques_emb, img_emb):
78 |         # Pdb().set_trace()
79 |         batch_size = img_emb.size()[0]
80 |         x_mm = []
81 |         for i in range(self.num_layers):
82 |             x_hv = img_emb
83 |             x_hv = self.image_transformation_layers[i](x_hv)
84 | 
85 |             x_hq = ques_emb
86 |             x_hq = self.ques_transformation_layers[i](x_hq)
87 |             x_mm.append(torch.mul(x_hq, x_hv))
88 | 
89 |         x_mm = torch.stack(x_mm, dim=1)
90 |         x_mm = x_mm.sum(1).view(batch_size, img_emb.shape[1], self.out_dim)
91 |         x_mm = F.tanh(x_mm)
92 |         return x_mm
93 | 


--------------------------------------------------------------------------------
/temporal_grounding/model/model.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | from .module.attention import GlobalTextPresentation
 5 | from .module.RefTransformer import RefTransformer
 6 | from .decoder import AnchorBasedDecoder, RegressionDecoder
 7 | 
 8 | 
 9 | class TextEncoder(nn.Module):
10 |     def __init__(self, config):
11 |         self.config = config
12 |         super(TextEncoder, self).__init__()
13 |         if config['gru_bidirection']:
14 |             self.backbone = nn.GRU(
15 |                 config['embedding_dim'], config['attention_dim'], batch_first=True, bidirectional=True)
16 |         else:
17 |             self.backbone = nn.GRU(
18 |                 config['embedding_dim'], config['attention_dim'], batch_first=True, bidirectional=False)
19 |         self.dropout0 = nn.Dropout(config['dropout'])
20 |         self.dropout = nn.Dropout(config['dropout'])
21 | 
22 |     def forward(self, text, embedding_length):
23 |         text = torch.nn.utils.rnn.pack_padded_sequence(
24 |             text, list(embedding_length), True, enforce_sorted=False)
25 |         word_embedding, _ = self.backbone(text)
26 |         word_embedding, _ = torch.nn.utils.rnn.pad_packed_sequence(
27 |             word_embedding, True)
28 |         if self.config['gru_bidirection']:
29 |             word_embedding = word_embedding.view(
30 |                 word_embedding.shape[0], word_embedding.shape[1], 2, -1)
31 |             word_embedding = torch.mean(word_embedding, dim=2)
32 |         return word_embedding
33 | 
34 | 
35 | class Model(nn.Module):
36 |     def __init__(self, config):
37 | 
38 |         super(Model, self).__init__()
39 |         self.config = config
40 |         self.text_encoder = TextEncoder(config)
41 |         self.video_encoder = nn.Linear(
42 |             config['video_fea_dim'], config['attention_dim'])
43 |         self.global_text = GlobalTextPresentation(config['attention_dim'])
44 |         self.pos_embedding = nn.Parameter(torch.randn(
45 |             1, config['attention_dim'], config['segment_num']))
46 |         self.prenorm = nn.LayerNorm(config['attention_dim'])
47 | 
48 |         self.TCN = RefTransformer(config)
49 |         if config['decoder_type'] == 'anchor_based':
50 |             self.decoder = AnchorBasedDecoder(config)
51 |         elif config['decoder_type'] == 'regression':
52 |             self.decoder = RegressionDecoder(config)
53 | 
54 |     def forward(self, video_fea, embedding, embedding_length, score=None, gt_reg=None, score_mask=None, score_nm=None, proposals=None, adj_mat=None, mode='train'):
55 |         text_feal = self.text_encoder(
56 |             embedding.float(), embedding_length)  # b*l*c
57 |         embedding_mask = torch.zeros((text_feal.shape[0], 1, text_feal.shape[1])).to(
58 |             text_feal.device)
59 | 
60 |         for b in range(embedding_mask.shape[0]):
61 |             embedding_mask[b, :, :int(embedding_length[b])] = 1
62 |         text_feag, text_weight = self.global_text(
63 |             text_feal, embedding_mask)  # b*1*d
64 | 
65 |         video_fea = self.video_encoder(video_fea.float())  # b*c*t
66 |         if self.config['with_text']:
67 |             video_fea = video_fea + text_feag
68 | 
69 |         video_fea = video_fea.permute(0, 2, 1)
70 |         out_fea, weights = self.TCN(
71 |             video_fea, text_feag, self.pos_embedding, embedding_mask)  # b*c*t
72 |         if self.config['decoder_type'] == 'anchor_based':
73 |             return self.decoder(out_fea, weights, score, gt_reg, score_mask, score_nm, proposals, adj_mat, mode)
74 |         elif self.config['decoder_type'] == 'regression':
75 |             return self.decoder(out_fea, weights, gt_reg, score_nm, mode)
76 | 
77 | 
78 | if __name__ == '__main__':
79 |     import json
80 |     feav = torch.randn((4, 1024, 100))
81 |     feat = torch.randn((4, 20, 300))
82 |     with open('../json/config.json') as f:
83 |         config = json.load(f)['model_config']
84 |     model = Model(config)
85 |     fea, weight = model(feav, feat, torch.tensor([15, 13, 12, 10]))
86 |     print(fea.shape)
87 |     print(len(weight))
88 | 


--------------------------------------------------------------------------------
/temporal_grounding/model/module/RefTransformer.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | from .attention import Attention
 5 | 
 6 | 
 7 | class RefTransformer(nn.Module):
 8 |     def __init__(self, config):
 9 |         super(RefTransformer, self).__init__()
10 |         self.config = config
11 |         self.padding_type = config['padding_type']
12 |         self.conv = nn.ModuleList()
13 |         self.dilations = []
14 |         self.attention = nn.ModuleList()
15 |         self.weight_conv = nn.ModuleList()
16 |         self.prenorms = nn.ModuleList()
17 |         for i in range(config['layer_num']):
18 |             dilation = torch.pow(torch.tensor(2), i)
19 |             dilation = int(dilation)
20 |             # dilation = i+1
21 |             self.prenorms.append(nn.LayerNorm(config['attention_dim']))
22 |             self.dilations.append(dilation)
23 |             if config['with_attention']:
24 |                 self.attention.append(Attention(
25 |                     config['attention_dim'], config['attention_dim'], config['attention_dim'], groups=config['groups']))
26 | 
27 |             if config['with_mlp']:
28 |                 self.conv.append(
29 |                     nn.Sequential(
30 |                         nn.Conv1d(config['attention_dim'], config['MLP_dim'],
31 |                                   3, 1, dilation=dilation, padding=0, bias=False),
32 |                         nn.GroupNorm(4, config['MLP_dim']),
33 |                         nn.Dropout(config['dropout']),
34 |                         nn.ReLU(),
35 |                         nn.Conv1d(config['MLP_dim'],
36 |                                   config['attention_dim'], 1, 1, bias=False),
37 |                         nn.GroupNorm(4, config['attention_dim'])
38 |                     )
39 |                 )
40 | 
41 |         # self.__init_weight()
42 | 
43 |     def __init_weight(self):
44 |         for m in self.modules():
45 |             if isinstance(m, nn.Conv1d):
46 |                 torch.nn.init.kaiming_normal_(m.weight)
47 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
48 |                 m.weight.data.fill_(1)
49 |                 m.bias.data.zero_()
50 | 
51 |     def forward(self, fea, text_fea, position, mask=None):
52 |         fea = fea + position
53 |         weights = []
54 |         for i in range(len(self.attention)):
55 |             if self.config['prenorm']:
56 |                 fea = self.prenorms[i](fea.permute(0, 2, 1)).permute(0, 2, 1)
57 |             res0 = fea
58 |             if self.config['with_attention']:
59 |                 weight, fea = self.attention[i](fea, text_fea)
60 |                 fea = fea + res0
61 |             res1 = fea
62 |             weights.append(weight)
63 |             if self.config['with_mlp']:
64 |                 if self.padding_type == 'circle':
65 |                     fea = circle_padding(self.dilations[i], fea)
66 |                 elif self.padding_type == 'zero':
67 |                     fea = F.pad(
68 |                         fea, (self.dilations[i], self.dilations[i]), mode='constant', value=0)
69 |                 else:
70 |                     fea = F.pad(
71 |                         fea, (self.dilations[i], self.dilations[i]), mode='replicate')
72 |                 fea = self.conv[i](fea)
73 |                 fea = res1 + fea
74 |         return fea, weights
75 | 
76 | 
77 | def circle_padding(padding, feature):
78 |     length_times = feature.shape[-1]
79 |     index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1))
80 |     total_num = 2 * padding + length_times
81 |     num_c = padding // len(index)
82 |     if num_c * len(index) < padding:
83 |         num_c = num_c + 1
84 |     expand_number = num_c * len(index) - padding
85 |     index_f = []
86 |     for n in range(num_c):
87 |         index = index + index + index
88 |     for i in range(expand_number, expand_number + total_num):
89 |         index_f.append(index[i])
90 | 
91 |     feas = []
92 |     for idf in index_f:
93 |         feas.append(feature[:, :, idf])
94 |     feas = torch.stack(feas, dim=2)
95 |     return feas
96 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/tester.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import os
 3 | import time
 4 | import datetime
 5 | from torch.utils import data
 6 | import torch.nn.functional as F
 7 | from tqdm import tqdm
 8 | import numpy as np
 9 | from utils.utils import CountMeter, compute_IoU_recall
10 | import collections
11 | 
12 | 
13 | class Tester(object):
14 |     def __init__(self, config):
15 |         self.config = config
16 |         self.checkpoint = config['checkpoint']
17 |         assert os.path.exists(self.checkpoint), 'incorrect checkpoint path!'
18 | 
19 |     def model_info(self):
20 |         print(self.model)
21 |         num_params = 0
22 |         for p in self.model.parameters():
23 |             num_params += p.numel()  # 返回一个tensor变量内所有元素个数
24 |         print("The total number of parameters: {}".format(num_params))
25 | 
26 |     def test(self, model, dataset):
27 |         self.model = model
28 |         self.model.eval()
29 |         self.model_info()
30 |         print('loading checkpoint ....')
31 |         checkpoint = torch.load(self.checkpoint)
32 |         self.model.module.load_state_dict(checkpoint['state_dict'])
33 |         loader = data.DataLoader(
34 |             dataset, self.config['batch_size'], False, num_workers=8)
35 |         meters_5 = collections.defaultdict(lambda: CountMeter())
36 |         recall_metrics = (1, 5)
37 |         iou_metrics = (0.1, 0.3, 0.5, 0.7)
38 |         table = [['Rank@{},mIoU@{}'.format(i, j)
39 |                   for i in recall_metrics for j in iou_metrics]]
40 | 
41 |         for i, data_batch in tqdm(enumerate(loader), total=len(loader)):
42 |             fea, embedding, score, score_mask, embedding_length, label, proposals, score_nm, adj_mat = \
43 |                 data_batch['feat'], data_batch['embedding'], data_batch['score'], data_batch['score_mask'], \
44 |                 data_batch['embedding_length'], data_batch['label'], data_batch['proposals'], data_batch['score_nm'], \
45 |                 data_batch['adj_mat']
46 |             if self.config['cuda']:
47 |                 fea, embedding, score, score_mask, label, proposals, score_nm, adj_mat = \
48 |                     fea.cuda(), embedding.cuda(), score.cuda(), score_mask.cuda(), label.cuda(), proposals.cuda(), \
49 |                     score_nm.cuda(), adj_mat.cuda()
50 | 
51 |             with torch.no_grad():
52 |                 predict_boxes, score = self.model(fea, embedding, embedding_length, score,
53 |                                                   label, score_mask, score_nm, proposals, adj_mat, 'test')
54 |                 predict_boxes_old = np.round(
55 |                     predict_boxes.cpu().numpy()).astype(np.int32)
56 |                 for k in range(predict_boxes.shape[0]):
57 |                     gt_boxes = label[k]
58 |                     predict_boxes = predict_boxes_old[k]
59 |                     predict_flatten = score[k]
60 |                     gt_starts, gt_ends = gt_boxes[0], gt_boxes[1]
61 |                     predict_starts, predict_ends = predict_boxes[:,
62 |                                                                  0], predict_boxes[:, 1]
63 |                     predict_starts[predict_starts < 0] = 0
64 |                     seq_len = self.config['segment_num']
65 |                     predict_ends[predict_ends >= seq_len] = seq_len - 1
66 |                     predict_flatten = predict_flatten.cpu().numpy()
67 |                     predict_boxes[:, 0], predict_boxes[:,
68 |                                                        1] = predict_starts, predict_ends
69 | 
70 |                     topn_IoU_matric = compute_IoU_recall(
71 |                         predict_flatten, predict_boxes, gt_boxes)
72 |                     meters_5['mIoU'].update(topn_IoU_matric, 1)
73 | 
74 |         IoU_threshs = [0.1, 0.3, 0.5, 0.7]
75 |         top_n_list = [1, 5]
76 |         topn_IoU_matric, count = meters_5['mIoU'].val, meters_5['mIoU'].count
77 |         for i in range(2):
78 |             for j in range(4):
79 |                 print('{}, {:.4f}'.format('IoU@' + str(top_n_list[i]) + '@' + str(IoU_threshs[j]),
80 |                                           topn_IoU_matric[i, j] / count), end=' | ')
81 | 


--------------------------------------------------------------------------------
/referring_segmentation/utils/video_reader.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import h5py
  3 | import numpy as np
  4 | 
  5 | 
  6 | def clip_annotation_reader(images_path, annotations_path, instances, clip_size=7, annotation_center=False, dataset='A2D'):
  7 | 
  8 |     datas = []
  9 |     frames = os.listdir(images_path)
 10 |     frames.sort()
 11 |     annotations = os.listdir(annotations_path)
 12 |     annotations.sort()
 13 |     for annotation in annotations:
 14 |         name = annotation.split('.')[0]
 15 |         name_int = int(name)
 16 |         with h5py.File(os.path.join(annotations_path, annotation)) as label:
 17 |             instances_anno = list(label['instance'][:])
 18 |         for instance in instances.keys():
 19 |             # if instance < int(instance_num):
 20 |             if int(instance) in instances_anno:
 21 |                 step = 1
 22 |                 if not annotation_center:
 23 |                     range_frames = step * np.random.randint(- (clip_size - 1), 1)
 24 |                 else:
 25 |                     range_frames = - step * (clip_size // 2)
 26 | 
 27 |                 initial_frame = name_int + range_frames
 28 |                 data = {}
 29 |                 data['dataset'] = dataset
 30 |                 data['video'] = images_path.split('/')[-1]
 31 |                 data['frames'] = []
 32 |                 data['label'] = []
 33 |                 data['instance'] = instance
 34 |                 data['sentence'] = instances[instance]
 35 |                 annotation_num = 0
 36 |                 for i in range(0, clip_size * step, step):
 37 |                     n_frame = initial_frame + i - 1
 38 |                     if n_frame < 0:
 39 |                         n_frame = 0
 40 |                     elif n_frame >= len(frames):
 41 |                         n_frame = len(frames) - 1
 42 |                     data['frames'].append(os.path.join(images_path, frames[n_frame]))
 43 | 
 44 |                     is_anno = 0
 45 |                     for anno in annotations:
 46 |                         if frames[n_frame].split('.')[0] == anno.split('.')[0]:
 47 |                             data['label'].append(os.path.join(annotations_path, anno))
 48 |                             annotation_num += 1
 49 |                             is_anno += 1
 50 | 
 51 |                     if is_anno == 0:
 52 |                         data['label'].append('None')
 53 |                 if annotation_num > 0:
 54 |                     datas.append(data)
 55 |     return datas
 56 | 
 57 | 
 58 | def sequence_reader(images_path, annotations_path, instances, dataset='A2D'):
 59 | 
 60 |     datas = []
 61 |     frames = [f for f in os.listdir(images_path) if '.png' in f]
 62 |     frames.sort()
 63 |     annotations = os.listdir(annotations_path)
 64 |     annotations.sort()
 65 |     for instance in instances.keys():
 66 |         data = {}
 67 |         data['dataset'] = dataset
 68 |         data['video'] = images_path.split('/')[-1]
 69 |         data['frames'] = []
 70 |         data['label'] = []
 71 |         data['sentence'] = instances[instance]
 72 |         data['instance'] = instance
 73 |         for frame in frames:
 74 |             data['frames'].append(os.path.join(images_path, frame))
 75 |             name = frame.split('.')[0]
 76 |             is_annotated = 0
 77 |             if dataset == 'A2D':
 78 |                 for annotation in annotations:
 79 |                     if annotation.split('.')[0] == name:
 80 |                         is_annotated += 1
 81 |                         with h5py.File(os.path.join(annotations_path, annotation)) as label:
 82 |                             instances_anno = list(label['instance'][:])
 83 |                         if int(instance) in instances_anno:
 84 |                             data['label'].append(os.path.join(annotations_path, annotation))
 85 |                         else:
 86 |                             data['label'].append('None')
 87 |                 if is_annotated == 0:
 88 |                     data['label'].append('None')
 89 |             elif dataset == 'JHMDB':
 90 |                 data['label'].append(os.path.join(annotations_path, annotations[0]))
 91 | 
 92 |         datas.append(data)
 93 |     return datas
 94 | 
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/optim.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
 2 | """Collections of utilities related to optimization."""
 3 | from bisect import bisect_right
 4 | 
 5 | import torch
 6 | 
 7 | 
 8 | def update_ema(model, model_ema, decay):
 9 |     """Apply exponential moving average update.
10 | 
11 |     The  weights are updated in-place as follow:
12 |     w_ema = w_ema * decay + (1 - decay) * w
13 |     Args:
14 |         model: active model that is being optimized
15 |         model_ema: running average model
16 |         decay: exponential decay parameter
17 |     """
18 |     with torch.no_grad():
19 |         if hasattr(model, "module"):
20 |             # unwrapping DDP
21 |             model = model.module
22 |         msd = model.state_dict()
23 |         for k, ema_v in model_ema.state_dict().items():
24 |             model_v = msd[k].detach()
25 |             ema_v.copy_(ema_v * decay + (1.0 - decay) * model_v)
26 | 
27 | 
28 | def adjust_learning_rate(
29 |     optimizer,
30 |     epoch: int,
31 |     curr_step: int,
32 |     num_training_steps: int,
33 |     args,
34 | ):
35 |     """Adjust the lr according to the schedule.
36 | 
37 |     Args:
38 |         Optimizer: torch optimizer to update.
39 |         epoch(int): number of the current epoch.
40 |         curr_step(int): number of optimization step taken so far.
41 |         num_training_step(int): total number of optimization steps.
42 |         args: additional training dependent args:
43 |               - lr_drop(int): number of epochs before dropping the learning rate.
44 |               - fraction_warmup_steps(float) fraction of steps over which the lr will be increased to its peak.
45 |               - lr(float): base learning rate
46 |               - lr_backbone(float): learning rate of the backbone
47 |               - text_encoder_backbone(float): learning rate of the text encoder
48 |               - schedule(str): the requested learning rate schedule:
49 |                    "step": all lrs divided by 10 after lr_drop epochs
50 |                    "multistep": divided by 2 after lr_drop epochs, then by 2 after every 50 epochs
51 |                    "linear_with_warmup": same as "step" for backbone + transformer, but for the text encoder, linearly
52 |                                          increase for a fraction of the training, then linearly decrease back to 0.
53 |                    "all_linear_with_warmup": same as "linear_with_warmup" for all learning rates involved.
54 | 
55 |     """
56 |     num_warmup_steps: int = round(args.fraction_warmup_steps * num_training_steps)
57 |     if args.schedule == "step":
58 |         gamma = 0.1 ** (epoch // args.lr_drop)
59 |         text_encoder_gamma = gamma
60 |     elif args.schedule == "multistep":
61 |         milestones = list(range(args.lr_drop, args.epochs, 50))
62 |         gamma = 0.5 ** bisect_right(milestones, epoch)
63 |         text_encoder_gamma = gamma
64 |     elif args.schedule == "linear_with_warmup":
65 |         gamma = 0.1 ** (epoch // args.lr_drop)
66 |         if curr_step < num_warmup_steps:
67 |             text_encoder_gamma = float(curr_step) / float(max(1, num_warmup_steps))
68 |         else:
69 |             text_encoder_gamma = max(
70 |                 0.0,
71 |                 float(num_training_steps - curr_step)
72 |                 / float(max(1, num_training_steps - num_warmup_steps)),
73 |             )
74 |     elif args.schedule == "all_linear_with_warmup":
75 |         if curr_step < num_warmup_steps:
76 |             text_encoder_gamma = float(curr_step) / float(max(1, num_warmup_steps))
77 |         else:
78 |             text_encoder_gamma = max(
79 |                 0.0,
80 |                 float(num_training_steps - curr_step)
81 |                 / float(max(1, num_training_steps - num_warmup_steps)),
82 |             )
83 |         gamma = text_encoder_gamma
84 |     else:
85 |         raise NotImplementedError
86 | 
87 |     base_lrs = [args.lr, args.lr_backbone, args.text_encoder_lr]
88 |     gammas = [gamma, gamma, text_encoder_gamma]
89 |     assert len(optimizer.param_groups) == len(base_lrs)
90 |     for param_group, lr, gamma_group in zip(optimizer.param_groups, base_lrs, gammas):
91 |         param_group["lr"] = lr * gamma_group
92 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/RefTransformer.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from models.module.attention import RelevanceFilter
  5 | from models.utils import temporal_separate_to_stack, temporal_stacked_to_separate
  6 | 
  7 | 
  8 | class RefTransformer(nn.Module):
  9 |     def __init__(self, text_dim, inchannel, hidden_channel, outchannel, layers=8, padding_type='circle', groups=8, dropout=0.1):
 10 |         super(RefTransformer, self).__init__()
 11 |         self.padding_type = padding_type
 12 |         self.conv_time = nn.ModuleList()
 13 |         self.conv_spatial = nn.ModuleList()
 14 |         self.conv_convert = nn.ModuleList()
 15 |         self.dropout1 = nn.ModuleList()
 16 |         self.dropout2 = nn.ModuleList()
 17 |         self.dilations = []
 18 |         self.local_attention = nn.ModuleList()
 19 |         for i in range(layers):
 20 |             dilation = torch.pow(torch.tensor(2), i)
 21 |             dilation = int(dilation)
 22 |             self.dilations.append(dilation)
 23 |             self.local_attention.append(RelevanceFilter(
 24 |                 text_dim, inchannel, inchannel, groups, (1, 1, 1), (1, 1, 1), phase='3D'))
 25 | 
 26 |             self.conv_spatial.append(
 27 |                 nn.Sequential(
 28 |                     nn.Conv3d(inchannel, hidden_channel, (1, 3, 3),
 29 |                               1, (0, 1, 1), (1, 1, 1), bias=False),
 30 |                     nn.GroupNorm(4, hidden_channel),
 31 |                     nn.ReLU(inplace=True)
 32 |                 )
 33 |             )
 34 | 
 35 |             self.conv_time.append(
 36 |                 nn.Sequential(
 37 |                     nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1),
 38 |                               (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False),
 39 |                     nn.GroupNorm(4, hidden_channel),
 40 |                     nn.ReLU(inplace=True)
 41 |                 )
 42 |             )
 43 | 
 44 |             self.conv_convert.append(
 45 |                 nn.Sequential(
 46 |                     nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False),
 47 |                     nn.GroupNorm(4, outchannel)
 48 |                 )
 49 |             )
 50 |             self.dropout1.append(nn.Dropout(dropout))
 51 |             self.dropout2.append(nn.Dropout(dropout))
 52 |         self.__init_weight()
 53 | 
 54 |     def __init_weight(self):
 55 |         for m in self.modules():
 56 |             if isinstance(m, nn.Conv3d) or isinstance(m, nn.Conv2d):
 57 |                 torch.nn.init.kaiming_normal_(m.weight)
 58 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
 59 |                 m.weight.data.fill_(1)
 60 |                 m.bias.data.zero_()
 61 | 
 62 |     def forward(self, fea, fea_text, frame_mask, durations):
 63 |         maps_layers = []
 64 |         for i in range(len(self.conv_time)):
 65 |             res0 = fea
 66 |             maps, fea = self.local_attention[i](fea, fea_text, frame_mask)
 67 |             maps = temporal_separate_to_stack(maps.transpose(1, 2), durations)
 68 |             maps_layers.append(maps)
 69 |             fea = res0 + self.dropout1[i](fea)  # [current]
 70 |             res1 = fea
 71 |             fea = self.conv_spatial[i](fea)
 72 | 
 73 |             if self.padding_type == 'circle':
 74 |                 fea = circle_padding(self.dilations[i], fea)
 75 |             elif self.padding_type == 'zero':
 76 |                 fea = F.pad(
 77 |                     fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='constant', value=0)
 78 |             else:
 79 |                 fea = F.pad(
 80 |                     fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='circular')
 81 | 
 82 |             fea = self.conv_time[i](fea)  # B*C*T
 83 | 
 84 |             fea = self.conv_convert[i](fea)
 85 |             fea = res1 + self.dropout2[i](fea)
 86 |         return fea, maps_layers
 87 | 
 88 | 
 89 | def circle_padding(padding, feature):
 90 |     length_times = feature.shape[2]
 91 |     index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1))
 92 |     total_num = 2 * padding + length_times
 93 |     num_c = padding // len(index)
 94 |     if num_c * len(index) < padding:
 95 |         num_c = num_c + 1
 96 |     expand_number = num_c * len(index) - padding
 97 |     index_f = []
 98 |     for n in range(num_c):
 99 |         index = index + index + index
100 |     for i in range(expand_number, expand_number + total_num):
101 |         index_f.append(index[i])
102 | 
103 |     feas = []
104 |     for idf in index_f:
105 |         feas.append(feature[:, :, idf, :, :].unsqueeze(2))
106 |     feas = torch.cat(feas, dim=2)
107 |     return feas
108 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/box_ops.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
  2 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
  3 | """
  4 | Utilities for bounding box manipulation and GIoU.
  5 | """
  6 | import torch
  7 | import numpy as np
  8 | from torchvision.ops.boxes import box_area
  9 | from typing import Tuple
 10 | 
 11 | #### Bounding box utilities imported from torchvision and converted to numpy
 12 | def np_box_area(boxes: np.array) -> np.array:
 13 |     """
 14 |     Computes the area of a set of bounding boxes, which are specified by its
 15 |     (x1, y1, x2, y2) coordinates.
 16 | 
 17 |     Args:
 18 |         boxes (Tensor[N, 4]): boxes for which the area will be computed. They
 19 |             are expected to be in (x1, y1, x2, y2) format with
 20 |             ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
 21 | 
 22 |     Returns:
 23 |         area (Tensor[N]): area for each box
 24 |     """
 25 |     assert boxes.ndim == 2 and boxes.shape[-1] == 4
 26 |     return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
 27 | 
 28 | 
 29 | # implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py
 30 | # with slight modifications
 31 | def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.array, np.array]:
 32 |     area1 = np_box_area(boxes1)
 33 |     area2 = np_box_area(boxes2)
 34 | 
 35 |     lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
 36 |     rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
 37 | 
 38 |     wh = (rb - lt).clip(min=0)  # [N,M,2]
 39 |     inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
 40 | 
 41 |     union = area1[:, None] + area2 - inter
 42 | 
 43 |     return inter, union
 44 | 
 45 | 
 46 | def np_box_iou(boxes1: np.array, boxes2: np.array) -> np.array:
 47 |     """
 48 |     Return intersection-over-union (Jaccard index) of boxes.
 49 | 
 50 |     Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
 51 |     ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
 52 | 
 53 |     Args:
 54 |         boxes1 (Tensor[N, 4])
 55 |         boxes2 (Tensor[M, 4])
 56 | 
 57 |     Returns:
 58 |         iou (Tensor[N, M]): the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
 59 |     """
 60 |     inter, union = _box_inter_union(boxes1, boxes2)
 61 |     iou = inter / union
 62 |     return iou
 63 | 
 64 | 
 65 | def box_cxcywh_to_xyxy(x):
 66 |     x_c, y_c, w, h = x.unbind(-1)
 67 |     b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
 68 |     return torch.stack(b, dim=-1)
 69 | 
 70 | 
 71 | def box_xyxy_to_cxcywh(x):
 72 |     x0, y0, x1, y1 = x.unbind(-1)
 73 |     b = [(x0 + x1) / 2, (y0 + y1) / 2, (x1 - x0), (y1 - y0)]
 74 |     return torch.stack(b, dim=-1)
 75 | 
 76 | 
 77 | # modified from torchvision to also return the union
 78 | def box_iou(boxes1, boxes2):
 79 |     area1 = box_area(boxes1)
 80 |     area2 = box_area(boxes2)
 81 | 
 82 |     lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
 83 |     rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
 84 | 
 85 |     wh = (rb - lt).clamp(min=0)  # [N,M,2]
 86 |     inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
 87 | 
 88 |     union = area1[:, None] + area2 - inter
 89 | 
 90 |     iou = inter / union
 91 |     return iou, union
 92 | 
 93 | 
 94 | def generalized_box_iou(boxes1, boxes2):
 95 |     """
 96 |     Generalized IoU from https://giou.stanford.edu/
 97 | 
 98 |     The boxes should be in [x0, y0, x1, y1] format
 99 | 
100 |     Returns a [N, M] pairwise matrix, where N = len(boxes1)
101 |     and M = len(boxes2)
102 |     """
103 |     # degenerate boxes gives inf / nan results
104 |     # so do an early check
105 |     assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
106 |     assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
107 |     iou, union = box_iou(boxes1, boxes2)
108 | 
109 |     lt = torch.min(boxes1[:, None, :2], boxes2[:, :2])
110 |     rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:])
111 | 
112 |     wh = (rb - lt).clamp(min=0)  # [N,M,2]
113 |     area = wh[:, :, 0] * wh[:, :, 1]
114 | 
115 |     return iou - (area - union) / area
116 | 
117 | 
118 | def masks_to_boxes(masks):
119 |     """Compute the bounding boxes around the provided masks
120 | 
121 |     The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.
122 | 
123 |     Returns a [N, 4] tensors, with the boxes in xyxy format
124 |     """
125 |     if masks.numel() == 0:
126 |         return torch.zeros((0, 4), device=masks.device)
127 | 
128 |     h, w = masks.shape[-2:]
129 | 
130 |     y = torch.arange(0, h, dtype=torch.float)
131 |     x = torch.arange(0, w, dtype=torch.float)
132 |     y, x = torch.meshgrid(y, x)
133 | 
134 |     x_mask = masks * x.unsqueeze(0)
135 |     x_max = x_mask.flatten(1).max(-1)[0]
136 |     x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
137 | 
138 |     y_mask = masks * y.unsqueeze(0)
139 |     y_max = y_mask.flatten(1).max(-1)[0]
140 |     y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
141 | 
142 |     return torch.stack([x_min, y_min, x_max, y_max], 1)
143 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/optimizer.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the LICENSE file in
  5 | # the root directory of this source tree. An additional grant of patent rights
  6 | # can be found in the PATENTS file in the same directory.
  7 | 
  8 | import torch
  9 | import math
 10 | 
 11 | class FairseqOptimizer(object):
 12 | 
 13 |     def __init__(self, args, params):
 14 |         super().__init__()
 15 |         self.args = args
 16 |         self.params = list(params)
 17 | 
 18 |     @staticmethod
 19 |     def add_args(parser):
 20 |         """Add optimizer-specific arguments to the parser."""
 21 |         pass
 22 | 
 23 |     @property
 24 |     def optimizer(self):
 25 |         """Return a torch.optimizer.optimizer.Optimizer instance."""
 26 |         if not hasattr(self, '_optimizer'):
 27 |             raise NotImplementedError
 28 |         if not isinstance(self._optimizer, torch.optim.Optimizer):
 29 |             raise ValueError('_optimizer must be an instance of torch.optimizer.Optimizer')
 30 |         return self._optimizer
 31 | 
 32 |     @property
 33 |     def optimizer_config(self):
 34 |         """
 35 |         Return a kwarg dictionary that will be used to override optimizer
 36 |         args stored in checkpoints. This allows us to load a checkpoint and
 37 |         resume training using a different set of optimizer args, e.g., with a
 38 |         different learning rate.
 39 |         """
 40 |         raise NotImplementedError
 41 | 
 42 |     def get_lr(self):
 43 |         """Return the current learning rate."""
 44 |         return self.optimizer.param_groups[0]['lr']
 45 | 
 46 |     def set_lr(self, lr):
 47 |         """Set the learning rate."""
 48 |         for param_group in self.optimizer.param_groups:
 49 |             param_group['lr'] = lr
 50 | 
 51 |     def state_dict(self):
 52 |         """Return the optimizer's state dict."""
 53 |         return self.optimizer.state_dict()
 54 | 
 55 |     def load_state_dict(self, state_dict, optimizer_overrides=None):
 56 |         """Load an optimizer state dict.
 57 | 
 58 |         In general we should prefer the configuration of the existing optimizer
 59 |         instance (e.g., learning rate) over that found in the state_dict. This
 60 |         allows us to resume training from a checkpoint using a new set of
 61 |         optimizer args.
 62 |         """
 63 |         self.optimizer.load_state_dict(state_dict)
 64 | 
 65 |         if optimizer_overrides is not None and len(optimizer_overrides) > 0:
 66 |             # override learning rate, momentum, etc. with latest values
 67 |             for group in self.optimizer.param_groups:
 68 |                 group.update(optimizer_overrides)
 69 | 
 70 |     def backward(self, loss):
 71 |         """Computes the sum of gradients of the given tensor w.r.t. graph leaves."""
 72 |         loss.backward()
 73 | 
 74 |     def multiply_grads(self, c):
 75 |         """Multiplies grads by a constant *c*."""
 76 |         for p in self.params:
 77 |             if p.grad is not None:
 78 |                 p.grad.data.mul_(c)
 79 | 
 80 |     def clip_grad_norm(self, max_norm):
 81 |         """Clips gradient norm."""
 82 |         if max_norm > 0:
 83 |             return torch.nn.utils.clip_grad_norm_(self.params, max_norm)
 84 |         else:
 85 |             return math.sqrt(sum(p.grad.data.norm()**2 for p in self.params if p.grad is not None))
 86 | 
 87 |     def step(self, closure=None):
 88 |         """Performs a single optimization step."""
 89 |         self.optimizer.step(closure)
 90 | 
 91 |     def zero_grad(self):
 92 |         """Clears the gradients of all optimized parameters."""
 93 |         for group in self.optimizer.param_groups:
 94 |             for p in group['params']:
 95 |                 p.grad = None
 96 |         self.optimizer.zero_grad()
 97 | 
 98 | 
 99 | class FairseqLRScheduler(object):
100 | 
101 |     def __init__(self, args, optimizer):
102 |         super().__init__()
103 |         if not isinstance(optimizer, FairseqOptimizer):
104 |             raise ValueError('optimizer must be an instance of FairseqOptimizer')
105 |         self.args = args
106 |         self.optimizer = optimizer
107 |         self.best = None
108 | 
109 |     @staticmethod
110 |     def add_args(parser):
111 |         """Add arguments to the parser for this LR scheduler."""
112 |         pass
113 | 
114 |     def state_dict(self):
115 |         """Return the LR scheduler state dict."""
116 |         return {'best': self.best}
117 | 
118 |     def load_state_dict(self, state_dict):
119 |         """Load an LR scheduler state dict."""
120 |         self.best = state_dict['best']
121 | 
122 |     def step(self, epoch, val_loss=None):
123 |         """Update the learning rate at the end of the given epoch."""
124 |         if val_loss is not None:
125 |             if self.best is None:
126 |                 self.best = val_loss
127 |             else:
128 |                 self.best = min(self.best, val_loss)
129 | 
130 |     def step_update(self, num_updates):
131 |         """Update the learning rate after each update."""
132 |         return self.optimizer.get_lr()
133 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/utils.py:
--------------------------------------------------------------------------------
  1 | from torch import Tensor, tensor
  2 | from typing import List
  3 | import torch
  4 | import torch.nn.functional as F
  5 | 
  6 | 
  7 | def temporal_stacked_to_separate(feature: Tensor, durations: List) -> Tensor:
  8 |     """
  9 |     Args: 
 10 |         feature: [\sigma t_i *]
 11 |         durations: [t_1, t_2, ..., t_b]
 12 |     Return:
 13 |         out_feature: [b, t, *]
 14 |     """
 15 |     shape_len = len(feature.shape) - 1
 16 |     max_seq_len = max(durations)
 17 |     padding_value = []
 18 |     for i in range(shape_len):
 19 |         padding_value += [0, 0]
 20 |     feature_splits = feature.split(durations, dim=0)
 21 |     out_feature = torch.stack([F.pad(f, padding_value + [0, max_seq_len-len(f)])
 22 |                               for f in feature_splits])    # [b, t, *]
 23 |     return out_feature
 24 | 
 25 | 
 26 | def temporal_separate_to_stack(feature: Tensor, durations: List) -> Tensor:
 27 |     """
 28 |     Args: 
 29 |         feature: [b, t, *]
 30 |         durations: [t_1, t_2, ..., t_b]
 31 |     Return:
 32 |         out_feature: [\sigma t_i *]
 33 |     """
 34 |     out_feature = torch.cat([feature[i][:durations[i]]
 35 |                             for i in range(len(feature))], dim=0)  # [\sigma t_i *]
 36 |     return out_feature
 37 | 
 38 | 
 39 | def generate_anchor_scores(proposals, label, seq_len, thres_score):
 40 |     """
 41 |     Args: 
 42 |         proposals: [b, t*n_windows, 2]
 43 |         label: [b, 2]
 44 |     Return:
 45 |         scores: [b, t*n_windows]
 46 |         scores_mask: [b, t*n_windows]
 47 |     """
 48 |     illegal = torch.logical_or(
 49 |         proposals[..., 0] < 0, proposals[..., 1] >= seq_len)
 50 |     label = label[:, None].repeat(1, proposals.shape[1], 1)
 51 |     IoUs = calculate_IoU_batch_temporal(proposals, label)
 52 |     IoUs[illegal] = 0.0
 53 |     max_IoU = torch.max(IoUs, dim=1)[0]
 54 |     IoUs[IoUs < thres_score * max_IoU[:, None]] = 0.0
 55 |     IoUs = IoUs / (max_IoU[:, None] + 1e-4)
 56 |     scores = IoUs.float()
 57 |     scores_mask = (1 - illegal.float())
 58 |     return scores, scores_mask
 59 | 
 60 | 
 61 | def calculate_IoU_batch_temporal(box0: Tensor, box1: Tensor) -> Tensor:
 62 |     """
 63 |     Args: 
 64 |         box0: [b, n_boxes, 2]
 65 |         box1: [b, n_boxes, 2]
 66 |     Return:
 67 |         iou: [b, n_boxes]
 68 |     """
 69 |     union = (torch.min(torch.stack([box0[..., 0], box1[..., 0]], 0), 0)[
 70 |              0], torch.max(torch.stack([box0[..., 1], box1[..., 1]], 0), 0)[0])
 71 |     inter = (torch.max(torch.stack([box0[..., 0], box1[..., 0]], 0), 0)[
 72 |              0], torch.min(torch.stack([box0[..., 1], box1[..., 1]], 0), 0)[0])
 73 |     iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10)
 74 |     iou[union[1] - union[0] < -1e-5] = 0
 75 |     iou[iou < 0] = 0.0
 76 |     return iou
 77 | 
 78 | 
 79 | def generate_2d_gaussian(boxes, w, h, delta=0.05):
 80 |     """
 81 |     generate gaussian according to the input boxes, normalized to [0, 1] [checked]
 82 |     Args:
 83 |         boxes: [k, 4]   in the form of cxcywh
 84 |         w: the width of gaussian map
 85 |         h: the height of gaussian map
 86 |         delta: gaussian parameter
 87 |     Return:
 88 |         gaussian: [k, h, w]
 89 |     """
 90 |     n_boxes = len(boxes)
 91 |     ww = torch.linspace(0, w-1, w)
 92 |     hh = torch.linspace(0, h-1, h)
 93 |     gridh, gridw = torch.meshgrid(hh, ww)
 94 |     grid = torch.stack([gridw, gridh], dim=0)[None, ...].repeat(
 95 |         n_boxes, 1, 1, 1).to(boxes.device)  # [k, 2, h, w]
 96 |     boxes = boxes[..., None, None].repeat(1, 1, h, w)
 97 |     gaussian = torch.exp(-(boxes[:, 0]-grid[:, 0])**2/(delta*boxes[:, 2]**2)) *\
 98 |         torch.exp(-(boxes[:, 1]-grid[:, 1])**2 /
 99 |                   (delta*boxes[:, 3]**2))  # [k, h, w]
100 |     gaussian[gaussian < 0.05] = 0
101 |     return gaussian
102 | 
103 | 
104 | def compute_temporal_reg_tar(label, score):
105 |     """
106 |     Args:
107 |         label: [b, 2]
108 |         score: [b, t]
109 |     Return:
110 |         label_reg: [b, t, 2]
111 |     """
112 |     label = label.unsqueeze(1)
113 |     segment_num = score.shape[1]
114 |     index_s = torch.arange(0, segment_num).unsqueeze(
115 |         0).unsqueeze(-1).to(score.device)
116 |     index_e = torch.arange(0, segment_num).unsqueeze(
117 |         0).unsqueeze(-1).to(score.device)
118 | 
119 |     label_reg_s = index_s - label[:, :, 0].unsqueeze(-1)
120 |     label_reg_e = label[:, :, 1].unsqueeze(-1) - index_e
121 | 
122 |     label_reg = torch.cat([label_reg_s, label_reg_e], dim=-1)
123 |     label_reg = label_reg * score.unsqueeze(-1)
124 |     return label_reg
125 | 
126 | 
127 | def segment_tiou(box_a, box_b):
128 | 
129 |     # gt: [batch, 1, 2], detections: [batch, k, 2]
130 |     # calculate interaction
131 |     inter_max_xy = torch.min(box_a[:, :, -1], box_b[:, :, -1])
132 |     inter_min_xy = torch.max(box_a[:, :, 0], box_b[:, :, 0])
133 |     inter = torch.clamp((inter_max_xy - inter_min_xy), min=0)
134 | 
135 |     # calculate union
136 |     union_max_xy = torch.max(box_a[:, :, -1], box_b[:, :, -1])
137 |     union_min_xy = torch.min(box_a[:, :, 0], box_b[:, :, 0])
138 |     union = torch.clamp((union_max_xy - union_min_xy), min=0)
139 | 
140 |     iou = inter / (union+1e-6)
141 | 
142 |     return iou
143 | 


--------------------------------------------------------------------------------
/referring_segmentation/pre_proc/generate_data_list.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import csv
  3 | import json
  4 | 
  5 | 
  6 | def generate_annotation_dict(root):
  7 |     annotation_file = open(os.path.join(root))
  8 |     annotation_list = list(annotation_file.read().splitlines())
  9 |     annotations = {}
 10 |     for i in range(1, len(annotation_list)):
 11 |         if 'a2d' in root:
 12 |             name, instance, desc = annotation_list[i].split(',')
 13 |         else:
 14 |             name, desc = annotation_list[i].split(',')
 15 |             instance = '0'
 16 |         if name not in annotations.keys():
 17 |             annotations[name] = {}
 18 |         annotations[name][instance] = desc
 19 |     return annotations
 20 | 
 21 | 
 22 | def generate_data_list_a2d(dataset_root='/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences/', save_root='./data'):
 23 |     if not os.path.exists(save_root):
 24 |         os.mkdir(save_root)
 25 |     assert os.path.exists(
 26 |             dataset_root), ('Incorrect dataset path: {}'.format(dataset_root))
 27 |     num_videos = 0
 28 |     num_annotated_videos = 0
 29 |     videos = os.listdir(os.path.join(dataset_root, 'Rename_Images'))
 30 |     ignore_file = open(os.path.join(
 31 |         dataset_root, 'a2d_missed_videos.txt'), 'r')
 32 |     ignore_data_list = ignore_file.read().splitlines()
 33 |     videoset_file = open(os.path.join(
 34 |         dataset_root, 'videoset.csv'), 'r')
 35 |     videosets = list(csv.reader(videoset_file))
 36 |     instances = generate_annotation_dict(os.path.join(dataset_root, 'a2d_annotation.txt'))
 37 | 
 38 |     json_train = {}
 39 |     json_test = {}
 40 |     for videoset in videosets:
 41 |         video_name = videoset[0]
 42 |         assert video_name in videos, ('Incorrect video name: {} in csv file: {}'.format(
 43 |             video_name, os.path.join(dataset_root, 'videoset.csv')))
 44 |         if video_name not in ignore_data_list:
 45 |             num_videos += 1
 46 |             frames_root = os.path.join(
 47 |                 dataset_root, 'Rename_Images', video_name)
 48 |             annotations_root = os.path.join(
 49 |                 dataset_root, 'a2d_annotation_with_instances', video_name)
 50 |             if os.path.exists(annotations_root):
 51 |                 num_annotated_videos += 1
 52 |                 if videoset[-1] == '0':
 53 |                     json_train[video_name] = {
 54 |                         'frames': os.path.join('Rename_Images', video_name), 'labels': os.path.join('a2d_annotation_with_instances', video_name), 'instances': instances[video_name]}
 55 |                 else:
 56 |                     json_test[video_name] = {
 57 |                         'frames': os.path.join('Rename_Images', video_name), 'labels': os.path.join('a2d_annotation_with_instances', video_name), 'instances': instances[video_name]}
 58 |             else:
 59 |                 print(
 60 |                     'Annotation of video {} in A2D dataset not exits'.format(video_name))
 61 | 
 62 |     with open(os.path.join(save_root, 'a2d_sentences_train.json'), 'w+') as json_train_file:
 63 |         json.dump(json_train, json_train_file, indent=1)
 64 |     with open(os.path.join(save_root, 'a2d_sentences_test.json'), 'w+') as json_test_file:
 65 |         json.dump(json_test, json_test_file, indent=1)
 66 |     print('A2D dataset : Total videos : {} | Annotated videos : {}'.format(
 67 |         num_videos, num_annotated_videos))
 68 | 
 69 | 
 70 | def generate_data_list_jhmdb(dataset_root='/media/wwk/HDD1/dataset/referring_video_segmentation/jhmdb_sentences', save_root='./data'):
 71 |     assert os.path.exists(
 72 |             dataset_root), ('Incorrect dataset path: {}'.format(dataset_root))
 73 |     num_videos = 0
 74 |     num_annotated_videos = 0
 75 |     video_groups = [f for f in os.listdir(os.path.join(
 76 |         dataset_root, 'Rename_Images')) if '.' not in f]
 77 |     instances = generate_annotation_dict(os.path.join(dataset_root, 'jhmdb_annotation.txt'))
 78 |     json_test = {}
 79 |     for video_group in video_groups:
 80 |         videos_root = os.path.join(
 81 |             dataset_root, 'Rename_Images', video_group)
 82 |         videos = [f for f in os.listdir(videos_root) if '.' not in f]
 83 |         for video in videos:
 84 |             annotation_root = os.path.join(
 85 |                 dataset_root, 'puppet_mask', video_group, video)
 86 |             num_videos += 1
 87 |             if os.path.exists(annotation_root):
 88 |                 json_test[video] = {'frames': os.path.join('Rename_Images', video), 'labels': os.path.join('puppet_mask', video_group, video),
 89 |                                     'instances': instances[video]}
 90 |                 num_annotated_videos += 1
 91 |             else:
 92 |                 print(
 93 |                     'Annotation of video {}/{} in JHMDB dataset not exits'.format(video_group, video))
 94 |     with open(os.path.join(save_root, 'jhmdb_sentences_test.json'), 'w+') as json_test_file:
 95 |         json.dump(json_test, json_test_file, indent=1)
 96 |     print('JHMDB dataset : Total videos : {} | Annotated videos : {}'.format(
 97 |         num_videos, num_annotated_videos))
 98 | 
 99 | 
100 | if __name__ == '__main__':
101 |     save_root='./data'
102 |     a2d_dataset_root=''
103 |     jhmdb_dataset_root=''
104 |     generate_data_list_a2d(a2d_dataset_root, save_root)
105 |     generate_data_list_jhmdb(jhmdb_dataset_root, save_root)
106 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/datasets/torch_videovision.py:
--------------------------------------------------------------------------------
  1 | # Adapted from https://github.com/hassony2/torch_videovision
  2 | from PIL import Image
  3 | import numbers
  4 | import torch
  5 | import cv2
  6 | import numpy as np
  7 | import PIL
  8 | from PIL import Image
  9 | 
 10 | 
 11 | def convert_img(img):
 12 |     """Converts (H, W, C) numpy.ndarray to (C, W, H) format"""
 13 |     if len(img.shape) == 3:
 14 |         img = img.transpose(2, 0, 1)
 15 |     if len(img.shape) == 2:
 16 |         img = np.expand_dims(img, 0)
 17 |     return img
 18 | 
 19 | 
 20 | class ClipToTensor(object):
 21 |     """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255]
 22 |     to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0]
 23 |     """
 24 | 
 25 |     def __init__(self, channel_nb=3, div_255=True, numpy=False):
 26 |         self.channel_nb = channel_nb
 27 |         self.div_255 = div_255
 28 |         self.numpy = numpy
 29 | 
 30 |     def __call__(self, clip):
 31 |         """
 32 |         Args: clip (list of numpy.ndarray): clip (list of images)
 33 |         to be converted to tensor.
 34 |         """
 35 |         # Retrieve shape
 36 |         if isinstance(clip[0], np.ndarray):
 37 |             h, w, ch = clip[0].shape
 38 |             assert ch == self.channel_nb, "Got {0} instead of 3 channels".format(ch)
 39 |         elif isinstance(clip[0], Image.Image):
 40 |             w, h = clip[0].size
 41 |         else:
 42 |             raise TypeError(
 43 |                 "Expected numpy.ndarray or PIL.Image\
 44 |             but got list of {0}".format(
 45 |                     type(clip[0])
 46 |                 )
 47 |             )
 48 | 
 49 |         np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)])
 50 | 
 51 |         # Convert
 52 |         for img_idx, img in enumerate(clip):
 53 |             if isinstance(img, np.ndarray):
 54 |                 pass
 55 |             elif isinstance(img, Image.Image):
 56 |                 img = np.array(img, copy=False)
 57 |             else:
 58 |                 raise TypeError(
 59 |                     "Expected numpy.ndarray or PIL.Image\
 60 |                 but got list of {0}".format(
 61 |                         type(clip[0])
 62 |                     )
 63 |                 )
 64 |             img = convert_img(img)
 65 |             np_clip[:, img_idx, :, :] = img
 66 |         if self.numpy:
 67 |             if self.div_255:
 68 |                 np_clip = np_clip / 255
 69 |             return np_clip
 70 | 
 71 |         else:
 72 |             tensor_clip = torch.from_numpy(np_clip)
 73 | 
 74 |             if not isinstance(tensor_clip, torch.FloatTensor):
 75 |                 tensor_clip = tensor_clip.float()
 76 |             if self.div_255:
 77 |                 tensor_clip = tensor_clip.div(255)
 78 |             return tensor_clip
 79 | 
 80 | 
 81 | def _is_tensor_clip(clip):
 82 |     return torch.is_tensor(clip) and clip.ndimension() == 4
 83 | 
 84 | 
 85 | def crop_clip(clip, min_h, min_w, h, w):
 86 |     if isinstance(clip[0], np.ndarray):
 87 |         cropped = [img[min_h : min_h + h, min_w : min_w + w, :] for img in clip]
 88 | 
 89 |     elif isinstance(clip[0], PIL.Image.Image):
 90 |         cropped = [img.crop((min_w, min_h, min_w + w, min_h + h)) for img in clip]
 91 |     else:
 92 |         raise TypeError(
 93 |             "Expected numpy.ndarray or PIL.Image"
 94 |             + "but got list of {0}".format(type(clip[0]))
 95 |         )
 96 |     return cropped
 97 | 
 98 | 
 99 | def normalize(clip, mean, std, inplace=False):
100 |     if not _is_tensor_clip(clip):
101 |         raise TypeError("tensor is not a torch clip.")
102 | 
103 |     if not inplace:
104 |         clip = clip.clone()
105 | 
106 |     dtype = clip.dtype
107 |     mean = torch.as_tensor(mean, dtype=dtype, device=clip.device)
108 |     std = torch.as_tensor(std, dtype=dtype, device=clip.device)
109 |     clip.sub_(mean[:, None, None, None]).div_(std[:, None, None, None])
110 | 
111 |     return clip
112 | 
113 | 
114 | def get_resize_sizes(im_h, im_w, size):
115 |     if im_w < im_h:
116 |         ow = size
117 |         oh = int(size * im_h / im_w)
118 |     else:
119 |         oh = size
120 |         ow = int(size * im_w / im_h)
121 |     return oh, ow
122 | 
123 | 
124 | def resize_clip(clip, size, interpolation="bilinear"):
125 |     if isinstance(clip[0], np.ndarray):
126 |         if isinstance(size, numbers.Number):
127 |             im_h, im_w, im_c = clip[0].shape
128 |             # Min spatial dim already matches minimal size
129 |             if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size):
130 |                 return clip
131 |             new_h, new_w = get_resize_sizes(im_h, im_w, size)
132 |             size = (new_w, new_h)
133 |         else:
134 |             size = size[1], size[0]
135 |         if interpolation == "bilinear":
136 |             np_inter = cv2.INTER_LINEAR
137 |         else:
138 |             np_inter = cv2.INTER_NEAREST
139 |         scaled = [cv2.resize(img, size, interpolation=np_inter) for img in clip]
140 |     elif isinstance(clip[0], PIL.Image.Image):
141 |         if isinstance(size, numbers.Number):
142 |             im_w, im_h = clip[0].size
143 |             # Min spatial dim already matches minimal size
144 |             if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size):
145 |                 return clip
146 |             new_h, new_w = get_resize_sizes(im_h, im_w, size)
147 |             size = (new_w, new_h)
148 |         else:
149 |             size = size[1], size[0]
150 |         if interpolation == "bilinear":
151 |             pil_inter = PIL.Image.NEAREST
152 |         else:
153 |             pil_inter = PIL.Image.BILINEAR
154 |         scaled = [img.resize(size, pil_inter) for img in clip]
155 |     else:
156 |         raise TypeError(
157 |             "Expected numpy.ndarray or PIL.Image"
158 |             + "but got list of {0}".format(type(clip[0]))
159 |         )
160 |     return scaled
161 | 


--------------------------------------------------------------------------------
/temporal_grounding/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from torch.utils import data
  3 | import h5py
  4 | import json
  5 | from utils.utils import *
  6 | from tqdm import tqdm
  7 | import torch.nn as nn
  8 | import torch
  9 | import numpy as np
 10 | import torchtext
 11 | 
 12 | 
 13 | class MyDataset(data.Dataset):
 14 |     def __init__(self, config, mode='train'):
 15 |         super(MyDataset, self).__init__()
 16 |         self.config = config
 17 |         self.dataset = config['{}ing_datasets'.format(mode)]
 18 |         self.embedding_type = config['embedding_type']
 19 |         self.segment_num = config['segment_num']
 20 |         self.mode = mode
 21 |         self.max_embedding_length = config['embedding_length']
 22 |         print('Preparing dataset: {}'.format(self.dataset))
 23 |         self.datas = []
 24 | 
 25 |         with open(os.path.join('./data', self.dataset, '{}.json'.format(mode)), 'r') as f:
 26 |             videosets = json.load(f)
 27 |         for n, video in tqdm(enumerate(videosets), total=len(videosets)):
 28 |             data = {}
 29 |             data['vid'] = video[0]
 30 |             data['timestamp'] = video[2]
 31 |             data['duration'] = video[1]
 32 |             data['words'] = video[3]
 33 |             data['index'] = n
 34 |             if (data['timestamp'][1] - data['timestamp'][0]) > 0 and data['timestamp'][1] <= data['duration'] and data['timestamp'][0] <= data['duration']:
 35 |                 self.datas.append(data)
 36 |         self.feat = h5py.File(config['datasets_root'])
 37 | 
 38 |         self.proposals = generate_proposals(
 39 |             config['segment_num'], config['window_width'])
 40 |         embedding_name, embedding_dim = self.config['embedding_type'].split(
 41 |             '_')[1], int(self.config['embedding_type'].split('_')[2])
 42 |         self.vocab = torchtext.vocab.GloVe(
 43 |             name=embedding_name, dim=embedding_dim)
 44 |         self.vocab.itos.extend(['<unk>'])
 45 |         self.vocab.stoi['<unk>'] = self.vocab.vectors.shape[0]
 46 |         self.vocab.vectors = torch.cat(
 47 |             [self.vocab.vectors, torch.zeros(1, self.vocab.dim)], dim=0)
 48 |         self.word_embedding = nn.Embedding.from_pretrained(self.vocab.vectors)
 49 | 
 50 |     def generate_label_feats(self, feat, label):
 51 |         ori_video_len = feat.shape[0]
 52 |         index = np.linspace(start=0, stop=ori_video_len - 1,
 53 |                             num=self.segment_num).astype(np.int32)
 54 |         new_video = []
 55 |         for i in range(len(index) - 1):
 56 |             start = index[i]
 57 |             end = index[i + 1]
 58 |             if start == end or start + 1 == end:
 59 |                 new_video.append(feat[start])
 60 |             else:
 61 |                 new_video.append(np.mean(feat[start: end], 0))
 62 |         new_video.append(feat[-1])
 63 |         feat = np.stack(new_video, 0)
 64 |         try:
 65 |             label[0] = min(np.where(index >= label[0])[0])
 66 |         except:
 67 |             print(label, index)
 68 |         if label[1] == ori_video_len - 1:
 69 |             label[1] = self.segment_num - 1
 70 |         else:
 71 |             label[1] = max(np.where(index <= label[1])[0])
 72 |         if label[1] < label[0]:
 73 |             label[0] = label[1]
 74 |         return feat, label
 75 | 
 76 |     def __len__(self):
 77 |         return len(self.datas)
 78 | 
 79 |     def __getitem__(self, item):
 80 |         if self.dataset != 'ActivityNet':
 81 |             feat = self.feat[self.datas[item]['vid']][:]
 82 |         else:
 83 |             feat = self.feat[self.datas[item]['vid']]['c3d_features'][:]
 84 | 
 85 |         duration = self.datas[item]['duration']
 86 |         timestamp = self.datas[item]['timestamp']
 87 | 
 88 |         feat = torch.from_numpy(feat)
 89 |         feat = average_to_fixed_length(feat, self.segment_num)
 90 | 
 91 |         start_frame = max(self.segment_num * timestamp[0] / duration, 0)
 92 |         end_frame = min(self.segment_num *
 93 |                         timestamp[1] / duration, self.segment_num-1)
 94 |         if start_frame > end_frame:
 95 |             start_frame = end_frame
 96 |         label = np.asarray([start_frame, end_frame]).astype(np.int32)
 97 | 
 98 |         word_idxs = torch.tensor([self.vocab.stoi.get(w, len(self.vocab.stoi)-1)
 99 |                                   for w in self.datas[item]['words'].strip().split()], dtype=torch.long)
100 |         embedding = self.word_embedding(word_idxs)
101 |         embedding_length = embedding.shape[0]
102 | 
103 |         if embedding_length > self.max_embedding_length:
104 |             embedding_padded = embedding[: self.max_embedding_length, :]
105 |             embedding_length = self.max_embedding_length
106 |         else:
107 |             embedding_padded = torch.zeros(
108 |                 (self.max_embedding_length, embedding.shape[1]))
109 |             embedding_padded[: embedding.shape[0], :] = embedding
110 | 
111 |         scores, scores_mask, adj_mat = generate_scores(
112 |             self.proposals, label, self.segment_num, self.config['thres_score'], self.config['thres_adjmat'])
113 | 
114 |         score_nm = []
115 |         for i in range(self.segment_num):
116 |             if i >= label[0] and i <= label[1]:
117 |                 score_nm.append(1)
118 |             else:
119 |                 score_nm.append(0)
120 |         score_nm = torch.tensor(score_nm).float()
121 | 
122 |         return {
123 |             'embedding': embedding_padded,
124 |             'feat': feat,
125 |             'embedding_length': embedding_length,
126 |             'label': label,
127 |             'duration': duration,
128 |             'vid': self.datas[item]['vid'],
129 |             'score': scores,
130 |             'score_nm': score_nm,
131 |             'score_mask': scores_mask,
132 |             'proposals': self.proposals.astype(np.float32),
133 |             'adj_mat': adj_mat,
134 |             'index': self.datas[item]['index']
135 |         }
136 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/postprocessors.py:
--------------------------------------------------------------------------------
  1 | # Adapted from
  2 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
  3 | """Postprocessors class to transform TubeDETR output according to the downstream task"""
  4 | import imp
  5 | from typing import Dict
  6 | 
  7 | import torch
  8 | import torch.nn.functional as F
  9 | from torch import nn
 10 | 
 11 | from .anchor_utils import generate_proposals
 12 | 
 13 | 
 14 | class PostProcessSTVG(nn.Module):
 15 |     def __init__(self, args):
 16 |         super().__init__()
 17 |         self.args = args
 18 | 
 19 |     @torch.no_grad()
 20 |     def forward(self, outputs, frames_id=None, video_ids=None, time_mask=None):
 21 |         """
 22 |         :param outputs: must contain a key pred_sted mapped to a [B, T, 2] tensor of logits for the start and end predictions
 23 |         :param frames_id: list of B lists which contains the increasing list of frame ids corresponding to the indexes of the decoder outputs
 24 |         :param video_ids: list of B video_ids, used to ensemble predictions when video_max_len_train < video_max_len
 25 |         :param time_mask: [B, T] tensor with False on the padded positions, used to take out padded frames from the possible predictions
 26 |         :return: list of B [start_frame, end_frame] for each video
 27 |         """
 28 | 
 29 |         if self.args.temporal_decoder_type == 'anchor':
 30 |             temporal_score, temporal_offset = outputs['temporal_score'], outputs['temporal_offset']
 31 |             max_length = temporal_score.shape[1]//len(
 32 |                 self.args.temporal_window_width)
 33 |             proposals = generate_proposals(
 34 |                 max_length, self.args.temporal_window_width).to(temporal_score.device)
 35 |             refined_boxes = proposals[None, :] + \
 36 |                 temporal_offset    # [b, t*n_windows, 2]
 37 |             _, ind = torch.topk(temporal_score, 1, -1)
 38 |             pred_steds = torch.gather(refined_boxes, 1, ind[..., None].repeat(
 39 |                 1, 1, 2)).squeeze(1).long()    # b*2
 40 |             pred_steds = pred_steds.clamp(0, max_length-1)
 41 |         elif self.args.temporal_decoder_type == 'regression':
 42 |             temporal_score, temporal_reg = outputs['temporal_score'], outputs['temporal_reg']
 43 |             max_length = temporal_score.shape[1]
 44 |             index = torch.as_tensor([i for i in range(max_length)]).to(
 45 |                 temporal_score.device)[None]
 46 |             pred_start = index - temporal_reg[:, :, 0]
 47 |             pred_end = index + temporal_reg[:, :, 1]
 48 |             predictions = torch.stack([pred_start, pred_end], dim=-1)
 49 |             _, ind = torch.topk(temporal_score, 1, -1)
 50 |             pred_steds = torch.gather(predictions, 1, ind[..., None].repeat(
 51 |                 1, 1, 2)).squeeze(1).long()    # b*2
 52 |             pred_steds = pred_steds.clamp(0, max_length-1)
 53 | 
 54 |         frames_id = (
 55 |             torch.tensor([row + [0] * (max_length - len(row))
 56 |                          for row in frames_id])
 57 |             .long()
 58 |             .to(pred_steds.device)
 59 |         )  # padded up to BxT
 60 |         # get corresponding frames id from the indexes
 61 |         pred_steds = torch.gather(frames_id, 1, pred_steds)
 62 |         pred_steds = pred_steds.float()
 63 |         pred_steds[:, 1] += 1  # the end frame is excluded in evaluation
 64 | 
 65 |         pred_steds = pred_steds.cpu().tolist()
 66 | 
 67 |         return pred_steds
 68 | 
 69 | 
 70 | class PostProcess(nn.Module):
 71 |     """This module converts the model's output into the format expected by the coco api"""
 72 | 
 73 |     @torch.no_grad()
 74 |     def forward(self, outputs, target_sizes):
 75 |         """Perform the computation
 76 |         Parameters:
 77 |             outputs: raw outputs of the model
 78 |             target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
 79 |                           For evaluation, this must be the original image size (before any data augmentation)
 80 |                           For visualization, this should be the image size after data augment, but before padding
 81 |         """
 82 |         hm_s, wh_s = outputs['spatial_map'].squeeze(1), outputs['spatial_wh']
 83 | 
 84 |         # Find the top response in heat map
 85 |         time, height, width = hm_s.size()
 86 |         topk_scores, topk_inds = torch.topk(hm_s.view(time, -1), 1)
 87 |         topk_inds = topk_inds % (height * width)
 88 |         topk_ys = (topk_inds / width).int().float()  # t*1
 89 |         topk_xs = (topk_inds % width).int().float()  # t*1
 90 | 
 91 |         pre_wh = torch.gather(wh_s.view(wh_s.shape[0], wh_s.shape[1], -1), -1,
 92 |                               topk_inds.unsqueeze(1).repeat(1, 2, 1))   # t*2*1
 93 |         out_bbox = torch.cat([topk_xs - pre_wh[:, 0, :] / 2,
 94 |                               topk_ys - pre_wh[:, 1, :] / 2,
 95 |                               topk_xs + pre_wh[:, 0, :] / 2,
 96 |                               topk_ys + pre_wh[:, 1, :] / 2], dim=-1)  # t*4
 97 |         out_bbox[:, 2].clamp(0, width)
 98 |         out_bbox[:, 3].clamp(0, height)
 99 | 
100 |         img_h, img_w = target_sizes.unbind(1)
101 |         scale_fct_out = torch.tensor(
102 |             [width, height, width, height]).float().unsqueeze(0).to(out_bbox.device)
103 |         scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
104 |         boxes = (out_bbox / scale_fct_out) * scale_fct
105 | 
106 |         results = [{"boxes": b} for b in boxes]
107 | 
108 |         return results
109 | 
110 | 
111 | def build_postprocessors(args, dataset_name=None) -> Dict[str, nn.Module]:
112 |     postprocessors: Dict[str, nn.Module] = {"bbox": PostProcess()}
113 | 
114 |     if dataset_name:
115 |         if dataset_name in ["vidstg", "hcstvg"]:
116 |             postprocessors[dataset_name] = PostProcessSTVG(args)
117 | 
118 |     return postprocessors
119 | 


--------------------------------------------------------------------------------
/referring_segmentation/utils/tester.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import os
  3 | import time
  4 | import datetime
  5 | from torch.utils import data
  6 | import torch.nn.functional as F
  7 | from tqdm import tqdm
  8 | from utils.utils import report_result
  9 | import numpy as np
 10 | import cv2
 11 | 
 12 | 
 13 | class Tester(object):
 14 |     def __init__(self, config):
 15 |         self.config = config
 16 |         self.save_fold = config['test_savefold']
 17 |         self.checkpoint = config['checkpoint']
 18 |         if not os.path.exists(self.save_fold):
 19 |             os.mkdir(self.save_fold)
 20 |         assert os.path.exists(self.checkpoint), 'incorrect checkpoint path!'
 21 | 
 22 |     def model_info(self):
 23 |         print(self.model)
 24 |         num_params = 0
 25 |         for p in self.model.parameters():
 26 |             num_params += p.numel()  # 返回一个tensor变量内所有元素个数
 27 |         print("The total number of parameters: {}".format(num_params))
 28 | 
 29 |     def test(self, model, dataset):
 30 |         self.model = model
 31 |         self.model.eval()
 32 |         self.model_info()
 33 |         print('loading checkpoint ....')
 34 |         checkpoint = torch.load(self.checkpoint)
 35 |         self.model.module.load_state_dict(checkpoint['state_dict'])
 36 |         loader = data.DataLoader(dataset, 1, False, num_workers=8)
 37 | 
 38 |         num_frames = 0
 39 |         total_times = 0
 40 |         pres = []
 41 |         gts = []
 42 | 
 43 |         pres_s = []
 44 |         pres_m = []
 45 |         pres_l = []
 46 |         gts_s = []
 47 |         gts_m = []
 48 |         gts_l = []
 49 | 
 50 |         with torch.no_grad():
 51 |             print('video sequence num: {}'.format(len(loader)))
 52 |             print('testing.....')
 53 | 
 54 |             for data_batch in tqdm(loader):
 55 |                 frames, labels, embedding = data_batch['frames'], data_batch['label'], data_batch['word_embedding']
 56 |                 embedding_length = data_batch['embedding_length']
 57 |                 is_annotated = data_batch['is_annotated']
 58 |                 video = data_batch['video']
 59 |                 name = data_batch['name']
 60 |                 instance = data_batch['instance']
 61 |                 # if not os.path.exists(os.path.join(self.save_fold, video[0])):
 62 |                 #     os.mkdir(os.path.join(self.save_fold, video[0]))
 63 |                 #     os.mkdir(os.path.join(self.save_fold, video[0], 'pre'))
 64 |                 #     os.mkdir(os.path.join(self.save_fold, video[0], 'gt'))
 65 |                 video_save_root = os.path.join(self.save_fold, video[0], 'pre')
 66 |                 gt_save_fold = os.path.join(self.save_fold, video[0], 'gt')
 67 |                 if self.config['cuda']:
 68 |                     for f in range(len(frames)):
 69 |                         frames[f] = frames[f].cuda()
 70 |                         labels[f] = labels[f].cuda()
 71 |                     embedding = embedding.cuda()
 72 |                 num_frames += len(frames)
 73 |                 start_time = time.time()
 74 |                 predictions, maps, _, _ = model(frames, embedding, embedding_length)
 75 | 
 76 |                 for j, prediction in enumerate(predictions):
 77 |                     if is_annotated[j][0] == 1:
 78 |                         pre = torch.sigmoid(prediction)
 79 |                         pre = (pre - pre.min()) / (pre.max() - pre.min())
 80 |                         pre = F.interpolate(pre, labels[j].shape[2:], mode='bilinear', align_corners=True)
 81 |                         pre_thres = torch.where(pre>0.5, torch.ones_like(pre), torch.zeros_like(pre))
 82 |                         gts.append(labels[j][0][0].cpu().numpy().astype(np.uint8))
 83 |                         pres.append(pre_thres[0][0].cpu().numpy().astype(np.uint8))
 84 |                         # if len(predictions) > 80 and len(predictions) < 150:
 85 |                         #     gts_m.append(labels[j][0][0].cpu().numpy().astype(np.uint8))
 86 |                         #     pres_m.append(pre_thres[0][0].cpu().numpy().astype(np.uint8))
 87 |                         if len(predictions) > 100:
 88 |                             gts_l.append(labels[j][0][0].cpu().numpy().astype(np.uint8))
 89 |                             pres_l.append(pre_thres[0][0].cpu().numpy().astype(np.uint8))
 90 |                         else:
 91 |                             gts_s.append(labels[j][0][0].cpu().numpy().astype(np.uint8))
 92 |                             pres_s.append(pre_thres[0][0].cpu().numpy().astype(np.uint8))
 93 | 
 94 |                 total_times += time.time() - start_time
 95 | 
 96 | 
 97 |         total_times = datetime.timedelta(seconds=int(total_times))
 98 |         time_per_frame = total_times / num_frames
 99 | 
100 |         print('prediction time per frame: {}s'.format(time_per_frame))
101 | 
102 |         print('evaluation...')
103 |         meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres, gts)
104 |         print('evaluation results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format(meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP))
105 | 
106 |         meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_s, gts_s)
107 |         print(
108 |             'evaluation short results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format(
109 |                 meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP))
110 | 
111 |         # meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_m, gts_m)
112 |         # print(
113 |         #     'evaluation middle results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format(
114 |         #         meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP))
115 | 
116 |         meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_l, gts_l)
117 |         print(
118 |             'evaluation long results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format(
119 |                 meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP))
120 | 
121 | 
122 | 
123 | 
124 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/attention.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from einops import rearrange, repeat
  5 | from torch import nn
  6 | 
  7 | 
  8 | class GlobalTextPresentation(nn.Module):
  9 |     def __init__(self, text_dim):
 10 |         super(GlobalTextPresentation, self).__init__()
 11 |         self.W_txt = nn.Linear(text_dim, text_dim)
 12 | 
 13 |     def forward(self, fea_text, mask=None):
 14 |         weight_text = self.W_txt(fea_text)  # B*L*C
 15 |         if mask is not None:
 16 |             weight_text = weight_text.masked_fill(mask == 0, -1e9)
 17 |         weight_text = weight_text.softmax(dim=1)
 18 |         fea_text_global = fea_text * weight_text
 19 |         fea_text_global = fea_text_global.sum(dim=1)  # B*C
 20 |         return fea_text_global
 21 | 
 22 | 
 23 | class MuTan(nn.Module):
 24 |     def __init__(self, video_fea_dim, text_fea_dim, out_fea_dim, heads=5):
 25 |         super(MuTan, self).__init__()
 26 | 
 27 |         self.heads = heads
 28 |         self.Wv = nn.ModuleList(
 29 |             [nn.Conv2d(video_fea_dim+8, out_fea_dim, 1, 1) for i in range(heads)])
 30 |         self.Wt = nn.ModuleList(
 31 |             [nn.Conv2d(text_fea_dim, out_fea_dim, 1, 1) for i in range(heads)])
 32 | 
 33 |     def forward(self, video_fea, text_fea, spatial):
 34 |         video_fea = torch.cat([video_fea, spatial], dim=1)
 35 |         fea_outs = []
 36 |         for i in range(self.heads):
 37 |             fea_v = self.Wv[i](video_fea)
 38 |             fea_v = torch.tanh(fea_v)  # B*C*H*W
 39 | 
 40 |             fea_t = self.Wt[i](text_fea)
 41 |             fea_t = torch.tanh(fea_t)  # B*C*1*1
 42 | 
 43 |             fea_out = fea_v * fea_t
 44 |             fea_outs.append(fea_out.unsqueeze(-1))
 45 |         fea_outs = torch.cat(fea_outs, dim=-1)
 46 |         fea_outs = torch.sum(fea_outs, dim=-1)
 47 |         mutan_fea = torch.tanh(fea_outs)
 48 |         mutan_fea = F.normalize(mutan_fea, dim=1)
 49 |         return mutan_fea
 50 | 
 51 | 
 52 | class RelevanceFilter(nn.Module):
 53 |     def __init__(self, text_fea_dim, video_fea_dim, attention_dim, groups=8, kernelsize=(1, 1), dilation=(1, 1), phase='3D'):
 54 |         super().__init__()
 55 |         assert phase in ['1D', '2D', '3D']
 56 |         assert text_fea_dim % groups == 0
 57 |         assert attention_dim % groups == 0
 58 |         self.phase = phase
 59 |         self.groups = groups
 60 |         self.kernel_size = kernelsize
 61 |         self.dilation = dilation
 62 |         if phase == '1D':
 63 |             assert len(kernelsize) == 1 and len(dilation) == 1
 64 |             self.Wkv = nn.Conv1d(video_fea_dim, 2*attention_dim, 1, 1)
 65 |             self.Wt = nn.Linear(text_fea_dim, attention_dim * kernelsize[0])
 66 |             self.padding = (kernelsize[0]//2)*dilation[0]
 67 |         elif phase == '2D':
 68 |             assert len(kernelsize) == 2 and len(dilation) == 2
 69 |             self.Wkv = nn.Conv2d(video_fea_dim, 2*attention_dim, 1, 1)
 70 |             self.Wt = nn.Linear(text_fea_dim, attention_dim *
 71 |                                 kernelsize[0] * kernelsize[1])
 72 |             self.padding = (
 73 |                 (kernelsize[0]//2)*dilation[0], (kernelsize[1]//2)*dilation[1])
 74 |         elif phase == '3D':
 75 |             assert len(kernelsize) == 3 and len(dilation) == 3
 76 |             self.Wkv = nn.Conv3d(video_fea_dim, 2*attention_dim, 1, 1)
 77 |             self.Wt = nn.Linear(text_fea_dim, attention_dim *
 78 |                                 kernelsize[0] * kernelsize[1] * kernelsize[2])
 79 |             self.padding = ((kernelsize[0]//2)*dilation[0], (kernelsize[1]//2)
 80 |                             * dilation[1], (kernelsize[2]//2)*dilation[2])
 81 | 
 82 |     def forward(self, video_fea, text_fea, masks=None):
 83 |         b = video_fea.shape[0]
 84 | 
 85 |         kv = self.Wkv(video_fea)
 86 |         k, v = kv.chunk(2, dim=1)
 87 |         kernel = self.Wt(text_fea)
 88 | 
 89 |         if self.phase == '1D':
 90 |             kernel = repeat(kernel, 'b (g c k0) -> (b g) c k0',
 91 |                             k0=self.kernel_size[0], g=self.groups)
 92 |             k = repeat(k, 'b c l0 -> n (b c) l0', n=1)
 93 |             att = F.conv1d(k, kernel, padding=self.padding,
 94 |                            dilation=self.dilation[0], groups=b*self.groups)
 95 |             att = rearrange(att, 'n (b g c) l0 -> (n b) g c l0',
 96 |                             b=b, g=self.groups)
 97 |             v = rearrange(v, 'b (g c) l0 -> b g c l0', g=self.groups)
 98 |         elif self.phase == '2D':
 99 |             kernel = repeat(kernel, 'b (g c k0 k1) -> (b g) c k0 k1',
100 |                             k0=self.kernel_size[0], k1=self.kernel_size[1], g=self.groups)
101 |             k = repeat(k, 'b c l0 l1 -> n (b c) l0 l1', n=1)
102 |             att = F.conv2d(k, kernel, padding=self.padding,
103 |                            dilation=self.dilation, groups=b*self.groups)
104 |             att = rearrange(
105 |                 att, 'n (b g c) l0 l1 -> (n b) g c l0 l1', b=b, g=self.groups)
106 |             v = rearrange(v, 'b (g c) l0 l1 -> b g c l0 l1', g=self.groups)
107 |         elif self.phase == '3D':
108 |             kernel = repeat(kernel, 'b (g c k0 k1 k2) -> (b g) c k0 k1 k2',
109 |                             k0=self.kernel_size[0], k1=self.kernel_size[1], k2=self.kernel_size[2], g=self.groups)
110 |             k = repeat(k, 'b c l0 l1 l2 -> n (b c) l0 l1 l2', n=1)
111 |             att = F.conv3d(k, kernel, padding=self.padding,
112 |                            dilation=self.dilation, groups=b*self.groups)
113 |             att = rearrange(
114 |                 att, 'n (b g c) l0 l1 l2 -> (n b) g c l0 l1 l2', b=b, g=self.groups)
115 |             v = rearrange(
116 |                 v, 'b (g c) l0 l1 l2 -> b g c l0 l1 l2', g=self.groups)
117 |         active_map = att.mean(dim=1)
118 |         out = v * torch.sigmoid(att)
119 |         out = torch.flatten(out, 1, 2)
120 | 
121 |         if masks is not None:
122 |             out = out * masks
123 |             active_map = active_map.sigmoid() * masks
124 |         return active_map, out
125 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/adam_optimizer.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the LICENSE file in
  5 | # the root directory of this source tree. An additional grant of patent rights
  6 | # can be found in the PATENTS file in the same directory.
  7 | 
  8 | import math
  9 | import torch
 10 | import torch.optim
 11 | 
 12 | from .optimizer import FairseqOptimizer
 13 | 
 14 | 
 15 | class AdamOptimizer(FairseqOptimizer):
 16 |     def __init__(self, args, params):
 17 |         super().__init__(args, params)
 18 |         self._optimizer = Adam(params, **self.optimizer_config)
 19 | 
 20 |     @staticmethod
 21 |     def add_args(parser):
 22 |         """Add optimizer-specific arguments to the parser."""
 23 |         # fmt: off
 24 |         parser.add_argument('--adam-betas', default='(0.9, 0.999)', metavar='B',
 25 |                             help='betas for Adam optimizer')
 26 |         parser.add_argument('--adam-eps', type=float, default=1e-8, metavar='D',
 27 |                             help='epsilon for Adam optimizer')
 28 |         # fmt: on
 29 | 
 30 |     @property
 31 |     def optimizer_config(self):
 32 |         """
 33 |         Return a kwarg dictionary that will be used to override optimizer
 34 |         args stored in checkpoints. This allows us to load a checkpoint and
 35 |         resume training using a different set of optimizer args, e.g., with a
 36 |         different learning rate.
 37 |         """
 38 |         return {
 39 |             'lr': self.args.lr,
 40 |             'betas': eval(self.args.adam_betas),
 41 |             'eps': self.args.adam_eps,
 42 |             'weight_decay': self.args.weight_decay,
 43 |         }
 44 | 
 45 | 
 46 | class Adam(torch.optim.Optimizer):
 47 |     """Implements Adam algorithm.
 48 | 
 49 |     This implementation is modified from torch.optimizer.Adam based on:
 50 |     `Fixed Weight Decay Regularization in Adam`
 51 |     (see https://arxiv.org/abs/1711.05101)
 52 | 
 53 |     It has been proposed in `Adam: A Method for Stochastic Optimization`_.
 54 | 
 55 |     Arguments:
 56 |         params (iterable): iterable of parameters to optimize or dicts defining
 57 |             parameter groups
 58 |         lr (float, optional): learning rate (default: 1e-3)
 59 |         betas (Tuple[float, float], optional): coefficients used for computing
 60 |             running averages of gradient and its square (default: (0.9, 0.999))
 61 |         eps (float, optional): term added to the denominator to improve
 62 |             numerical stability (default: 1e-8)
 63 |         weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
 64 |         amsgrad (boolean, optional): whether to use the AMSGrad variant of this
 65 |             algorithm from the paper `On the Convergence of Adam and Beyond`_
 66 | 
 67 |     .. _Adam\: A Method for Stochastic Optimization:
 68 |         https://arxiv.org/abs/1412.6980
 69 |     .. _On the Convergence of Adam and Beyond:
 70 |         https://openreview.net/forum?id=ryQu7f-RZ
 71 |     """
 72 | 
 73 |     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
 74 |                  weight_decay=0, amsgrad=False):
 75 |         defaults = dict(lr=lr, betas=betas, eps=eps,
 76 |                         weight_decay=weight_decay, amsgrad=amsgrad)
 77 |         super(Adam, self).__init__(params, defaults)
 78 | 
 79 |     def step(self, closure=None):
 80 |         """Performs a single optimization step.
 81 | 
 82 |         Arguments:
 83 |             closure (callable, optional): A closure that reevaluates the model
 84 |                 and returns the loss.
 85 |         """
 86 |         loss = None
 87 |         if closure is not None:
 88 |             loss = closure()
 89 | 
 90 |         for group in self.param_groups:
 91 |             for p in group['params']:
 92 |                 if p.grad is None:
 93 |                     continue
 94 |                 grad = p.grad.data
 95 |                 if grad.is_sparse:
 96 |                     raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
 97 |                 amsgrad = group['amsgrad']
 98 | 
 99 |                 state = self.state[p]
100 | 
101 |                 # State initialization
102 |                 if len(state) == 0:
103 |                     state['step'] = 0
104 |                     # Exponential moving average of gradient values
105 |                     state['exp_avg'] = torch.zeros_like(p.data)
106 |                     # Exponential moving average of squared gradient values
107 |                     state['exp_avg_sq'] = torch.zeros_like(p.data)
108 |                     if amsgrad:
109 |                         # Maintains max of all exp. moving avg. of sq. grad. values
110 |                         state['max_exp_avg_sq'] = torch.zeros_like(p.data)
111 | 
112 |                 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
113 |                 if amsgrad:
114 |                     max_exp_avg_sq = state['max_exp_avg_sq']
115 |                 beta1, beta2 = group['betas']
116 | 
117 |                 state['step'] += 1
118 | 
119 |                 # Decay the first and second moment running average coefficient
120 |                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
121 |                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
122 |                 if amsgrad:
123 |                     # Maintains the maximum of all 2nd moment running avg. till now
124 |                     torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
125 |                     # Use the max. for normalizing running avg. of gradient
126 |                     denom = max_exp_avg_sq.sqrt().add_(group['eps'])
127 |                 else:
128 |                     denom = exp_avg_sq.sqrt().add_(group['eps'])
129 | 
130 |                 bias_correction1 = 1 - beta1 ** state['step']
131 |                 bias_correction2 = 1 - beta2 ** state['step']
132 |                 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
133 | 
134 |                 if group['weight_decay'] != 0:
135 |                     p.data.add_(-group['weight_decay'] * group['lr'], p.data)
136 | 
137 |                 p.data.addcdiv_(-step_size, exp_avg, denom)
138 | 
139 |         return loss
140 | 


--------------------------------------------------------------------------------
/referring_segmentation/dataset/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from PIL import Image
  3 | from torch.utils import data
  4 | import numpy as np
  5 | import h5py
  6 | from torchvision import transforms
  7 | from dataset import augmentation
  8 | import torch
  9 | from tqdm import tqdm
 10 | from utils.video_reader import clip_annotation_reader, sequence_reader
 11 | import json
 12 | import torchtext
 13 | import torch.nn as nn
 14 | 
 15 | 
 16 | class MyDataset(data.Dataset):
 17 |     def __init__(self, config, mode='train'):
 18 |         super(MyDataset, self).__init__()
 19 |         self.input_size = config['input_size']
 20 |         self.clip_size = config['clip_size']
 21 |         self.datasets = config['{}ing_datasets'.format(mode)]
 22 |         self.dataset_root = config['datasets_root']
 23 |         self.max_embedding_length = config['max_embedding_length']
 24 |         self.mode = mode
 25 |         if type(self.datasets) != list:
 26 |             self.datasets = [self.datasets]
 27 |         print('Preparing datasets: {}'.format(self.datasets))
 28 |         self.datas = []
 29 |         augmen = [augmentation.FixedResize(self.input_size)]
 30 |         if mode == 'train':
 31 |             if config['augmentations']['random_crop']:
 32 |                 augmen.append(augmentation.RandomScale((1.0, 1.1)))
 33 |                 augmen.append(augmentation.ExtRandomCrop(self.input_size, pad_if_needed=True))
 34 |             if config['augmentations']['random_flip']:
 35 |                 augmen.append(augmentation.RandomHorizontalFlip())
 36 |         augmen.append(augmentation.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)))
 37 |         augmen.append(augmentation.ToTensor())
 38 |         self.transformation = transforms.Compose(augmen)
 39 |     
 40 |         for dataset in self.datasets:
 41 |             assert os.path.exists('./data/{}_{}.json'.format(dataset.lower(), mode)), 'json file not exist: {}'.format('./data/{}_{}.json'.format(dataset.lower(), mode))
 42 |             with open('./data/{}_{}.json'.format(dataset.lower(), mode), 'r') as f:
 43 |                 videosets = json.load(f)
 44 | 
 45 |             for video_file, attribute in tqdm(videosets.items()):
 46 |                 video_root, annotation_root, instances = attribute['frames'], attribute['labels'], attribute['instances']
 47 |                 if mode == 'train':
 48 |                     video_data = clip_annotation_reader(os.path.join(self.dataset_root, video_root), os.path.join(self.dataset_root, annotation_root), \
 49 |                                                     instances, self.clip_size, annotation_center=False, dataset=dataset)
 50 |                 else:
 51 |                     video_data = sequence_reader(os.path.join(self.dataset_root, video_root), os.path.join(self.dataset_root, annotation_root), instances, dataset=dataset)
 52 |                 self.datas += video_data
 53 |         
 54 |         embedding_name, embedding_dim = config['embedding_type'].split(
 55 |             '_')[1], int(config['embedding_type'].split('_')[2])
 56 |         self.vocab = torchtext.vocab.GloVe(
 57 |             name=embedding_name, dim=embedding_dim)
 58 |         self.vocab.itos.extend(['<unk>'])
 59 |         self.vocab.stoi['<unk>'] = self.vocab.vectors.shape[0]
 60 |         self.vocab.vectors = torch.cat(
 61 |             [self.vocab.vectors, torch.zeros(1, self.vocab.dim)], dim=0)
 62 |         self.word_embedding = nn.Embedding.from_pretrained(self.vocab.vectors)
 63 | 
 64 |     def __len__(self):
 65 |         return len(self.datas)
 66 | 
 67 |     def __getitem__(self, item):
 68 |         frames = []
 69 |         annotations = []
 70 |         is_annotated = []
 71 |         frame_names = []
 72 |         instance = self.datas[item]['instance']
 73 | 
 74 |         word_idxs = torch.tensor([self.vocab.stoi.get(w, len(self.vocab.stoi)-1)
 75 |                                   for w in self.datas[item]['sentence'].strip().split()], dtype=torch.long)
 76 |         embedding = self.word_embedding(word_idxs)
 77 |         embedding_length = embedding.shape[0]
 78 |         if embedding_length > self.max_embedding_length:
 79 |             embedding_padded = embedding[: self.max_embedding_length, :]
 80 |             embedding_length = self.max_embedding_length
 81 |         else:
 82 |             embedding_padded = torch.zeros(
 83 |                 (self.max_embedding_length, embedding.shape[1]))
 84 |             embedding_padded[: embedding.shape[0], :] = embedding
 85 | 
 86 |         for i in range(len(self.datas[item]['frames'])):
 87 |             frame_names.append(self.datas[item]['frames'][i].split('/')[-1].split('.')[0])
 88 |             frame = Image.open(self.datas[item]['frames'][i]).convert('RGB')
 89 |             frames.append(frame)
 90 |             w, h = frame.size
 91 | 
 92 |             sign = True
 93 |             if self.datas[item]['label'][i] != 'None':
 94 |                 with h5py.File(self.datas[item]['label'][i], 'r') as file_annotation:
 95 |                     if int(instance) not in list(file_annotation['instance']):
 96 |                         annotation = Image.new('L', (w, h))
 97 |                     else:
 98 |                         if len(file_annotation['reMask'].shape) != 3:
 99 |                             annotation = file_annotation['reMask'][:]
100 |                         else:
101 |                             annotation = file_annotation['reMask'][np.where(file_annotation['instance'][:] == int(instance))][0]
102 |                         annotation = Image.fromarray(annotation.T)
103 |             else:
104 |                 annotation = Image.new('L', (w, h))
105 |                 sign = False
106 |             annotations.append(annotation)
107 |             is_annotated.append(sign)
108 | 
109 |         sample = {}
110 |         sample['frames'] = frames
111 |         sample['label'] = annotations
112 |         sample = self.transformation(sample)
113 |         sample['word_embedding'] = embedding_padded
114 |         sample['embedding_length'] = embedding_length
115 |         sample['is_annotated'] = is_annotated
116 |         sample['video'] = self.datas[item]['video']
117 |         sample['name'] = frame_names
118 |         sample['dataset'] = self.datas[item]['dataset']
119 |         sample['instance'] = instance
120 | 
121 |         return sample
122 | 
123 | 
124 | 
125 | 
126 | 
127 | 


--------------------------------------------------------------------------------
/temporal_grounding/utils/utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch
  3 | 
  4 | 
  5 | def generate_anchors(windows):
  6 |     widths = np.array(windows)
  7 |     center = 7.5
  8 |     start = center - 0.5 * (widths - 1)
  9 |     end = center + 0.5 * (widths - 1)
 10 |     return np.stack([start, end], -1)
 11 | 
 12 | def generate_proposals(max_num_frames, windows):
 13 |     anchors = generate_anchors(windows)
 14 |     widths = (anchors[:, 1] - anchors[:, 0] + 1)  # [num_anchors]
 15 |     centers = np.arange(0, max_num_frames)  # [video_len]
 16 |     start = np.expand_dims(centers, 1) - 0.5 * (np.expand_dims(widths, 0) - 1)
 17 |     end = np.expand_dims(centers, 1) + 0.5 * (np.expand_dims(widths, 0) - 1)
 18 |     proposals = np.stack([start, end], -1)  # [video_len, num_anchors, 2]
 19 |     return proposals
 20 | 
 21 | def generate_scores(proposals, label, max_num_frames, thres_score, thres_adjmat):
 22 |     proposals = np.reshape(proposals, [-1, 2])
 23 |     illegal = np.logical_or(proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames)
 24 |     label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0)
 25 |     IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]),
 26 |                                         (label1[:, 0], label1[:, 1]))
 27 |     IoUs[illegal] = 0.0  # [video_len * num_anchors]
 28 |     max_IoU = np.max(IoUs)
 29 |     IoUs[IoUs < thres_score * max_IoU] = 0.0
 30 |     IoUs = IoUs / (max_IoU + 1e-4)
 31 |     adj_mat = IoUs.copy()
 32 |     adj_mat[adj_mat < thres_adjmat] = 0.0  # best 0.7 * max_IoU
 33 | 
 34 |     scores = IoUs.astype(np.float32)
 35 |     scores_mask = (1 - illegal).astype(np.uint8)
 36 |     return scores, scores_mask, adj_mat
 37 | 
 38 | def calculate_IoU_batch(i0, i1):
 39 |     union = (np.min(np.stack([i0[0], i1[0]], 0), 0), np.max(np.stack([i0[1], i1[1]], 0), 0))
 40 |     inter = (np.max(np.stack([i0[0], i1[0]], 0), 0), np.min(np.stack([i0[1], i1[1]], 0), 0))
 41 |     iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10)
 42 |     iou[union[1] - union[0] < -1e-5] = 0
 43 |     iou[iou < 0] = 0.0
 44 |     return iou
 45 | 
 46 | def calculate_IoU(i0, i1):
 47 |     union = (min(i0[0], i1[0]), max(i0[1], i1[1]))
 48 |     inter = (max(i0[0], i1[0]), min(i0[1], i1[1]))
 49 | 
 50 |     if union[1] - union[0] < -1e-5:
 51 |         return 0
 52 |     iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10)
 53 |     return iou if iou >= 0.0 else 0.0
 54 | 
 55 | 
 56 | def average_to_fixed_length(visual_input, num_sample_clips):
 57 |     num_clips = visual_input.shape[0]
 58 |     idxs = torch.arange(0, num_sample_clips+1, 1.0)/num_sample_clips*num_clips
 59 |     idxs = torch.min(torch.round(idxs).long(), torch.tensor(num_clips-1))
 60 |     new_visual_input = []
 61 |     for i in range(num_sample_clips):
 62 |         s_idx, e_idx = idxs[i].item(), idxs[i+1].item()
 63 |         if s_idx < e_idx:
 64 |             new_visual_input.append(torch.mean(
 65 |                 visual_input[s_idx:e_idx], dim=0))
 66 |         else:
 67 |             new_visual_input.append(visual_input[s_idx])
 68 |     new_visual_input = torch.stack(new_visual_input, dim=0)
 69 |     return new_visual_input
 70 | 
 71 | 
 72 | def nms_temporal(predict_score, predict_windows, overlap):
 73 |     pick = list()
 74 |     starts = predict_windows[:, 0]
 75 |     ends = predict_windows[:, 1]
 76 |     scores = predict_score
 77 |     assert len(starts) == len(scores)
 78 |     if len(starts) == 0:
 79 |         return pick
 80 | 
 81 |     unions = ends - starts
 82 |     # sort and get index
 83 |     indexs = [x[0] for x in sorted(enumerate(scores), key=lambda x:x[1])]
 84 | 
 85 |     while len(indexs) > 0:
 86 |         i = indexs[-1]
 87 |         pick.append(i)
 88 | 
 89 |         lefts = [max(starts[i], starts[j]) for j in indexs[:-1]]
 90 |         rights = [min(ends[i], ends[j]) for j in indexs[:-1]]
 91 |         inters = [max(0.0, right-left) for left, right in zip(lefts, rights)]
 92 |         laps = [inters[u]/(unions[i] + unions[indexs[u]] - inters[u])
 93 |                 for u in range(len(indexs)-1)]
 94 |         indexs_new = []
 95 |         for j in range(len(laps)):
 96 |             if laps[j] <= overlap:
 97 |                 indexs_new.append(indexs[j])
 98 |         indexs = indexs_new
 99 | 
100 |     return pick
101 | 
102 | 
103 | def compute_IoU_recall_top_n(predict_windows, gt_windows, picks, top_n, IoU_thresh):
104 | 
105 |     correct = 0
106 |     if top_n < len(picks):
107 |         cur_picks = picks[0:top_n]
108 |     else:
109 |         cur_picks = picks
110 |     for index in cur_picks:
111 |         pred_start = predict_windows[index][0]
112 |         pred_end = predict_windows[index][1]
113 |         iou = calculate_IoU(gt_windows, (pred_start, pred_end))
114 |         if iou >= IoU_thresh:
115 |             correct += 1
116 |             break
117 | 
118 |     return correct
119 | 
120 | 
121 | def compute_IoU_recall(predict_score, predict_windows, gt_windows):
122 | 
123 |     IoU_threshs = [0.1, 0.3, 0.5, 0.7]
124 |     top_n_list = [1, 5]
125 |     topn_IoU_matric = np.zeros([2, 4], dtype=np.float32)
126 | 
127 |     for i, IoU_thresh in enumerate(IoU_threshs):
128 |         picks = nms_temporal(predict_score, predict_windows, IoU_thresh-0.05)
129 | 
130 |         for j, top_n in enumerate(top_n_list):
131 |             correct = compute_IoU_recall_top_n(
132 |                 predict_windows, gt_windows, picks, top_n, IoU_thresh)
133 |             topn_IoU_matric[j, i] = correct
134 | 
135 |     return topn_IoU_matric
136 | 
137 | 
138 | class CountMeter(object):
139 |     """Computes and stores the average and current value"""
140 | 
141 |     def __init__(self):
142 |         self.reset()
143 | 
144 |     def reset(self):
145 |         self.val = np.zeros([2, 4], dtype=np.float32)
146 |         self.count = 0
147 | 
148 |     def update(self, val, n=1):
149 |         self.val += val
150 |         self.count += n
151 | 
152 | 
153 | class AverageMeter(object):
154 |     """Computes and stores the average and current value"""
155 | 
156 |     def __init__(self):
157 |         self.reset()
158 | 
159 |     def reset(self):
160 |         self.val = 0
161 |         self.avg = 0
162 |         self.sum = 0
163 |         self.count = 0
164 | 
165 |     def update(self, val, n=1):
166 |         self.val = val
167 |         self.sum += val * n
168 |         self.count += n
169 |         self.avg = self.sum / self.count
170 | 


--------------------------------------------------------------------------------
/referring_segmentation/model/module/attention.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from einops import rearrange, repeat
  5 | from torch import nn, einsum
  6 | 
  7 | 
  8 | class GlobalTextPresentation(nn.Module):
  9 |     def __init__(self, text_dim):
 10 |         super(GlobalTextPresentation, self).__init__()
 11 |         self.W_txt = nn.Linear(text_dim, text_dim)
 12 | 
 13 |     def forward(self, fea_text, mask=None):
 14 |         fea_text = fea_text.permute(0, 2, 1)  # B*L*C2
 15 |         weight_text = self.W_txt(fea_text)  # B*L*C
 16 |         if mask is not None:
 17 |             mask = mask.permute(0, 2, 1)
 18 |             weight_text = weight_text.masked_fill(mask == 0, -1e9)
 19 |         weight_text = weight_text.softmax(dim=1)
 20 |         fea_text_global = fea_text * weight_text
 21 |         fea_text_global = fea_text_global.sum(dim=1, keepdim=True).permute(0, 2, 1).unsqueeze(-1)  # B*C*1*1
 22 |         return fea_text_global
 23 | 
 24 | 
 25 | class GlobalAttention(nn.Module):
 26 |     def __init__(self, video_feature_dim, text_dim, global_attention_dim):
 27 |         super(GlobalAttention, self).__init__()
 28 |         self.scale = global_attention_dim ** -0.5
 29 | 
 30 |         self.Q = nn.Linear(video_feature_dim+text_dim+8, global_attention_dim)
 31 |         self.K = nn.Linear(text_dim, global_attention_dim)
 32 |         self.V = nn.Linear(text_dim, global_attention_dim)
 33 | 
 34 |     def forward(self, fea_video, fea_text):
 35 |         """
 36 |         :param fea_video: B*(C1+C2+8)*H*W
 37 |         :param fea_text: B*C2*1*1
 38 |         :param mask: B*1*L
 39 |         :return:
 40 |         """
 41 |         B, C1, H, W = fea_video.shape
 42 |         B, C2, _, _ = fea_text.shape
 43 |         fea_video = fea_video.view(B, C1, -1).permute(0, 2, 1)
 44 |         fea_text = fea_text.view(B, C2, -1).permute(0, 2, 1)
 45 | 
 46 | 
 47 |         q = self.Q(fea_video)
 48 |         k = self.K(fea_text)
 49 |         v = self.V(fea_text)
 50 | 
 51 |         att = torch.matmul(q, k.permute(0, 2, 1)) * self.scale # B*HW*1
 52 |         att = att.softmax(-1)
 53 |         out = torch.matmul(att, v)  # B*HW*C
 54 |         out = out.permute(0, 2, 1).view(B, -1, H, W)
 55 |         return out
 56 | 
 57 | 
 58 | class LocalAttention(nn.Module):
 59 |     def __init__(self, video_feature_dim, text_dim, attention_dim):
 60 |         super(LocalAttention, self).__init__()
 61 |         self.scale = attention_dim ** -0.5
 62 | 
 63 |         self.Q = nn.Linear(video_feature_dim, attention_dim)
 64 |         self.K = nn.Linear(text_dim, attention_dim)
 65 |         self.V = nn.Linear(text_dim, attention_dim)
 66 | 
 67 |     def forward(self, fea_video, fea_text, mask):
 68 |         """
 69 |         :param fea_video: B*C*T*H*W
 70 |         :param fea_text: B*C*L
 71 |         :param mask: B*HW*L
 72 |         :return:
 73 |         """
 74 | 
 75 |         B, C, T, H, W = fea_video.shape
 76 |         fea_frames = fea_video.chunk(T, dim=2)
 77 |         fea_text = fea_text.permute(0, 2, 1)  # B*L*C
 78 |         outs = []
 79 |         for fea_frame in fea_frames:
 80 |             fea_frame = fea_frame.view(B, C, -1).permute(0, 2, 1)  # B*HW*C
 81 | 
 82 |             q = self.Q(fea_frame)
 83 |             k = self.K(fea_text)
 84 |             v = self.V(fea_text)
 85 | 
 86 |             att = torch.matmul(q, k.permute(0, 2, 1)) * self.scale  # B*HW*L
 87 |             if mask is not None:
 88 |                 att = att.masked_fill(mask == 0, -1e9)
 89 |             att = att.softmax(-1)
 90 |             out = torch.matmul(att, v)  # B*HW*C
 91 |             out = out.permute(0, 2, 1).view(B, C, H, W).unsqueeze(2)
 92 |             outs.append(out)
 93 |         outs = torch.cat(outs, dim=2)
 94 |         return outs
 95 | 
 96 | 
 97 | class MuTan(nn.Module):
 98 |     def __init__(self, video_fea_dim, text_fea_dim, out_fea_dim, heads = 5):
 99 |         super(MuTan, self).__init__()
100 | 
101 |         self.heads = heads
102 |         self.Wv = nn.ModuleList([nn.Conv2d(video_fea_dim+8, out_fea_dim, 1, 1) for i in range(heads)])
103 |         self.Wt = nn.ModuleList([nn.Conv2d(text_fea_dim, out_fea_dim, 1, 1) for i in range(heads)])
104 | 
105 |     def forward(self, video_fea, text_fea, spatial):
106 |         video_fea = torch.cat([video_fea, spatial], dim=1)
107 |         fea_outs = []
108 |         for i in range(self.heads):
109 |             fea_v = self.Wv[i](video_fea)
110 |             fea_v = torch.tanh(fea_v)  # B*C*H*W
111 | 
112 |             fea_t = self.Wt[i](text_fea)
113 |             fea_t = torch.tanh(fea_t)  # B*C*1*1
114 | 
115 |             fea_out = fea_v * fea_t
116 |             fea_outs.append(fea_out.unsqueeze(-1))
117 |         fea_outs = torch.cat(fea_outs, dim=-1)
118 |         fea_outs = torch.sum(fea_outs, dim=-1)
119 |         mutan_fea = torch.tanh(fea_outs)
120 |         mutan_fea = F.normalize(mutan_fea, dim=1)
121 |         return mutan_fea
122 | 
123 | class RelevanceFilter(nn.Module):
124 |     def __init__(self, text_fea_dim, video_fea_dim, attention_dim, groups=8, kernelsize=(1, 1, 1)):
125 |         super(RelevanceFilter, self).__init__()
126 |         assert text_fea_dim % groups == 0
127 |         assert attention_dim % groups == 0
128 |         self.groups = groups
129 |         self.Wv = nn.Conv3d(video_fea_dim, 2 * attention_dim, 1, 1)
130 | 
131 |         self.Wt = nn.Linear(text_fea_dim, attention_dim *
132 |                             kernelsize[0] * kernelsize[1] * kernelsize[2])
133 |         self.kernel_size = kernelsize
134 | 
135 |     def forward(self, video_fea, text_fea):
136 | 
137 |         fea = self.Wv(video_fea)  # B*C*T*H*W
138 |         B, C, T, H, W = video_fea.shape
139 |         k, v = fea.chunk(2, dim=1)
140 |         kernel = self.Wt(text_fea)  # B*L*(C*K*K)
141 |         kernel = repeat(kernel, 'b l (g c t h w) -> (b g l) c t h w',
142 |                         t=self.kernel_size[0], h=self.kernel_size[1], w=self.kernel_size[2], g=self.groups)
143 |         k = repeat(k, 'b c t h w -> n (b c) t h w', n=1)
144 |         att = F.conv3d(k, kernel, padding=(
145 |             self.kernel_size[0]//2, self.kernel_size[1]//2, self.kernel_size[2]//2), groups=B*self.groups)
146 |         att = rearrange(
147 |             att, 'n (b g c) t h w -> (n b) g c t h w', b=B, g=self.groups)
148 |         active_map = att.mean(dim=1)
149 |         v = rearrange(v, 'b (g c) t h w -> b g c t h w', g=self.groups)
150 |         out = v * torch.sigmoid(att)
151 |         out = rearrange(out, 'b g c t h w -> b (g c) t h w')
152 |         maps = active_map.permute(
153 |             2, 0, 1, 3, 4)
154 |         maps = [maps[i] for i in range(T)]
155 |         return maps, out
156 | 
157 | 
158 | 
159 | 
160 | 
161 | 
162 | 
163 | 
164 | 
165 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/criterion.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | import torch.nn as nn
  4 | from .anchor_utils import generate_proposals, generate_scores, generate_2d_gaussian
  5 | from einops import repeat, rearrange
  6 | from .utils import generate_anchor_scores, compute_temporal_reg_tar, segment_tiou
  7 | from .utils import generate_2d_gaussian as generate_2d_gaussian_new
  8 | 
  9 | 
 10 | class SetCriterion(nn.Module):
 11 |     def __init__(self, cfg):
 12 |         super().__init__()
 13 |         self.cfg = cfg
 14 |         self.temporal_reg_loss = nn.SmoothL1Loss()
 15 |         self.temporal_cls_loss = nn.BCELoss()
 16 |         self.spatial_hm_loss = nn.BCELoss()
 17 |         self.spatial_wh_loss = nn.SmoothL1Loss()
 18 | 
 19 |     def loss_spatial(self, outputs, targets, inter_idx):
 20 |         inter_idx = inter_idx[0]
 21 |         h, w = outputs['spatial_map'].shape[-2:]
 22 |         box_gt = [targets[i]['boxes'] for i in range(len(targets))]
 23 |         box_gt = torch.cat(box_gt, dim=0)   # k*4, cxcywh
 24 | 
 25 |         size_gt = [targets[i]['size']
 26 |                    for i in range(len(targets))]  # current input frame size
 27 |         size_gt = torch.stack(size_gt)  # [\sigma t_i, 2]
 28 |         padded_size_gt = torch.max(size_gt, dim=0)[0]    # [2]
 29 |         box_gt_unnormed = box_gt * \
 30 |             torch.stack([size_gt[:, 1], size_gt[:, 0],
 31 |                         size_gt[:, 1], size_gt[:, 0]], dim=-1)
 32 |         padded_box_gt = box_gt_unnormed / \
 33 |             torch.stack([padded_size_gt[1], padded_size_gt[0],
 34 |                         padded_size_gt[1], padded_size_gt[0]], dim=-1)[None]
 35 |         gaussian_gt = generate_2d_gaussian(padded_box_gt, w, h, delta=0.05)[
 36 |             :, None]  # k*1*h*w
 37 |         wh_gt = (padded_box_gt[:, 2:] * torch.as_tensor([w, h])[None,
 38 |                  :].to(box_gt.device))[..., None, None].repeat(1, 1, h, w)
 39 | 
 40 |         pred_hm = outputs['spatial_map']    # k*1*h*w
 41 |         pred_wh = outputs['spatial_wh']    # k*2*h*w
 42 | 
 43 |         loss_hm = self.spatial_hm_loss(pred_hm, gaussian_gt)
 44 |         loss_wh = self.spatial_wh_loss(pred_wh*gaussian_gt, wh_gt*gaussian_gt)
 45 |         loss_map = 0
 46 |         for map in outputs['maps']:
 47 |             map = F.interpolate(
 48 |                 map, (h, w), mode='bilinear', align_corners=True)
 49 |             loss_map += self.spatial_hm_loss(map, gaussian_gt)
 50 | 
 51 |         return {
 52 |             'spatial_hm_loss': loss_hm,
 53 |             'spatial_wh_loss': loss_wh,
 54 |             'spatial_map_loss': loss_map
 55 |         }, gaussian_gt
 56 | 
 57 |     def loss_temporal(self, outputs, durations, inter_idx):
 58 |         device = outputs['spatial_map'].device
 59 |         seq_len = max(durations)
 60 |         b = len(durations)
 61 |         inter_idx = torch.as_tensor(inter_idx).float().to(device)  # [b, 2]
 62 |         index = torch.as_tensor([i for i in range(seq_len)]).to(device)[
 63 |             None].repeat(b, 1)  # [b, t]
 64 |         inter_idx_expand = inter_idx[:, None].repeat(
 65 |             1, seq_len, 1)  # [b, t, 2]
 66 |         # [b, t], 1 for moments when action happens, otherwise 0
 67 |         action_gt = ((index >= inter_idx_expand[..., 0]) & (
 68 |             index <= inter_idx_expand[..., 1])).float()
 69 | 
 70 |         # [b, t] "True" represent the padded moment
 71 |         time_mask = torch.ones(b, seq_len).bool().to(device)
 72 |         for i_dur, duration in enumerate(durations):
 73 |             time_mask[i_dur, :duration] = False
 74 |         if self.cfg.temporal_decoder_type == 'anchor':
 75 |             proposals = generate_proposals(seq_len, self.cfg.temporal_window_width)[
 76 |                 None].repeat(b, 1, 1).to(device)    # [b, t*n_window, 2]
 77 |             score_gt, score_mask = generate_anchor_scores(
 78 |                 proposals, inter_idx, seq_len, self.cfg.temporal_score_thres)
 79 |             time_mask_expanded = repeat(
 80 |                 time_mask, 'b t -> b (t n)', n=len(self.cfg.temporal_window_width))
 81 |             score_mask[time_mask_expanded] = True   # [b, t*n_window]
 82 |             score_pos = (score_gt >= self.cfg.temporal_valid_thres).float()
 83 |             score_pos = score_pos.masked_fill(time_mask_expanded, 0.)
 84 |             reg_gt = inter_idx[:, None].repeat(1, proposals.shape[1], 1)
 85 |             refined_box = outputs['temporal_offset'] + \
 86 |                 proposals   # [b, t*n_window, 2]
 87 |             loss_reg = self.temporal_reg_loss(
 88 |                 refined_box*score_pos[..., None], reg_gt*score_pos[..., None])
 89 |             loss_cls = self.temporal_cls_loss(outputs['temporal_score'].masked_fill(time_mask_expanded[..., None], 0.),
 90 |                                               score_gt.masked_fill(time_mask_expanded[..., None], 0.))
 91 |             return {
 92 |                 'temporal_cls_loss': loss_cls,
 93 |                 'temporal_align_loss': loss_reg
 94 |             }
 95 | 
 96 |         elif self.cfg.temporal_decoder_type == 'regression':
 97 |             pred_start = index - outputs['temporal_reg'][:, :, 0]
 98 |             pred_end = index + outputs['temporal_reg'][:, :, 1]
 99 |             predictions = torch.stack([pred_start, pred_end], dim=-1) / seq_len
100 |             predictions = torch.clamp(predictions, 0, 1)
101 |             label_reg = compute_temporal_reg_tar(inter_idx, action_gt)
102 |             label_iou = segment_tiou(predictions, inter_idx[:, None] / seq_len)
103 |             iou_pos_ind = label_iou > 0.5
104 |             pos_iou_target = label_iou[iou_pos_ind]
105 |             pos_iou_pred = outputs['temporal_iou'][iou_pos_ind]
106 |             loss_reg = self.temporal_reg_loss(
107 |                 outputs['temporal_reg'] * action_gt.unsqueeze(-1), label_reg)
108 |             loss_score = self.temporal_cls_loss(
109 |                 outputs['temporal_score'], action_gt)
110 |             if iou_pos_ind.sum().item() == 0:
111 |                 loss_iou = 0
112 |             else:
113 |                 loss_iou = self.temporal_cls_loss(
114 |                     pos_iou_pred, pos_iou_target.detach())
115 |             return {
116 |                 'temporal_score_loss': loss_score,
117 |                 'temporal_reg_loss': loss_reg,
118 |                 'temporal_iou_loss': loss_iou
119 |             }
120 |         else:
121 |             raise NotImplementedError
122 | 
123 |     def forward(self, outputs, durations, inter_idx, targets):
124 |         loss_dict = self.loss_temporal(outputs, durations, inter_idx)
125 |         loss_dict_s, gaussian_gt = self.loss_spatial(
126 |             outputs, targets, inter_idx)
127 |         loss_dict.update(loss_dict_s)
128 |         return loss_dict, gaussian_gt
129 | 


--------------------------------------------------------------------------------
/referring_segmentation/model/backbone/mobilenet.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | import torch.nn as nn
  4 | import math
  5 | # from modeling.sync_batchnorm.batchnorm import SynchronizedBatchNorm2d
  6 | import torch.utils.model_zoo as model_zoo
  7 | 
  8 | 
  9 | 
 10 | def conv_bn(inp, oup, stride, BatchNorm):
 11 |     return nn.Sequential(
 12 |         nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
 13 |         BatchNorm(oup),
 14 |         nn.ReLU6(inplace=True)
 15 |     )
 16 | 
 17 | 
 18 | def fixed_padding(inputs, kernel_size, dilation):
 19 |     kernel_size_effective = kernel_size + (kernel_size - 1) * (dilation - 1)
 20 |     pad_total = kernel_size_effective - 1
 21 |     pad_beg = pad_total // 2
 22 |     pad_end = pad_total - pad_beg
 23 |     padded_inputs = F.pad(inputs, (pad_beg, pad_end, pad_beg, pad_end))
 24 |     return padded_inputs
 25 | 
 26 | 
 27 | class InvertedResidual(nn.Module):
 28 |     def __init__(self, inp, oup, stride, dilation, expand_ratio, BatchNorm):
 29 |         super(InvertedResidual, self).__init__()
 30 |         self.stride = stride
 31 |         assert stride in [1, 2]
 32 | 
 33 |         hidden_dim = round(inp * expand_ratio)
 34 |         self.use_res_connect = self.stride == 1 and inp == oup
 35 |         self.kernel_size = 3
 36 |         self.dilation = dilation
 37 | 
 38 |         if expand_ratio == 1:
 39 |             self.conv = nn.Sequential(
 40 |                 # dw
 41 |                 nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 0, dilation, groups=hidden_dim, bias=False),
 42 |                 BatchNorm(hidden_dim),
 43 |                 nn.ReLU6(inplace=True),
 44 |                 # pw-linear
 45 |                 nn.Conv2d(hidden_dim, oup, 1, 1, 0, 1, 1, bias=False),
 46 |                 BatchNorm(oup),
 47 |             )
 48 |         else:
 49 |             self.conv = nn.Sequential(
 50 |                 # pw
 51 |                 nn.Conv2d(inp, hidden_dim, 1, 1, 0, 1, bias=False),
 52 |                 BatchNorm(hidden_dim),
 53 |                 nn.ReLU6(inplace=True),
 54 |                 # dw
 55 |                 nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 0, dilation, groups=hidden_dim, bias=False),
 56 |                 BatchNorm(hidden_dim),
 57 |                 nn.ReLU6(inplace=True),
 58 |                 # pw-linear
 59 |                 nn.Conv2d(hidden_dim, oup, 1, 1, 0, 1, bias=False),
 60 |                 BatchNorm(oup),
 61 |             )
 62 | 
 63 |     def forward(self, x):
 64 |         x_pad = fixed_padding(x, self.kernel_size, dilation=self.dilation)
 65 |         if self.use_res_connect:
 66 |             x = x + self.conv(x_pad)
 67 |         else:
 68 |             x = self.conv(x_pad)
 69 |         return x
 70 | 
 71 | 
 72 | class MobileNetV2(nn.Module):
 73 |     def __init__(self, inchannel, output_stride=8, BatchNorm=None, width_mult=1., pretrained=True):
 74 |         super(MobileNetV2, self).__init__()
 75 |         block = InvertedResidual
 76 |         input_channel = 32
 77 |         current_stride = 1
 78 |         rate = 1
 79 |         interverted_residual_setting = [
 80 |             # t, c, n, s
 81 |             [1, 16, 1, 1],
 82 |             [6, 24, 2, 2],
 83 |             [6, 32, 3, 2],
 84 |             [6, 64, 4, 2],
 85 |             [6, 96, 3, 1],
 86 |             [6, 160, 3, 2],
 87 |             [6, 320, 1, 1],
 88 |         ]
 89 | 
 90 |         # building first layer
 91 |         input_channel = int(input_channel * width_mult)
 92 |         self.features = [conv_bn(inchannel, input_channel, 2, BatchNorm)]
 93 |         current_stride *= 2
 94 |         # building inverted residual blocks
 95 |         for t, c, n, s in interverted_residual_setting:
 96 |             if current_stride == output_stride:
 97 |                 stride = 1
 98 |                 dilation = rate
 99 |                 rate *= s
100 |             else:
101 |                 stride = s
102 |                 dilation = 1
103 |                 current_stride *= s
104 |             output_channel = int(c * width_mult)
105 |             for i in range(n):
106 |                 if i == 0:
107 |                     self.features.append(block(input_channel, output_channel, stride, dilation, t, BatchNorm))
108 |                 else:
109 |                     self.features.append(block(input_channel, output_channel, 1, dilation, t, BatchNorm))
110 |                 input_channel = output_channel
111 |         self.features = nn.Sequential(*self.features)
112 |         self._initialize_weights()
113 | 
114 |         if pretrained:
115 |             self._load_pretrained_model()
116 | 
117 |         self.low_level_features = self.features[0:4]
118 |         self.high_level_features = self.features[4:]
119 | 
120 |     def forward(self, x):
121 |         low_level_feat = self.low_level_features(x)
122 |         x = self.high_level_features(low_level_feat)
123 |         return x, low_level_feat
124 | 
125 |     def _load_pretrained_model(self):
126 |         pretrain_dict = model_zoo.load_url('http://jeff95.me/models/mobilenet_v2-6a65762b.pth')
127 |         model_dict = {}
128 |         state_dict = self.state_dict()
129 |         for k, v in pretrain_dict.items():
130 |             if k in state_dict:
131 |                 model_dict[k] = v
132 |         state_dict.update(model_dict)
133 |         self.load_state_dict(state_dict)
134 | 
135 |     def _initialize_weights(self):
136 |         for m in self.modules():
137 |             if isinstance(m, nn.Conv2d):
138 |                 # n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
139 |                 # m.weight.data.normal_(0, math.sqrt(2. / n))
140 |                 torch.nn.init.kaiming_normal_(m.weight)
141 |             # elif isinstance(m, SynchronizedBatchNorm2d):
142 |             #     m.weight.data.fill_(1)
143 |             #     m.bias.data.zero_()
144 |             elif isinstance(m, nn.BatchNorm2d):
145 |                 m.weight.data.fill_(1)
146 |                 m.bias.data.zero_()
147 | 
148 | 
149 | class Mobilenet_deeplab(nn.Module):
150 |     def __init__(self, inchannel=3, os=16, pretrained=False):
151 |         super(Mobilenet_deeplab, self).__init__()
152 |         self.backbone = MobileNetV2(inchannel, os, BatchNorm=nn.BatchNorm2d, pretrained=False)
153 |         if pretrained:
154 |             self._load_pretrained_model()
155 | 
156 |     def _load_pretrained_model(self):
157 |         root = '/media/HardDisk/wwk/video_text/codes/code1/model/pretrained/deeplab-mobilenet.pth.tar'
158 |         pretrain_dict = torch.load(root)
159 |         state_dict = {k: v for k, v in pretrain_dict['state_dict'].items() if k in self.state_dict().keys()}
160 |         # model_dict = {}
161 |         # state_dict = self.state_dict()
162 |         # for k, v in pretrain_dict.items():
163 |         #     if k in state_dict:
164 |         #         model_dict[k] = v
165 |         # state_dict.update(model_dict)
166 |         self.load_state_dict(state_dict)
167 | 
168 |     def forward(self, x):
169 |         x, low_level_feat = self.backbone(x)
170 |         return x, low_level_feat
171 | 
172 | 


--------------------------------------------------------------------------------
/referring_segmentation/model/module/TCN.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from model.module.attention import LocalAttention, RelevanceFilter
  5 | 
  6 | 
  7 | class TCN(nn.Module):
  8 |     def __init__(self, text_dim, inchannel, hidden_channel, outchannel, layers=8, padding_type='zero', with_local_attention=True, conv_type='3D', local_attention_type='relevance_filter', groups=8, norm_type='GroupNorm'):
  9 |         super(TCN, self).__init__()
 10 |         self.padding_type = padding_type
 11 |         self.with_local_attention = with_local_attention
 12 |         self.local_attention_type = local_attention_type
 13 |         self.conv_time = nn.ModuleList()
 14 |         self.conv_spatial = nn.ModuleList()
 15 |         self.conv_convert = nn.ModuleList()
 16 |         self.dilations = []
 17 |         self.local_attention = nn.ModuleList()
 18 |         # self.global_txt_W = nn.ModuleList()
 19 |         for i in range(layers):
 20 |             # self.global_txt_W.append(nn.Linear(text_dim, hidden_channel))
 21 |             dilation = torch.pow(torch.tensor(2), i)
 22 |             dilation = int(dilation)
 23 |             self.dilations.append(dilation)
 24 |             if with_local_attention:
 25 |                 if local_attention_type == 'attention':
 26 |                     self.local_attention.append(LocalAttention(inchannel, text_dim, inchannel))
 27 |                 else:
 28 |                     self.local_attention.append(RelevanceFilter(text_dim, inchannel, inchannel, groups=groups))
 29 |             else:
 30 |                 self.local_attention.append(nn.Identity())
 31 | 
 32 |             if conv_type == '3D':
 33 |                 self.conv_spatial.append(nn.Identity())
 34 |                 if norm_type == "GroupNorm":
 35 |                     self.conv_time.append(
 36 |                         nn.Sequential(
 37 |                             nn.Conv3d(inchannel, hidden_channel, (3, 3, 3), 1, (0, 1, 1), (dilation, 1, 1), bias=False),
 38 |                             nn.GroupNorm(8, hidden_channel),
 39 |                             nn.ReLU(inplace=True))
 40 |                         )
 41 |                 else:
 42 |                     self.conv_time.append(
 43 |                         nn.Sequential(
 44 |                             nn.Conv3d(inchannel, hidden_channel, (3, 3, 3), 1, (0, 1, 1), (dilation, 1, 1), bias=False),
 45 |                             nn.BatchNorm3d(hidden_channel),
 46 |                             nn.ReLU(inplace=True))
 47 |                         )
 48 | 
 49 |             else:
 50 |                 if norm_type == "GroupNorm":
 51 |                     self.conv_spatial.append(
 52 |                         nn.Sequential(
 53 |                             nn.Conv3d(inchannel, hidden_channel, (1, 3, 3), 1, (0, 1, 1), (1, 1, 1), bias=False),
 54 |                             nn.GroupNorm(8, hidden_channel),
 55 |                             nn.ReLU(inplace=True)
 56 |                         )
 57 |                     )
 58 |                     self.conv_time.append(
 59 |                         nn.Sequential(
 60 |                             nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1), (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False),
 61 |                             nn.GroupNorm(8, hidden_channel),
 62 |                             nn.ReLU(inplace=True)
 63 |                         )
 64 |                     )
 65 |                 else:
 66 |                     self.conv_spatial.append(
 67 |                         nn.Sequential(
 68 |                             nn.Conv3d(inchannel, hidden_channel, (1, 3, 3), 1, (0, 1, 1), (1, 1, 1), bias=False),
 69 |                             nn.BatchNorm3d(hidden_channel),
 70 |                             nn.ReLU(inplace=True)
 71 |                         )
 72 |                     )
 73 |                     self.conv_time.append(
 74 |                         nn.Sequential(
 75 |                             nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1), (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False),
 76 |                             nn.BatchNorm3d(hidden_channel),
 77 |                             nn.ReLU(inplace=True)
 78 |                         )
 79 |                     )
 80 |             if norm_type == "GroupNorm":
 81 |                 self.conv_convert.append(
 82 |                     nn.Sequential(
 83 |                         nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False),
 84 |                         nn.GroupNorm(8, outchannel)
 85 |                     )
 86 |                 )
 87 |             else:
 88 |                 self.conv_convert.append(
 89 |                     nn.Sequential(
 90 |                         nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False),
 91 |                         nn.BatchNorm3d(outchannel)
 92 |                     )
 93 |                 )
 94 |         self.__init_weight()
 95 | 
 96 |     def __init_weight(self):
 97 |         for m in self.modules():
 98 |             if isinstance(m, nn.Conv3d):
 99 |                 torch.nn.init.kaiming_normal_(m.weight)
100 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
101 |                 m.weight.data.fill_(1)
102 |                 m.bias.data.zero_()
103 | 
104 |     def forward(self, fea, fea_text, mask_local):
105 |         fea_text = fea_text.permute(0, 2, 1)  # B*L*C
106 |         maps_layers = []
107 |         for i in range(len(self.conv_time)):
108 |             res0 = fea
109 | 
110 |             if self.with_local_attention:
111 |                 if self.local_attention_type == 'attention':
112 |                     fea = self.local_attention[i](fea, fea_text, mask_local)
113 |                 else:
114 |                     maps, fea = self.local_attention[i](fea, fea_text)
115 |                     maps_layers.append(maps)
116 |                 fea = res0 + fea
117 | 
118 |             res1 = fea
119 |             fea = self.conv_spatial[i](fea)
120 | 
121 |             if self.padding_type == 'circle':
122 |                 fea = circle_padding(self.dilations[i], fea)
123 |             elif self.padding_type == 'zero':
124 |                 fea = F.pad(fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='constant', value=0)
125 |             else:
126 |                 fea = F.pad(fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='circular')
127 | 
128 |             fea = self.conv_time[i](fea)  # B*C*T
129 |             fea = self.conv_convert[i](fea)
130 |             fea = fea + res1
131 |         return fea, maps_layers
132 | 
133 | 
134 | def circle_padding(padding, feature):
135 |     length_times = feature.shape[2]
136 |     index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1))
137 |     total_num = 2 * padding + length_times
138 |     num_c = padding // len(index)
139 |     if num_c * len(index) < padding:
140 |         num_c = num_c + 1
141 |     expand_number = num_c * len(index) - padding
142 |     index_f = []
143 |     for n in range(num_c):
144 |         index = index + index + index
145 |     for i in range(expand_number, expand_number + total_num):
146 |         index_f.append(index[i])
147 | 
148 |     feas = []
149 |     for idf in index_f:
150 |         feas.append(feature[:, :, idf, :, :].unsqueeze(2))
151 |     feas = torch.cat(feas, dim=2)
152 |     return feas
153 | 


--------------------------------------------------------------------------------
/spatiotemporal_grounding/util/metrics.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
  2 | """
  3 | Various utilities related to track and report metrics
  4 | """
  5 | import datetime
  6 | import time
  7 | from collections import defaultdict, deque
  8 | 
  9 | import torch
 10 | import torch.distributed as dist
 11 | 
 12 | from util.dist import is_dist_avail_and_initialized
 13 | 
 14 | 
 15 | class SmoothedValue:
 16 |     """Track a series of values and provide access to smoothed values over a
 17 |     window or the global series average.
 18 |     """
 19 | 
 20 |     def __init__(self, window_size=20, fmt=None):
 21 |         if fmt is None:
 22 |             fmt = "{median:.4f} ({global_avg:.4f})"
 23 |         self.deque = deque(maxlen=window_size)
 24 |         self.total = 0.0
 25 |         self.count = 0
 26 |         self.fmt = fmt
 27 | 
 28 |     def update(self, value, num=1):
 29 |         self.deque.append(value)
 30 |         self.count += num
 31 |         self.total += value * num
 32 | 
 33 |     def synchronize_between_processes(self):
 34 |         """
 35 |         Distributed synchronization of the metric
 36 |         Warning: does not synchronize the deque!
 37 |         """
 38 |         if not is_dist_avail_and_initialized():
 39 |             return
 40 |         t = torch.tensor([self.count, self.total], dtype=torch.float64, device="cuda")
 41 |         dist.barrier()
 42 |         dist.all_reduce(t)
 43 |         t = t.tolist()
 44 |         self.count = int(t[0])
 45 |         self.total = t[1]
 46 | 
 47 |     @property
 48 |     def median(self):
 49 |         d = torch.tensor(list(self.deque))
 50 |         return d.median().item()
 51 | 
 52 |     @property
 53 |     def avg(self):
 54 |         d = torch.tensor(list(self.deque), dtype=torch.float32)
 55 |         return d.mean().item()
 56 | 
 57 |     @property
 58 |     def global_avg(self):
 59 |         return self.total / self.count
 60 | 
 61 |     @property
 62 |     def max(self):
 63 |         return max(self.deque)
 64 | 
 65 |     @property
 66 |     def value(self):
 67 |         return self.deque[-1]
 68 | 
 69 |     def __str__(self):
 70 |         return self.fmt.format(
 71 |             median=self.median,
 72 |             avg=self.avg,
 73 |             global_avg=self.global_avg,
 74 |             max=self.max,
 75 |             value=self.value,
 76 |         )
 77 | 
 78 | 
 79 | class MetricLogger(object):
 80 |     def __init__(self, delimiter="\t"):
 81 |         self.meters = defaultdict(SmoothedValue)
 82 |         self.delimiter = delimiter
 83 | 
 84 |     def update(self, **kwargs):
 85 |         for k, v in kwargs.items():
 86 |             if isinstance(v, torch.Tensor):
 87 |                 v = v.item()
 88 |             assert isinstance(v, (float, int))
 89 |             self.meters[k].update(v)
 90 | 
 91 |     def __getattr__(self, attr):
 92 |         if attr in self.meters:
 93 |             return self.meters[attr]
 94 |         if attr in self.__dict__:
 95 |             return self.__dict__[attr]
 96 |         raise AttributeError(
 97 |             "'{}' object has no attribute '{}'".format(type(self).__name__, attr)
 98 |         )
 99 | 
100 |     def __str__(self):
101 |         loss_str = []
102 |         for name, meter in self.meters.items():
103 |             loss_str.append("{}: {}".format(name, str(meter)))
104 |         return self.delimiter.join(loss_str)
105 | 
106 |     def synchronize_between_processes(self):
107 |         for meter in self.meters.values():
108 |             meter.synchronize_between_processes()
109 | 
110 |     def add_meter(self, name, meter):
111 |         self.meters[name] = meter
112 | 
113 |     def log_every(self, iterable, print_freq, header=None):
114 |         i = 0
115 |         if not header:
116 |             header = ""
117 |         start_time = time.time()
118 |         end = time.time()
119 |         iter_time = SmoothedValue(fmt="{avg:.4f}")
120 |         data_time = SmoothedValue(fmt="{avg:.4f}")
121 |         space_fmt = ":" + str(len(str(len(iterable)))) + "d"
122 |         if torch.cuda.is_available():
123 |             log_msg = self.delimiter.join(
124 |                 [
125 |                     header,
126 |                     "[{0" + space_fmt + "}/{1}]",
127 |                     "eta: {eta}",
128 |                     "{meters}",
129 |                     "time: {time}",
130 |                     "data: {data}",
131 |                     "max mem: {memory:.0f}",
132 |                 ]
133 |             )
134 |         else:
135 |             log_msg = self.delimiter.join(
136 |                 [
137 |                     header,
138 |                     "[{0" + space_fmt + "}/{1}]",
139 |                     "eta: {eta}",
140 |                     "{meters}",
141 |                     "time: {time}",
142 |                     "data: {data}",
143 |                 ]
144 |             )
145 |         MB = 1024.0 * 1024.0
146 |         for obj in iterable:
147 |             data_time.update(time.time() - end)
148 |             yield obj
149 |             iter_time.update(time.time() - end)
150 |             if i % print_freq == 0 or i == len(iterable) - 1:
151 |                 eta_seconds = iter_time.global_avg * (len(iterable) - i)
152 |                 eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
153 |                 if torch.cuda.is_available():
154 |                     print(
155 |                         log_msg.format(
156 |                             i,
157 |                             len(iterable),
158 |                             eta=eta_string,
159 |                             meters=str(self),
160 |                             time=str(iter_time),
161 |                             data=str(data_time),
162 |                             memory=torch.cuda.max_memory_allocated() / MB,
163 |                         )
164 |                     )
165 |                 else:
166 |                     print(
167 |                         log_msg.format(
168 |                             i,
169 |                             len(iterable),
170 |                             eta=eta_string,
171 |                             meters=str(self),
172 |                             time=str(iter_time),
173 |                             data=str(data_time),
174 |                         )
175 |                     )
176 |             i += 1
177 |             end = time.time()
178 |         total_time = time.time() - start_time
179 |         total_time_str = str(datetime.timedelta(seconds=int(total_time)))
180 |         print(
181 |             "{} Total time: {} ({:.4f} s / it)".format(
182 |                 header, total_time_str, total_time / len(iterable)
183 |             )
184 |         )
185 |         torch.cuda.reset_peak_memory_stats()
186 | 
187 | 
188 | @torch.no_grad()
189 | def accuracy(output, target, topk=(1,)):
190 |     """Computes the precision@k for the specified values of k"""
191 |     if target.numel() == 0:
192 |         return [torch.zeros([], device=output.device)]
193 |     maxk = max(topk)
194 |     batch_size = target.size(0)
195 | 
196 |     _, pred = output.topk(maxk, 1, True, True)
197 |     pred = pred.t()
198 |     correct = pred.eq(target.view(1, -1).expand_as(pred))
199 | 
200 |     res = []
201 |     for k in topk:
202 |         correct_k = correct[:k].view(-1).float().sum(0)
203 |         res.append(correct_k.mul_(100.0 / batch_size))
204 |     return res
205 | 


--------------------------------------------------------------------------------
/referring_segmentation/utils/loss.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | from torch.autograd import Variable
  4 | import numpy as np
  5 | from math import exp
  6 | 
  7 | 
  8 | def gaussian(window_size, sigma):
  9 |     gauss = torch.Tensor(
 10 |         [exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)])
 11 |     return gauss/gauss.sum()
 12 | 
 13 | 
 14 | def create_window(window_size, channel):
 15 |     _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
 16 |     _2D_window = _1D_window.mm(
 17 |         _1D_window.t()).float().unsqueeze(0).unsqueeze(0)
 18 |     window = Variable(_2D_window.expand(
 19 |         channel, 1, window_size, window_size).contiguous())
 20 |     return window
 21 | 
 22 | 
 23 | def _ssim(img1, img2, window, window_size, channel, size_average=True):
 24 |     mu1 = F.conv2d(img1, window, padding=window_size//2, groups=channel)
 25 |     mu2 = F.conv2d(img2, window, padding=window_size//2, groups=channel)
 26 | 
 27 |     mu1_sq = mu1.pow(2)
 28 |     mu2_sq = mu2.pow(2)
 29 |     mu1_mu2 = mu1*mu2
 30 | 
 31 |     sigma1_sq = F.conv2d(
 32 |         img1*img1, window, padding=window_size//2, groups=channel) - mu1_sq
 33 |     sigma2_sq = F.conv2d(
 34 |         img2*img2, window, padding=window_size//2, groups=channel) - mu2_sq
 35 |     sigma12 = F.conv2d(img1*img2, window, padding=window_size //
 36 |                        2, groups=channel) - mu1_mu2
 37 | 
 38 |     C1 = 0.01**2
 39 |     C2 = 0.03**2
 40 | 
 41 |     ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2)) / \
 42 |         ((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))
 43 | 
 44 |     if size_average:
 45 |         return ssim_map.mean()
 46 |     else:
 47 |         return ssim_map.mean(1).mean(1).mean(1)
 48 | 
 49 | 
 50 | class SSIM(torch.nn.Module):
 51 |     def __init__(self, window_size=11, size_average=True):
 52 |         super(SSIM, self).__init__()
 53 |         self.window_size = window_size
 54 |         self.size_average = size_average
 55 |         self.channel = 1
 56 |         self.window = create_window(window_size, self.channel)
 57 | 
 58 |     def forward(self, img1, img2):
 59 |         (_, channel, _, _) = img1.size()
 60 | 
 61 |         if channel == self.channel and self.window.data.type() == img1.data.type():
 62 |             window = self.window
 63 |         else:
 64 |             window = create_window(self.window_size, channel)
 65 | 
 66 |             if img1.is_cuda:
 67 |                 window = window.cuda(img1.get_device())
 68 |             window = window.type_as(img1)
 69 | 
 70 |             self.window = window
 71 |             self.channel = channel
 72 | 
 73 |         return _ssim(img1, img2, window, self.window_size, channel, self.size_average)
 74 | 
 75 | 
 76 | def _logssim(img1, img2, window, window_size, channel, size_average=True):
 77 |     mu1 = F.conv2d(img1, window, padding=window_size//2, groups=channel)
 78 |     mu2 = F.conv2d(img2, window, padding=window_size//2, groups=channel)
 79 | 
 80 |     mu1_sq = mu1.pow(2)
 81 |     mu2_sq = mu2.pow(2)
 82 |     mu1_mu2 = mu1*mu2
 83 | 
 84 |     sigma1_sq = F.conv2d(
 85 |         img1*img1, window, padding=window_size//2, groups=channel) - mu1_sq
 86 |     sigma2_sq = F.conv2d(
 87 |         img2*img2, window, padding=window_size//2, groups=channel) - mu2_sq
 88 |     sigma12 = F.conv2d(img1*img2, window, padding=window_size //
 89 |                        2, groups=channel) - mu1_mu2
 90 | 
 91 |     C1 = 0.01**2
 92 |     C2 = 0.03**2
 93 | 
 94 |     ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2)) / \
 95 |         ((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))
 96 |     ssim_map = (ssim_map - torch.min(ssim_map)) / \
 97 |         (torch.max(ssim_map)-torch.min(ssim_map))
 98 |     ssim_map = -torch.log(ssim_map + 1e-8)
 99 | 
100 |     if size_average:
101 |         return ssim_map.mean()
102 |     else:
103 |         return ssim_map.mean(1).mean(1).mean(1)
104 | 
105 | 
106 | class LOGSSIM(torch.nn.Module):
107 |     def __init__(self, window_size=11, size_average=True):
108 |         super(LOGSSIM, self).__init__()
109 |         self.window_size = window_size
110 |         self.size_average = size_average
111 |         self.channel = 1
112 |         self.window = create_window(window_size, self.channel)
113 | 
114 |     def forward(self, img1, img2):
115 |         (_, channel, _, _) = img1.size()
116 | 
117 |         if channel == self.channel and self.window.data.type() == img1.data.type():
118 |             window = self.window
119 |         else:
120 |             window = create_window(self.window_size, channel)
121 | 
122 |             if img1.is_cuda:
123 |                 window = window.cuda(img1.get_device())
124 |             window = window.type_as(img1)
125 | 
126 |             self.window = window
127 |             self.channel = channel
128 | 
129 |         return _logssim(img1, img2, window, self.window_size, channel, self.size_average)
130 | 
131 | 
132 | def ssim(img1, img2, window_size=11, size_average=True):
133 |     (_, channel, _, _) = img1.size()
134 |     window = create_window(window_size, channel)
135 | 
136 |     if img1.is_cuda:
137 |         window = window.cuda(img1.get_device())
138 |     window = window.type_as(img1)
139 | 
140 |     return _ssim(img1, img2, window, window_size, channel, size_average)
141 | 
142 | 
143 | def _iou(pred, target, size_average=True):
144 |     b = pred.shape[0]
145 |     IoU = 0.0
146 |     for i in range(0, b):
147 |         # compute the IoU of the foreground
148 |         Iand1 = torch.sum(target[i, :, :, :] * pred[i, :, :, :])
149 |         Ior1 = torch.sum(target[i, :, :, :]) + \
150 |             torch.sum(pred[i, :, :, :]) - Iand1
151 |         IoU1 = Iand1 / Ior1
152 | 
153 |         # IoU loss is (1-IoU1)
154 |         IoU = IoU + (1 - IoU1)
155 | 
156 |     return IoU / b
157 | 
158 | 
159 | class IOU(torch.nn.Module):
160 |     def __init__(self, size_average=True):
161 |         super(IOU, self).__init__()
162 |         self.size_average = size_average
163 | 
164 |     def forward(self, pred, target):
165 |         return _iou(pred, target, self.size_average)
166 | 
167 | def dice_loss(inputs, targets, num_masks):
168 |     """
169 |     Compute the DICE loss, similar to generalized IOU for masks
170 |     Args:
171 |         inputs: A float tensor of arbitrary shape.
172 |                 The predictions for each example.
173 |         targets: A float tensor with the same shape as inputs. Stores the binary
174 |                  classification label for each element in inputs
175 |                 (0 for the negative class and 1 for the positive class).
176 |     """
177 |     inputs = inputs.sigmoid()
178 |     numerator = 2 * (inputs * targets).sum(1)
179 |     denominator = inputs.sum(-1) + targets.sum(-1)
180 |     loss = 1 - (numerator + 1) / (denominator + 1)
181 |     return loss.sum() / num_masks
182 | 
183 | 
184 | def sigmoid_focal_loss(inputs, targets, num_masks, alpha: float = 0.25, gamma: float = 2):
185 |     """
186 |     Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.
187 |     Args:
188 |         inputs: A float tensor of arbitrary shape.
189 |                 The predictions for each example.
190 |         targets: A float tensor with the same shape as inputs. Stores the binary
191 |                  classification label for each element in inputs
192 |                 (0 for the negative class and 1 for the positive class).
193 |         alpha: (optional) Weighting factor in range (0,1) to balance
194 |                 positive vs negative examples. Default = -1 (no weighting).
195 |         gamma: Exponent of the modulating factor (1 - p_t) to
196 |                balance easy vs hard examples.
197 |     Returns:
198 |         Loss tensor
199 |     """
200 |     prob = inputs.sigmoid()
201 |     ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
202 |     p_t = prob * targets + (1 - prob) * (1 - targets)
203 |     loss = ce_loss * ((1 - p_t) ** gamma)
204 | 
205 |     if alpha >= 0:
206 |         alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
207 |         loss = alpha_t * loss
208 | 
209 |     return loss.mean(1).sum() / num_masks


--------------------------------------------------------------------------------
/spatiotemporal_grounding/models/module/decoder.py:
--------------------------------------------------------------------------------
  1 | from typing import Any, List, Tuple, Optional
  2 | import torch
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | from torch import Tensor
  6 | 
  7 | 
  8 | class SpatialDecoderBlock(nn.Module):
  9 |     def __init__(self, in_dim: int = 256, hidden_dim: int = 64, scale_layer_num: int = 3,
 10 |                  out_hm_dim: int = 1, out_reg_dim: int = 1, phase: str = '1D'):
 11 |         super().__init__()
 12 |         self.phase = phase
 13 |         if phase == '2D':
 14 |             conv = nn.Conv2d
 15 |             upsample_mode = 'bilinear'
 16 |         elif phase == '1D':
 17 |             conv = nn.Conv1d
 18 |             upsample_mode = 'linear'
 19 |         else:
 20 |             assert NotImplementedError
 21 | 
 22 |         up = []
 23 |         for i, _ in enumerate(range(scale_layer_num)):
 24 |             if i == 0:
 25 |                 up.append(conv(in_dim, hidden_dim, 3, 1, 1, bias=False),)
 26 |             else:
 27 |                 up.append(conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False))
 28 |             up.append(nn.GroupNorm(4, hidden_dim))
 29 |             up.append(nn.ReLU())
 30 |             up.append(nn.Upsample(scale_factor=2,
 31 |                       mode=upsample_mode, align_corners=True))
 32 |         self.up = nn.Sequential(*up)
 33 |         self.hm = nn.Sequential(
 34 |             conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False),
 35 |             nn.GroupNorm(4, hidden_dim),
 36 |             nn.ReLU(),
 37 |             conv(hidden_dim, out_hm_dim, 1)
 38 |         )
 39 |         self.reg = nn.Sequential(
 40 |             conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False),
 41 |             nn.GroupNorm(4, hidden_dim),
 42 |             nn.ReLU(),
 43 |             conv(hidden_dim, out_reg_dim, 1)
 44 |         )
 45 |         self.__init_weight()
 46 | 
 47 |     def __init_weight(self):
 48 |         for m in self.modules():
 49 |             if isinstance(m, nn.Conv2d):
 50 |                 torch.nn.init.kaiming_normal_(m.weight)
 51 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
 52 |                 m.weight.data.fill_(1)
 53 |                 m.bias.data.zero_()
 54 | 
 55 |     def forward(self, feature: Tensor) -> Tuple[Tensor, Tensor]:
 56 |         feature = self.up(feature)
 57 |         heatmap = self.hm(feature)
 58 |         regression = self.reg(feature)
 59 |         return heatmap, regression
 60 | 
 61 | 
 62 | class SpatialDecoder2D(nn.Module):
 63 |     def __init__(self, model_dim: int = 256, decoder_hidden_dim: int = 64, dilation: bool = False):
 64 |         super().__init__()
 65 |         self.decoder_type = 'spatial_2d'
 66 |         scale_layer_num = 2 if dilation else 3
 67 |         self.decoder = SpatialDecoderBlock(
 68 |             model_dim, decoder_hidden_dim, scale_layer_num, 1, 2, '2D')
 69 | 
 70 |     def forward(self, feature: Tensor, frame_mask: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:
 71 |         """
 72 |         Args:
 73 |             feature: [\sigma t_i, c, h, w]
 74 |             frame_mask: [\sigma t_i, h, w]
 75 |         Return:
 76 |             heatmap: [\sigma t_i, 1, h, w]
 77 |             regression: [\sigma ti, 2, h, w]
 78 |         """
 79 |         heatmap, regression = self.decoder(feature)
 80 |         heatmap = heatmap.sigmoid()
 81 |         if frame_mask is not None:
 82 |             frame_mask = F.interpolate(
 83 |                 frame_mask[:, None].float(), heatmap.shape[-2:], mode='nearest').bool()
 84 |             heatmap = heatmap.masked_fill(frame_mask, 0.)
 85 |             regression = regression.masked_fill(frame_mask, 0.)
 86 |         return {
 87 |             'spatial_map': heatmap,
 88 |             'spatial_wh': regression
 89 |         }
 90 | 
 91 | 
 92 | class TemporalDecoderAnchor(nn.Module):
 93 |     def __init__(self, model_dim: int = 256, temporal_window_width: Optional[List] = None, dropout: float = 0.1):
 94 |         super().__init__()
 95 |         self.decoder_type = 'temporal_anchor'
 96 |         self.temporal_window_width = temporal_window_width
 97 |         self.reg_head = MLP(2*model_dim, model_dim,
 98 |                             len(temporal_window_width) * 2, 2, dropout=dropout)
 99 |         self.cls_head = MLP(2*model_dim, model_dim,
100 |                             len(temporal_window_width), 2, dropout=dropout)
101 | 
102 |     def forward(self, feature_global: Tensor, feature_obj: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:
103 |         """
104 |         Args:
105 |             feature_global: [b, c, t]
106 |             feature_obj: [b, c, t]
107 |         Return:
108 |             offset: [b, t*n_window, 2]
109 |             pred_score: [b, t*n_window]
110 |         """
111 | 
112 |         feature_flatten = torch.cat(
113 |             [feature_obj, feature_global], dim=1).transpose(1, 2)   # [b, t, 2*c]
114 |         offset = self.reg_head(feature_flatten)  # [b, t, 2*n_window]
115 |         offset = offset.contiguous().view(-1,
116 |                                           offset.shape[1] * len(self.temporal_window_width), 2)  # [b, t*n_window, 2]
117 |         pred_score = self.cls_head(feature_flatten)  # [b, t, n_window]
118 |         pred_score = torch.sigmoid(pred_score).contiguous().view(
119 |             pred_score.size(0), -1)  # [b, t*n_window]
120 |         return {
121 |             'temporal_offset': offset,
122 |             'temporal_score': pred_score
123 |         }
124 | 
125 | 
126 | class TemporalDecoderRegression(nn.Module):
127 |     def __init__(self, model_dim: int = 256, dropout: float = 0.1):
128 |         super().__init__()
129 |         self.decoder_type = 'temporal_regression'
130 |         self.score_head = MLP(2*model_dim, model_dim, 1, 2, dropout=dropout)
131 |         self.iou_head = MLP(2*model_dim, model_dim, 1, 2, dropout=dropout)
132 |         self.reg_head = MLP(2*model_dim, model_dim, 2, 2, dropout=dropout)
133 | 
134 |     def forward(self, feature_global: Tensor, feature_obj: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:
135 |         """
136 |         Args:
137 |             feature_global: [b, c, t]
138 |             feature_obj: [b, c, t]
139 |         Return:
140 |             pred_score: [b, t]
141 |             pred_reg: [b, t, 2]
142 |             pred_iou: [b, t]
143 |         """
144 | 
145 |         feature_flatten = torch.cat(
146 |             [feature_obj, feature_global], dim=1).transpose(1, 2)   # [b, t, 2*c]
147 |         pred_score = self.score_head(
148 |             feature_flatten).squeeze(-1).sigmoid()  # [b, t]
149 |         pred_reg = self.reg_head(feature_flatten)  # [b, t, 2]
150 |         pred_iou = self.iou_head(
151 |             feature_flatten).squeeze(-1).sigmoid()  # [b, t]
152 |         return {
153 |             'temporal_score': pred_score,
154 |             'temporal_reg': pred_reg,
155 |             'temporal_iou': pred_iou,
156 |         }
157 | 
158 | 
159 | class MLP(nn.Module):
160 |     def __init__(self, input_dim: int = 256, hidden_dim: int = 256, output_dim: int = 256,
161 |                  num_layers: int = 2, normalization: bool = True, dropout: float = 0.1):
162 |         super().__init__()
163 |         self.num_layers = num_layers
164 |         h = [hidden_dim] * (num_layers - 1)
165 |         if normalization:
166 |             self.layers = nn.ModuleList(
167 |                 nn.Sequential(
168 |                     nn.Linear(n, k),
169 |                     nn.LayerNorm(k),
170 |                 ) if idx < self.num_layers - 1 else nn.Linear(n, k)
171 |                 for idx, (n, k) in enumerate(zip([input_dim] + h, h + [output_dim]))
172 |             )
173 |         else:
174 |             self.layers = nn.ModuleList(
175 |                 nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])
176 |             )
177 |         self.dropout = dropout
178 |         if dropout:
179 |             self.dropout = nn.Dropout(dropout)
180 | 
181 |     def forward(self, x: Tensor) -> Tensor:
182 |         for i, layer in enumerate(self.layers):
183 |             x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
184 |             if self.dropout and i < self.num_layers:
185 |                 x = self.dropout(x)
186 |         return x
187 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # SAW
  2 | 
  3 | The official implementation OF paper "Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query" \[[Paper](https://ieeexplore.ieee.org/document/10043827)\]
  4 | 
  5 | ![](./docs/net.png)
  6 | 
  7 | We propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. The unified framework can be utilized to varies video-text action localization tasks, e.g., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding. 
  8 | 
  9 | ## Requirements
 10 | 
 11 | * python 3.8
 12 | 
 13 | * pytorch 1.9.1
 14 | 
 15 | * torchtext 0.10.1
 16 | 
 17 | 
 18 | ## Referring Video Segmentation
 19 | 
 20 | run `cd referring_segmentation` for referring video segmentation task.
 21 | 
 22 | ### 1. Dataset
 23 | 
 24 | Download the A2D Sentences dataset and J-HMDB Sentences dataset from [https://kgavrilyuk.github.io/publication/actor_action/](https://kgavrilyuk.github.io/publication/actor_action/) and convert the videos to RGB frames. 
 25 | 
 26 | For A2D Sentences dataset, run `python pre_proc\video2imgs.py` to convert videos to RGB frames. The following directory structure is expected:
 27 | 
 28 | ```python
 29 | -a2d_sentences
 30 |     -Rename_Images
 31 |     -a2d_annotation_with_instances
 32 |     -videoset.csv
 33 |     -a2d_missed_videos.txt
 34 |     -a2d_annotation.txt
 35 | -jhmdb_sentences
 36 |     -Rename_Images
 37 |     -puppet_mask
 38 |     -jhmdb_annotation.txt
 39 | ```
 40 | 
 41 | Edit the item `datasets_root` in `json/onfig_$DATASET$.json` to be the current dataset path.
 42 | 
 43 | Run `python pre_proc\generate_data_list.py` to generate the training and testing data splits.
 44 | 
 45 | ### 2. Backbone
 46 | 
 47 | Download the pretrained DeepLabResNet from [https://github.com/VainF/DeepLabV3Plus-Pytorch](https://github.com/VainF/DeepLabV3Plus-Pytorch) and put it into `model/pretrained/`. 
 48 | 
 49 | ### 4. Training
 50 | 
 51 | Only the A2D Sentences dataset is adopted for training, run:
 52 | 
 53 | ```python
 54 | python main.py --json_file=json\config_a2d_sentences.json --mode=train
 55 | ```
 56 | 
 57 | ### 5. Evaluation
 58 | 
 59 | For A2d Sentences dataset, run:
 60 | 
 61 | ```python
 62 | python main.py --json_file=json\config_a2d_sentences.json --mode=test
 63 | ``` 
 64 | 
 65 | For JHMDB Sentences dataset, run:
 66 | 
 67 | ```python
 68 | python main.py --json_file=json\config_jhmdb_sentences.json --mode=test
 69 | ``` 
 70 | 
 71 | ## Temporal Sentence Grounding
 72 | 
 73 | run `cd temporal_grounding` for referring temporal sentence grounding task.
 74 | 
 75 | ### 1. Dataset
 76 | 
 77 | * For charades-STA dataset, download the pre-extracted I3D features following [LGI4temporalgrounding](https://github.com/JonghwanMun/LGI4temporalgrounding) and the pre-extracted VGG feature following [2D-TAN](https://github.com/microsoft/VideoX/tree/master/2D-TAN).
 78 | 
 79 | * For TACoS dataset, download the pre-extracted C3D features following [2D-TAN](https://github.com/microsoft/VideoX/tree/master/2D-TAN)
 80 | 
 81 | * For ActivityNet Captions dataset, download the pre-extracted C3D features from [http://activity-net.org/challenges/2016/download.html](http://activity-net.org/challenges/2016/download.html).
 82 | 
 83 | ### 2. Training and Evaluation
 84 | 
 85 | The config files can be find in `./json` and the following model settings are supported
 86 | 
 87 | ```
 88 | -config_ActivityNet_C3D_anchor.json
 89 | -config_ActivityNet_C3D_regression.json
 90 | -config_Charades-STA_I3D_anchor.json
 91 | -config_Charades-STA_I3D_regression.json
 92 | -config_Charades-STA_VGG_anchor.json
 93 | -config_Charades-STA_VGG_regression.json
 94 | -config_TACoS_C3D_anchor.json
 95 | -config_TACoS_C3D_regression.json
 96 | ```
 97 | 
 98 | Set the `"datasets_root"` in each config file to be your feature path.
 99 | 
100 | To train on different dataset with different grounding heads, run
101 | 
102 | ```
103 | python main.py --json_file=$JSON_FILE_PATH$ --mode=train
104 | ```
105 | 
106 | For evaluation, run 
107 | 
108 | ```
109 | python main.py --json_file=$JSON_FILE_PATH$ --mode=test --checkpoint=$CHECKPOINT_PATH$
110 | ```
111 | 
112 | The pretrained models and their correspondance performance are shown bellow
113 | 
114 | | Datasets     | Feature | Decoder    | Checkpoints |
115 | |--------------|---------|------------|-------------|
116 | | Charades-STA | I3D     | Regression        |  \[[Baidu](https://pan.baidu.com/s/1GQBkElQITd-exS1njNZrwQ) \| gj54 \]           |
117 | | Charades-STA | I3D     | Anchor     |    \[[Baidu](https://pan.baidu.com/s/1MXZqAEBLOzauR8cOLjo3QA) \| 5j3a \]           |
118 | | Charades-STA | VGG     | Regression |                \[[Baidu](https://pan.baidu.com/s/1Yacke_tkaAELzMY_ePyIhw) \| 52xf \]           |
119 | | Charades-STA | VGG     | Anchor     |                \[[Baidu](https://pan.baidu.com/s/1PcIZ7QEWcYnfzne1dkMsng) \| rdmx \]          |
120 | | ActivityNet  | C3D     | Regression |                \[[Baidu](https://pan.baidu.com/s/1zlH64seHimscTOtNry-6Ag) \| 6sbh \]            |
121 | | ActivityNet  | C3D     | Anchor     |              \[[Baidu](https://pan.baidu.com/s/1mi8M2wBUAqskWQQqHdmi2Q) \| ysr5 \]           |
122 | | TACOS        | C3D     | Regression |                \[[Baidu](https://pan.baidu.com/s/140m-9geYbktSRfP7Pa1rzA) \| iwx2 \]           |
123 | | TACOS        | C3D     | Anchor     |               \[[Baidu](https://pan.baidu.com/s/1dzIIb4dKQY9t-oAF-N2sLw) \| 1ube \]           |
124 | 
125 | 
126 | ## Spatiotemporal Video Grounding
127 | 
128 | run `cd spatiotemporal_grounding` for spatiotemporal video grounding task. The code for spatiotemporal grounding is built on the [TubeDETR codebase](https://github.com/antoyang/TubeDETR).
129 | 
130 | ### 1. Dataset
131 | 
132 | We prepare the `HC-STVG` and `VidSTG` datasets following the [TubeDETR](https://github.com/antoyang/TubeDETR). The annotation formation of the VidSTG dataset has been optimized to reduce the training memory usage. 
133 | 
134 | **videos**
135 | 
136 | VidSTG dataset: Download VidOR videos from [the VidOR dataset providers](https://xdshang.github.io/docs/vidor.html)
137 | 
138 | HC-STVG dataset: Download HC-STVG videos from [the HC-STVG dataset providers](https://github.com/tzhhhh123/HC-STVG).
139 | 
140 | Edit the item `vidstg_vid_path` in `spatiotemporal_grounding/config/vidstg.json` and the `hcstvg_vid_path` in `spatiotemporal_grounding/config/hcstvg.json` to be the current video path.
141 | 
142 | **annotations**
143 | 
144 | Download the preprocessed annotation files from \[[https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w](https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w), password: n6y4\]. Then put the downloaded `annotations` into `spatiotemporal_grounding`.
145 | 
146 | ### 2. Training and Evaluation
147 | 
148 | To train on HC-STVG dataset, run
149 | 
150 | ```
151 | python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result
152 | ```
153 | 
154 | To train on VidSTG dataset, run
155 | 
156 | ```
157 | python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result
158 | ```
159 | 
160 | To evaluate on HC-STVG dataset, run:
161 | 
162 | ```
163 | python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result --eval --resume=$CHECKPOINT_PATH$
164 | ```
165 | 
166 | To evaluate on VidSTG dataset, run
167 | 
168 | ```
169 | python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result --eval --resume=$CHECKPOINT_PATH$
170 | ```
171 | 
172 | ## Citation
173 | 
174 | ```
175 | @article{2023saw,
176 |     title     = {Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query},
177 |     author    = {Yuting Su, Weikang Wang, Jing Liu, Shuang Ma, Xiaokang Yang},
178 |     booktitle = {IEEE Transactions on Image Processing},
179 |     year      = {2023}
180 | }
181 | ```


--------------------------------------------------------------------------------