├── referring_segmentation ├── model │ ├── backbone │ │ ├── deeplab_resnet.py │ │ ├── __pycache__ │ │ │ ├── resnet.cpython-38.pyc │ │ │ ├── mobilenet.cpython-38.pyc │ │ │ └── frozen_batchnorm.cpython-38.pyc │ │ ├── frozen_batchnorm.py │ │ └── mobilenet.py │ ├── __pycache__ │ │ └── model.cpython-38.pyc │ └── module │ │ ├── __pycache__ │ │ ├── TCN.cpython-38.pyc │ │ ├── aspp.cpython-38.pyc │ │ └── attention.cpython-38.pyc │ │ ├── aspp.py │ │ ├── attention.py │ │ └── TCN.py ├── utils │ ├── __pycache__ │ │ ├── loss.cpython-38.pyc │ │ ├── utils.cpython-38.pyc │ │ ├── tester.cpython-38.pyc │ │ ├── trainer.cpython-38.pyc │ │ ├── check_datas.cpython-38.pyc │ │ └── video_reader.cpython-38.pyc │ ├── utils.py │ ├── video_reader.py │ ├── tester.py │ └── loss.py ├── dataset │ ├── __pycache__ │ │ ├── dataset.cpython-38.pyc │ │ └── augmentation.cpython-38.pyc │ └── dataset.py ├── pre_proc │ ├── video2imgs.py │ └── generate_data_list.py ├── json │ ├── config_a2d_sentences.json │ └── config_jhmdb_sentences.json └── main.py ├── docs └── net.png ├── temporal_grounding ├── model │ ├── __pycache__ │ │ ├── model.cpython-38.pyc │ │ └── decoder.cpython-38.pyc │ ├── module │ │ ├── __pycache__ │ │ │ ├── TCN.cpython-38.pyc │ │ │ ├── aspp.cpython-38.pyc │ │ │ ├── encoder.cpython-38.pyc │ │ │ ├── attention.cpython-38.pyc │ │ │ ├── tanmodule.cpython-38.pyc │ │ │ └── RefTransformer.cpython-38.pyc │ │ ├── attention.py │ │ └── RefTransformer.py │ ├── backbone │ │ ├── __pycache__ │ │ │ ├── C3D.cpython-38.pyc │ │ │ ├── resnet.cpython-38.pyc │ │ │ ├── mobilenet.cpython-38.pyc │ │ │ ├── pytorch_i3d.cpython-38.pyc │ │ │ └── frozen_batchnorm.cpython-38.pyc │ │ └── C3D.py │ └── model.py ├── utils │ ├── __pycache__ │ │ ├── losses.cpython-38.pyc │ │ ├── tester.cpython-38.pyc │ │ ├── utils.cpython-38.pyc │ │ ├── __init__.cpython-38.pyc │ │ ├── optimizer.cpython-38.pyc │ │ ├── scheduler.cpython-38.pyc │ │ ├── trainer.cpython-38.pyc │ │ ├── adam_optimizer.cpython-38.pyc │ │ └── generate_batch.cpython-38.pyc │ ├── __init__.py │ ├── generate_batch.py │ ├── losses.py │ ├── scheduler.py │ ├── tester.py │ ├── optimizer.py │ ├── adam_optimizer.py │ └── utils.py ├── json │ ├── config_TACoS_C3D_anchor.json │ ├── config_Charades-STA_I3D_anchor.json │ ├── config_Charades-STA_I3D_regression.json │ ├── config_TACoS_C3D_regression.json │ ├── config_Charades-STA_VGG_anchor.json │ ├── config_Charades-STA_VGG_regression.json │ ├── config_ActivityNet_C3D_anchor.json │ └── config_ActivityNet_C3D_regression.json ├── main.py └── dataset.py ├── spatiotemporal_grounding ├── util │ ├── __pycache__ │ │ ├── dist.cpython-38.pyc │ │ ├── misc.cpython-38.pyc │ │ ├── optim.cpython-38.pyc │ │ ├── __init__.cpython-38.pyc │ │ ├── box_ops.cpython-38.pyc │ │ └── metrics.cpython-38.pyc │ ├── __init__.py │ ├── optim.py │ ├── box_ops.py │ └── metrics.py ├── models │ ├── __pycache__ │ │ ├── model.cpython-38.pyc │ │ ├── __init__.cpython-38.pyc │ │ ├── criterion.cpython-38.pyc │ │ ├── anchor_utils.cpython-38.pyc │ │ └── postprocessors.cpython-38.pyc │ ├── module │ │ ├── __pycache__ │ │ │ ├── TCN.cpython-310.pyc │ │ │ ├── TCN.cpython-38.pyc │ │ │ ├── aspp.cpython-310.pyc │ │ │ ├── aspp.cpython-38.pyc │ │ │ ├── attention.cpython-310.pyc │ │ │ └── attention.cpython-38.pyc │ │ ├── RefTransformer.py │ │ ├── attention.py │ │ └── decoder.py │ ├── __init__.py │ ├── anchor_utils.py │ ├── utils.py │ ├── postprocessors.py │ └── criterion.py ├── datasets │ ├── __pycache__ │ │ ├── hcstvg.cpython-38.pyc │ │ ├── vidstg.cpython-38.pyc │ │ ├── __init__.cpython-38.pyc │ │ ├── hcstvg_eval.cpython-38.pyc │ │ ├── vidstg_eval.cpython-38.pyc │ │ ├── torch_videovision.cpython-38.pyc │ │ └── video_transforms.cpython-38.pyc │ ├── __init__.py │ └── torch_videovision.py └── config │ ├── hcstvg.json │ └── vidstg.json └── README.md /referring_segmentation/model/backbone/deeplab_resnet.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/docs/net.png -------------------------------------------------------------------------------- /temporal_grounding/model/__pycache__/model.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/__pycache__/model.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/losses.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/losses.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/tester.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/tester.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/__pycache__/model.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/__pycache__/model.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/loss.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/loss.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/dist.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/dist.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/misc.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/misc.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/__pycache__/decoder.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/__pycache__/decoder.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/optimizer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/optimizer.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/scheduler.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/scheduler.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/trainer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/trainer.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/tester.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/tester.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/trainer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/trainer.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/optim.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/optim.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/TCN.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/TCN.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/aspp.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/aspp.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/dataset/__pycache__/dataset.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/dataset/__pycache__/dataset.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/module/__pycache__/TCN.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/TCN.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__pycache__/model.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/model.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/box_ops.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/box_ops.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__pycache__/metrics.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/util/__pycache__/metrics.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/__pycache__/C3D.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/C3D.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/encoder.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/encoder.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/adam_optimizer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/adam_optimizer.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/utils/__pycache__/generate_batch.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/utils/__pycache__/generate_batch.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/module/__pycache__/aspp.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/aspp.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/check_datas.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/check_datas.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/utils/__pycache__/video_reader.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/utils/__pycache__/video_reader.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/hcstvg.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/hcstvg.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/vidstg.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/vidstg.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__pycache__/criterion.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/criterion.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/__pycache__/resnet.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/resnet.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/attention.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/attention.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/tanmodule.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/tanmodule.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/dataset/__pycache__/augmentation.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/dataset/__pycache__/augmentation.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/backbone/__pycache__/resnet.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/resnet.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__pycache__/anchor_utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/anchor_utils.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-310.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/TCN.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-310.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/aspp.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/__pycache__/mobilenet.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/mobilenet.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/module/__pycache__/attention.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/module/__pycache__/attention.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/hcstvg_eval.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/hcstvg_eval.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/vidstg_eval.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/vidstg_eval.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__pycache__/postprocessors.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/__pycache__/postprocessors.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/__pycache__/pytorch_i3d.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/pytorch_i3d.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/module/__pycache__/RefTransformer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/module/__pycache__/RefTransformer.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/backbone/__pycache__/mobilenet.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/mobilenet.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/attention.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-310.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/__pycache__/attention.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/models/module/__pycache__/attention.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/torch_videovision.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/torch_videovision.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__pycache__/video_transforms.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/spatiotemporal_grounding/datasets/__pycache__/video_transforms.cpython-38.pyc -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/temporal_grounding/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc -------------------------------------------------------------------------------- /referring_segmentation/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TJUMMG/SAW/HEAD/referring_segmentation/model/backbone/__pycache__/frozen_batchnorm.cpython-38.pyc -------------------------------------------------------------------------------- /spatiotemporal_grounding/config/hcstvg.json: -------------------------------------------------------------------------------- 1 | { 2 | "combine_datasets": [ 3 | "hcstvg" 4 | ], 5 | "combine_datasets_val": [ 6 | "hcstvg" 7 | ], 8 | "hcstvg_vid_path": "", 9 | "hcstvg_ann_path": "./annotations/HC-STVG/v1" 10 | } -------------------------------------------------------------------------------- /spatiotemporal_grounding/config/vidstg.json: -------------------------------------------------------------------------------- 1 | { 2 | "combine_datasets": [ 3 | "vidstg" 4 | ], 5 | "combine_datasets_val": [ 6 | "vidstg" 7 | ], 8 | "vidstg_vid_path": "", 9 | "vidstg_ann_path": "./annotations/VidSTG" 10 | } -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/__init__.py: -------------------------------------------------------------------------------- 1 | # Adapted from 2 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved 3 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved 4 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | from .vidstg import build as build_vidstg 2 | from .hcstvg import build as build_hcstvg 3 | 4 | 5 | def build_dataset(dataset_file: str, image_set: str, args): 6 | if dataset_file == "vidstg": 7 | return build_vidstg(image_set, args) 8 | if dataset_file == "hcstvg": 9 | return build_hcstvg(image_set, args) 10 | raise ValueError(f"dataset {dataset_file} not supported") 11 | -------------------------------------------------------------------------------- /referring_segmentation/pre_proc/video2imgs.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | from tqdm import tqdm 3 | import os 4 | 5 | def video2imgs(videos_path, imgs_save_path): 6 | videos = os.listdir(videos_path) 7 | for video_name in tqdm(videos): 8 | file_name = video_name.split('.')[0] 9 | img_save_path = os.path.join(imgs_save_path, file_name) 10 | if not os.path.exists(img_save_path): 11 | os.makedirs(img_save_path) 12 | 13 | vc = cv2.VideoCapture(os.path.join(videos_path, video_name)) 14 | i_frame = 0 15 | rval=vc.isOpened() 16 | 17 | while rval: 18 | i_frame = i_frame + 1 19 | rval, frame = vc.read() 20 | if rval: 21 | cv2.imwrite(os.path.join(img_save_path, '{:05d}.jpg'.format(i_frame)), frame) 22 | else: 23 | break 24 | vc.release() 25 | 26 | if __name__ =='__main__': 27 | videos_path = '/media/wwk/HDD1/dataset/referring_video_segmentation/a2d_sentences/Release/clips320H' 28 | imgs_save_path = '/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences/Rename_Images' 29 | video2imgs(videos_path, imgs_save_path) 30 | -------------------------------------------------------------------------------- /temporal_grounding/json/config_TACoS_C3D_anchor.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "TACoS", 6 | "testing_datasets": "TACoS", 7 | "datasets_root": "", 8 | "video_fea_type": "C3D", 9 | "window_width": [ 10 | 4, 11 | 9, 12 | 16, 13 | 29, 14 | 64 15 | ], 16 | "segment_num": 200, 17 | "embedding_type": "glove_840B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "anchor_based", 20 | "dropout": 0.1, 21 | "embedding_dim": 300, 22 | "video_fea_dim": 4096, 23 | "attention_dim": 256, 24 | "MLP_dim": 256, 25 | "prenorm": false, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 4, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.001, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/TACoS_anchor", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 8, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/TACoS_anchor/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_Charades-STA_I3D_anchor.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "Charades-STA", 6 | "testing_datasets": "Charades-STA", 7 | "datasets_root": "", 8 | "video_fea_type": "I3D", 9 | "window_width": [ 10 | 6, 11 | 12, 12 | 24, 13 | 48, 14 | 72 15 | ], 16 | "segment_num": 75, 17 | "embedding_type": "glove_6B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "anchor_based", 20 | "dropout": 0.1, 21 | "embedding_dim": 300, 22 | "video_fea_dim": 1024, 23 | "attention_dim": 256, 24 | "MLP_dim": 256, 25 | "prenorm": false, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 8, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.001, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/Charades-STA_I3D_anchor", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 4, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/Charades-STA_I3D_anchor/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_Charades-STA_I3D_regression.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "Charades-STA", 6 | "testing_datasets": "Charades-STA", 7 | "datasets_root": "", 8 | "video_fea_type": "I3D", 9 | "window_width": [ 10 | 6, 11 | 12, 12 | 24, 13 | 48, 14 | 72 15 | ], 16 | "segment_num": 75, 17 | "embedding_type": "glove_6B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "regression", 20 | "dropout": 0.1, 21 | "embedding_dim": 300, 22 | "video_fea_dim": 1024, 23 | "attention_dim": 256, 24 | "MLP_dim": 256, 25 | "prenorm": false, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 8, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.001, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/Charades-STA_I3D_regression", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 4, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/Charades-STA_I3D_regression/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the LICENSE file in 5 | # the root directory of this source tree. An additional grant of patent rights 6 | # can be found in the PATENTS file in the same directory. 7 | 8 | import importlib 9 | import os 10 | 11 | from .optimizer import FairseqLRScheduler 12 | 13 | 14 | LR_SCHEDULER_REGISTRY = {} 15 | 16 | 17 | def build_lr_scheduler(args, optimizer): 18 | return LR_SCHEDULER_REGISTRY[args.lr_scheduler](args, optimizer) 19 | 20 | 21 | def register_lr_scheduler(name): 22 | """Decorator to register a new LR scheduler.""" 23 | 24 | def register_lr_scheduler_cls(cls): 25 | if name in LR_SCHEDULER_REGISTRY: 26 | raise ValueError('Cannot register duplicate LR scheduler ({})'.format(name)) 27 | if not issubclass(cls, FairseqLRScheduler): 28 | raise ValueError('LR Scheduler ({}: {}) must extend FairseqLRScheduler'.format(name, cls.__name__)) 29 | LR_SCHEDULER_REGISTRY[name] = cls 30 | return cls 31 | 32 | return register_lr_scheduler_cls 33 | 34 | 35 | # automatically import any Python files in the optimizer/lr_scheduler/ directory 36 | # for file in os.listdir(os.path.dirname(__file__)): 37 | # if file.endswith('.py') and not file.startswith('_'): 38 | # module = file[:file.find('.py')] 39 | # importlib.import_module('optimizer.lr_scheduler.' + module) 40 | -------------------------------------------------------------------------------- /temporal_grounding/json/config_TACoS_C3D_regression.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "TACoS", 6 | "testing_datasets": "TACoS", 7 | "datasets_root": "", 8 | "video_fea_type": "C3D", 9 | "window_width": [ 10 | 4, 11 | 9, 12 | 16, 13 | 29, 14 | 64 15 | ], 16 | "segment_num": 200, 17 | "embedding_type": "glove_840B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "regression", 20 | "dropout": 0.1, 21 | "prenorm": false, 22 | "embedding_dim": 300, 23 | "video_fea_dim": 4096, 24 | "attention_dim": 256, 25 | "MLP_dim": 256, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 4, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.0005, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/TACoS_regression", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 8, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/TACoS_regression/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_Charades-STA_VGG_anchor.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "Charades-STA", 6 | "testing_datasets": "Charades-STA", 7 | "datasets_root": "", 8 | "video_fea_type": "VGG", 9 | "window_width": [ 10 | 6, 11 | 12, 12 | 24, 13 | 48, 14 | 72 15 | ], 16 | "segment_num": 75, 17 | "embedding_type": "glove_6B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "anchor_based", 20 | "dropout": 0.1, 21 | "embedding_dim": 300, 22 | "video_fea_dim": 4096, 23 | "attention_dim": 256, 24 | "MLP_dim": 256, 25 | "prenorm": false, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 8, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.001, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/Charades-STA_VGG_anchor", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 4, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/Charades-STA_VGG_anchor/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_Charades-STA_VGG_regression.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "Charades-STA", 6 | "testing_datasets": "Charades-STA", 7 | "datasets_root": "", 8 | "video_fea_type": "VGG", 9 | "window_width": [ 10 | 6, 11 | 12, 12 | 24, 13 | 48, 14 | 72 15 | ], 16 | "segment_num": 75, 17 | "embedding_type": "glove_6B_300", 18 | "embedding_length": 20, 19 | "decoder_type": "regression", 20 | "dropout": 0.1, 21 | "embedding_dim": 300, 22 | "video_fea_dim": 4096, 23 | "attention_dim": 256, 24 | "MLP_dim": 256, 25 | "prenorm": false, 26 | "padding_type": "circle", 27 | "layer_num": 10, 28 | "with_text": true, 29 | "with_attention": true, 30 | "with_mlp": true, 31 | "groups": 8, 32 | "gru_bidirection": true, 33 | "thres_score": 0.3, 34 | "thres_adjmat": 0.8, 35 | "resume": "", 36 | "batch_size": 128, 37 | "epochs": 20, 38 | "lr": 0.001, 39 | "optimizer": "AdamW", 40 | "lr_schedule": "StepLR", 41 | "decay_epochs": 15, 42 | "decay_ratio": 0.1, 43 | "with_weight_loss": true, 44 | "loss_weight": [ 45 | 1, 46 | 0.001, 47 | 0.001 48 | ], 49 | "log_root": "./logs/Charades-STA_VGG_regression", 50 | "save_temp_iters": 10, 51 | "save_iters": 5, 52 | "num_worker": 4, 53 | "test_savefold": "./result", 54 | "checkpoint": "./logs/Charades-STA_VGG_regression/checkpoints/best_model.pth" 55 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_ActivityNet_C3D_anchor.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "ActivityNet", 6 | "testing_datasets": "ActivityNet", 7 | "datasets_root": "", 8 | "video_fea_type": "C3D", 9 | "window_width": [ 10 | 16, 11 | 32, 12 | 64, 13 | 96, 14 | 128, 15 | 160, 16 | 192 17 | ], 18 | "segment_num": 200, 19 | "embedding_type": "glove_840B_300", 20 | "embedding_length": 40, 21 | "decoder_type": "anchor_based", 22 | "dropout": 0.1, 23 | "embedding_dim": 300, 24 | "video_fea_dim": 500, 25 | "attention_dim": 256, 26 | "MLP_dim": 256, 27 | "padding_type": "circle", 28 | "layer_num": 10, 29 | "prenorm": false, 30 | "with_text": true, 31 | "with_attention": true, 32 | "with_mlp": true, 33 | "groups": 4, 34 | "gru_bidirection": true, 35 | "thres_score": 0.3, 36 | "thres_adjmat": 0.8, 37 | "resume": "", 38 | "batch_size": 128, 39 | "epochs": 20, 40 | "lr": 1e-3, 41 | "optimizer": "AdamW", 42 | "lr_schedule": "StepLR", 43 | "decay_epochs": 15, 44 | "decay_ratio": 0.1, 45 | "with_weight_loss": true, 46 | "loss_weight": [ 47 | 1, 48 | 0.001, 49 | 0.001 50 | ], 51 | "log_root": "./logs/ActivityNet_anchor", 52 | "save_temp_iters": 10, 53 | "save_iters": 5, 54 | "num_worker": 8, 55 | "test_savefold": "./result", 56 | "checkpoint": "./logs/ActivityNet_C3D_anchor/checkpoints/best_model.pth" 57 | } -------------------------------------------------------------------------------- /temporal_grounding/json/config_ActivityNet_C3D_regression.json: -------------------------------------------------------------------------------- 1 | { 2 | "cuda": true, 3 | "gpu_id": "0", 4 | "seed": 1, 5 | "training_datasets": "ActivityNet", 6 | "testing_datasets": "ActivityNet", 7 | "datasets_root": "", 8 | "video_fea_type": "C3D", 9 | "window_width": [ 10 | 16, 11 | 32, 12 | 64, 13 | 96, 14 | 128, 15 | 160, 16 | 192 17 | ], 18 | "segment_num": 64, 19 | "embedding_type": "glove_840B_300", 20 | "embedding_length": 40, 21 | "decoder_type": "regression", 22 | "dropout": 0.1, 23 | "embedding_dim": 300, 24 | "video_fea_dim": 500, 25 | "attention_dim": 256, 26 | "MLP_dim": 256, 27 | "padding_type": "circle", 28 | "layer_num": 10, 29 | "prenorm": true, 30 | "with_text": true, 31 | "with_attention": true, 32 | "with_mlp": true, 33 | "groups": 4, 34 | "gru_bidirection": true, 35 | "thres_score": 0.3, 36 | "thres_adjmat": 0.8, 37 | "resume": "", 38 | "batch_size": 128, 39 | "epochs": 20, 40 | "lr": 0.0005, 41 | "optimizer": "AdamW", 42 | "lr_schedule": "StepLR", 43 | "decay_epochs": 15, 44 | "decay_ratio": 0.1, 45 | "with_weight_loss": true, 46 | "loss_weight": [ 47 | 1, 48 | 0.001, 49 | 0.001 50 | ], 51 | "log_root": "./logs/ActivityNet_regression", 52 | "save_temp_iters": 10, 53 | "save_iters": 5, 54 | "num_worker": 8, 55 | "test_savefold": "./result", 56 | "checkpoint": "./logs/ActivityNet_C3D_regression/checkpoints/best_model.pth" 57 | } -------------------------------------------------------------------------------- /referring_segmentation/json/config_a2d_sentences.json: -------------------------------------------------------------------------------- 1 | { 2 | "setting_config": { 3 | "cuda": true, 4 | "gpu_id": "0", 5 | "seed": 20 6 | }, 7 | "data_config": { 8 | "training_datasets": [ 9 | "a2d_sentences" 10 | ], 11 | "testing_datasets": [ 12 | "a2d_sentences" 13 | ], 14 | "datasets_root": "/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences", 15 | "input_size": [ 16 | 320, 17 | 320 18 | ], 19 | "clip_size": 8, 20 | "augmentations": { 21 | "random_crop": true, 22 | "random_flip": false 23 | }, 24 | "embedding_type": "glove_840B_300", 25 | "max_embedding_length": 20 26 | }, 27 | "model_config": { 28 | "backbone": "deeplab_resnet101", 29 | "input_dim": 3, 30 | "os": 16, 31 | "train_backbone": false, 32 | "backbone_path": "", 33 | "backbone_multi_scale": true, 34 | "video_feature_dim": 256, 35 | "text_feature_dim": 256, 36 | "TCN_feature_dim": 256, 37 | "gru_bidirection": true, 38 | "attention_dim": 256, 39 | "TCN_hidden_dim": 64, 40 | "embedding_dim": 300, 41 | "layer_num": 5, 42 | "is_local_attention": true, 43 | "groups": 4, 44 | "is_global_attention": true, 45 | "conv_type": "2D", 46 | "filter_type": "global", 47 | "global_fuse_type": "mutan", 48 | "local_fuse_type": "relevance_filter", 49 | "padding_type": "circle", 50 | "norm_type": "GroupNorm", 51 | "frozen_batchnorm": true, 52 | "frozen_backbone": true 53 | }, 54 | "training_config": { 55 | "resume": "", 56 | "batch_size": 8, 57 | "epochs": 20, 58 | 59 | "lr_backbone": 5e-05, 60 | "lr_branch": 0.0005, 61 | "optimizer": "AdamW", 62 | "lr_schedule": "Step", 63 | "loss_function": "SSIM", 64 | "log_root": "./logs", 65 | "save_iters": 100, 66 | "num_worker": 8, 67 | "lr_weight": 0.1 68 | }, 69 | "testing_config": { 70 | "test_savefold": "./result", 71 | "checkpoint": "./logs/checkpoints/checkpoint_best.pth" 72 | } 73 | } 74 | -------------------------------------------------------------------------------- /referring_segmentation/json/config_jhmdb_sentences.json: -------------------------------------------------------------------------------- 1 | { 2 | "setting_config": { 3 | "cuda": true, 4 | "gpu_id": "0", 5 | "seed": 20 6 | }, 7 | "data_config": { 8 | "training_datasets": [ 9 | "jhmdb_sentences" 10 | ], 11 | "testing_datasets": [ 12 | "jhmdb_sentences" 13 | ], 14 | "datasets_root": "/media/wwk/HDD1/dataset/referring_video_segmentation/jhmdb_sentences", 15 | "input_size": [ 16 | 320, 17 | 320 18 | ], 19 | "clip_size": 8, 20 | "augmentations": { 21 | "random_crop": true, 22 | "random_flip": false 23 | }, 24 | "embedding_type": "glove_840B_300", 25 | "max_embedding_length": 20 26 | }, 27 | "model_config": { 28 | "backbone": "deeplab_resnet101", 29 | "input_dim": 3, 30 | "os": 16, 31 | "train_backbone": false, 32 | "backbone_path": "", 33 | "backbone_multi_scale": true, 34 | "video_feature_dim": 256, 35 | "text_feature_dim": 256, 36 | "TCN_feature_dim": 256, 37 | "gru_bidirection": true, 38 | "attention_dim": 256, 39 | "TCN_hidden_dim": 64, 40 | "embedding_dim": 300, 41 | "layer_num": 5, 42 | "is_local_attention": true, 43 | "groups": 4, 44 | "is_global_attention": true, 45 | "conv_type": "2D", 46 | "filter_type": "global", 47 | "global_fuse_type": "mutan", 48 | "local_fuse_type": "relevance_filter", 49 | "padding_type": "circle", 50 | "norm_type": "GroupNorm", 51 | "frozen_batchnorm": true, 52 | "frozen_backbone": true 53 | }, 54 | "training_config": { 55 | "resume": "", 56 | "batch_size": 8, 57 | "epochs": 20, 58 | 59 | "lr_backbone": 5e-05, 60 | "lr_branch": 0.0005, 61 | "optimizer": "AdamW", 62 | "lr_schedule": "Step", 63 | "loss_function": "SSIM", 64 | "log_root": "./logs", 65 | "save_iters": 100, 66 | "num_worker": 8, 67 | "lr_weight": 0.1 68 | }, 69 | "testing_config": { 70 | "test_savefold": "./result", 71 | "checkpoint": "./logs/checkpoints/checkpoint_best.pth" 72 | } 73 | } 74 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from .model import build_model 3 | 4 | def generate_anchors(windows): 5 | widths = torch.tensor(windows) 6 | center = 7.5 7 | start = center - 0.5 * (widths - 1) 8 | end = center + 0.5 * (widths - 1) 9 | return torch.stack([start, end], -1) 10 | 11 | def generate_proposals(max_num_frames, windows): 12 | anchors = generate_anchors(windows) 13 | widths = (anchors[:, 1] - anchors[:, 0] + 1) # [num_anchors] 14 | centers = torch.arange(0, max_num_frames) # [video_len] 15 | start = centers[:, None] - 0.5 * (widths[None, :] - 1) 16 | end = centers[:, None] + 0.5 * (widths[None, :] - 1) 17 | proposals = torch.stack([start, end], -1) # [video_len, num_anchors, 2] 18 | return proposals.view(-1, 2) 19 | 20 | def calculate_IoU_batch(i0, i1): 21 | union = (torch.min(torch.stack([i0[0], i1[0]], 0), 0)[0], torch.max(torch.stack([i0[1], i1[1]], 0), 0)[0]) 22 | inter = (torch.max(torch.stack([i0[0], i1[0]], 0), 0)[0], torch.min(torch.stack([i0[1], i1[1]], 0), 0)[0]) 23 | iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10) 24 | iou[union[1] - union[0] < -1e-5] = 0 25 | iou[iou < 0] = 0.0 26 | return iou 27 | 28 | 29 | def generate_scores(proposals, label, max_num_frames, thres_score): 30 | illegal = torch.logical_or(proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames) 31 | label1 = label[None, :].repeat(proposals.shape[0], 1) 32 | # label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0) 33 | IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]), 34 | (label1[:, 0], label1[:, 1])) 35 | IoUs[illegal] = 0.0 # [video_len * num_anchors] 36 | max_IoU = torch.max(IoUs) 37 | IoUs[IoUs < thres_score * max_IoU] = 0.0 38 | IoUs = IoUs / (max_IoU + 1e-4) 39 | 40 | scores = IoUs.float() 41 | scores_mask = (1 - illegal.float()) 42 | return scores, scores_mask 43 | -------------------------------------------------------------------------------- /referring_segmentation/model/backbone/frozen_batchnorm.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class FrozenBatchNorm2d(torch.nn.Module): 6 | """ 7 | BatchNorm2d where the batch statistics and the affine parameters are fixed. 8 | 9 | Copy-paste from torchvision.misc.ops with added eps before rqsrt, 10 | without which any other models than torchvision.models.resnet[18,34,50,101] 11 | produce nans. 12 | """ 13 | 14 | def __init__(self, n): 15 | super(FrozenBatchNorm2d, self).__init__() 16 | self.register_buffer("weight", torch.ones(n)) 17 | self.register_buffer("bias", torch.zeros(n)) 18 | self.register_buffer("running_mean", torch.zeros(n)) 19 | self.register_buffer("running_var", torch.ones(n)) 20 | 21 | def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, 22 | missing_keys, unexpected_keys, error_msgs): 23 | num_batches_tracked_key = prefix + 'num_batches_tracked' 24 | if num_batches_tracked_key in state_dict: 25 | del state_dict[num_batches_tracked_key] 26 | 27 | super(FrozenBatchNorm2d, self)._load_from_state_dict( 28 | state_dict, prefix, local_metadata, strict, 29 | missing_keys, unexpected_keys, error_msgs) 30 | 31 | def forward(self, x): 32 | # move reshapes to the beginning 33 | # to make it fuser-friendly 34 | w = self.weight.reshape(1, -1, 1, 1) 35 | b = self.bias.reshape(1, -1, 1, 1) 36 | rv = self.running_var.reshape(1, -1, 1, 1) 37 | rm = self.running_mean.reshape(1, -1, 1, 1) 38 | eps = 1e-5 39 | scale = w * (rv + eps).rsqrt() 40 | bias = b - rm * scale 41 | return x * scale + bias 42 | 43 | 44 | def convert_to_frozen_batchnorm(module): 45 | new_module = module 46 | if isinstance(module, nn.BatchNorm2d): 47 | new_module = FrozenBatchNorm2d(module.num_features) 48 | for name, child in module.named_children(): 49 | new_module.add_module(name, convert_to_frozen_batchnorm(child)) 50 | return new_module -------------------------------------------------------------------------------- /temporal_grounding/utils/generate_batch.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | def collect_fn(batch): 4 | """ 5 | heat, offset, d, mask, embedding, fea, ratio, embedding_length, label 6 | """ 7 | heats = [] 8 | offsets = [] 9 | ds = [] 10 | ratios = [] 11 | feas = [] 12 | embeddings = [] 13 | masks = [] 14 | lengths = [] 15 | labels = [] 16 | max_fea_len = 0 17 | max_embedding_len = 0 18 | for b in batch: 19 | fea = b[5] 20 | embedding = b[4] 21 | lengths.append(b[7]) 22 | if max_fea_len < fea.shape[-1]: 23 | max_fea_len = fea.shape[-1] 24 | if max_embedding_len < embedding.shape[0]: 25 | max_embedding_len = embedding.shape[0] 26 | lengths = torch.tensor(lengths) 27 | sorted, indices = lengths.sort(descending=True) 28 | for index in indices: 29 | b = batch[index] 30 | seq_length = b[0].shape[0] 31 | fea_padded = torch.zeros((b[5].shape[0], max_fea_len)) 32 | embedding_padded = torch.zeros((max_embedding_len, b[4].shape[1])) 33 | heat_padded = torch.zeros(max_fea_len) 34 | mask_padded = torch.zeros(max_fea_len) 35 | d_padded = torch.zeros(max_fea_len) 36 | fea_padded[:, : seq_length] = b[5] 37 | embedding_padded[:b[4].shape[0], :] = b[4] 38 | mask_padded[: seq_length] = b[3] 39 | heat_padded[: seq_length] = b[0] 40 | d_padded[: seq_length] = b[2] 41 | 42 | offsets.append(b[1]) 43 | ds.append(d_padded) 44 | feas.append(fea_padded) 45 | heats.append(heat_padded) 46 | embeddings.append(embedding_padded) 47 | masks.append(mask_padded) 48 | ratios.append(b[6]) 49 | labels.append(b[-1]) 50 | feas = torch.stack(feas, dim=0) 51 | heats = torch.stack(heats, dim=0) 52 | embeddings = torch.stack(embeddings, dim=0) 53 | masks = torch.stack(masks, dim=0) 54 | ratios = torch.stack(ratios, dim=0) 55 | offsets = torch.stack(offsets, dim=0) 56 | ds = torch.stack(ds, dim=0) 57 | labels = torch.stack(labels, dim=0) 58 | return heats, offsets, ds, masks, embeddings, feas, ratios, sorted, labels 59 | -------------------------------------------------------------------------------- /referring_segmentation/main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | import numpy as np 5 | import random 6 | import torch 7 | from dataset.dataset import MyDataset 8 | from model.model import Model 9 | import torch.nn as nn 10 | from utils.trainer import Trainer 11 | from utils.tester import Tester 12 | 13 | 14 | def main(args): 15 | with open(args.json_file, 'r') as f: 16 | config = json.load(f) 17 | 18 | setting = config['setting_config'] 19 | data_config = config['data_config'] 20 | model_config = config['model_config'] 21 | training_config = config['training_config'] 22 | testing_config = config['testing_config'] 23 | is_cuda = setting['cuda'] 24 | if is_cuda: 25 | os.environ["CUDA_VISIBLE_DEVICES"] = setting['gpu_id'] 26 | if args.checkpoint is not None: 27 | testing_config['checkpoint'] = args.checkpoint 28 | 29 | torch.manual_seed(setting['seed']) 30 | np.random.seed(setting['seed']) 31 | random.seed(setting['seed']) 32 | 33 | training_config['cuda'] = is_cuda 34 | testing_config['cuda'] = is_cuda 35 | training_config['train_backbone'] = model_config['train_backbone'] 36 | 37 | # init_distributed_mode() 38 | if args.mode == 'train': 39 | dataset = MyDataset(data_config, 'train') 40 | val_dataset = MyDataset(data_config, 'test') 41 | elif args.mode == 'test': 42 | dataset = MyDataset(data_config, 'test') 43 | else: 44 | raise NotImplementedError 45 | 46 | model = Model(model_config) 47 | if is_cuda: 48 | model = nn.DataParallel(model) 49 | model = model.cuda() 50 | 51 | if not os.path.exists(training_config['log_root']): 52 | os.mkdir(training_config['log_root']) 53 | if args.mode == 'train': 54 | trainer = Trainer(training_config) 55 | trainer.train(model, dataset, val_dataset) 56 | elif args.mode == 'test': 57 | tester = Tester(testing_config) 58 | tester.test(model, dataset) 59 | else: 60 | raise NotImplementedError 61 | 62 | 63 | if __name__ == '__main__': 64 | 65 | parser = argparse.ArgumentParser() 66 | parser.add_argument('--json_file', type=str, default='json/config_a2d_sentences.json') 67 | parser.add_argument('--mode', type=str, default='train') 68 | parser.add_argument('--checkpoint', type=str, default=None) 69 | args = parser.parse_args() 70 | main(args) 71 | -------------------------------------------------------------------------------- /temporal_grounding/main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | import numpy as np 5 | import random 6 | import torch 7 | from dataset import MyDataset 8 | from model.model import Model 9 | import torch.nn as nn 10 | from utils.trainer import Trainer 11 | from utils.tester import Tester 12 | 13 | 14 | def main(args): 15 | with open(args.json_file, 'r') as f: 16 | config = json.load(f) 17 | config['mode'] = args.mode 18 | if args.checkpoint is not None: 19 | config['checkpoint'] = args.checkpoint 20 | 21 | is_cuda = config['cuda'] 22 | if is_cuda: 23 | os.environ["CUDA_VISIBLE_DEVICES"] = config['gpu_id'] 24 | 25 | torch.manual_seed(config['seed']) 26 | np.random.seed(config['seed']) 27 | random.seed(config['seed']) 28 | 29 | # init_distributed_mode() 30 | if config['mode'] == 'train': 31 | dataset = MyDataset(config, 'train') 32 | eval_dataset = MyDataset(config, 'test') 33 | elif config['mode'] == 'test': 34 | dataset = MyDataset(config, 'test') 35 | else: 36 | raise NotImplementedError 37 | 38 | model = Model(config) 39 | if is_cuda: 40 | model = nn.DataParallel(model) 41 | model = model.cuda() 42 | 43 | if not os.path.exists(config['log_root']): 44 | os.makedirs(config['log_root']) 45 | if config['mode'] == 'train': 46 | trainer = Trainer(config) 47 | trainer.train(model, dataset, eval_dataset) 48 | elif config['mode'] == 'test': 49 | tester = Tester(config) 50 | tester.test(model, dataset) 51 | else: 52 | raise NotImplementedError 53 | 54 | 55 | if __name__ == '__main__': 56 | # python main.py --json_file=json/config_Charades-STA_I3D_regression.json --mode=train 57 | # python main.py --json_file=json/config_Charades-STA_I3D_anchor.json --mode=train 58 | # python main.py --json_file=json/config_ActivityNet_C3D_regression.json --mode=train 59 | # python main.py --json_file=json/config_ActivityNet_C3D_anchor.json --mode=train 60 | parser = argparse.ArgumentParser() 61 | parser.add_argument('--json_file', type=str, 62 | default='json/config_Charades-STA_I3D_regression.json', required=True) 63 | parser.add_argument('--mode', type=str, 64 | default='train', required=True, choices=['train', 'test']) 65 | parser.add_argument('--checkpoint', type=str, default=None) 66 | args = parser.parse_args() 67 | main(args) 68 | -------------------------------------------------------------------------------- /temporal_grounding/model/backbone/C3D.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | import torch.nn as nn 4 | 5 | 6 | class C3D(nn.Module): 7 | """ 8 | nb_classes: nb_classes in classification task, 101 for UCF101 dataset 9 | """ 10 | 11 | def __init__(self, nb_classes): 12 | super(C3D, self).__init__() 13 | 14 | self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 15 | self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2)) 16 | 17 | self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 18 | self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) 19 | 20 | self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 21 | self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 22 | self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) 23 | 24 | self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 25 | self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 26 | self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) 27 | 28 | self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 29 | self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) 30 | self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1)) 31 | 32 | self.fc6 = nn.Linear(8192, 4096) 33 | self.fc7 = nn.Linear(4096, 4096) 34 | self.fc8 = nn.Linear(4096, nb_classes) 35 | 36 | self.dropout = nn.Dropout(p=0.5) 37 | 38 | self.relu = nn.ReLU() 39 | 40 | def forward(self, x): 41 | h = self.relu(self.conv1(x)) 42 | h = self.pool1(h) 43 | h = self.relu(self.conv2(h)) 44 | h = self.pool2(h) 45 | 46 | h = self.relu(self.conv3a(h)) 47 | h = self.relu(self.conv3b(h)) 48 | h = self.pool3(h) 49 | 50 | h = self.relu(self.conv4a(h)) 51 | h = self.relu(self.conv4b(h)) 52 | h = self.pool4(h) 53 | 54 | h = self.relu(self.conv5a(h)) 55 | h = self.relu(self.conv5b(h)) 56 | h = self.pool5(h) 57 | 58 | h = h.view(-1, 8192) 59 | h = self.relu(self.fc6(h)) 60 | out = h 61 | # h = self.dropout(h) 62 | # h = self.relu(self.fc7(h)) 63 | # out = h if feature_layer == 7 and out == None else out 64 | # h = self.dropout(h) 65 | # logits = self.fc8(h) 66 | return out 67 | 68 | 69 | 70 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/anchor_utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def generate_anchors(windows): 5 | widths = torch.tensor(windows) 6 | center = 7.5 7 | start = center - 0.5 * (widths - 1) 8 | end = center + 0.5 * (widths - 1) 9 | return torch.stack([start, end], -1) 10 | 11 | 12 | def generate_proposals(max_num_frames, windows): 13 | anchors = generate_anchors(windows) 14 | widths = (anchors[:, 1] - anchors[:, 0] + 1) # [num_anchors] 15 | centers = torch.arange(0, max_num_frames) # [video_len] 16 | start = centers[:, None] - 0.5 * (widths[None, :] - 1) 17 | end = centers[:, None] + 0.5 * (widths[None, :] - 1) 18 | proposals = torch.stack([start, end], -1) # [video_len, num_anchors, 2] 19 | return proposals.view(-1, 2) 20 | 21 | 22 | def calculate_IoU_batch(i0, i1): 23 | union = (torch.min(torch.stack([i0[0], i1[0]], 0), 0)[ 24 | 0], torch.max(torch.stack([i0[1], i1[1]], 0), 0)[0]) 25 | inter = (torch.max(torch.stack([i0[0], i1[0]], 0), 0)[ 26 | 0], torch.min(torch.stack([i0[1], i1[1]], 0), 0)[0]) 27 | iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10) 28 | iou[union[1] - union[0] < -1e-5] = 0 29 | iou[iou < 0] = 0.0 30 | return iou 31 | 32 | 33 | def generate_scores(proposals, label, max_num_frames, thres_score): 34 | illegal = torch.logical_or( 35 | proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames) 36 | label1 = label[None, :].repeat(proposals.shape[0], 1) 37 | # label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0) 38 | IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]), 39 | (label1[:, 0], label1[:, 1])) 40 | IoUs[illegal] = 0.0 # [video_len * num_anchors] 41 | max_IoU = torch.max(IoUs) 42 | IoUs[IoUs < thres_score * max_IoU] = 0.0 43 | IoUs = IoUs / (max_IoU + 1e-4) 44 | 45 | scores = IoUs.float() 46 | scores_mask = (1 - illegal.float()) 47 | return scores, scores_mask 48 | 49 | 50 | def generate_2d_gaussian(boxes, w, h, delta=0.05): 51 | # boxes: k*4 cxcywh 52 | n_boxes = len(boxes) 53 | ww = torch.linspace(0, 1, w) 54 | hh = torch.linspace(0, 1, h) 55 | gridh, gridw = torch.meshgrid(hh, ww) 56 | grid = torch.stack([gridw, gridh], dim=0)[None, ...].repeat( 57 | n_boxes, 1, 1, 1).to(boxes.device) # k*2*h*w 58 | boxes = boxes[..., None, None].repeat(1, 1, h, w) 59 | gaussian = torch.exp(-(boxes[:, 0]-grid[:, 0])**2/(delta*boxes[:, 2]**2)) *\ 60 | torch.exp(-(boxes[:, 1]-grid[:, 1])**2/(delta*boxes[:, 3]**2)) # k*h*w 61 | gaussian[gaussian < 0.05] = 0 62 | return gaussian 63 | -------------------------------------------------------------------------------- /temporal_grounding/utils/losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | def _neg_loss(pred, gt): 6 | pos_inds = gt.eq(1).float() 7 | neg_inds = gt.lt(1).float() 8 | neg_weights = torch.pow(1 - gt, 4) 9 | 10 | loss = 0 11 | pos_loss = torch.log(pred) * torch.pow(1 - pred, 2) * pos_inds 12 | neg_loss = torch.log(1 - pred) * torch.pow(pred, 2) * neg_weights * neg_inds 13 | 14 | num_pos = pos_inds.float().sum() 15 | pos_loss = (pos_loss).sum() 16 | neg_loss = (neg_loss).sum() 17 | 18 | if num_pos == 0: 19 | loss = loss - neg_loss 20 | else: 21 | loss = loss - (pos_loss + neg_loss) / num_pos 22 | return loss 23 | 24 | 25 | class FocalLoss(nn.Module): 26 | '''nn.Module warpper for focal loss''' 27 | def __init__(self): 28 | super(FocalLoss, self).__init__() 29 | self.neg_loss = _neg_loss 30 | 31 | def forward(self, out, target): 32 | return self.neg_loss(out, target) 33 | 34 | 35 | class RegL1Loss(nn.Module): 36 | def __init__(self): 37 | super(RegL1Loss, self).__init__() 38 | 39 | def forward(self, output, ind, target): 40 | pred = output.gather(1, ind) 41 | loss = F.l1_loss(pred, target, size_average=False) 42 | return loss 43 | 44 | def generate_weight(target): 45 | 46 | pos = torch.where(target>0, torch.ones_like(target), torch.zeros_like(target)) 47 | neg = torch.where(target==0, torch.ones_like(target), torch.zeros_like(target)) 48 | # ing = ((torch.gt(target, 0) & torch.lt(target, 1))).float() 49 | 50 | num_pos = torch.sum(pos) 51 | num_neg = torch.sum(neg) 52 | num_total = num_pos + num_neg 53 | 54 | alpha = num_neg / num_total 55 | beta = 1.1 * num_pos / num_total 56 | # target pixel = 1 -> weight beta 57 | # target pixel = 0 -> weight 1-beta 58 | # if num_pos == 0: 59 | # weights = alpha * neg 60 | # else: 61 | weights = alpha * pos + beta * neg 62 | 63 | return weights 64 | 65 | 66 | class IoULoss(nn.Module): 67 | def __init__(self): 68 | super(IoULoss, self).__init__() 69 | 70 | def forward(self, box_a, box_b): 71 | inter_max_xy = torch.min(box_a[:, -1], box_b[:, -1]) 72 | inter_min_xy = torch.max(box_a[:, 0], box_b[:, 0]) 73 | inter = torch.clamp((inter_max_xy - inter_min_xy), min=0) 74 | 75 | # calculate union 76 | union_max_xy = torch.max(box_a[:, -1], box_b[:, -1]) 77 | union_min_xy = torch.min(box_a[:, 0], box_b[:, 0]) 78 | union = torch.clamp((union_max_xy - union_min_xy), min=0) 79 | 80 | iou = inter / (union + 1e-6) 81 | 82 | return 1 - iou.mean() 83 | 84 | class TAGLoss(nn.Module): 85 | def __init__(self): 86 | super(TAGLoss, self).__init__() 87 | 88 | def forward(self, net_outs, gts): 89 | 90 | ac_loss = (-gts*torch.log(net_outs+1e-8)).sum(1) / gts.sum(1) 91 | ac_loss = ac_loss.mean() 92 | 93 | return ac_loss -------------------------------------------------------------------------------- /referring_segmentation/model/module/aspp.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | 6 | class ASPP_module(nn.Module): 7 | def __init__(self, inplanes, planes, rate, norm_type='GroupNorm'): 8 | super(ASPP_module, self).__init__() 9 | if rate == 1: 10 | kernel_size = 1 11 | padding = 0 12 | else: 13 | kernel_size = 3 14 | padding = rate 15 | if norm_type=='GroupNorm': 16 | self.bn = nn.GroupNorm(8, planes) 17 | else: 18 | self.bn = nn.BatchNorm2d(planes) 19 | self.atrous_convolution = nn.Conv2d(inplanes, planes, kernel_size=kernel_size, stride=1, padding=padding, dilation=rate, bias=False) 20 | self.relu = nn.ReLU() 21 | 22 | def forward(self, x): 23 | x = self.atrous_convolution(x) 24 | x = self.bn(x) 25 | 26 | return self.relu(x) 27 | 28 | 29 | class ASPP(nn.Module): 30 | def __init__(self, inplanes, planes, rates, norm_type='GroupNorm'): 31 | super(ASPP, self).__init__() 32 | 33 | self.aspp1 = ASPP_module(inplanes, planes, rate=rates[0], norm_type=norm_type) 34 | self.aspp2 = ASPP_module(inplanes, planes, rate=rates[1], norm_type=norm_type) 35 | self.aspp3 = ASPP_module(inplanes, planes, rate=rates[2], norm_type=norm_type) 36 | self.aspp4 = ASPP_module(inplanes, planes, rate=rates[3], norm_type=norm_type) 37 | 38 | self.relu = nn.ReLU() 39 | 40 | if norm_type=='GroupNorm': 41 | self.global_avg_pool = nn.Sequential( 42 | nn.AdaptiveAvgPool2d((1, 1)), 43 | nn.Conv2d(inplanes, planes, 1, stride=1, bias=False), 44 | nn.GroupNorm(8, planes), 45 | nn.ReLU() 46 | ) 47 | self.bn1 = nn.GroupNorm(8, planes) 48 | else: 49 | self.global_avg_pool = nn.Sequential( 50 | nn.AdaptiveAvgPool2d((1, 1)), 51 | nn.Conv2d(inplanes, planes, 1, stride=1, bias=False), 52 | nn.BatchNorm2d(planes), 53 | nn.ReLU() 54 | ) 55 | self.bn1 = nn.BatchNorm2d(planes) 56 | 57 | self.conv1 = nn.Conv2d(planes*5, planes, 1, bias=False) 58 | self.__init_weight() 59 | 60 | def __init_weight(self): 61 | for m in self.modules(): 62 | if isinstance(m, nn.Conv2d): 63 | torch.nn.init.kaiming_normal_(m.weight) 64 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 65 | m.weight.data.fill_(1) 66 | m.bias.data.zero_() 67 | 68 | def forward(self, x): 69 | x1 = self.aspp1(x) 70 | x2 = self.aspp2(x) 71 | x3 = self.aspp3(x) 72 | x4 = self.aspp4(x) 73 | x5 = self.global_avg_pool(x) 74 | x5 = F.interpolate(x5, size=x4.size()[2:], mode='bilinear', align_corners=True) 75 | 76 | x = torch.cat((x1, x2, x3, x4, x5), dim=1) 77 | 78 | x = self.conv1(x) 79 | x = self.bn1(x) 80 | x = self.relu(x) 81 | return x -------------------------------------------------------------------------------- /referring_segmentation/utils/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.distributed as dist 3 | import os 4 | import numpy as np 5 | 6 | 7 | def setup_for_distributed(is_master): 8 | """ 9 | This function disables printing when not in master process 10 | """ 11 | import builtins as __builtin__ 12 | builtin_print = __builtin__.print 13 | 14 | def print(*args, **kwargs): 15 | force = kwargs.pop('force', False) 16 | if is_master or force: 17 | builtin_print(*args, **kwargs) 18 | 19 | __builtin__.print = print 20 | 21 | 22 | def init_distributed_mode(): 23 | if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ: 24 | rank = int(os.environ["RANK"]) 25 | world_size = int(os.environ['WORLD_SIZE']) 26 | gpu = int(os.environ['LOCAL_RANK']) 27 | elif 'SLURM_PROCID' in os.environ: 28 | rank = int(os.environ['SLURM_PROCID']) 29 | gpu = rank % torch.cuda.device_count() 30 | else: 31 | print('Not using distributed mode') 32 | return 33 | 34 | torch.cuda.set_device(gpu) 35 | dist_backend = 'nccl' 36 | dist_url = 'env://' 37 | print('| distributed init (rank {}): {}'.format( 38 | rank, dist_url), flush=True) 39 | torch.distributed.init_process_group(backend=dist_backend, init_method=dist_url, 40 | world_size=world_size, rank=rank) 41 | torch.distributed.barrier() 42 | setup_for_distributed(rank == 0) 43 | 44 | 45 | 46 | def calculate_IoU(pred, gt): 47 | SMOOTH = 1e-6 48 | IArea = (pred & (gt == 1.0)).astype(float).sum() 49 | OArea = (pred | (gt == 1.0)).astype(float).sum() 50 | IoU = (IArea + SMOOTH) / (OArea + SMOOTH) 51 | return IoU, IArea, OArea 52 | 53 | 54 | def report_result(preds, labels): 55 | print(len(preds)) 56 | MeanIoU, IArea, OArea, Overlap = [], [], [], [] 57 | for i in range(len(preds)): 58 | iou, iarea, oarea = calculate_IoU(preds[i], labels[i]) 59 | MeanIoU.append(iou) 60 | IArea.append(iarea) 61 | OArea.append(oarea) 62 | Overlap.append(iou) 63 | 64 | prec5, prec6, prec7, prec8, prec9 = np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1)), \ 65 | np.zeros((len(Overlap), 1)), np.zeros((len(Overlap), 1)) 66 | for i in range(len(Overlap)): 67 | if Overlap[i] >= 0.5: 68 | prec5[i] = 1 69 | if Overlap[i] >= 0.6: 70 | prec6[i] = 1 71 | if Overlap[i] >= 0.7: 72 | prec7[i] = 1 73 | if Overlap[i] >= 0.8: 74 | prec8[i] = 1 75 | if Overlap[i] >= 0.9: 76 | prec9[i] = 1 77 | 78 | mAP_thres_list = list(range(50, 95+1, 5)) 79 | mAP = [] 80 | for i in range(len(mAP_thres_list)): 81 | tmp = np.zeros((len(Overlap), 1)) 82 | for j in range(len(Overlap)): 83 | if Overlap[j] >= mAP_thres_list[i] / 100.0: 84 | tmp[j] = 1 85 | mAP.append(tmp.sum() / tmp.shape[0]) 86 | 87 | return np.mean(np.array(MeanIoU)), np.array(IArea).sum() / np.array(OArea).sum(), \ 88 | prec5.sum() / prec5.shape[0], prec6.sum() / prec6.shape[0], prec7.sum() / prec7.shape[0], \ 89 | prec8.sum() / prec8.shape[0], prec9.sum() / prec9.shape[0], np.mean(np.array(mAP)) 90 | -------------------------------------------------------------------------------- /temporal_grounding/utils/scheduler.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the LICENSE file in 5 | # the root directory of this source tree. An additional grant of patent rights 6 | # can be found in the PATENTS file in the same directory. 7 | 8 | from .optimizer import FairseqLRScheduler 9 | from . import register_lr_scheduler 10 | import argparse 11 | 12 | @register_lr_scheduler('inverse_sqrt') 13 | class InverseSquareRootSchedule(FairseqLRScheduler): 14 | """Decay the LR based on the inverse square root of the update number. 15 | 16 | We also support a warmup phase where we linearly increase the learning rate 17 | from some initial learning rate (``--warmup-init-lr``) until the configured 18 | learning rate (``--lr``). Thereafter we decay proportional to the number of 19 | updates, with a decay factor set to align with the configured learning rate. 20 | 21 | During warmup:: 22 | 23 | lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates) 24 | lr = lrs[update_num] 25 | 26 | After warmup:: 27 | 28 | decay_factor = args.lr * sqrt(args.warmup_updates) 29 | lr = decay_factor / sqrt(update_num) 30 | """ 31 | 32 | def __init__(self, args, optimizer): 33 | super().__init__(args, optimizer) 34 | # if len(args.lr) > 1: 35 | # raise ValueError( 36 | # 'Cannot use a fixed learning rate schedule with inverse_sqrt.' 37 | # ' Consider --lr-scheduler=fixed instead.' 38 | # ) 39 | warmup_end_lr = args.lr 40 | if args.warmup_init_lr < 0: 41 | args.warmup_init_lr = warmup_end_lr 42 | 43 | # linearly warmup for the first args.warmup_updates 44 | self.lr_step = (warmup_end_lr - args.warmup_init_lr) / \ 45 | args.warmup_updates 46 | 47 | # then, decay prop. to the inverse square root of the update number 48 | self.decay_factor = warmup_end_lr * args.warmup_updates**0.5 49 | 50 | # initial learning rate 51 | self.lr = args.warmup_init_lr 52 | self.optimizer.set_lr(self.lr) 53 | 54 | @staticmethod 55 | def add_args(parser): 56 | """Add arguments to the parser for this LR scheduler.""" 57 | # fmt: off 58 | parser.add_argument('--warmup-updates', default=300, type=int, metavar='N', 59 | help='warmup the learning rate linearly for the first N updates') 60 | parser.add_argument('--warmup-init-lr', default=1e-06, type=float, metavar='LR', 61 | help='initial learning rate during warmup phase; default is args.lr') 62 | # fmt: on 63 | 64 | def step(self, epoch, val_loss=None): 65 | """Update the learning rate at the end of the given epoch.""" 66 | super().step(epoch, val_loss) 67 | # we don't change the learning rate at epoch boundaries 68 | return self.optimizer.get_lr() 69 | 70 | def step_update(self, num_updates): 71 | """Update the learning rate after each update.""" 72 | if num_updates < self.args.warmup_updates: 73 | self.lr = self.args.warmup_init_lr + num_updates*self.lr_step 74 | else: 75 | self.lr = self.decay_factor * num_updates**-0.5 76 | self.optimizer.set_lr(self.lr) 77 | return self.lr 78 | -------------------------------------------------------------------------------- /temporal_grounding/model/module/attention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from einops import rearrange, repeat 5 | from torch import einsum 6 | import warnings 7 | warnings.filterwarnings("ignore") 8 | 9 | 10 | class GlobalTextPresentation(nn.Module): 11 | def __init__(self, text_dim): 12 | super(GlobalTextPresentation, self).__init__() 13 | self.W_txt = nn.Linear(text_dim, text_dim) 14 | 15 | def forward(self, fea_text, mask=None): 16 | fea_text = fea_text 17 | weight_text = self.W_txt(fea_text) # B*L*C 18 | if mask is not None: 19 | mask = mask.permute(0, 2, 1) 20 | weight_text = weight_text.masked_fill(mask == 0, -1e9) 21 | weight_text = weight_text.softmax(dim=1) 22 | weight_text_global_out = weight_text.mean(dim=2) # B*L 23 | fea_text_global = fea_text * weight_text 24 | fea_text_global = fea_text_global.sum(dim=1, keepdim=True) # B*C*1*1 25 | return fea_text_global, weight_text_global_out 26 | 27 | 28 | class Attention(nn.Module): 29 | def __init__(self, videodim, textdim, attentiondim, groups): 30 | super(Attention, self).__init__() 31 | 32 | self.groups = groups 33 | self.q = nn.Linear(textdim, attentiondim) 34 | self.kv = nn.Linear(videodim, 2*attentiondim) 35 | 36 | def forward(self, videofea, textfea): 37 | videofea = videofea.permute(0, 2, 1) # b*t*c 38 | q = self.q(textfea) # b*l*c 39 | kv = self.kv(videofea) 40 | k, v = kv.chunk(2, dim=-1) 41 | q = rearrange(q, 'b l (g d) -> b g l d', g=self.groups) 42 | k = rearrange(k, 'b t (g d) -> b g t d', g=self.groups) 43 | v = rearrange(v, 'b t (g d) -> b g t d', g=self.groups) 44 | A = einsum('bgld,bgtd->bglt', [q, k] 45 | ).mean(dim=2, keepdim=True) # b*g*l*t 46 | 47 | att = torch.sigmoid(A) 48 | out = v.permute(0, 1, 3, 2) * att # b*g*d*t 49 | out = rearrange(out, 'b g d t -> b (g d) t') 50 | return A.mean(dim=[1, 2]), out 51 | 52 | 53 | class MutanFusion(nn.Module): 54 | def __init__(self, input_dim, out_dim, num_layers): 55 | super(MutanFusion, self).__init__() 56 | self.input_dim = input_dim 57 | self.out_dim = out_dim 58 | self.num_layers = num_layers 59 | 60 | hv = [] 61 | for i in range(self.num_layers): 62 | do = nn.Dropout(p=0.5) 63 | lin = nn.Linear(input_dim, out_dim) 64 | 65 | hv.append(nn.Sequential(do, lin, nn.Tanh())) 66 | # 67 | self.image_transformation_layers = nn.ModuleList(hv) 68 | # 69 | hq = [] 70 | for i in range(self.num_layers): 71 | do = nn.Dropout(p=0.5) 72 | lin = nn.Linear(input_dim, out_dim) 73 | hq.append(nn.Sequential(do, lin, nn.Tanh())) 74 | # 75 | self.ques_transformation_layers = nn.ModuleList(hq) 76 | 77 | def forward(self, ques_emb, img_emb): 78 | # Pdb().set_trace() 79 | batch_size = img_emb.size()[0] 80 | x_mm = [] 81 | for i in range(self.num_layers): 82 | x_hv = img_emb 83 | x_hv = self.image_transformation_layers[i](x_hv) 84 | 85 | x_hq = ques_emb 86 | x_hq = self.ques_transformation_layers[i](x_hq) 87 | x_mm.append(torch.mul(x_hq, x_hv)) 88 | 89 | x_mm = torch.stack(x_mm, dim=1) 90 | x_mm = x_mm.sum(1).view(batch_size, img_emb.shape[1], self.out_dim) 91 | x_mm = F.tanh(x_mm) 92 | return x_mm 93 | -------------------------------------------------------------------------------- /temporal_grounding/model/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from .module.attention import GlobalTextPresentation 5 | from .module.RefTransformer import RefTransformer 6 | from .decoder import AnchorBasedDecoder, RegressionDecoder 7 | 8 | 9 | class TextEncoder(nn.Module): 10 | def __init__(self, config): 11 | self.config = config 12 | super(TextEncoder, self).__init__() 13 | if config['gru_bidirection']: 14 | self.backbone = nn.GRU( 15 | config['embedding_dim'], config['attention_dim'], batch_first=True, bidirectional=True) 16 | else: 17 | self.backbone = nn.GRU( 18 | config['embedding_dim'], config['attention_dim'], batch_first=True, bidirectional=False) 19 | self.dropout0 = nn.Dropout(config['dropout']) 20 | self.dropout = nn.Dropout(config['dropout']) 21 | 22 | def forward(self, text, embedding_length): 23 | text = torch.nn.utils.rnn.pack_padded_sequence( 24 | text, list(embedding_length), True, enforce_sorted=False) 25 | word_embedding, _ = self.backbone(text) 26 | word_embedding, _ = torch.nn.utils.rnn.pad_packed_sequence( 27 | word_embedding, True) 28 | if self.config['gru_bidirection']: 29 | word_embedding = word_embedding.view( 30 | word_embedding.shape[0], word_embedding.shape[1], 2, -1) 31 | word_embedding = torch.mean(word_embedding, dim=2) 32 | return word_embedding 33 | 34 | 35 | class Model(nn.Module): 36 | def __init__(self, config): 37 | 38 | super(Model, self).__init__() 39 | self.config = config 40 | self.text_encoder = TextEncoder(config) 41 | self.video_encoder = nn.Linear( 42 | config['video_fea_dim'], config['attention_dim']) 43 | self.global_text = GlobalTextPresentation(config['attention_dim']) 44 | self.pos_embedding = nn.Parameter(torch.randn( 45 | 1, config['attention_dim'], config['segment_num'])) 46 | self.prenorm = nn.LayerNorm(config['attention_dim']) 47 | 48 | self.TCN = RefTransformer(config) 49 | if config['decoder_type'] == 'anchor_based': 50 | self.decoder = AnchorBasedDecoder(config) 51 | elif config['decoder_type'] == 'regression': 52 | self.decoder = RegressionDecoder(config) 53 | 54 | def forward(self, video_fea, embedding, embedding_length, score=None, gt_reg=None, score_mask=None, score_nm=None, proposals=None, adj_mat=None, mode='train'): 55 | text_feal = self.text_encoder( 56 | embedding.float(), embedding_length) # b*l*c 57 | embedding_mask = torch.zeros((text_feal.shape[0], 1, text_feal.shape[1])).to( 58 | text_feal.device) 59 | 60 | for b in range(embedding_mask.shape[0]): 61 | embedding_mask[b, :, :int(embedding_length[b])] = 1 62 | text_feag, text_weight = self.global_text( 63 | text_feal, embedding_mask) # b*1*d 64 | 65 | video_fea = self.video_encoder(video_fea.float()) # b*c*t 66 | if self.config['with_text']: 67 | video_fea = video_fea + text_feag 68 | 69 | video_fea = video_fea.permute(0, 2, 1) 70 | out_fea, weights = self.TCN( 71 | video_fea, text_feag, self.pos_embedding, embedding_mask) # b*c*t 72 | if self.config['decoder_type'] == 'anchor_based': 73 | return self.decoder(out_fea, weights, score, gt_reg, score_mask, score_nm, proposals, adj_mat, mode) 74 | elif self.config['decoder_type'] == 'regression': 75 | return self.decoder(out_fea, weights, gt_reg, score_nm, mode) 76 | 77 | 78 | if __name__ == '__main__': 79 | import json 80 | feav = torch.randn((4, 1024, 100)) 81 | feat = torch.randn((4, 20, 300)) 82 | with open('../json/config.json') as f: 83 | config = json.load(f)['model_config'] 84 | model = Model(config) 85 | fea, weight = model(feav, feat, torch.tensor([15, 13, 12, 10])) 86 | print(fea.shape) 87 | print(len(weight)) 88 | -------------------------------------------------------------------------------- /temporal_grounding/model/module/RefTransformer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from .attention import Attention 5 | 6 | 7 | class RefTransformer(nn.Module): 8 | def __init__(self, config): 9 | super(RefTransformer, self).__init__() 10 | self.config = config 11 | self.padding_type = config['padding_type'] 12 | self.conv = nn.ModuleList() 13 | self.dilations = [] 14 | self.attention = nn.ModuleList() 15 | self.weight_conv = nn.ModuleList() 16 | self.prenorms = nn.ModuleList() 17 | for i in range(config['layer_num']): 18 | dilation = torch.pow(torch.tensor(2), i) 19 | dilation = int(dilation) 20 | # dilation = i+1 21 | self.prenorms.append(nn.LayerNorm(config['attention_dim'])) 22 | self.dilations.append(dilation) 23 | if config['with_attention']: 24 | self.attention.append(Attention( 25 | config['attention_dim'], config['attention_dim'], config['attention_dim'], groups=config['groups'])) 26 | 27 | if config['with_mlp']: 28 | self.conv.append( 29 | nn.Sequential( 30 | nn.Conv1d(config['attention_dim'], config['MLP_dim'], 31 | 3, 1, dilation=dilation, padding=0, bias=False), 32 | nn.GroupNorm(4, config['MLP_dim']), 33 | nn.Dropout(config['dropout']), 34 | nn.ReLU(), 35 | nn.Conv1d(config['MLP_dim'], 36 | config['attention_dim'], 1, 1, bias=False), 37 | nn.GroupNorm(4, config['attention_dim']) 38 | ) 39 | ) 40 | 41 | # self.__init_weight() 42 | 43 | def __init_weight(self): 44 | for m in self.modules(): 45 | if isinstance(m, nn.Conv1d): 46 | torch.nn.init.kaiming_normal_(m.weight) 47 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 48 | m.weight.data.fill_(1) 49 | m.bias.data.zero_() 50 | 51 | def forward(self, fea, text_fea, position, mask=None): 52 | fea = fea + position 53 | weights = [] 54 | for i in range(len(self.attention)): 55 | if self.config['prenorm']: 56 | fea = self.prenorms[i](fea.permute(0, 2, 1)).permute(0, 2, 1) 57 | res0 = fea 58 | if self.config['with_attention']: 59 | weight, fea = self.attention[i](fea, text_fea) 60 | fea = fea + res0 61 | res1 = fea 62 | weights.append(weight) 63 | if self.config['with_mlp']: 64 | if self.padding_type == 'circle': 65 | fea = circle_padding(self.dilations[i], fea) 66 | elif self.padding_type == 'zero': 67 | fea = F.pad( 68 | fea, (self.dilations[i], self.dilations[i]), mode='constant', value=0) 69 | else: 70 | fea = F.pad( 71 | fea, (self.dilations[i], self.dilations[i]), mode='replicate') 72 | fea = self.conv[i](fea) 73 | fea = res1 + fea 74 | return fea, weights 75 | 76 | 77 | def circle_padding(padding, feature): 78 | length_times = feature.shape[-1] 79 | index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1)) 80 | total_num = 2 * padding + length_times 81 | num_c = padding // len(index) 82 | if num_c * len(index) < padding: 83 | num_c = num_c + 1 84 | expand_number = num_c * len(index) - padding 85 | index_f = [] 86 | for n in range(num_c): 87 | index = index + index + index 88 | for i in range(expand_number, expand_number + total_num): 89 | index_f.append(index[i]) 90 | 91 | feas = [] 92 | for idf in index_f: 93 | feas.append(feature[:, :, idf]) 94 | feas = torch.stack(feas, dim=2) 95 | return feas 96 | -------------------------------------------------------------------------------- /temporal_grounding/utils/tester.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import os 3 | import time 4 | import datetime 5 | from torch.utils import data 6 | import torch.nn.functional as F 7 | from tqdm import tqdm 8 | import numpy as np 9 | from utils.utils import CountMeter, compute_IoU_recall 10 | import collections 11 | 12 | 13 | class Tester(object): 14 | def __init__(self, config): 15 | self.config = config 16 | self.checkpoint = config['checkpoint'] 17 | assert os.path.exists(self.checkpoint), 'incorrect checkpoint path!' 18 | 19 | def model_info(self): 20 | print(self.model) 21 | num_params = 0 22 | for p in self.model.parameters(): 23 | num_params += p.numel() # 返回一个tensor变量内所有元素个数 24 | print("The total number of parameters: {}".format(num_params)) 25 | 26 | def test(self, model, dataset): 27 | self.model = model 28 | self.model.eval() 29 | self.model_info() 30 | print('loading checkpoint ....') 31 | checkpoint = torch.load(self.checkpoint) 32 | self.model.module.load_state_dict(checkpoint['state_dict']) 33 | loader = data.DataLoader( 34 | dataset, self.config['batch_size'], False, num_workers=8) 35 | meters_5 = collections.defaultdict(lambda: CountMeter()) 36 | recall_metrics = (1, 5) 37 | iou_metrics = (0.1, 0.3, 0.5, 0.7) 38 | table = [['Rank@{},mIoU@{}'.format(i, j) 39 | for i in recall_metrics for j in iou_metrics]] 40 | 41 | for i, data_batch in tqdm(enumerate(loader), total=len(loader)): 42 | fea, embedding, score, score_mask, embedding_length, label, proposals, score_nm, adj_mat = \ 43 | data_batch['feat'], data_batch['embedding'], data_batch['score'], data_batch['score_mask'], \ 44 | data_batch['embedding_length'], data_batch['label'], data_batch['proposals'], data_batch['score_nm'], \ 45 | data_batch['adj_mat'] 46 | if self.config['cuda']: 47 | fea, embedding, score, score_mask, label, proposals, score_nm, adj_mat = \ 48 | fea.cuda(), embedding.cuda(), score.cuda(), score_mask.cuda(), label.cuda(), proposals.cuda(), \ 49 | score_nm.cuda(), adj_mat.cuda() 50 | 51 | with torch.no_grad(): 52 | predict_boxes, score = self.model(fea, embedding, embedding_length, score, 53 | label, score_mask, score_nm, proposals, adj_mat, 'test') 54 | predict_boxes_old = np.round( 55 | predict_boxes.cpu().numpy()).astype(np.int32) 56 | for k in range(predict_boxes.shape[0]): 57 | gt_boxes = label[k] 58 | predict_boxes = predict_boxes_old[k] 59 | predict_flatten = score[k] 60 | gt_starts, gt_ends = gt_boxes[0], gt_boxes[1] 61 | predict_starts, predict_ends = predict_boxes[:, 62 | 0], predict_boxes[:, 1] 63 | predict_starts[predict_starts < 0] = 0 64 | seq_len = self.config['segment_num'] 65 | predict_ends[predict_ends >= seq_len] = seq_len - 1 66 | predict_flatten = predict_flatten.cpu().numpy() 67 | predict_boxes[:, 0], predict_boxes[:, 68 | 1] = predict_starts, predict_ends 69 | 70 | topn_IoU_matric = compute_IoU_recall( 71 | predict_flatten, predict_boxes, gt_boxes) 72 | meters_5['mIoU'].update(topn_IoU_matric, 1) 73 | 74 | IoU_threshs = [0.1, 0.3, 0.5, 0.7] 75 | top_n_list = [1, 5] 76 | topn_IoU_matric, count = meters_5['mIoU'].val, meters_5['mIoU'].count 77 | for i in range(2): 78 | for j in range(4): 79 | print('{}, {:.4f}'.format('IoU@' + str(top_n_list[i]) + '@' + str(IoU_threshs[j]), 80 | topn_IoU_matric[i, j] / count), end=' | ') 81 | -------------------------------------------------------------------------------- /referring_segmentation/utils/video_reader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import h5py 3 | import numpy as np 4 | 5 | 6 | def clip_annotation_reader(images_path, annotations_path, instances, clip_size=7, annotation_center=False, dataset='A2D'): 7 | 8 | datas = [] 9 | frames = os.listdir(images_path) 10 | frames.sort() 11 | annotations = os.listdir(annotations_path) 12 | annotations.sort() 13 | for annotation in annotations: 14 | name = annotation.split('.')[0] 15 | name_int = int(name) 16 | with h5py.File(os.path.join(annotations_path, annotation)) as label: 17 | instances_anno = list(label['instance'][:]) 18 | for instance in instances.keys(): 19 | # if instance < int(instance_num): 20 | if int(instance) in instances_anno: 21 | step = 1 22 | if not annotation_center: 23 | range_frames = step * np.random.randint(- (clip_size - 1), 1) 24 | else: 25 | range_frames = - step * (clip_size // 2) 26 | 27 | initial_frame = name_int + range_frames 28 | data = {} 29 | data['dataset'] = dataset 30 | data['video'] = images_path.split('/')[-1] 31 | data['frames'] = [] 32 | data['label'] = [] 33 | data['instance'] = instance 34 | data['sentence'] = instances[instance] 35 | annotation_num = 0 36 | for i in range(0, clip_size * step, step): 37 | n_frame = initial_frame + i - 1 38 | if n_frame < 0: 39 | n_frame = 0 40 | elif n_frame >= len(frames): 41 | n_frame = len(frames) - 1 42 | data['frames'].append(os.path.join(images_path, frames[n_frame])) 43 | 44 | is_anno = 0 45 | for anno in annotations: 46 | if frames[n_frame].split('.')[0] == anno.split('.')[0]: 47 | data['label'].append(os.path.join(annotations_path, anno)) 48 | annotation_num += 1 49 | is_anno += 1 50 | 51 | if is_anno == 0: 52 | data['label'].append('None') 53 | if annotation_num > 0: 54 | datas.append(data) 55 | return datas 56 | 57 | 58 | def sequence_reader(images_path, annotations_path, instances, dataset='A2D'): 59 | 60 | datas = [] 61 | frames = [f for f in os.listdir(images_path) if '.png' in f] 62 | frames.sort() 63 | annotations = os.listdir(annotations_path) 64 | annotations.sort() 65 | for instance in instances.keys(): 66 | data = {} 67 | data['dataset'] = dataset 68 | data['video'] = images_path.split('/')[-1] 69 | data['frames'] = [] 70 | data['label'] = [] 71 | data['sentence'] = instances[instance] 72 | data['instance'] = instance 73 | for frame in frames: 74 | data['frames'].append(os.path.join(images_path, frame)) 75 | name = frame.split('.')[0] 76 | is_annotated = 0 77 | if dataset == 'A2D': 78 | for annotation in annotations: 79 | if annotation.split('.')[0] == name: 80 | is_annotated += 1 81 | with h5py.File(os.path.join(annotations_path, annotation)) as label: 82 | instances_anno = list(label['instance'][:]) 83 | if int(instance) in instances_anno: 84 | data['label'].append(os.path.join(annotations_path, annotation)) 85 | else: 86 | data['label'].append('None') 87 | if is_annotated == 0: 88 | data['label'].append('None') 89 | elif dataset == 'JHMDB': 90 | data['label'].append(os.path.join(annotations_path, annotations[0])) 91 | 92 | datas.append(data) 93 | return datas 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/optim.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved 2 | """Collections of utilities related to optimization.""" 3 | from bisect import bisect_right 4 | 5 | import torch 6 | 7 | 8 | def update_ema(model, model_ema, decay): 9 | """Apply exponential moving average update. 10 | 11 | The weights are updated in-place as follow: 12 | w_ema = w_ema * decay + (1 - decay) * w 13 | Args: 14 | model: active model that is being optimized 15 | model_ema: running average model 16 | decay: exponential decay parameter 17 | """ 18 | with torch.no_grad(): 19 | if hasattr(model, "module"): 20 | # unwrapping DDP 21 | model = model.module 22 | msd = model.state_dict() 23 | for k, ema_v in model_ema.state_dict().items(): 24 | model_v = msd[k].detach() 25 | ema_v.copy_(ema_v * decay + (1.0 - decay) * model_v) 26 | 27 | 28 | def adjust_learning_rate( 29 | optimizer, 30 | epoch: int, 31 | curr_step: int, 32 | num_training_steps: int, 33 | args, 34 | ): 35 | """Adjust the lr according to the schedule. 36 | 37 | Args: 38 | Optimizer: torch optimizer to update. 39 | epoch(int): number of the current epoch. 40 | curr_step(int): number of optimization step taken so far. 41 | num_training_step(int): total number of optimization steps. 42 | args: additional training dependent args: 43 | - lr_drop(int): number of epochs before dropping the learning rate. 44 | - fraction_warmup_steps(float) fraction of steps over which the lr will be increased to its peak. 45 | - lr(float): base learning rate 46 | - lr_backbone(float): learning rate of the backbone 47 | - text_encoder_backbone(float): learning rate of the text encoder 48 | - schedule(str): the requested learning rate schedule: 49 | "step": all lrs divided by 10 after lr_drop epochs 50 | "multistep": divided by 2 after lr_drop epochs, then by 2 after every 50 epochs 51 | "linear_with_warmup": same as "step" for backbone + transformer, but for the text encoder, linearly 52 | increase for a fraction of the training, then linearly decrease back to 0. 53 | "all_linear_with_warmup": same as "linear_with_warmup" for all learning rates involved. 54 | 55 | """ 56 | num_warmup_steps: int = round(args.fraction_warmup_steps * num_training_steps) 57 | if args.schedule == "step": 58 | gamma = 0.1 ** (epoch // args.lr_drop) 59 | text_encoder_gamma = gamma 60 | elif args.schedule == "multistep": 61 | milestones = list(range(args.lr_drop, args.epochs, 50)) 62 | gamma = 0.5 ** bisect_right(milestones, epoch) 63 | text_encoder_gamma = gamma 64 | elif args.schedule == "linear_with_warmup": 65 | gamma = 0.1 ** (epoch // args.lr_drop) 66 | if curr_step < num_warmup_steps: 67 | text_encoder_gamma = float(curr_step) / float(max(1, num_warmup_steps)) 68 | else: 69 | text_encoder_gamma = max( 70 | 0.0, 71 | float(num_training_steps - curr_step) 72 | / float(max(1, num_training_steps - num_warmup_steps)), 73 | ) 74 | elif args.schedule == "all_linear_with_warmup": 75 | if curr_step < num_warmup_steps: 76 | text_encoder_gamma = float(curr_step) / float(max(1, num_warmup_steps)) 77 | else: 78 | text_encoder_gamma = max( 79 | 0.0, 80 | float(num_training_steps - curr_step) 81 | / float(max(1, num_training_steps - num_warmup_steps)), 82 | ) 83 | gamma = text_encoder_gamma 84 | else: 85 | raise NotImplementedError 86 | 87 | base_lrs = [args.lr, args.lr_backbone, args.text_encoder_lr] 88 | gammas = [gamma, gamma, text_encoder_gamma] 89 | assert len(optimizer.param_groups) == len(base_lrs) 90 | for param_group, lr, gamma_group in zip(optimizer.param_groups, base_lrs, gammas): 91 | param_group["lr"] = lr * gamma_group 92 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/RefTransformer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from models.module.attention import RelevanceFilter 5 | from models.utils import temporal_separate_to_stack, temporal_stacked_to_separate 6 | 7 | 8 | class RefTransformer(nn.Module): 9 | def __init__(self, text_dim, inchannel, hidden_channel, outchannel, layers=8, padding_type='circle', groups=8, dropout=0.1): 10 | super(RefTransformer, self).__init__() 11 | self.padding_type = padding_type 12 | self.conv_time = nn.ModuleList() 13 | self.conv_spatial = nn.ModuleList() 14 | self.conv_convert = nn.ModuleList() 15 | self.dropout1 = nn.ModuleList() 16 | self.dropout2 = nn.ModuleList() 17 | self.dilations = [] 18 | self.local_attention = nn.ModuleList() 19 | for i in range(layers): 20 | dilation = torch.pow(torch.tensor(2), i) 21 | dilation = int(dilation) 22 | self.dilations.append(dilation) 23 | self.local_attention.append(RelevanceFilter( 24 | text_dim, inchannel, inchannel, groups, (1, 1, 1), (1, 1, 1), phase='3D')) 25 | 26 | self.conv_spatial.append( 27 | nn.Sequential( 28 | nn.Conv3d(inchannel, hidden_channel, (1, 3, 3), 29 | 1, (0, 1, 1), (1, 1, 1), bias=False), 30 | nn.GroupNorm(4, hidden_channel), 31 | nn.ReLU(inplace=True) 32 | ) 33 | ) 34 | 35 | self.conv_time.append( 36 | nn.Sequential( 37 | nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1), 38 | (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False), 39 | nn.GroupNorm(4, hidden_channel), 40 | nn.ReLU(inplace=True) 41 | ) 42 | ) 43 | 44 | self.conv_convert.append( 45 | nn.Sequential( 46 | nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False), 47 | nn.GroupNorm(4, outchannel) 48 | ) 49 | ) 50 | self.dropout1.append(nn.Dropout(dropout)) 51 | self.dropout2.append(nn.Dropout(dropout)) 52 | self.__init_weight() 53 | 54 | def __init_weight(self): 55 | for m in self.modules(): 56 | if isinstance(m, nn.Conv3d) or isinstance(m, nn.Conv2d): 57 | torch.nn.init.kaiming_normal_(m.weight) 58 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 59 | m.weight.data.fill_(1) 60 | m.bias.data.zero_() 61 | 62 | def forward(self, fea, fea_text, frame_mask, durations): 63 | maps_layers = [] 64 | for i in range(len(self.conv_time)): 65 | res0 = fea 66 | maps, fea = self.local_attention[i](fea, fea_text, frame_mask) 67 | maps = temporal_separate_to_stack(maps.transpose(1, 2), durations) 68 | maps_layers.append(maps) 69 | fea = res0 + self.dropout1[i](fea) # [current] 70 | res1 = fea 71 | fea = self.conv_spatial[i](fea) 72 | 73 | if self.padding_type == 'circle': 74 | fea = circle_padding(self.dilations[i], fea) 75 | elif self.padding_type == 'zero': 76 | fea = F.pad( 77 | fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='constant', value=0) 78 | else: 79 | fea = F.pad( 80 | fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='circular') 81 | 82 | fea = self.conv_time[i](fea) # B*C*T 83 | 84 | fea = self.conv_convert[i](fea) 85 | fea = res1 + self.dropout2[i](fea) 86 | return fea, maps_layers 87 | 88 | 89 | def circle_padding(padding, feature): 90 | length_times = feature.shape[2] 91 | index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1)) 92 | total_num = 2 * padding + length_times 93 | num_c = padding // len(index) 94 | if num_c * len(index) < padding: 95 | num_c = num_c + 1 96 | expand_number = num_c * len(index) - padding 97 | index_f = [] 98 | for n in range(num_c): 99 | index = index + index + index 100 | for i in range(expand_number, expand_number + total_num): 101 | index_f.append(index[i]) 102 | 103 | feas = [] 104 | for idf in index_f: 105 | feas.append(feature[:, :, idf, :, :].unsqueeze(2)) 106 | feas = torch.cat(feas, dim=2) 107 | return feas 108 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/box_ops.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved 2 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved 3 | """ 4 | Utilities for bounding box manipulation and GIoU. 5 | """ 6 | import torch 7 | import numpy as np 8 | from torchvision.ops.boxes import box_area 9 | from typing import Tuple 10 | 11 | #### Bounding box utilities imported from torchvision and converted to numpy 12 | def np_box_area(boxes: np.array) -> np.array: 13 | """ 14 | Computes the area of a set of bounding boxes, which are specified by its 15 | (x1, y1, x2, y2) coordinates. 16 | 17 | Args: 18 | boxes (Tensor[N, 4]): boxes for which the area will be computed. They 19 | are expected to be in (x1, y1, x2, y2) format with 20 | ``0 <= x1 < x2`` and ``0 <= y1 < y2``. 21 | 22 | Returns: 23 | area (Tensor[N]): area for each box 24 | """ 25 | assert boxes.ndim == 2 and boxes.shape[-1] == 4 26 | return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) 27 | 28 | 29 | # implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py 30 | # with slight modifications 31 | def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.array, np.array]: 32 | area1 = np_box_area(boxes1) 33 | area2 = np_box_area(boxes2) 34 | 35 | lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] 36 | rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] 37 | 38 | wh = (rb - lt).clip(min=0) # [N,M,2] 39 | inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] 40 | 41 | union = area1[:, None] + area2 - inter 42 | 43 | return inter, union 44 | 45 | 46 | def np_box_iou(boxes1: np.array, boxes2: np.array) -> np.array: 47 | """ 48 | Return intersection-over-union (Jaccard index) of boxes. 49 | 50 | Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with 51 | ``0 <= x1 < x2`` and ``0 <= y1 < y2``. 52 | 53 | Args: 54 | boxes1 (Tensor[N, 4]) 55 | boxes2 (Tensor[M, 4]) 56 | 57 | Returns: 58 | iou (Tensor[N, M]): the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2 59 | """ 60 | inter, union = _box_inter_union(boxes1, boxes2) 61 | iou = inter / union 62 | return iou 63 | 64 | 65 | def box_cxcywh_to_xyxy(x): 66 | x_c, y_c, w, h = x.unbind(-1) 67 | b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)] 68 | return torch.stack(b, dim=-1) 69 | 70 | 71 | def box_xyxy_to_cxcywh(x): 72 | x0, y0, x1, y1 = x.unbind(-1) 73 | b = [(x0 + x1) / 2, (y0 + y1) / 2, (x1 - x0), (y1 - y0)] 74 | return torch.stack(b, dim=-1) 75 | 76 | 77 | # modified from torchvision to also return the union 78 | def box_iou(boxes1, boxes2): 79 | area1 = box_area(boxes1) 80 | area2 = box_area(boxes2) 81 | 82 | lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] 83 | rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] 84 | 85 | wh = (rb - lt).clamp(min=0) # [N,M,2] 86 | inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] 87 | 88 | union = area1[:, None] + area2 - inter 89 | 90 | iou = inter / union 91 | return iou, union 92 | 93 | 94 | def generalized_box_iou(boxes1, boxes2): 95 | """ 96 | Generalized IoU from https://giou.stanford.edu/ 97 | 98 | The boxes should be in [x0, y0, x1, y1] format 99 | 100 | Returns a [N, M] pairwise matrix, where N = len(boxes1) 101 | and M = len(boxes2) 102 | """ 103 | # degenerate boxes gives inf / nan results 104 | # so do an early check 105 | assert (boxes1[:, 2:] >= boxes1[:, :2]).all() 106 | assert (boxes2[:, 2:] >= boxes2[:, :2]).all() 107 | iou, union = box_iou(boxes1, boxes2) 108 | 109 | lt = torch.min(boxes1[:, None, :2], boxes2[:, :2]) 110 | rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:]) 111 | 112 | wh = (rb - lt).clamp(min=0) # [N,M,2] 113 | area = wh[:, :, 0] * wh[:, :, 1] 114 | 115 | return iou - (area - union) / area 116 | 117 | 118 | def masks_to_boxes(masks): 119 | """Compute the bounding boxes around the provided masks 120 | 121 | The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions. 122 | 123 | Returns a [N, 4] tensors, with the boxes in xyxy format 124 | """ 125 | if masks.numel() == 0: 126 | return torch.zeros((0, 4), device=masks.device) 127 | 128 | h, w = masks.shape[-2:] 129 | 130 | y = torch.arange(0, h, dtype=torch.float) 131 | x = torch.arange(0, w, dtype=torch.float) 132 | y, x = torch.meshgrid(y, x) 133 | 134 | x_mask = masks * x.unsqueeze(0) 135 | x_max = x_mask.flatten(1).max(-1)[0] 136 | x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0] 137 | 138 | y_mask = masks * y.unsqueeze(0) 139 | y_max = y_mask.flatten(1).max(-1)[0] 140 | y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0] 141 | 142 | return torch.stack([x_min, y_min, x_max, y_max], 1) 143 | -------------------------------------------------------------------------------- /temporal_grounding/utils/optimizer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the LICENSE file in 5 | # the root directory of this source tree. An additional grant of patent rights 6 | # can be found in the PATENTS file in the same directory. 7 | 8 | import torch 9 | import math 10 | 11 | class FairseqOptimizer(object): 12 | 13 | def __init__(self, args, params): 14 | super().__init__() 15 | self.args = args 16 | self.params = list(params) 17 | 18 | @staticmethod 19 | def add_args(parser): 20 | """Add optimizer-specific arguments to the parser.""" 21 | pass 22 | 23 | @property 24 | def optimizer(self): 25 | """Return a torch.optimizer.optimizer.Optimizer instance.""" 26 | if not hasattr(self, '_optimizer'): 27 | raise NotImplementedError 28 | if not isinstance(self._optimizer, torch.optim.Optimizer): 29 | raise ValueError('_optimizer must be an instance of torch.optimizer.Optimizer') 30 | return self._optimizer 31 | 32 | @property 33 | def optimizer_config(self): 34 | """ 35 | Return a kwarg dictionary that will be used to override optimizer 36 | args stored in checkpoints. This allows us to load a checkpoint and 37 | resume training using a different set of optimizer args, e.g., with a 38 | different learning rate. 39 | """ 40 | raise NotImplementedError 41 | 42 | def get_lr(self): 43 | """Return the current learning rate.""" 44 | return self.optimizer.param_groups[0]['lr'] 45 | 46 | def set_lr(self, lr): 47 | """Set the learning rate.""" 48 | for param_group in self.optimizer.param_groups: 49 | param_group['lr'] = lr 50 | 51 | def state_dict(self): 52 | """Return the optimizer's state dict.""" 53 | return self.optimizer.state_dict() 54 | 55 | def load_state_dict(self, state_dict, optimizer_overrides=None): 56 | """Load an optimizer state dict. 57 | 58 | In general we should prefer the configuration of the existing optimizer 59 | instance (e.g., learning rate) over that found in the state_dict. This 60 | allows us to resume training from a checkpoint using a new set of 61 | optimizer args. 62 | """ 63 | self.optimizer.load_state_dict(state_dict) 64 | 65 | if optimizer_overrides is not None and len(optimizer_overrides) > 0: 66 | # override learning rate, momentum, etc. with latest values 67 | for group in self.optimizer.param_groups: 68 | group.update(optimizer_overrides) 69 | 70 | def backward(self, loss): 71 | """Computes the sum of gradients of the given tensor w.r.t. graph leaves.""" 72 | loss.backward() 73 | 74 | def multiply_grads(self, c): 75 | """Multiplies grads by a constant *c*.""" 76 | for p in self.params: 77 | if p.grad is not None: 78 | p.grad.data.mul_(c) 79 | 80 | def clip_grad_norm(self, max_norm): 81 | """Clips gradient norm.""" 82 | if max_norm > 0: 83 | return torch.nn.utils.clip_grad_norm_(self.params, max_norm) 84 | else: 85 | return math.sqrt(sum(p.grad.data.norm()**2 for p in self.params if p.grad is not None)) 86 | 87 | def step(self, closure=None): 88 | """Performs a single optimization step.""" 89 | self.optimizer.step(closure) 90 | 91 | def zero_grad(self): 92 | """Clears the gradients of all optimized parameters.""" 93 | for group in self.optimizer.param_groups: 94 | for p in group['params']: 95 | p.grad = None 96 | self.optimizer.zero_grad() 97 | 98 | 99 | class FairseqLRScheduler(object): 100 | 101 | def __init__(self, args, optimizer): 102 | super().__init__() 103 | if not isinstance(optimizer, FairseqOptimizer): 104 | raise ValueError('optimizer must be an instance of FairseqOptimizer') 105 | self.args = args 106 | self.optimizer = optimizer 107 | self.best = None 108 | 109 | @staticmethod 110 | def add_args(parser): 111 | """Add arguments to the parser for this LR scheduler.""" 112 | pass 113 | 114 | def state_dict(self): 115 | """Return the LR scheduler state dict.""" 116 | return {'best': self.best} 117 | 118 | def load_state_dict(self, state_dict): 119 | """Load an LR scheduler state dict.""" 120 | self.best = state_dict['best'] 121 | 122 | def step(self, epoch, val_loss=None): 123 | """Update the learning rate at the end of the given epoch.""" 124 | if val_loss is not None: 125 | if self.best is None: 126 | self.best = val_loss 127 | else: 128 | self.best = min(self.best, val_loss) 129 | 130 | def step_update(self, num_updates): 131 | """Update the learning rate after each update.""" 132 | return self.optimizer.get_lr() 133 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/utils.py: -------------------------------------------------------------------------------- 1 | from torch import Tensor, tensor 2 | from typing import List 3 | import torch 4 | import torch.nn.functional as F 5 | 6 | 7 | def temporal_stacked_to_separate(feature: Tensor, durations: List) -> Tensor: 8 | """ 9 | Args: 10 | feature: [\sigma t_i *] 11 | durations: [t_1, t_2, ..., t_b] 12 | Return: 13 | out_feature: [b, t, *] 14 | """ 15 | shape_len = len(feature.shape) - 1 16 | max_seq_len = max(durations) 17 | padding_value = [] 18 | for i in range(shape_len): 19 | padding_value += [0, 0] 20 | feature_splits = feature.split(durations, dim=0) 21 | out_feature = torch.stack([F.pad(f, padding_value + [0, max_seq_len-len(f)]) 22 | for f in feature_splits]) # [b, t, *] 23 | return out_feature 24 | 25 | 26 | def temporal_separate_to_stack(feature: Tensor, durations: List) -> Tensor: 27 | """ 28 | Args: 29 | feature: [b, t, *] 30 | durations: [t_1, t_2, ..., t_b] 31 | Return: 32 | out_feature: [\sigma t_i *] 33 | """ 34 | out_feature = torch.cat([feature[i][:durations[i]] 35 | for i in range(len(feature))], dim=0) # [\sigma t_i *] 36 | return out_feature 37 | 38 | 39 | def generate_anchor_scores(proposals, label, seq_len, thres_score): 40 | """ 41 | Args: 42 | proposals: [b, t*n_windows, 2] 43 | label: [b, 2] 44 | Return: 45 | scores: [b, t*n_windows] 46 | scores_mask: [b, t*n_windows] 47 | """ 48 | illegal = torch.logical_or( 49 | proposals[..., 0] < 0, proposals[..., 1] >= seq_len) 50 | label = label[:, None].repeat(1, proposals.shape[1], 1) 51 | IoUs = calculate_IoU_batch_temporal(proposals, label) 52 | IoUs[illegal] = 0.0 53 | max_IoU = torch.max(IoUs, dim=1)[0] 54 | IoUs[IoUs < thres_score * max_IoU[:, None]] = 0.0 55 | IoUs = IoUs / (max_IoU[:, None] + 1e-4) 56 | scores = IoUs.float() 57 | scores_mask = (1 - illegal.float()) 58 | return scores, scores_mask 59 | 60 | 61 | def calculate_IoU_batch_temporal(box0: Tensor, box1: Tensor) -> Tensor: 62 | """ 63 | Args: 64 | box0: [b, n_boxes, 2] 65 | box1: [b, n_boxes, 2] 66 | Return: 67 | iou: [b, n_boxes] 68 | """ 69 | union = (torch.min(torch.stack([box0[..., 0], box1[..., 0]], 0), 0)[ 70 | 0], torch.max(torch.stack([box0[..., 1], box1[..., 1]], 0), 0)[0]) 71 | inter = (torch.max(torch.stack([box0[..., 0], box1[..., 0]], 0), 0)[ 72 | 0], torch.min(torch.stack([box0[..., 1], box1[..., 1]], 0), 0)[0]) 73 | iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10) 74 | iou[union[1] - union[0] < -1e-5] = 0 75 | iou[iou < 0] = 0.0 76 | return iou 77 | 78 | 79 | def generate_2d_gaussian(boxes, w, h, delta=0.05): 80 | """ 81 | generate gaussian according to the input boxes, normalized to [0, 1] [checked] 82 | Args: 83 | boxes: [k, 4] in the form of cxcywh 84 | w: the width of gaussian map 85 | h: the height of gaussian map 86 | delta: gaussian parameter 87 | Return: 88 | gaussian: [k, h, w] 89 | """ 90 | n_boxes = len(boxes) 91 | ww = torch.linspace(0, w-1, w) 92 | hh = torch.linspace(0, h-1, h) 93 | gridh, gridw = torch.meshgrid(hh, ww) 94 | grid = torch.stack([gridw, gridh], dim=0)[None, ...].repeat( 95 | n_boxes, 1, 1, 1).to(boxes.device) # [k, 2, h, w] 96 | boxes = boxes[..., None, None].repeat(1, 1, h, w) 97 | gaussian = torch.exp(-(boxes[:, 0]-grid[:, 0])**2/(delta*boxes[:, 2]**2)) *\ 98 | torch.exp(-(boxes[:, 1]-grid[:, 1])**2 / 99 | (delta*boxes[:, 3]**2)) # [k, h, w] 100 | gaussian[gaussian < 0.05] = 0 101 | return gaussian 102 | 103 | 104 | def compute_temporal_reg_tar(label, score): 105 | """ 106 | Args: 107 | label: [b, 2] 108 | score: [b, t] 109 | Return: 110 | label_reg: [b, t, 2] 111 | """ 112 | label = label.unsqueeze(1) 113 | segment_num = score.shape[1] 114 | index_s = torch.arange(0, segment_num).unsqueeze( 115 | 0).unsqueeze(-1).to(score.device) 116 | index_e = torch.arange(0, segment_num).unsqueeze( 117 | 0).unsqueeze(-1).to(score.device) 118 | 119 | label_reg_s = index_s - label[:, :, 0].unsqueeze(-1) 120 | label_reg_e = label[:, :, 1].unsqueeze(-1) - index_e 121 | 122 | label_reg = torch.cat([label_reg_s, label_reg_e], dim=-1) 123 | label_reg = label_reg * score.unsqueeze(-1) 124 | return label_reg 125 | 126 | 127 | def segment_tiou(box_a, box_b): 128 | 129 | # gt: [batch, 1, 2], detections: [batch, k, 2] 130 | # calculate interaction 131 | inter_max_xy = torch.min(box_a[:, :, -1], box_b[:, :, -1]) 132 | inter_min_xy = torch.max(box_a[:, :, 0], box_b[:, :, 0]) 133 | inter = torch.clamp((inter_max_xy - inter_min_xy), min=0) 134 | 135 | # calculate union 136 | union_max_xy = torch.max(box_a[:, :, -1], box_b[:, :, -1]) 137 | union_min_xy = torch.min(box_a[:, :, 0], box_b[:, :, 0]) 138 | union = torch.clamp((union_max_xy - union_min_xy), min=0) 139 | 140 | iou = inter / (union+1e-6) 141 | 142 | return iou 143 | -------------------------------------------------------------------------------- /referring_segmentation/pre_proc/generate_data_list.py: -------------------------------------------------------------------------------- 1 | import os 2 | import csv 3 | import json 4 | 5 | 6 | def generate_annotation_dict(root): 7 | annotation_file = open(os.path.join(root)) 8 | annotation_list = list(annotation_file.read().splitlines()) 9 | annotations = {} 10 | for i in range(1, len(annotation_list)): 11 | if 'a2d' in root: 12 | name, instance, desc = annotation_list[i].split(',') 13 | else: 14 | name, desc = annotation_list[i].split(',') 15 | instance = '0' 16 | if name not in annotations.keys(): 17 | annotations[name] = {} 18 | annotations[name][instance] = desc 19 | return annotations 20 | 21 | 22 | def generate_data_list_a2d(dataset_root='/media/wwk/HDD2/datasets/referring_video_segmentation/a2d_sentences/', save_root='./data'): 23 | if not os.path.exists(save_root): 24 | os.mkdir(save_root) 25 | assert os.path.exists( 26 | dataset_root), ('Incorrect dataset path: {}'.format(dataset_root)) 27 | num_videos = 0 28 | num_annotated_videos = 0 29 | videos = os.listdir(os.path.join(dataset_root, 'Rename_Images')) 30 | ignore_file = open(os.path.join( 31 | dataset_root, 'a2d_missed_videos.txt'), 'r') 32 | ignore_data_list = ignore_file.read().splitlines() 33 | videoset_file = open(os.path.join( 34 | dataset_root, 'videoset.csv'), 'r') 35 | videosets = list(csv.reader(videoset_file)) 36 | instances = generate_annotation_dict(os.path.join(dataset_root, 'a2d_annotation.txt')) 37 | 38 | json_train = {} 39 | json_test = {} 40 | for videoset in videosets: 41 | video_name = videoset[0] 42 | assert video_name in videos, ('Incorrect video name: {} in csv file: {}'.format( 43 | video_name, os.path.join(dataset_root, 'videoset.csv'))) 44 | if video_name not in ignore_data_list: 45 | num_videos += 1 46 | frames_root = os.path.join( 47 | dataset_root, 'Rename_Images', video_name) 48 | annotations_root = os.path.join( 49 | dataset_root, 'a2d_annotation_with_instances', video_name) 50 | if os.path.exists(annotations_root): 51 | num_annotated_videos += 1 52 | if videoset[-1] == '0': 53 | json_train[video_name] = { 54 | 'frames': os.path.join('Rename_Images', video_name), 'labels': os.path.join('a2d_annotation_with_instances', video_name), 'instances': instances[video_name]} 55 | else: 56 | json_test[video_name] = { 57 | 'frames': os.path.join('Rename_Images', video_name), 'labels': os.path.join('a2d_annotation_with_instances', video_name), 'instances': instances[video_name]} 58 | else: 59 | print( 60 | 'Annotation of video {} in A2D dataset not exits'.format(video_name)) 61 | 62 | with open(os.path.join(save_root, 'a2d_sentences_train.json'), 'w+') as json_train_file: 63 | json.dump(json_train, json_train_file, indent=1) 64 | with open(os.path.join(save_root, 'a2d_sentences_test.json'), 'w+') as json_test_file: 65 | json.dump(json_test, json_test_file, indent=1) 66 | print('A2D dataset : Total videos : {} | Annotated videos : {}'.format( 67 | num_videos, num_annotated_videos)) 68 | 69 | 70 | def generate_data_list_jhmdb(dataset_root='/media/wwk/HDD1/dataset/referring_video_segmentation/jhmdb_sentences', save_root='./data'): 71 | assert os.path.exists( 72 | dataset_root), ('Incorrect dataset path: {}'.format(dataset_root)) 73 | num_videos = 0 74 | num_annotated_videos = 0 75 | video_groups = [f for f in os.listdir(os.path.join( 76 | dataset_root, 'Rename_Images')) if '.' not in f] 77 | instances = generate_annotation_dict(os.path.join(dataset_root, 'jhmdb_annotation.txt')) 78 | json_test = {} 79 | for video_group in video_groups: 80 | videos_root = os.path.join( 81 | dataset_root, 'Rename_Images', video_group) 82 | videos = [f for f in os.listdir(videos_root) if '.' not in f] 83 | for video in videos: 84 | annotation_root = os.path.join( 85 | dataset_root, 'puppet_mask', video_group, video) 86 | num_videos += 1 87 | if os.path.exists(annotation_root): 88 | json_test[video] = {'frames': os.path.join('Rename_Images', video), 'labels': os.path.join('puppet_mask', video_group, video), 89 | 'instances': instances[video]} 90 | num_annotated_videos += 1 91 | else: 92 | print( 93 | 'Annotation of video {}/{} in JHMDB dataset not exits'.format(video_group, video)) 94 | with open(os.path.join(save_root, 'jhmdb_sentences_test.json'), 'w+') as json_test_file: 95 | json.dump(json_test, json_test_file, indent=1) 96 | print('JHMDB dataset : Total videos : {} | Annotated videos : {}'.format( 97 | num_videos, num_annotated_videos)) 98 | 99 | 100 | if __name__ == '__main__': 101 | save_root='./data' 102 | a2d_dataset_root='' 103 | jhmdb_dataset_root='' 104 | generate_data_list_a2d(a2d_dataset_root, save_root) 105 | generate_data_list_jhmdb(jhmdb_dataset_root, save_root) 106 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/datasets/torch_videovision.py: -------------------------------------------------------------------------------- 1 | # Adapted from https://github.com/hassony2/torch_videovision 2 | from PIL import Image 3 | import numbers 4 | import torch 5 | import cv2 6 | import numpy as np 7 | import PIL 8 | from PIL import Image 9 | 10 | 11 | def convert_img(img): 12 | """Converts (H, W, C) numpy.ndarray to (C, W, H) format""" 13 | if len(img.shape) == 3: 14 | img = img.transpose(2, 0, 1) 15 | if len(img.shape) == 2: 16 | img = np.expand_dims(img, 0) 17 | return img 18 | 19 | 20 | class ClipToTensor(object): 21 | """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255] 22 | to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0] 23 | """ 24 | 25 | def __init__(self, channel_nb=3, div_255=True, numpy=False): 26 | self.channel_nb = channel_nb 27 | self.div_255 = div_255 28 | self.numpy = numpy 29 | 30 | def __call__(self, clip): 31 | """ 32 | Args: clip (list of numpy.ndarray): clip (list of images) 33 | to be converted to tensor. 34 | """ 35 | # Retrieve shape 36 | if isinstance(clip[0], np.ndarray): 37 | h, w, ch = clip[0].shape 38 | assert ch == self.channel_nb, "Got {0} instead of 3 channels".format(ch) 39 | elif isinstance(clip[0], Image.Image): 40 | w, h = clip[0].size 41 | else: 42 | raise TypeError( 43 | "Expected numpy.ndarray or PIL.Image\ 44 | but got list of {0}".format( 45 | type(clip[0]) 46 | ) 47 | ) 48 | 49 | np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)]) 50 | 51 | # Convert 52 | for img_idx, img in enumerate(clip): 53 | if isinstance(img, np.ndarray): 54 | pass 55 | elif isinstance(img, Image.Image): 56 | img = np.array(img, copy=False) 57 | else: 58 | raise TypeError( 59 | "Expected numpy.ndarray or PIL.Image\ 60 | but got list of {0}".format( 61 | type(clip[0]) 62 | ) 63 | ) 64 | img = convert_img(img) 65 | np_clip[:, img_idx, :, :] = img 66 | if self.numpy: 67 | if self.div_255: 68 | np_clip = np_clip / 255 69 | return np_clip 70 | 71 | else: 72 | tensor_clip = torch.from_numpy(np_clip) 73 | 74 | if not isinstance(tensor_clip, torch.FloatTensor): 75 | tensor_clip = tensor_clip.float() 76 | if self.div_255: 77 | tensor_clip = tensor_clip.div(255) 78 | return tensor_clip 79 | 80 | 81 | def _is_tensor_clip(clip): 82 | return torch.is_tensor(clip) and clip.ndimension() == 4 83 | 84 | 85 | def crop_clip(clip, min_h, min_w, h, w): 86 | if isinstance(clip[0], np.ndarray): 87 | cropped = [img[min_h : min_h + h, min_w : min_w + w, :] for img in clip] 88 | 89 | elif isinstance(clip[0], PIL.Image.Image): 90 | cropped = [img.crop((min_w, min_h, min_w + w, min_h + h)) for img in clip] 91 | else: 92 | raise TypeError( 93 | "Expected numpy.ndarray or PIL.Image" 94 | + "but got list of {0}".format(type(clip[0])) 95 | ) 96 | return cropped 97 | 98 | 99 | def normalize(clip, mean, std, inplace=False): 100 | if not _is_tensor_clip(clip): 101 | raise TypeError("tensor is not a torch clip.") 102 | 103 | if not inplace: 104 | clip = clip.clone() 105 | 106 | dtype = clip.dtype 107 | mean = torch.as_tensor(mean, dtype=dtype, device=clip.device) 108 | std = torch.as_tensor(std, dtype=dtype, device=clip.device) 109 | clip.sub_(mean[:, None, None, None]).div_(std[:, None, None, None]) 110 | 111 | return clip 112 | 113 | 114 | def get_resize_sizes(im_h, im_w, size): 115 | if im_w < im_h: 116 | ow = size 117 | oh = int(size * im_h / im_w) 118 | else: 119 | oh = size 120 | ow = int(size * im_w / im_h) 121 | return oh, ow 122 | 123 | 124 | def resize_clip(clip, size, interpolation="bilinear"): 125 | if isinstance(clip[0], np.ndarray): 126 | if isinstance(size, numbers.Number): 127 | im_h, im_w, im_c = clip[0].shape 128 | # Min spatial dim already matches minimal size 129 | if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size): 130 | return clip 131 | new_h, new_w = get_resize_sizes(im_h, im_w, size) 132 | size = (new_w, new_h) 133 | else: 134 | size = size[1], size[0] 135 | if interpolation == "bilinear": 136 | np_inter = cv2.INTER_LINEAR 137 | else: 138 | np_inter = cv2.INTER_NEAREST 139 | scaled = [cv2.resize(img, size, interpolation=np_inter) for img in clip] 140 | elif isinstance(clip[0], PIL.Image.Image): 141 | if isinstance(size, numbers.Number): 142 | im_w, im_h = clip[0].size 143 | # Min spatial dim already matches minimal size 144 | if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size): 145 | return clip 146 | new_h, new_w = get_resize_sizes(im_h, im_w, size) 147 | size = (new_w, new_h) 148 | else: 149 | size = size[1], size[0] 150 | if interpolation == "bilinear": 151 | pil_inter = PIL.Image.NEAREST 152 | else: 153 | pil_inter = PIL.Image.BILINEAR 154 | scaled = [img.resize(size, pil_inter) for img in clip] 155 | else: 156 | raise TypeError( 157 | "Expected numpy.ndarray or PIL.Image" 158 | + "but got list of {0}".format(type(clip[0])) 159 | ) 160 | return scaled 161 | -------------------------------------------------------------------------------- /temporal_grounding/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils import data 3 | import h5py 4 | import json 5 | from utils.utils import * 6 | from tqdm import tqdm 7 | import torch.nn as nn 8 | import torch 9 | import numpy as np 10 | import torchtext 11 | 12 | 13 | class MyDataset(data.Dataset): 14 | def __init__(self, config, mode='train'): 15 | super(MyDataset, self).__init__() 16 | self.config = config 17 | self.dataset = config['{}ing_datasets'.format(mode)] 18 | self.embedding_type = config['embedding_type'] 19 | self.segment_num = config['segment_num'] 20 | self.mode = mode 21 | self.max_embedding_length = config['embedding_length'] 22 | print('Preparing dataset: {}'.format(self.dataset)) 23 | self.datas = [] 24 | 25 | with open(os.path.join('./data', self.dataset, '{}.json'.format(mode)), 'r') as f: 26 | videosets = json.load(f) 27 | for n, video in tqdm(enumerate(videosets), total=len(videosets)): 28 | data = {} 29 | data['vid'] = video[0] 30 | data['timestamp'] = video[2] 31 | data['duration'] = video[1] 32 | data['words'] = video[3] 33 | data['index'] = n 34 | if (data['timestamp'][1] - data['timestamp'][0]) > 0 and data['timestamp'][1] <= data['duration'] and data['timestamp'][0] <= data['duration']: 35 | self.datas.append(data) 36 | self.feat = h5py.File(config['datasets_root']) 37 | 38 | self.proposals = generate_proposals( 39 | config['segment_num'], config['window_width']) 40 | embedding_name, embedding_dim = self.config['embedding_type'].split( 41 | '_')[1], int(self.config['embedding_type'].split('_')[2]) 42 | self.vocab = torchtext.vocab.GloVe( 43 | name=embedding_name, dim=embedding_dim) 44 | self.vocab.itos.extend(['']) 45 | self.vocab.stoi[''] = self.vocab.vectors.shape[0] 46 | self.vocab.vectors = torch.cat( 47 | [self.vocab.vectors, torch.zeros(1, self.vocab.dim)], dim=0) 48 | self.word_embedding = nn.Embedding.from_pretrained(self.vocab.vectors) 49 | 50 | def generate_label_feats(self, feat, label): 51 | ori_video_len = feat.shape[0] 52 | index = np.linspace(start=0, stop=ori_video_len - 1, 53 | num=self.segment_num).astype(np.int32) 54 | new_video = [] 55 | for i in range(len(index) - 1): 56 | start = index[i] 57 | end = index[i + 1] 58 | if start == end or start + 1 == end: 59 | new_video.append(feat[start]) 60 | else: 61 | new_video.append(np.mean(feat[start: end], 0)) 62 | new_video.append(feat[-1]) 63 | feat = np.stack(new_video, 0) 64 | try: 65 | label[0] = min(np.where(index >= label[0])[0]) 66 | except: 67 | print(label, index) 68 | if label[1] == ori_video_len - 1: 69 | label[1] = self.segment_num - 1 70 | else: 71 | label[1] = max(np.where(index <= label[1])[0]) 72 | if label[1] < label[0]: 73 | label[0] = label[1] 74 | return feat, label 75 | 76 | def __len__(self): 77 | return len(self.datas) 78 | 79 | def __getitem__(self, item): 80 | if self.dataset != 'ActivityNet': 81 | feat = self.feat[self.datas[item]['vid']][:] 82 | else: 83 | feat = self.feat[self.datas[item]['vid']]['c3d_features'][:] 84 | 85 | duration = self.datas[item]['duration'] 86 | timestamp = self.datas[item]['timestamp'] 87 | 88 | feat = torch.from_numpy(feat) 89 | feat = average_to_fixed_length(feat, self.segment_num) 90 | 91 | start_frame = max(self.segment_num * timestamp[0] / duration, 0) 92 | end_frame = min(self.segment_num * 93 | timestamp[1] / duration, self.segment_num-1) 94 | if start_frame > end_frame: 95 | start_frame = end_frame 96 | label = np.asarray([start_frame, end_frame]).astype(np.int32) 97 | 98 | word_idxs = torch.tensor([self.vocab.stoi.get(w, len(self.vocab.stoi)-1) 99 | for w in self.datas[item]['words'].strip().split()], dtype=torch.long) 100 | embedding = self.word_embedding(word_idxs) 101 | embedding_length = embedding.shape[0] 102 | 103 | if embedding_length > self.max_embedding_length: 104 | embedding_padded = embedding[: self.max_embedding_length, :] 105 | embedding_length = self.max_embedding_length 106 | else: 107 | embedding_padded = torch.zeros( 108 | (self.max_embedding_length, embedding.shape[1])) 109 | embedding_padded[: embedding.shape[0], :] = embedding 110 | 111 | scores, scores_mask, adj_mat = generate_scores( 112 | self.proposals, label, self.segment_num, self.config['thres_score'], self.config['thres_adjmat']) 113 | 114 | score_nm = [] 115 | for i in range(self.segment_num): 116 | if i >= label[0] and i <= label[1]: 117 | score_nm.append(1) 118 | else: 119 | score_nm.append(0) 120 | score_nm = torch.tensor(score_nm).float() 121 | 122 | return { 123 | 'embedding': embedding_padded, 124 | 'feat': feat, 125 | 'embedding_length': embedding_length, 126 | 'label': label, 127 | 'duration': duration, 128 | 'vid': self.datas[item]['vid'], 129 | 'score': scores, 130 | 'score_nm': score_nm, 131 | 'score_mask': scores_mask, 132 | 'proposals': self.proposals.astype(np.float32), 133 | 'adj_mat': adj_mat, 134 | 'index': self.datas[item]['index'] 135 | } 136 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/postprocessors.py: -------------------------------------------------------------------------------- 1 | # Adapted from 2 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved 3 | """Postprocessors class to transform TubeDETR output according to the downstream task""" 4 | import imp 5 | from typing import Dict 6 | 7 | import torch 8 | import torch.nn.functional as F 9 | from torch import nn 10 | 11 | from .anchor_utils import generate_proposals 12 | 13 | 14 | class PostProcessSTVG(nn.Module): 15 | def __init__(self, args): 16 | super().__init__() 17 | self.args = args 18 | 19 | @torch.no_grad() 20 | def forward(self, outputs, frames_id=None, video_ids=None, time_mask=None): 21 | """ 22 | :param outputs: must contain a key pred_sted mapped to a [B, T, 2] tensor of logits for the start and end predictions 23 | :param frames_id: list of B lists which contains the increasing list of frame ids corresponding to the indexes of the decoder outputs 24 | :param video_ids: list of B video_ids, used to ensemble predictions when video_max_len_train < video_max_len 25 | :param time_mask: [B, T] tensor with False on the padded positions, used to take out padded frames from the possible predictions 26 | :return: list of B [start_frame, end_frame] for each video 27 | """ 28 | 29 | if self.args.temporal_decoder_type == 'anchor': 30 | temporal_score, temporal_offset = outputs['temporal_score'], outputs['temporal_offset'] 31 | max_length = temporal_score.shape[1]//len( 32 | self.args.temporal_window_width) 33 | proposals = generate_proposals( 34 | max_length, self.args.temporal_window_width).to(temporal_score.device) 35 | refined_boxes = proposals[None, :] + \ 36 | temporal_offset # [b, t*n_windows, 2] 37 | _, ind = torch.topk(temporal_score, 1, -1) 38 | pred_steds = torch.gather(refined_boxes, 1, ind[..., None].repeat( 39 | 1, 1, 2)).squeeze(1).long() # b*2 40 | pred_steds = pred_steds.clamp(0, max_length-1) 41 | elif self.args.temporal_decoder_type == 'regression': 42 | temporal_score, temporal_reg = outputs['temporal_score'], outputs['temporal_reg'] 43 | max_length = temporal_score.shape[1] 44 | index = torch.as_tensor([i for i in range(max_length)]).to( 45 | temporal_score.device)[None] 46 | pred_start = index - temporal_reg[:, :, 0] 47 | pred_end = index + temporal_reg[:, :, 1] 48 | predictions = torch.stack([pred_start, pred_end], dim=-1) 49 | _, ind = torch.topk(temporal_score, 1, -1) 50 | pred_steds = torch.gather(predictions, 1, ind[..., None].repeat( 51 | 1, 1, 2)).squeeze(1).long() # b*2 52 | pred_steds = pred_steds.clamp(0, max_length-1) 53 | 54 | frames_id = ( 55 | torch.tensor([row + [0] * (max_length - len(row)) 56 | for row in frames_id]) 57 | .long() 58 | .to(pred_steds.device) 59 | ) # padded up to BxT 60 | # get corresponding frames id from the indexes 61 | pred_steds = torch.gather(frames_id, 1, pred_steds) 62 | pred_steds = pred_steds.float() 63 | pred_steds[:, 1] += 1 # the end frame is excluded in evaluation 64 | 65 | pred_steds = pred_steds.cpu().tolist() 66 | 67 | return pred_steds 68 | 69 | 70 | class PostProcess(nn.Module): 71 | """This module converts the model's output into the format expected by the coco api""" 72 | 73 | @torch.no_grad() 74 | def forward(self, outputs, target_sizes): 75 | """Perform the computation 76 | Parameters: 77 | outputs: raw outputs of the model 78 | target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch 79 | For evaluation, this must be the original image size (before any data augmentation) 80 | For visualization, this should be the image size after data augment, but before padding 81 | """ 82 | hm_s, wh_s = outputs['spatial_map'].squeeze(1), outputs['spatial_wh'] 83 | 84 | # Find the top response in heat map 85 | time, height, width = hm_s.size() 86 | topk_scores, topk_inds = torch.topk(hm_s.view(time, -1), 1) 87 | topk_inds = topk_inds % (height * width) 88 | topk_ys = (topk_inds / width).int().float() # t*1 89 | topk_xs = (topk_inds % width).int().float() # t*1 90 | 91 | pre_wh = torch.gather(wh_s.view(wh_s.shape[0], wh_s.shape[1], -1), -1, 92 | topk_inds.unsqueeze(1).repeat(1, 2, 1)) # t*2*1 93 | out_bbox = torch.cat([topk_xs - pre_wh[:, 0, :] / 2, 94 | topk_ys - pre_wh[:, 1, :] / 2, 95 | topk_xs + pre_wh[:, 0, :] / 2, 96 | topk_ys + pre_wh[:, 1, :] / 2], dim=-1) # t*4 97 | out_bbox[:, 2].clamp(0, width) 98 | out_bbox[:, 3].clamp(0, height) 99 | 100 | img_h, img_w = target_sizes.unbind(1) 101 | scale_fct_out = torch.tensor( 102 | [width, height, width, height]).float().unsqueeze(0).to(out_bbox.device) 103 | scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1) 104 | boxes = (out_bbox / scale_fct_out) * scale_fct 105 | 106 | results = [{"boxes": b} for b in boxes] 107 | 108 | return results 109 | 110 | 111 | def build_postprocessors(args, dataset_name=None) -> Dict[str, nn.Module]: 112 | postprocessors: Dict[str, nn.Module] = {"bbox": PostProcess()} 113 | 114 | if dataset_name: 115 | if dataset_name in ["vidstg", "hcstvg"]: 116 | postprocessors[dataset_name] = PostProcessSTVG(args) 117 | 118 | return postprocessors 119 | -------------------------------------------------------------------------------- /referring_segmentation/utils/tester.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import os 3 | import time 4 | import datetime 5 | from torch.utils import data 6 | import torch.nn.functional as F 7 | from tqdm import tqdm 8 | from utils.utils import report_result 9 | import numpy as np 10 | import cv2 11 | 12 | 13 | class Tester(object): 14 | def __init__(self, config): 15 | self.config = config 16 | self.save_fold = config['test_savefold'] 17 | self.checkpoint = config['checkpoint'] 18 | if not os.path.exists(self.save_fold): 19 | os.mkdir(self.save_fold) 20 | assert os.path.exists(self.checkpoint), 'incorrect checkpoint path!' 21 | 22 | def model_info(self): 23 | print(self.model) 24 | num_params = 0 25 | for p in self.model.parameters(): 26 | num_params += p.numel() # 返回一个tensor变量内所有元素个数 27 | print("The total number of parameters: {}".format(num_params)) 28 | 29 | def test(self, model, dataset): 30 | self.model = model 31 | self.model.eval() 32 | self.model_info() 33 | print('loading checkpoint ....') 34 | checkpoint = torch.load(self.checkpoint) 35 | self.model.module.load_state_dict(checkpoint['state_dict']) 36 | loader = data.DataLoader(dataset, 1, False, num_workers=8) 37 | 38 | num_frames = 0 39 | total_times = 0 40 | pres = [] 41 | gts = [] 42 | 43 | pres_s = [] 44 | pres_m = [] 45 | pres_l = [] 46 | gts_s = [] 47 | gts_m = [] 48 | gts_l = [] 49 | 50 | with torch.no_grad(): 51 | print('video sequence num: {}'.format(len(loader))) 52 | print('testing.....') 53 | 54 | for data_batch in tqdm(loader): 55 | frames, labels, embedding = data_batch['frames'], data_batch['label'], data_batch['word_embedding'] 56 | embedding_length = data_batch['embedding_length'] 57 | is_annotated = data_batch['is_annotated'] 58 | video = data_batch['video'] 59 | name = data_batch['name'] 60 | instance = data_batch['instance'] 61 | # if not os.path.exists(os.path.join(self.save_fold, video[0])): 62 | # os.mkdir(os.path.join(self.save_fold, video[0])) 63 | # os.mkdir(os.path.join(self.save_fold, video[0], 'pre')) 64 | # os.mkdir(os.path.join(self.save_fold, video[0], 'gt')) 65 | video_save_root = os.path.join(self.save_fold, video[0], 'pre') 66 | gt_save_fold = os.path.join(self.save_fold, video[0], 'gt') 67 | if self.config['cuda']: 68 | for f in range(len(frames)): 69 | frames[f] = frames[f].cuda() 70 | labels[f] = labels[f].cuda() 71 | embedding = embedding.cuda() 72 | num_frames += len(frames) 73 | start_time = time.time() 74 | predictions, maps, _, _ = model(frames, embedding, embedding_length) 75 | 76 | for j, prediction in enumerate(predictions): 77 | if is_annotated[j][0] == 1: 78 | pre = torch.sigmoid(prediction) 79 | pre = (pre - pre.min()) / (pre.max() - pre.min()) 80 | pre = F.interpolate(pre, labels[j].shape[2:], mode='bilinear', align_corners=True) 81 | pre_thres = torch.where(pre>0.5, torch.ones_like(pre), torch.zeros_like(pre)) 82 | gts.append(labels[j][0][0].cpu().numpy().astype(np.uint8)) 83 | pres.append(pre_thres[0][0].cpu().numpy().astype(np.uint8)) 84 | # if len(predictions) > 80 and len(predictions) < 150: 85 | # gts_m.append(labels[j][0][0].cpu().numpy().astype(np.uint8)) 86 | # pres_m.append(pre_thres[0][0].cpu().numpy().astype(np.uint8)) 87 | if len(predictions) > 100: 88 | gts_l.append(labels[j][0][0].cpu().numpy().astype(np.uint8)) 89 | pres_l.append(pre_thres[0][0].cpu().numpy().astype(np.uint8)) 90 | else: 91 | gts_s.append(labels[j][0][0].cpu().numpy().astype(np.uint8)) 92 | pres_s.append(pre_thres[0][0].cpu().numpy().astype(np.uint8)) 93 | 94 | total_times += time.time() - start_time 95 | 96 | 97 | total_times = datetime.timedelta(seconds=int(total_times)) 98 | time_per_frame = total_times / num_frames 99 | 100 | print('prediction time per frame: {}s'.format(time_per_frame)) 101 | 102 | print('evaluation...') 103 | meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres, gts) 104 | print('evaluation results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format(meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP)) 105 | 106 | meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_s, gts_s) 107 | print( 108 | 'evaluation short results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format( 109 | meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP)) 110 | 111 | # meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_m, gts_m) 112 | # print( 113 | # 'evaluation middle results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format( 114 | # meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP)) 115 | 116 | meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP = report_result(pres_l, gts_l) 117 | print( 118 | 'evaluation long results: meanIOU: {} | overallIOU: {} | P@5: {} | P@6: {} | P@7: {} | P@8: {} | P@9: {} | mAP: {}'.format( 119 | meaIOU, overallIOU, P5, P6, P7, P8, P9, mAP)) 120 | 121 | 122 | 123 | 124 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/attention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from einops import rearrange, repeat 5 | from torch import nn 6 | 7 | 8 | class GlobalTextPresentation(nn.Module): 9 | def __init__(self, text_dim): 10 | super(GlobalTextPresentation, self).__init__() 11 | self.W_txt = nn.Linear(text_dim, text_dim) 12 | 13 | def forward(self, fea_text, mask=None): 14 | weight_text = self.W_txt(fea_text) # B*L*C 15 | if mask is not None: 16 | weight_text = weight_text.masked_fill(mask == 0, -1e9) 17 | weight_text = weight_text.softmax(dim=1) 18 | fea_text_global = fea_text * weight_text 19 | fea_text_global = fea_text_global.sum(dim=1) # B*C 20 | return fea_text_global 21 | 22 | 23 | class MuTan(nn.Module): 24 | def __init__(self, video_fea_dim, text_fea_dim, out_fea_dim, heads=5): 25 | super(MuTan, self).__init__() 26 | 27 | self.heads = heads 28 | self.Wv = nn.ModuleList( 29 | [nn.Conv2d(video_fea_dim+8, out_fea_dim, 1, 1) for i in range(heads)]) 30 | self.Wt = nn.ModuleList( 31 | [nn.Conv2d(text_fea_dim, out_fea_dim, 1, 1) for i in range(heads)]) 32 | 33 | def forward(self, video_fea, text_fea, spatial): 34 | video_fea = torch.cat([video_fea, spatial], dim=1) 35 | fea_outs = [] 36 | for i in range(self.heads): 37 | fea_v = self.Wv[i](video_fea) 38 | fea_v = torch.tanh(fea_v) # B*C*H*W 39 | 40 | fea_t = self.Wt[i](text_fea) 41 | fea_t = torch.tanh(fea_t) # B*C*1*1 42 | 43 | fea_out = fea_v * fea_t 44 | fea_outs.append(fea_out.unsqueeze(-1)) 45 | fea_outs = torch.cat(fea_outs, dim=-1) 46 | fea_outs = torch.sum(fea_outs, dim=-1) 47 | mutan_fea = torch.tanh(fea_outs) 48 | mutan_fea = F.normalize(mutan_fea, dim=1) 49 | return mutan_fea 50 | 51 | 52 | class RelevanceFilter(nn.Module): 53 | def __init__(self, text_fea_dim, video_fea_dim, attention_dim, groups=8, kernelsize=(1, 1), dilation=(1, 1), phase='3D'): 54 | super().__init__() 55 | assert phase in ['1D', '2D', '3D'] 56 | assert text_fea_dim % groups == 0 57 | assert attention_dim % groups == 0 58 | self.phase = phase 59 | self.groups = groups 60 | self.kernel_size = kernelsize 61 | self.dilation = dilation 62 | if phase == '1D': 63 | assert len(kernelsize) == 1 and len(dilation) == 1 64 | self.Wkv = nn.Conv1d(video_fea_dim, 2*attention_dim, 1, 1) 65 | self.Wt = nn.Linear(text_fea_dim, attention_dim * kernelsize[0]) 66 | self.padding = (kernelsize[0]//2)*dilation[0] 67 | elif phase == '2D': 68 | assert len(kernelsize) == 2 and len(dilation) == 2 69 | self.Wkv = nn.Conv2d(video_fea_dim, 2*attention_dim, 1, 1) 70 | self.Wt = nn.Linear(text_fea_dim, attention_dim * 71 | kernelsize[0] * kernelsize[1]) 72 | self.padding = ( 73 | (kernelsize[0]//2)*dilation[0], (kernelsize[1]//2)*dilation[1]) 74 | elif phase == '3D': 75 | assert len(kernelsize) == 3 and len(dilation) == 3 76 | self.Wkv = nn.Conv3d(video_fea_dim, 2*attention_dim, 1, 1) 77 | self.Wt = nn.Linear(text_fea_dim, attention_dim * 78 | kernelsize[0] * kernelsize[1] * kernelsize[2]) 79 | self.padding = ((kernelsize[0]//2)*dilation[0], (kernelsize[1]//2) 80 | * dilation[1], (kernelsize[2]//2)*dilation[2]) 81 | 82 | def forward(self, video_fea, text_fea, masks=None): 83 | b = video_fea.shape[0] 84 | 85 | kv = self.Wkv(video_fea) 86 | k, v = kv.chunk(2, dim=1) 87 | kernel = self.Wt(text_fea) 88 | 89 | if self.phase == '1D': 90 | kernel = repeat(kernel, 'b (g c k0) -> (b g) c k0', 91 | k0=self.kernel_size[0], g=self.groups) 92 | k = repeat(k, 'b c l0 -> n (b c) l0', n=1) 93 | att = F.conv1d(k, kernel, padding=self.padding, 94 | dilation=self.dilation[0], groups=b*self.groups) 95 | att = rearrange(att, 'n (b g c) l0 -> (n b) g c l0', 96 | b=b, g=self.groups) 97 | v = rearrange(v, 'b (g c) l0 -> b g c l0', g=self.groups) 98 | elif self.phase == '2D': 99 | kernel = repeat(kernel, 'b (g c k0 k1) -> (b g) c k0 k1', 100 | k0=self.kernel_size[0], k1=self.kernel_size[1], g=self.groups) 101 | k = repeat(k, 'b c l0 l1 -> n (b c) l0 l1', n=1) 102 | att = F.conv2d(k, kernel, padding=self.padding, 103 | dilation=self.dilation, groups=b*self.groups) 104 | att = rearrange( 105 | att, 'n (b g c) l0 l1 -> (n b) g c l0 l1', b=b, g=self.groups) 106 | v = rearrange(v, 'b (g c) l0 l1 -> b g c l0 l1', g=self.groups) 107 | elif self.phase == '3D': 108 | kernel = repeat(kernel, 'b (g c k0 k1 k2) -> (b g) c k0 k1 k2', 109 | k0=self.kernel_size[0], k1=self.kernel_size[1], k2=self.kernel_size[2], g=self.groups) 110 | k = repeat(k, 'b c l0 l1 l2 -> n (b c) l0 l1 l2', n=1) 111 | att = F.conv3d(k, kernel, padding=self.padding, 112 | dilation=self.dilation, groups=b*self.groups) 113 | att = rearrange( 114 | att, 'n (b g c) l0 l1 l2 -> (n b) g c l0 l1 l2', b=b, g=self.groups) 115 | v = rearrange( 116 | v, 'b (g c) l0 l1 l2 -> b g c l0 l1 l2', g=self.groups) 117 | active_map = att.mean(dim=1) 118 | out = v * torch.sigmoid(att) 119 | out = torch.flatten(out, 1, 2) 120 | 121 | if masks is not None: 122 | out = out * masks 123 | active_map = active_map.sigmoid() * masks 124 | return active_map, out 125 | -------------------------------------------------------------------------------- /temporal_grounding/utils/adam_optimizer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the LICENSE file in 5 | # the root directory of this source tree. An additional grant of patent rights 6 | # can be found in the PATENTS file in the same directory. 7 | 8 | import math 9 | import torch 10 | import torch.optim 11 | 12 | from .optimizer import FairseqOptimizer 13 | 14 | 15 | class AdamOptimizer(FairseqOptimizer): 16 | def __init__(self, args, params): 17 | super().__init__(args, params) 18 | self._optimizer = Adam(params, **self.optimizer_config) 19 | 20 | @staticmethod 21 | def add_args(parser): 22 | """Add optimizer-specific arguments to the parser.""" 23 | # fmt: off 24 | parser.add_argument('--adam-betas', default='(0.9, 0.999)', metavar='B', 25 | help='betas for Adam optimizer') 26 | parser.add_argument('--adam-eps', type=float, default=1e-8, metavar='D', 27 | help='epsilon for Adam optimizer') 28 | # fmt: on 29 | 30 | @property 31 | def optimizer_config(self): 32 | """ 33 | Return a kwarg dictionary that will be used to override optimizer 34 | args stored in checkpoints. This allows us to load a checkpoint and 35 | resume training using a different set of optimizer args, e.g., with a 36 | different learning rate. 37 | """ 38 | return { 39 | 'lr': self.args.lr, 40 | 'betas': eval(self.args.adam_betas), 41 | 'eps': self.args.adam_eps, 42 | 'weight_decay': self.args.weight_decay, 43 | } 44 | 45 | 46 | class Adam(torch.optim.Optimizer): 47 | """Implements Adam algorithm. 48 | 49 | This implementation is modified from torch.optimizer.Adam based on: 50 | `Fixed Weight Decay Regularization in Adam` 51 | (see https://arxiv.org/abs/1711.05101) 52 | 53 | It has been proposed in `Adam: A Method for Stochastic Optimization`_. 54 | 55 | Arguments: 56 | params (iterable): iterable of parameters to optimize or dicts defining 57 | parameter groups 58 | lr (float, optional): learning rate (default: 1e-3) 59 | betas (Tuple[float, float], optional): coefficients used for computing 60 | running averages of gradient and its square (default: (0.9, 0.999)) 61 | eps (float, optional): term added to the denominator to improve 62 | numerical stability (default: 1e-8) 63 | weight_decay (float, optional): weight decay (L2 penalty) (default: 0) 64 | amsgrad (boolean, optional): whether to use the AMSGrad variant of this 65 | algorithm from the paper `On the Convergence of Adam and Beyond`_ 66 | 67 | .. _Adam\: A Method for Stochastic Optimization: 68 | https://arxiv.org/abs/1412.6980 69 | .. _On the Convergence of Adam and Beyond: 70 | https://openreview.net/forum?id=ryQu7f-RZ 71 | """ 72 | 73 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, 74 | weight_decay=0, amsgrad=False): 75 | defaults = dict(lr=lr, betas=betas, eps=eps, 76 | weight_decay=weight_decay, amsgrad=amsgrad) 77 | super(Adam, self).__init__(params, defaults) 78 | 79 | def step(self, closure=None): 80 | """Performs a single optimization step. 81 | 82 | Arguments: 83 | closure (callable, optional): A closure that reevaluates the model 84 | and returns the loss. 85 | """ 86 | loss = None 87 | if closure is not None: 88 | loss = closure() 89 | 90 | for group in self.param_groups: 91 | for p in group['params']: 92 | if p.grad is None: 93 | continue 94 | grad = p.grad.data 95 | if grad.is_sparse: 96 | raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead') 97 | amsgrad = group['amsgrad'] 98 | 99 | state = self.state[p] 100 | 101 | # State initialization 102 | if len(state) == 0: 103 | state['step'] = 0 104 | # Exponential moving average of gradient values 105 | state['exp_avg'] = torch.zeros_like(p.data) 106 | # Exponential moving average of squared gradient values 107 | state['exp_avg_sq'] = torch.zeros_like(p.data) 108 | if amsgrad: 109 | # Maintains max of all exp. moving avg. of sq. grad. values 110 | state['max_exp_avg_sq'] = torch.zeros_like(p.data) 111 | 112 | exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] 113 | if amsgrad: 114 | max_exp_avg_sq = state['max_exp_avg_sq'] 115 | beta1, beta2 = group['betas'] 116 | 117 | state['step'] += 1 118 | 119 | # Decay the first and second moment running average coefficient 120 | exp_avg.mul_(beta1).add_(1 - beta1, grad) 121 | exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 122 | if amsgrad: 123 | # Maintains the maximum of all 2nd moment running avg. till now 124 | torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq) 125 | # Use the max. for normalizing running avg. of gradient 126 | denom = max_exp_avg_sq.sqrt().add_(group['eps']) 127 | else: 128 | denom = exp_avg_sq.sqrt().add_(group['eps']) 129 | 130 | bias_correction1 = 1 - beta1 ** state['step'] 131 | bias_correction2 = 1 - beta2 ** state['step'] 132 | step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 133 | 134 | if group['weight_decay'] != 0: 135 | p.data.add_(-group['weight_decay'] * group['lr'], p.data) 136 | 137 | p.data.addcdiv_(-step_size, exp_avg, denom) 138 | 139 | return loss 140 | -------------------------------------------------------------------------------- /referring_segmentation/dataset/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | from PIL import Image 3 | from torch.utils import data 4 | import numpy as np 5 | import h5py 6 | from torchvision import transforms 7 | from dataset import augmentation 8 | import torch 9 | from tqdm import tqdm 10 | from utils.video_reader import clip_annotation_reader, sequence_reader 11 | import json 12 | import torchtext 13 | import torch.nn as nn 14 | 15 | 16 | class MyDataset(data.Dataset): 17 | def __init__(self, config, mode='train'): 18 | super(MyDataset, self).__init__() 19 | self.input_size = config['input_size'] 20 | self.clip_size = config['clip_size'] 21 | self.datasets = config['{}ing_datasets'.format(mode)] 22 | self.dataset_root = config['datasets_root'] 23 | self.max_embedding_length = config['max_embedding_length'] 24 | self.mode = mode 25 | if type(self.datasets) != list: 26 | self.datasets = [self.datasets] 27 | print('Preparing datasets: {}'.format(self.datasets)) 28 | self.datas = [] 29 | augmen = [augmentation.FixedResize(self.input_size)] 30 | if mode == 'train': 31 | if config['augmentations']['random_crop']: 32 | augmen.append(augmentation.RandomScale((1.0, 1.1))) 33 | augmen.append(augmentation.ExtRandomCrop(self.input_size, pad_if_needed=True)) 34 | if config['augmentations']['random_flip']: 35 | augmen.append(augmentation.RandomHorizontalFlip()) 36 | augmen.append(augmentation.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))) 37 | augmen.append(augmentation.ToTensor()) 38 | self.transformation = transforms.Compose(augmen) 39 | 40 | for dataset in self.datasets: 41 | assert os.path.exists('./data/{}_{}.json'.format(dataset.lower(), mode)), 'json file not exist: {}'.format('./data/{}_{}.json'.format(dataset.lower(), mode)) 42 | with open('./data/{}_{}.json'.format(dataset.lower(), mode), 'r') as f: 43 | videosets = json.load(f) 44 | 45 | for video_file, attribute in tqdm(videosets.items()): 46 | video_root, annotation_root, instances = attribute['frames'], attribute['labels'], attribute['instances'] 47 | if mode == 'train': 48 | video_data = clip_annotation_reader(os.path.join(self.dataset_root, video_root), os.path.join(self.dataset_root, annotation_root), \ 49 | instances, self.clip_size, annotation_center=False, dataset=dataset) 50 | else: 51 | video_data = sequence_reader(os.path.join(self.dataset_root, video_root), os.path.join(self.dataset_root, annotation_root), instances, dataset=dataset) 52 | self.datas += video_data 53 | 54 | embedding_name, embedding_dim = config['embedding_type'].split( 55 | '_')[1], int(config['embedding_type'].split('_')[2]) 56 | self.vocab = torchtext.vocab.GloVe( 57 | name=embedding_name, dim=embedding_dim) 58 | self.vocab.itos.extend(['']) 59 | self.vocab.stoi[''] = self.vocab.vectors.shape[0] 60 | self.vocab.vectors = torch.cat( 61 | [self.vocab.vectors, torch.zeros(1, self.vocab.dim)], dim=0) 62 | self.word_embedding = nn.Embedding.from_pretrained(self.vocab.vectors) 63 | 64 | def __len__(self): 65 | return len(self.datas) 66 | 67 | def __getitem__(self, item): 68 | frames = [] 69 | annotations = [] 70 | is_annotated = [] 71 | frame_names = [] 72 | instance = self.datas[item]['instance'] 73 | 74 | word_idxs = torch.tensor([self.vocab.stoi.get(w, len(self.vocab.stoi)-1) 75 | for w in self.datas[item]['sentence'].strip().split()], dtype=torch.long) 76 | embedding = self.word_embedding(word_idxs) 77 | embedding_length = embedding.shape[0] 78 | if embedding_length > self.max_embedding_length: 79 | embedding_padded = embedding[: self.max_embedding_length, :] 80 | embedding_length = self.max_embedding_length 81 | else: 82 | embedding_padded = torch.zeros( 83 | (self.max_embedding_length, embedding.shape[1])) 84 | embedding_padded[: embedding.shape[0], :] = embedding 85 | 86 | for i in range(len(self.datas[item]['frames'])): 87 | frame_names.append(self.datas[item]['frames'][i].split('/')[-1].split('.')[0]) 88 | frame = Image.open(self.datas[item]['frames'][i]).convert('RGB') 89 | frames.append(frame) 90 | w, h = frame.size 91 | 92 | sign = True 93 | if self.datas[item]['label'][i] != 'None': 94 | with h5py.File(self.datas[item]['label'][i], 'r') as file_annotation: 95 | if int(instance) not in list(file_annotation['instance']): 96 | annotation = Image.new('L', (w, h)) 97 | else: 98 | if len(file_annotation['reMask'].shape) != 3: 99 | annotation = file_annotation['reMask'][:] 100 | else: 101 | annotation = file_annotation['reMask'][np.where(file_annotation['instance'][:] == int(instance))][0] 102 | annotation = Image.fromarray(annotation.T) 103 | else: 104 | annotation = Image.new('L', (w, h)) 105 | sign = False 106 | annotations.append(annotation) 107 | is_annotated.append(sign) 108 | 109 | sample = {} 110 | sample['frames'] = frames 111 | sample['label'] = annotations 112 | sample = self.transformation(sample) 113 | sample['word_embedding'] = embedding_padded 114 | sample['embedding_length'] = embedding_length 115 | sample['is_annotated'] = is_annotated 116 | sample['video'] = self.datas[item]['video'] 117 | sample['name'] = frame_names 118 | sample['dataset'] = self.datas[item]['dataset'] 119 | sample['instance'] = instance 120 | 121 | return sample 122 | 123 | 124 | 125 | 126 | 127 | -------------------------------------------------------------------------------- /temporal_grounding/utils/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def generate_anchors(windows): 6 | widths = np.array(windows) 7 | center = 7.5 8 | start = center - 0.5 * (widths - 1) 9 | end = center + 0.5 * (widths - 1) 10 | return np.stack([start, end], -1) 11 | 12 | def generate_proposals(max_num_frames, windows): 13 | anchors = generate_anchors(windows) 14 | widths = (anchors[:, 1] - anchors[:, 0] + 1) # [num_anchors] 15 | centers = np.arange(0, max_num_frames) # [video_len] 16 | start = np.expand_dims(centers, 1) - 0.5 * (np.expand_dims(widths, 0) - 1) 17 | end = np.expand_dims(centers, 1) + 0.5 * (np.expand_dims(widths, 0) - 1) 18 | proposals = np.stack([start, end], -1) # [video_len, num_anchors, 2] 19 | return proposals 20 | 21 | def generate_scores(proposals, label, max_num_frames, thres_score, thres_adjmat): 22 | proposals = np.reshape(proposals, [-1, 2]) 23 | illegal = np.logical_or(proposals[:, 0] < 0, proposals[:, 1] >= max_num_frames) 24 | label1 = np.repeat(np.expand_dims(label, 0), proposals.shape[0], 0) 25 | IoUs = calculate_IoU_batch((proposals[:, 0], proposals[:, 1]), 26 | (label1[:, 0], label1[:, 1])) 27 | IoUs[illegal] = 0.0 # [video_len * num_anchors] 28 | max_IoU = np.max(IoUs) 29 | IoUs[IoUs < thres_score * max_IoU] = 0.0 30 | IoUs = IoUs / (max_IoU + 1e-4) 31 | adj_mat = IoUs.copy() 32 | adj_mat[adj_mat < thres_adjmat] = 0.0 # best 0.7 * max_IoU 33 | 34 | scores = IoUs.astype(np.float32) 35 | scores_mask = (1 - illegal).astype(np.uint8) 36 | return scores, scores_mask, adj_mat 37 | 38 | def calculate_IoU_batch(i0, i1): 39 | union = (np.min(np.stack([i0[0], i1[0]], 0), 0), np.max(np.stack([i0[1], i1[1]], 0), 0)) 40 | inter = (np.max(np.stack([i0[0], i1[0]], 0), 0), np.min(np.stack([i0[1], i1[1]], 0), 0)) 41 | iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10) 42 | iou[union[1] - union[0] < -1e-5] = 0 43 | iou[iou < 0] = 0.0 44 | return iou 45 | 46 | def calculate_IoU(i0, i1): 47 | union = (min(i0[0], i1[0]), max(i0[1], i1[1])) 48 | inter = (max(i0[0], i1[0]), min(i0[1], i1[1])) 49 | 50 | if union[1] - union[0] < -1e-5: 51 | return 0 52 | iou = 1.0 * (inter[1] - inter[0]) / (union[1] - union[0] + 1e-10) 53 | return iou if iou >= 0.0 else 0.0 54 | 55 | 56 | def average_to_fixed_length(visual_input, num_sample_clips): 57 | num_clips = visual_input.shape[0] 58 | idxs = torch.arange(0, num_sample_clips+1, 1.0)/num_sample_clips*num_clips 59 | idxs = torch.min(torch.round(idxs).long(), torch.tensor(num_clips-1)) 60 | new_visual_input = [] 61 | for i in range(num_sample_clips): 62 | s_idx, e_idx = idxs[i].item(), idxs[i+1].item() 63 | if s_idx < e_idx: 64 | new_visual_input.append(torch.mean( 65 | visual_input[s_idx:e_idx], dim=0)) 66 | else: 67 | new_visual_input.append(visual_input[s_idx]) 68 | new_visual_input = torch.stack(new_visual_input, dim=0) 69 | return new_visual_input 70 | 71 | 72 | def nms_temporal(predict_score, predict_windows, overlap): 73 | pick = list() 74 | starts = predict_windows[:, 0] 75 | ends = predict_windows[:, 1] 76 | scores = predict_score 77 | assert len(starts) == len(scores) 78 | if len(starts) == 0: 79 | return pick 80 | 81 | unions = ends - starts 82 | # sort and get index 83 | indexs = [x[0] for x in sorted(enumerate(scores), key=lambda x:x[1])] 84 | 85 | while len(indexs) > 0: 86 | i = indexs[-1] 87 | pick.append(i) 88 | 89 | lefts = [max(starts[i], starts[j]) for j in indexs[:-1]] 90 | rights = [min(ends[i], ends[j]) for j in indexs[:-1]] 91 | inters = [max(0.0, right-left) for left, right in zip(lefts, rights)] 92 | laps = [inters[u]/(unions[i] + unions[indexs[u]] - inters[u]) 93 | for u in range(len(indexs)-1)] 94 | indexs_new = [] 95 | for j in range(len(laps)): 96 | if laps[j] <= overlap: 97 | indexs_new.append(indexs[j]) 98 | indexs = indexs_new 99 | 100 | return pick 101 | 102 | 103 | def compute_IoU_recall_top_n(predict_windows, gt_windows, picks, top_n, IoU_thresh): 104 | 105 | correct = 0 106 | if top_n < len(picks): 107 | cur_picks = picks[0:top_n] 108 | else: 109 | cur_picks = picks 110 | for index in cur_picks: 111 | pred_start = predict_windows[index][0] 112 | pred_end = predict_windows[index][1] 113 | iou = calculate_IoU(gt_windows, (pred_start, pred_end)) 114 | if iou >= IoU_thresh: 115 | correct += 1 116 | break 117 | 118 | return correct 119 | 120 | 121 | def compute_IoU_recall(predict_score, predict_windows, gt_windows): 122 | 123 | IoU_threshs = [0.1, 0.3, 0.5, 0.7] 124 | top_n_list = [1, 5] 125 | topn_IoU_matric = np.zeros([2, 4], dtype=np.float32) 126 | 127 | for i, IoU_thresh in enumerate(IoU_threshs): 128 | picks = nms_temporal(predict_score, predict_windows, IoU_thresh-0.05) 129 | 130 | for j, top_n in enumerate(top_n_list): 131 | correct = compute_IoU_recall_top_n( 132 | predict_windows, gt_windows, picks, top_n, IoU_thresh) 133 | topn_IoU_matric[j, i] = correct 134 | 135 | return topn_IoU_matric 136 | 137 | 138 | class CountMeter(object): 139 | """Computes and stores the average and current value""" 140 | 141 | def __init__(self): 142 | self.reset() 143 | 144 | def reset(self): 145 | self.val = np.zeros([2, 4], dtype=np.float32) 146 | self.count = 0 147 | 148 | def update(self, val, n=1): 149 | self.val += val 150 | self.count += n 151 | 152 | 153 | class AverageMeter(object): 154 | """Computes and stores the average and current value""" 155 | 156 | def __init__(self): 157 | self.reset() 158 | 159 | def reset(self): 160 | self.val = 0 161 | self.avg = 0 162 | self.sum = 0 163 | self.count = 0 164 | 165 | def update(self, val, n=1): 166 | self.val = val 167 | self.sum += val * n 168 | self.count += n 169 | self.avg = self.sum / self.count 170 | -------------------------------------------------------------------------------- /referring_segmentation/model/module/attention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from einops import rearrange, repeat 5 | from torch import nn, einsum 6 | 7 | 8 | class GlobalTextPresentation(nn.Module): 9 | def __init__(self, text_dim): 10 | super(GlobalTextPresentation, self).__init__() 11 | self.W_txt = nn.Linear(text_dim, text_dim) 12 | 13 | def forward(self, fea_text, mask=None): 14 | fea_text = fea_text.permute(0, 2, 1) # B*L*C2 15 | weight_text = self.W_txt(fea_text) # B*L*C 16 | if mask is not None: 17 | mask = mask.permute(0, 2, 1) 18 | weight_text = weight_text.masked_fill(mask == 0, -1e9) 19 | weight_text = weight_text.softmax(dim=1) 20 | fea_text_global = fea_text * weight_text 21 | fea_text_global = fea_text_global.sum(dim=1, keepdim=True).permute(0, 2, 1).unsqueeze(-1) # B*C*1*1 22 | return fea_text_global 23 | 24 | 25 | class GlobalAttention(nn.Module): 26 | def __init__(self, video_feature_dim, text_dim, global_attention_dim): 27 | super(GlobalAttention, self).__init__() 28 | self.scale = global_attention_dim ** -0.5 29 | 30 | self.Q = nn.Linear(video_feature_dim+text_dim+8, global_attention_dim) 31 | self.K = nn.Linear(text_dim, global_attention_dim) 32 | self.V = nn.Linear(text_dim, global_attention_dim) 33 | 34 | def forward(self, fea_video, fea_text): 35 | """ 36 | :param fea_video: B*(C1+C2+8)*H*W 37 | :param fea_text: B*C2*1*1 38 | :param mask: B*1*L 39 | :return: 40 | """ 41 | B, C1, H, W = fea_video.shape 42 | B, C2, _, _ = fea_text.shape 43 | fea_video = fea_video.view(B, C1, -1).permute(0, 2, 1) 44 | fea_text = fea_text.view(B, C2, -1).permute(0, 2, 1) 45 | 46 | 47 | q = self.Q(fea_video) 48 | k = self.K(fea_text) 49 | v = self.V(fea_text) 50 | 51 | att = torch.matmul(q, k.permute(0, 2, 1)) * self.scale # B*HW*1 52 | att = att.softmax(-1) 53 | out = torch.matmul(att, v) # B*HW*C 54 | out = out.permute(0, 2, 1).view(B, -1, H, W) 55 | return out 56 | 57 | 58 | class LocalAttention(nn.Module): 59 | def __init__(self, video_feature_dim, text_dim, attention_dim): 60 | super(LocalAttention, self).__init__() 61 | self.scale = attention_dim ** -0.5 62 | 63 | self.Q = nn.Linear(video_feature_dim, attention_dim) 64 | self.K = nn.Linear(text_dim, attention_dim) 65 | self.V = nn.Linear(text_dim, attention_dim) 66 | 67 | def forward(self, fea_video, fea_text, mask): 68 | """ 69 | :param fea_video: B*C*T*H*W 70 | :param fea_text: B*C*L 71 | :param mask: B*HW*L 72 | :return: 73 | """ 74 | 75 | B, C, T, H, W = fea_video.shape 76 | fea_frames = fea_video.chunk(T, dim=2) 77 | fea_text = fea_text.permute(0, 2, 1) # B*L*C 78 | outs = [] 79 | for fea_frame in fea_frames: 80 | fea_frame = fea_frame.view(B, C, -1).permute(0, 2, 1) # B*HW*C 81 | 82 | q = self.Q(fea_frame) 83 | k = self.K(fea_text) 84 | v = self.V(fea_text) 85 | 86 | att = torch.matmul(q, k.permute(0, 2, 1)) * self.scale # B*HW*L 87 | if mask is not None: 88 | att = att.masked_fill(mask == 0, -1e9) 89 | att = att.softmax(-1) 90 | out = torch.matmul(att, v) # B*HW*C 91 | out = out.permute(0, 2, 1).view(B, C, H, W).unsqueeze(2) 92 | outs.append(out) 93 | outs = torch.cat(outs, dim=2) 94 | return outs 95 | 96 | 97 | class MuTan(nn.Module): 98 | def __init__(self, video_fea_dim, text_fea_dim, out_fea_dim, heads = 5): 99 | super(MuTan, self).__init__() 100 | 101 | self.heads = heads 102 | self.Wv = nn.ModuleList([nn.Conv2d(video_fea_dim+8, out_fea_dim, 1, 1) for i in range(heads)]) 103 | self.Wt = nn.ModuleList([nn.Conv2d(text_fea_dim, out_fea_dim, 1, 1) for i in range(heads)]) 104 | 105 | def forward(self, video_fea, text_fea, spatial): 106 | video_fea = torch.cat([video_fea, spatial], dim=1) 107 | fea_outs = [] 108 | for i in range(self.heads): 109 | fea_v = self.Wv[i](video_fea) 110 | fea_v = torch.tanh(fea_v) # B*C*H*W 111 | 112 | fea_t = self.Wt[i](text_fea) 113 | fea_t = torch.tanh(fea_t) # B*C*1*1 114 | 115 | fea_out = fea_v * fea_t 116 | fea_outs.append(fea_out.unsqueeze(-1)) 117 | fea_outs = torch.cat(fea_outs, dim=-1) 118 | fea_outs = torch.sum(fea_outs, dim=-1) 119 | mutan_fea = torch.tanh(fea_outs) 120 | mutan_fea = F.normalize(mutan_fea, dim=1) 121 | return mutan_fea 122 | 123 | class RelevanceFilter(nn.Module): 124 | def __init__(self, text_fea_dim, video_fea_dim, attention_dim, groups=8, kernelsize=(1, 1, 1)): 125 | super(RelevanceFilter, self).__init__() 126 | assert text_fea_dim % groups == 0 127 | assert attention_dim % groups == 0 128 | self.groups = groups 129 | self.Wv = nn.Conv3d(video_fea_dim, 2 * attention_dim, 1, 1) 130 | 131 | self.Wt = nn.Linear(text_fea_dim, attention_dim * 132 | kernelsize[0] * kernelsize[1] * kernelsize[2]) 133 | self.kernel_size = kernelsize 134 | 135 | def forward(self, video_fea, text_fea): 136 | 137 | fea = self.Wv(video_fea) # B*C*T*H*W 138 | B, C, T, H, W = video_fea.shape 139 | k, v = fea.chunk(2, dim=1) 140 | kernel = self.Wt(text_fea) # B*L*(C*K*K) 141 | kernel = repeat(kernel, 'b l (g c t h w) -> (b g l) c t h w', 142 | t=self.kernel_size[0], h=self.kernel_size[1], w=self.kernel_size[2], g=self.groups) 143 | k = repeat(k, 'b c t h w -> n (b c) t h w', n=1) 144 | att = F.conv3d(k, kernel, padding=( 145 | self.kernel_size[0]//2, self.kernel_size[1]//2, self.kernel_size[2]//2), groups=B*self.groups) 146 | att = rearrange( 147 | att, 'n (b g c) t h w -> (n b) g c t h w', b=B, g=self.groups) 148 | active_map = att.mean(dim=1) 149 | v = rearrange(v, 'b (g c) t h w -> b g c t h w', g=self.groups) 150 | out = v * torch.sigmoid(att) 151 | out = rearrange(out, 'b g c t h w -> b (g c) t h w') 152 | maps = active_map.permute( 153 | 2, 0, 1, 3, 4) 154 | maps = [maps[i] for i in range(T)] 155 | return maps, out 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/criterion.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | import torch.nn as nn 4 | from .anchor_utils import generate_proposals, generate_scores, generate_2d_gaussian 5 | from einops import repeat, rearrange 6 | from .utils import generate_anchor_scores, compute_temporal_reg_tar, segment_tiou 7 | from .utils import generate_2d_gaussian as generate_2d_gaussian_new 8 | 9 | 10 | class SetCriterion(nn.Module): 11 | def __init__(self, cfg): 12 | super().__init__() 13 | self.cfg = cfg 14 | self.temporal_reg_loss = nn.SmoothL1Loss() 15 | self.temporal_cls_loss = nn.BCELoss() 16 | self.spatial_hm_loss = nn.BCELoss() 17 | self.spatial_wh_loss = nn.SmoothL1Loss() 18 | 19 | def loss_spatial(self, outputs, targets, inter_idx): 20 | inter_idx = inter_idx[0] 21 | h, w = outputs['spatial_map'].shape[-2:] 22 | box_gt = [targets[i]['boxes'] for i in range(len(targets))] 23 | box_gt = torch.cat(box_gt, dim=0) # k*4, cxcywh 24 | 25 | size_gt = [targets[i]['size'] 26 | for i in range(len(targets))] # current input frame size 27 | size_gt = torch.stack(size_gt) # [\sigma t_i, 2] 28 | padded_size_gt = torch.max(size_gt, dim=0)[0] # [2] 29 | box_gt_unnormed = box_gt * \ 30 | torch.stack([size_gt[:, 1], size_gt[:, 0], 31 | size_gt[:, 1], size_gt[:, 0]], dim=-1) 32 | padded_box_gt = box_gt_unnormed / \ 33 | torch.stack([padded_size_gt[1], padded_size_gt[0], 34 | padded_size_gt[1], padded_size_gt[0]], dim=-1)[None] 35 | gaussian_gt = generate_2d_gaussian(padded_box_gt, w, h, delta=0.05)[ 36 | :, None] # k*1*h*w 37 | wh_gt = (padded_box_gt[:, 2:] * torch.as_tensor([w, h])[None, 38 | :].to(box_gt.device))[..., None, None].repeat(1, 1, h, w) 39 | 40 | pred_hm = outputs['spatial_map'] # k*1*h*w 41 | pred_wh = outputs['spatial_wh'] # k*2*h*w 42 | 43 | loss_hm = self.spatial_hm_loss(pred_hm, gaussian_gt) 44 | loss_wh = self.spatial_wh_loss(pred_wh*gaussian_gt, wh_gt*gaussian_gt) 45 | loss_map = 0 46 | for map in outputs['maps']: 47 | map = F.interpolate( 48 | map, (h, w), mode='bilinear', align_corners=True) 49 | loss_map += self.spatial_hm_loss(map, gaussian_gt) 50 | 51 | return { 52 | 'spatial_hm_loss': loss_hm, 53 | 'spatial_wh_loss': loss_wh, 54 | 'spatial_map_loss': loss_map 55 | }, gaussian_gt 56 | 57 | def loss_temporal(self, outputs, durations, inter_idx): 58 | device = outputs['spatial_map'].device 59 | seq_len = max(durations) 60 | b = len(durations) 61 | inter_idx = torch.as_tensor(inter_idx).float().to(device) # [b, 2] 62 | index = torch.as_tensor([i for i in range(seq_len)]).to(device)[ 63 | None].repeat(b, 1) # [b, t] 64 | inter_idx_expand = inter_idx[:, None].repeat( 65 | 1, seq_len, 1) # [b, t, 2] 66 | # [b, t], 1 for moments when action happens, otherwise 0 67 | action_gt = ((index >= inter_idx_expand[..., 0]) & ( 68 | index <= inter_idx_expand[..., 1])).float() 69 | 70 | # [b, t] "True" represent the padded moment 71 | time_mask = torch.ones(b, seq_len).bool().to(device) 72 | for i_dur, duration in enumerate(durations): 73 | time_mask[i_dur, :duration] = False 74 | if self.cfg.temporal_decoder_type == 'anchor': 75 | proposals = generate_proposals(seq_len, self.cfg.temporal_window_width)[ 76 | None].repeat(b, 1, 1).to(device) # [b, t*n_window, 2] 77 | score_gt, score_mask = generate_anchor_scores( 78 | proposals, inter_idx, seq_len, self.cfg.temporal_score_thres) 79 | time_mask_expanded = repeat( 80 | time_mask, 'b t -> b (t n)', n=len(self.cfg.temporal_window_width)) 81 | score_mask[time_mask_expanded] = True # [b, t*n_window] 82 | score_pos = (score_gt >= self.cfg.temporal_valid_thres).float() 83 | score_pos = score_pos.masked_fill(time_mask_expanded, 0.) 84 | reg_gt = inter_idx[:, None].repeat(1, proposals.shape[1], 1) 85 | refined_box = outputs['temporal_offset'] + \ 86 | proposals # [b, t*n_window, 2] 87 | loss_reg = self.temporal_reg_loss( 88 | refined_box*score_pos[..., None], reg_gt*score_pos[..., None]) 89 | loss_cls = self.temporal_cls_loss(outputs['temporal_score'].masked_fill(time_mask_expanded[..., None], 0.), 90 | score_gt.masked_fill(time_mask_expanded[..., None], 0.)) 91 | return { 92 | 'temporal_cls_loss': loss_cls, 93 | 'temporal_align_loss': loss_reg 94 | } 95 | 96 | elif self.cfg.temporal_decoder_type == 'regression': 97 | pred_start = index - outputs['temporal_reg'][:, :, 0] 98 | pred_end = index + outputs['temporal_reg'][:, :, 1] 99 | predictions = torch.stack([pred_start, pred_end], dim=-1) / seq_len 100 | predictions = torch.clamp(predictions, 0, 1) 101 | label_reg = compute_temporal_reg_tar(inter_idx, action_gt) 102 | label_iou = segment_tiou(predictions, inter_idx[:, None] / seq_len) 103 | iou_pos_ind = label_iou > 0.5 104 | pos_iou_target = label_iou[iou_pos_ind] 105 | pos_iou_pred = outputs['temporal_iou'][iou_pos_ind] 106 | loss_reg = self.temporal_reg_loss( 107 | outputs['temporal_reg'] * action_gt.unsqueeze(-1), label_reg) 108 | loss_score = self.temporal_cls_loss( 109 | outputs['temporal_score'], action_gt) 110 | if iou_pos_ind.sum().item() == 0: 111 | loss_iou = 0 112 | else: 113 | loss_iou = self.temporal_cls_loss( 114 | pos_iou_pred, pos_iou_target.detach()) 115 | return { 116 | 'temporal_score_loss': loss_score, 117 | 'temporal_reg_loss': loss_reg, 118 | 'temporal_iou_loss': loss_iou 119 | } 120 | else: 121 | raise NotImplementedError 122 | 123 | def forward(self, outputs, durations, inter_idx, targets): 124 | loss_dict = self.loss_temporal(outputs, durations, inter_idx) 125 | loss_dict_s, gaussian_gt = self.loss_spatial( 126 | outputs, targets, inter_idx) 127 | loss_dict.update(loss_dict_s) 128 | return loss_dict, gaussian_gt 129 | -------------------------------------------------------------------------------- /referring_segmentation/model/backbone/mobilenet.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | import torch.nn as nn 4 | import math 5 | # from modeling.sync_batchnorm.batchnorm import SynchronizedBatchNorm2d 6 | import torch.utils.model_zoo as model_zoo 7 | 8 | 9 | 10 | def conv_bn(inp, oup, stride, BatchNorm): 11 | return nn.Sequential( 12 | nn.Conv2d(inp, oup, 3, stride, 1, bias=False), 13 | BatchNorm(oup), 14 | nn.ReLU6(inplace=True) 15 | ) 16 | 17 | 18 | def fixed_padding(inputs, kernel_size, dilation): 19 | kernel_size_effective = kernel_size + (kernel_size - 1) * (dilation - 1) 20 | pad_total = kernel_size_effective - 1 21 | pad_beg = pad_total // 2 22 | pad_end = pad_total - pad_beg 23 | padded_inputs = F.pad(inputs, (pad_beg, pad_end, pad_beg, pad_end)) 24 | return padded_inputs 25 | 26 | 27 | class InvertedResidual(nn.Module): 28 | def __init__(self, inp, oup, stride, dilation, expand_ratio, BatchNorm): 29 | super(InvertedResidual, self).__init__() 30 | self.stride = stride 31 | assert stride in [1, 2] 32 | 33 | hidden_dim = round(inp * expand_ratio) 34 | self.use_res_connect = self.stride == 1 and inp == oup 35 | self.kernel_size = 3 36 | self.dilation = dilation 37 | 38 | if expand_ratio == 1: 39 | self.conv = nn.Sequential( 40 | # dw 41 | nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 0, dilation, groups=hidden_dim, bias=False), 42 | BatchNorm(hidden_dim), 43 | nn.ReLU6(inplace=True), 44 | # pw-linear 45 | nn.Conv2d(hidden_dim, oup, 1, 1, 0, 1, 1, bias=False), 46 | BatchNorm(oup), 47 | ) 48 | else: 49 | self.conv = nn.Sequential( 50 | # pw 51 | nn.Conv2d(inp, hidden_dim, 1, 1, 0, 1, bias=False), 52 | BatchNorm(hidden_dim), 53 | nn.ReLU6(inplace=True), 54 | # dw 55 | nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 0, dilation, groups=hidden_dim, bias=False), 56 | BatchNorm(hidden_dim), 57 | nn.ReLU6(inplace=True), 58 | # pw-linear 59 | nn.Conv2d(hidden_dim, oup, 1, 1, 0, 1, bias=False), 60 | BatchNorm(oup), 61 | ) 62 | 63 | def forward(self, x): 64 | x_pad = fixed_padding(x, self.kernel_size, dilation=self.dilation) 65 | if self.use_res_connect: 66 | x = x + self.conv(x_pad) 67 | else: 68 | x = self.conv(x_pad) 69 | return x 70 | 71 | 72 | class MobileNetV2(nn.Module): 73 | def __init__(self, inchannel, output_stride=8, BatchNorm=None, width_mult=1., pretrained=True): 74 | super(MobileNetV2, self).__init__() 75 | block = InvertedResidual 76 | input_channel = 32 77 | current_stride = 1 78 | rate = 1 79 | interverted_residual_setting = [ 80 | # t, c, n, s 81 | [1, 16, 1, 1], 82 | [6, 24, 2, 2], 83 | [6, 32, 3, 2], 84 | [6, 64, 4, 2], 85 | [6, 96, 3, 1], 86 | [6, 160, 3, 2], 87 | [6, 320, 1, 1], 88 | ] 89 | 90 | # building first layer 91 | input_channel = int(input_channel * width_mult) 92 | self.features = [conv_bn(inchannel, input_channel, 2, BatchNorm)] 93 | current_stride *= 2 94 | # building inverted residual blocks 95 | for t, c, n, s in interverted_residual_setting: 96 | if current_stride == output_stride: 97 | stride = 1 98 | dilation = rate 99 | rate *= s 100 | else: 101 | stride = s 102 | dilation = 1 103 | current_stride *= s 104 | output_channel = int(c * width_mult) 105 | for i in range(n): 106 | if i == 0: 107 | self.features.append(block(input_channel, output_channel, stride, dilation, t, BatchNorm)) 108 | else: 109 | self.features.append(block(input_channel, output_channel, 1, dilation, t, BatchNorm)) 110 | input_channel = output_channel 111 | self.features = nn.Sequential(*self.features) 112 | self._initialize_weights() 113 | 114 | if pretrained: 115 | self._load_pretrained_model() 116 | 117 | self.low_level_features = self.features[0:4] 118 | self.high_level_features = self.features[4:] 119 | 120 | def forward(self, x): 121 | low_level_feat = self.low_level_features(x) 122 | x = self.high_level_features(low_level_feat) 123 | return x, low_level_feat 124 | 125 | def _load_pretrained_model(self): 126 | pretrain_dict = model_zoo.load_url('http://jeff95.me/models/mobilenet_v2-6a65762b.pth') 127 | model_dict = {} 128 | state_dict = self.state_dict() 129 | for k, v in pretrain_dict.items(): 130 | if k in state_dict: 131 | model_dict[k] = v 132 | state_dict.update(model_dict) 133 | self.load_state_dict(state_dict) 134 | 135 | def _initialize_weights(self): 136 | for m in self.modules(): 137 | if isinstance(m, nn.Conv2d): 138 | # n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels 139 | # m.weight.data.normal_(0, math.sqrt(2. / n)) 140 | torch.nn.init.kaiming_normal_(m.weight) 141 | # elif isinstance(m, SynchronizedBatchNorm2d): 142 | # m.weight.data.fill_(1) 143 | # m.bias.data.zero_() 144 | elif isinstance(m, nn.BatchNorm2d): 145 | m.weight.data.fill_(1) 146 | m.bias.data.zero_() 147 | 148 | 149 | class Mobilenet_deeplab(nn.Module): 150 | def __init__(self, inchannel=3, os=16, pretrained=False): 151 | super(Mobilenet_deeplab, self).__init__() 152 | self.backbone = MobileNetV2(inchannel, os, BatchNorm=nn.BatchNorm2d, pretrained=False) 153 | if pretrained: 154 | self._load_pretrained_model() 155 | 156 | def _load_pretrained_model(self): 157 | root = '/media/HardDisk/wwk/video_text/codes/code1/model/pretrained/deeplab-mobilenet.pth.tar' 158 | pretrain_dict = torch.load(root) 159 | state_dict = {k: v for k, v in pretrain_dict['state_dict'].items() if k in self.state_dict().keys()} 160 | # model_dict = {} 161 | # state_dict = self.state_dict() 162 | # for k, v in pretrain_dict.items(): 163 | # if k in state_dict: 164 | # model_dict[k] = v 165 | # state_dict.update(model_dict) 166 | self.load_state_dict(state_dict) 167 | 168 | def forward(self, x): 169 | x, low_level_feat = self.backbone(x) 170 | return x, low_level_feat 171 | 172 | -------------------------------------------------------------------------------- /referring_segmentation/model/module/TCN.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from model.module.attention import LocalAttention, RelevanceFilter 5 | 6 | 7 | class TCN(nn.Module): 8 | def __init__(self, text_dim, inchannel, hidden_channel, outchannel, layers=8, padding_type='zero', with_local_attention=True, conv_type='3D', local_attention_type='relevance_filter', groups=8, norm_type='GroupNorm'): 9 | super(TCN, self).__init__() 10 | self.padding_type = padding_type 11 | self.with_local_attention = with_local_attention 12 | self.local_attention_type = local_attention_type 13 | self.conv_time = nn.ModuleList() 14 | self.conv_spatial = nn.ModuleList() 15 | self.conv_convert = nn.ModuleList() 16 | self.dilations = [] 17 | self.local_attention = nn.ModuleList() 18 | # self.global_txt_W = nn.ModuleList() 19 | for i in range(layers): 20 | # self.global_txt_W.append(nn.Linear(text_dim, hidden_channel)) 21 | dilation = torch.pow(torch.tensor(2), i) 22 | dilation = int(dilation) 23 | self.dilations.append(dilation) 24 | if with_local_attention: 25 | if local_attention_type == 'attention': 26 | self.local_attention.append(LocalAttention(inchannel, text_dim, inchannel)) 27 | else: 28 | self.local_attention.append(RelevanceFilter(text_dim, inchannel, inchannel, groups=groups)) 29 | else: 30 | self.local_attention.append(nn.Identity()) 31 | 32 | if conv_type == '3D': 33 | self.conv_spatial.append(nn.Identity()) 34 | if norm_type == "GroupNorm": 35 | self.conv_time.append( 36 | nn.Sequential( 37 | nn.Conv3d(inchannel, hidden_channel, (3, 3, 3), 1, (0, 1, 1), (dilation, 1, 1), bias=False), 38 | nn.GroupNorm(8, hidden_channel), 39 | nn.ReLU(inplace=True)) 40 | ) 41 | else: 42 | self.conv_time.append( 43 | nn.Sequential( 44 | nn.Conv3d(inchannel, hidden_channel, (3, 3, 3), 1, (0, 1, 1), (dilation, 1, 1), bias=False), 45 | nn.BatchNorm3d(hidden_channel), 46 | nn.ReLU(inplace=True)) 47 | ) 48 | 49 | else: 50 | if norm_type == "GroupNorm": 51 | self.conv_spatial.append( 52 | nn.Sequential( 53 | nn.Conv3d(inchannel, hidden_channel, (1, 3, 3), 1, (0, 1, 1), (1, 1, 1), bias=False), 54 | nn.GroupNorm(8, hidden_channel), 55 | nn.ReLU(inplace=True) 56 | ) 57 | ) 58 | self.conv_time.append( 59 | nn.Sequential( 60 | nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1), (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False), 61 | nn.GroupNorm(8, hidden_channel), 62 | nn.ReLU(inplace=True) 63 | ) 64 | ) 65 | else: 66 | self.conv_spatial.append( 67 | nn.Sequential( 68 | nn.Conv3d(inchannel, hidden_channel, (1, 3, 3), 1, (0, 1, 1), (1, 1, 1), bias=False), 69 | nn.BatchNorm3d(hidden_channel), 70 | nn.ReLU(inplace=True) 71 | ) 72 | ) 73 | self.conv_time.append( 74 | nn.Sequential( 75 | nn.Conv3d(hidden_channel, hidden_channel, (3, 1, 1), (1, 1, 1), (0, 0, 0), (dilation, 1, 1), bias=False), 76 | nn.BatchNorm3d(hidden_channel), 77 | nn.ReLU(inplace=True) 78 | ) 79 | ) 80 | if norm_type == "GroupNorm": 81 | self.conv_convert.append( 82 | nn.Sequential( 83 | nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False), 84 | nn.GroupNorm(8, outchannel) 85 | ) 86 | ) 87 | else: 88 | self.conv_convert.append( 89 | nn.Sequential( 90 | nn.Conv3d(hidden_channel, outchannel, 1, 1, bias=False), 91 | nn.BatchNorm3d(outchannel) 92 | ) 93 | ) 94 | self.__init_weight() 95 | 96 | def __init_weight(self): 97 | for m in self.modules(): 98 | if isinstance(m, nn.Conv3d): 99 | torch.nn.init.kaiming_normal_(m.weight) 100 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 101 | m.weight.data.fill_(1) 102 | m.bias.data.zero_() 103 | 104 | def forward(self, fea, fea_text, mask_local): 105 | fea_text = fea_text.permute(0, 2, 1) # B*L*C 106 | maps_layers = [] 107 | for i in range(len(self.conv_time)): 108 | res0 = fea 109 | 110 | if self.with_local_attention: 111 | if self.local_attention_type == 'attention': 112 | fea = self.local_attention[i](fea, fea_text, mask_local) 113 | else: 114 | maps, fea = self.local_attention[i](fea, fea_text) 115 | maps_layers.append(maps) 116 | fea = res0 + fea 117 | 118 | res1 = fea 119 | fea = self.conv_spatial[i](fea) 120 | 121 | if self.padding_type == 'circle': 122 | fea = circle_padding(self.dilations[i], fea) 123 | elif self.padding_type == 'zero': 124 | fea = F.pad(fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='constant', value=0) 125 | else: 126 | fea = F.pad(fea, (0, 0, 0, 0, self.dilations[i], self.dilations[i]), mode='circular') 127 | 128 | fea = self.conv_time[i](fea) # B*C*T 129 | fea = self.conv_convert[i](fea) 130 | fea = fea + res1 131 | return fea, maps_layers 132 | 133 | 134 | def circle_padding(padding, feature): 135 | length_times = feature.shape[2] 136 | index = list(range(0, length_times)) + list(range(length_times - 2, 0, -1)) 137 | total_num = 2 * padding + length_times 138 | num_c = padding // len(index) 139 | if num_c * len(index) < padding: 140 | num_c = num_c + 1 141 | expand_number = num_c * len(index) - padding 142 | index_f = [] 143 | for n in range(num_c): 144 | index = index + index + index 145 | for i in range(expand_number, expand_number + total_num): 146 | index_f.append(index[i]) 147 | 148 | feas = [] 149 | for idf in index_f: 150 | feas.append(feature[:, :, idf, :, :].unsqueeze(2)) 151 | feas = torch.cat(feas, dim=2) 152 | return feas 153 | -------------------------------------------------------------------------------- /spatiotemporal_grounding/util/metrics.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved 2 | """ 3 | Various utilities related to track and report metrics 4 | """ 5 | import datetime 6 | import time 7 | from collections import defaultdict, deque 8 | 9 | import torch 10 | import torch.distributed as dist 11 | 12 | from util.dist import is_dist_avail_and_initialized 13 | 14 | 15 | class SmoothedValue: 16 | """Track a series of values and provide access to smoothed values over a 17 | window or the global series average. 18 | """ 19 | 20 | def __init__(self, window_size=20, fmt=None): 21 | if fmt is None: 22 | fmt = "{median:.4f} ({global_avg:.4f})" 23 | self.deque = deque(maxlen=window_size) 24 | self.total = 0.0 25 | self.count = 0 26 | self.fmt = fmt 27 | 28 | def update(self, value, num=1): 29 | self.deque.append(value) 30 | self.count += num 31 | self.total += value * num 32 | 33 | def synchronize_between_processes(self): 34 | """ 35 | Distributed synchronization of the metric 36 | Warning: does not synchronize the deque! 37 | """ 38 | if not is_dist_avail_and_initialized(): 39 | return 40 | t = torch.tensor([self.count, self.total], dtype=torch.float64, device="cuda") 41 | dist.barrier() 42 | dist.all_reduce(t) 43 | t = t.tolist() 44 | self.count = int(t[0]) 45 | self.total = t[1] 46 | 47 | @property 48 | def median(self): 49 | d = torch.tensor(list(self.deque)) 50 | return d.median().item() 51 | 52 | @property 53 | def avg(self): 54 | d = torch.tensor(list(self.deque), dtype=torch.float32) 55 | return d.mean().item() 56 | 57 | @property 58 | def global_avg(self): 59 | return self.total / self.count 60 | 61 | @property 62 | def max(self): 63 | return max(self.deque) 64 | 65 | @property 66 | def value(self): 67 | return self.deque[-1] 68 | 69 | def __str__(self): 70 | return self.fmt.format( 71 | median=self.median, 72 | avg=self.avg, 73 | global_avg=self.global_avg, 74 | max=self.max, 75 | value=self.value, 76 | ) 77 | 78 | 79 | class MetricLogger(object): 80 | def __init__(self, delimiter="\t"): 81 | self.meters = defaultdict(SmoothedValue) 82 | self.delimiter = delimiter 83 | 84 | def update(self, **kwargs): 85 | for k, v in kwargs.items(): 86 | if isinstance(v, torch.Tensor): 87 | v = v.item() 88 | assert isinstance(v, (float, int)) 89 | self.meters[k].update(v) 90 | 91 | def __getattr__(self, attr): 92 | if attr in self.meters: 93 | return self.meters[attr] 94 | if attr in self.__dict__: 95 | return self.__dict__[attr] 96 | raise AttributeError( 97 | "'{}' object has no attribute '{}'".format(type(self).__name__, attr) 98 | ) 99 | 100 | def __str__(self): 101 | loss_str = [] 102 | for name, meter in self.meters.items(): 103 | loss_str.append("{}: {}".format(name, str(meter))) 104 | return self.delimiter.join(loss_str) 105 | 106 | def synchronize_between_processes(self): 107 | for meter in self.meters.values(): 108 | meter.synchronize_between_processes() 109 | 110 | def add_meter(self, name, meter): 111 | self.meters[name] = meter 112 | 113 | def log_every(self, iterable, print_freq, header=None): 114 | i = 0 115 | if not header: 116 | header = "" 117 | start_time = time.time() 118 | end = time.time() 119 | iter_time = SmoothedValue(fmt="{avg:.4f}") 120 | data_time = SmoothedValue(fmt="{avg:.4f}") 121 | space_fmt = ":" + str(len(str(len(iterable)))) + "d" 122 | if torch.cuda.is_available(): 123 | log_msg = self.delimiter.join( 124 | [ 125 | header, 126 | "[{0" + space_fmt + "}/{1}]", 127 | "eta: {eta}", 128 | "{meters}", 129 | "time: {time}", 130 | "data: {data}", 131 | "max mem: {memory:.0f}", 132 | ] 133 | ) 134 | else: 135 | log_msg = self.delimiter.join( 136 | [ 137 | header, 138 | "[{0" + space_fmt + "}/{1}]", 139 | "eta: {eta}", 140 | "{meters}", 141 | "time: {time}", 142 | "data: {data}", 143 | ] 144 | ) 145 | MB = 1024.0 * 1024.0 146 | for obj in iterable: 147 | data_time.update(time.time() - end) 148 | yield obj 149 | iter_time.update(time.time() - end) 150 | if i % print_freq == 0 or i == len(iterable) - 1: 151 | eta_seconds = iter_time.global_avg * (len(iterable) - i) 152 | eta_string = str(datetime.timedelta(seconds=int(eta_seconds))) 153 | if torch.cuda.is_available(): 154 | print( 155 | log_msg.format( 156 | i, 157 | len(iterable), 158 | eta=eta_string, 159 | meters=str(self), 160 | time=str(iter_time), 161 | data=str(data_time), 162 | memory=torch.cuda.max_memory_allocated() / MB, 163 | ) 164 | ) 165 | else: 166 | print( 167 | log_msg.format( 168 | i, 169 | len(iterable), 170 | eta=eta_string, 171 | meters=str(self), 172 | time=str(iter_time), 173 | data=str(data_time), 174 | ) 175 | ) 176 | i += 1 177 | end = time.time() 178 | total_time = time.time() - start_time 179 | total_time_str = str(datetime.timedelta(seconds=int(total_time))) 180 | print( 181 | "{} Total time: {} ({:.4f} s / it)".format( 182 | header, total_time_str, total_time / len(iterable) 183 | ) 184 | ) 185 | torch.cuda.reset_peak_memory_stats() 186 | 187 | 188 | @torch.no_grad() 189 | def accuracy(output, target, topk=(1,)): 190 | """Computes the precision@k for the specified values of k""" 191 | if target.numel() == 0: 192 | return [torch.zeros([], device=output.device)] 193 | maxk = max(topk) 194 | batch_size = target.size(0) 195 | 196 | _, pred = output.topk(maxk, 1, True, True) 197 | pred = pred.t() 198 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 199 | 200 | res = [] 201 | for k in topk: 202 | correct_k = correct[:k].view(-1).float().sum(0) 203 | res.append(correct_k.mul_(100.0 / batch_size)) 204 | return res 205 | -------------------------------------------------------------------------------- /referring_segmentation/utils/loss.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | from torch.autograd import Variable 4 | import numpy as np 5 | from math import exp 6 | 7 | 8 | def gaussian(window_size, sigma): 9 | gauss = torch.Tensor( 10 | [exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)]) 11 | return gauss/gauss.sum() 12 | 13 | 14 | def create_window(window_size, channel): 15 | _1D_window = gaussian(window_size, 1.5).unsqueeze(1) 16 | _2D_window = _1D_window.mm( 17 | _1D_window.t()).float().unsqueeze(0).unsqueeze(0) 18 | window = Variable(_2D_window.expand( 19 | channel, 1, window_size, window_size).contiguous()) 20 | return window 21 | 22 | 23 | def _ssim(img1, img2, window, window_size, channel, size_average=True): 24 | mu1 = F.conv2d(img1, window, padding=window_size//2, groups=channel) 25 | mu2 = F.conv2d(img2, window, padding=window_size//2, groups=channel) 26 | 27 | mu1_sq = mu1.pow(2) 28 | mu2_sq = mu2.pow(2) 29 | mu1_mu2 = mu1*mu2 30 | 31 | sigma1_sq = F.conv2d( 32 | img1*img1, window, padding=window_size//2, groups=channel) - mu1_sq 33 | sigma2_sq = F.conv2d( 34 | img2*img2, window, padding=window_size//2, groups=channel) - mu2_sq 35 | sigma12 = F.conv2d(img1*img2, window, padding=window_size // 36 | 2, groups=channel) - mu1_mu2 37 | 38 | C1 = 0.01**2 39 | C2 = 0.03**2 40 | 41 | ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2)) / \ 42 | ((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2)) 43 | 44 | if size_average: 45 | return ssim_map.mean() 46 | else: 47 | return ssim_map.mean(1).mean(1).mean(1) 48 | 49 | 50 | class SSIM(torch.nn.Module): 51 | def __init__(self, window_size=11, size_average=True): 52 | super(SSIM, self).__init__() 53 | self.window_size = window_size 54 | self.size_average = size_average 55 | self.channel = 1 56 | self.window = create_window(window_size, self.channel) 57 | 58 | def forward(self, img1, img2): 59 | (_, channel, _, _) = img1.size() 60 | 61 | if channel == self.channel and self.window.data.type() == img1.data.type(): 62 | window = self.window 63 | else: 64 | window = create_window(self.window_size, channel) 65 | 66 | if img1.is_cuda: 67 | window = window.cuda(img1.get_device()) 68 | window = window.type_as(img1) 69 | 70 | self.window = window 71 | self.channel = channel 72 | 73 | return _ssim(img1, img2, window, self.window_size, channel, self.size_average) 74 | 75 | 76 | def _logssim(img1, img2, window, window_size, channel, size_average=True): 77 | mu1 = F.conv2d(img1, window, padding=window_size//2, groups=channel) 78 | mu2 = F.conv2d(img2, window, padding=window_size//2, groups=channel) 79 | 80 | mu1_sq = mu1.pow(2) 81 | mu2_sq = mu2.pow(2) 82 | mu1_mu2 = mu1*mu2 83 | 84 | sigma1_sq = F.conv2d( 85 | img1*img1, window, padding=window_size//2, groups=channel) - mu1_sq 86 | sigma2_sq = F.conv2d( 87 | img2*img2, window, padding=window_size//2, groups=channel) - mu2_sq 88 | sigma12 = F.conv2d(img1*img2, window, padding=window_size // 89 | 2, groups=channel) - mu1_mu2 90 | 91 | C1 = 0.01**2 92 | C2 = 0.03**2 93 | 94 | ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2)) / \ 95 | ((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2)) 96 | ssim_map = (ssim_map - torch.min(ssim_map)) / \ 97 | (torch.max(ssim_map)-torch.min(ssim_map)) 98 | ssim_map = -torch.log(ssim_map + 1e-8) 99 | 100 | if size_average: 101 | return ssim_map.mean() 102 | else: 103 | return ssim_map.mean(1).mean(1).mean(1) 104 | 105 | 106 | class LOGSSIM(torch.nn.Module): 107 | def __init__(self, window_size=11, size_average=True): 108 | super(LOGSSIM, self).__init__() 109 | self.window_size = window_size 110 | self.size_average = size_average 111 | self.channel = 1 112 | self.window = create_window(window_size, self.channel) 113 | 114 | def forward(self, img1, img2): 115 | (_, channel, _, _) = img1.size() 116 | 117 | if channel == self.channel and self.window.data.type() == img1.data.type(): 118 | window = self.window 119 | else: 120 | window = create_window(self.window_size, channel) 121 | 122 | if img1.is_cuda: 123 | window = window.cuda(img1.get_device()) 124 | window = window.type_as(img1) 125 | 126 | self.window = window 127 | self.channel = channel 128 | 129 | return _logssim(img1, img2, window, self.window_size, channel, self.size_average) 130 | 131 | 132 | def ssim(img1, img2, window_size=11, size_average=True): 133 | (_, channel, _, _) = img1.size() 134 | window = create_window(window_size, channel) 135 | 136 | if img1.is_cuda: 137 | window = window.cuda(img1.get_device()) 138 | window = window.type_as(img1) 139 | 140 | return _ssim(img1, img2, window, window_size, channel, size_average) 141 | 142 | 143 | def _iou(pred, target, size_average=True): 144 | b = pred.shape[0] 145 | IoU = 0.0 146 | for i in range(0, b): 147 | # compute the IoU of the foreground 148 | Iand1 = torch.sum(target[i, :, :, :] * pred[i, :, :, :]) 149 | Ior1 = torch.sum(target[i, :, :, :]) + \ 150 | torch.sum(pred[i, :, :, :]) - Iand1 151 | IoU1 = Iand1 / Ior1 152 | 153 | # IoU loss is (1-IoU1) 154 | IoU = IoU + (1 - IoU1) 155 | 156 | return IoU / b 157 | 158 | 159 | class IOU(torch.nn.Module): 160 | def __init__(self, size_average=True): 161 | super(IOU, self).__init__() 162 | self.size_average = size_average 163 | 164 | def forward(self, pred, target): 165 | return _iou(pred, target, self.size_average) 166 | 167 | def dice_loss(inputs, targets, num_masks): 168 | """ 169 | Compute the DICE loss, similar to generalized IOU for masks 170 | Args: 171 | inputs: A float tensor of arbitrary shape. 172 | The predictions for each example. 173 | targets: A float tensor with the same shape as inputs. Stores the binary 174 | classification label for each element in inputs 175 | (0 for the negative class and 1 for the positive class). 176 | """ 177 | inputs = inputs.sigmoid() 178 | numerator = 2 * (inputs * targets).sum(1) 179 | denominator = inputs.sum(-1) + targets.sum(-1) 180 | loss = 1 - (numerator + 1) / (denominator + 1) 181 | return loss.sum() / num_masks 182 | 183 | 184 | def sigmoid_focal_loss(inputs, targets, num_masks, alpha: float = 0.25, gamma: float = 2): 185 | """ 186 | Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002. 187 | Args: 188 | inputs: A float tensor of arbitrary shape. 189 | The predictions for each example. 190 | targets: A float tensor with the same shape as inputs. Stores the binary 191 | classification label for each element in inputs 192 | (0 for the negative class and 1 for the positive class). 193 | alpha: (optional) Weighting factor in range (0,1) to balance 194 | positive vs negative examples. Default = -1 (no weighting). 195 | gamma: Exponent of the modulating factor (1 - p_t) to 196 | balance easy vs hard examples. 197 | Returns: 198 | Loss tensor 199 | """ 200 | prob = inputs.sigmoid() 201 | ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none") 202 | p_t = prob * targets + (1 - prob) * (1 - targets) 203 | loss = ce_loss * ((1 - p_t) ** gamma) 204 | 205 | if alpha >= 0: 206 | alpha_t = alpha * targets + (1 - alpha) * (1 - targets) 207 | loss = alpha_t * loss 208 | 209 | return loss.mean(1).sum() / num_masks -------------------------------------------------------------------------------- /spatiotemporal_grounding/models/module/decoder.py: -------------------------------------------------------------------------------- 1 | from typing import Any, List, Tuple, Optional 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | from torch import Tensor 6 | 7 | 8 | class SpatialDecoderBlock(nn.Module): 9 | def __init__(self, in_dim: int = 256, hidden_dim: int = 64, scale_layer_num: int = 3, 10 | out_hm_dim: int = 1, out_reg_dim: int = 1, phase: str = '1D'): 11 | super().__init__() 12 | self.phase = phase 13 | if phase == '2D': 14 | conv = nn.Conv2d 15 | upsample_mode = 'bilinear' 16 | elif phase == '1D': 17 | conv = nn.Conv1d 18 | upsample_mode = 'linear' 19 | else: 20 | assert NotImplementedError 21 | 22 | up = [] 23 | for i, _ in enumerate(range(scale_layer_num)): 24 | if i == 0: 25 | up.append(conv(in_dim, hidden_dim, 3, 1, 1, bias=False),) 26 | else: 27 | up.append(conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False)) 28 | up.append(nn.GroupNorm(4, hidden_dim)) 29 | up.append(nn.ReLU()) 30 | up.append(nn.Upsample(scale_factor=2, 31 | mode=upsample_mode, align_corners=True)) 32 | self.up = nn.Sequential(*up) 33 | self.hm = nn.Sequential( 34 | conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False), 35 | nn.GroupNorm(4, hidden_dim), 36 | nn.ReLU(), 37 | conv(hidden_dim, out_hm_dim, 1) 38 | ) 39 | self.reg = nn.Sequential( 40 | conv(hidden_dim, hidden_dim, 3, 1, 1, bias=False), 41 | nn.GroupNorm(4, hidden_dim), 42 | nn.ReLU(), 43 | conv(hidden_dim, out_reg_dim, 1) 44 | ) 45 | self.__init_weight() 46 | 47 | def __init_weight(self): 48 | for m in self.modules(): 49 | if isinstance(m, nn.Conv2d): 50 | torch.nn.init.kaiming_normal_(m.weight) 51 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 52 | m.weight.data.fill_(1) 53 | m.bias.data.zero_() 54 | 55 | def forward(self, feature: Tensor) -> Tuple[Tensor, Tensor]: 56 | feature = self.up(feature) 57 | heatmap = self.hm(feature) 58 | regression = self.reg(feature) 59 | return heatmap, regression 60 | 61 | 62 | class SpatialDecoder2D(nn.Module): 63 | def __init__(self, model_dim: int = 256, decoder_hidden_dim: int = 64, dilation: bool = False): 64 | super().__init__() 65 | self.decoder_type = 'spatial_2d' 66 | scale_layer_num = 2 if dilation else 3 67 | self.decoder = SpatialDecoderBlock( 68 | model_dim, decoder_hidden_dim, scale_layer_num, 1, 2, '2D') 69 | 70 | def forward(self, feature: Tensor, frame_mask: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]: 71 | """ 72 | Args: 73 | feature: [\sigma t_i, c, h, w] 74 | frame_mask: [\sigma t_i, h, w] 75 | Return: 76 | heatmap: [\sigma t_i, 1, h, w] 77 | regression: [\sigma ti, 2, h, w] 78 | """ 79 | heatmap, regression = self.decoder(feature) 80 | heatmap = heatmap.sigmoid() 81 | if frame_mask is not None: 82 | frame_mask = F.interpolate( 83 | frame_mask[:, None].float(), heatmap.shape[-2:], mode='nearest').bool() 84 | heatmap = heatmap.masked_fill(frame_mask, 0.) 85 | regression = regression.masked_fill(frame_mask, 0.) 86 | return { 87 | 'spatial_map': heatmap, 88 | 'spatial_wh': regression 89 | } 90 | 91 | 92 | class TemporalDecoderAnchor(nn.Module): 93 | def __init__(self, model_dim: int = 256, temporal_window_width: Optional[List] = None, dropout: float = 0.1): 94 | super().__init__() 95 | self.decoder_type = 'temporal_anchor' 96 | self.temporal_window_width = temporal_window_width 97 | self.reg_head = MLP(2*model_dim, model_dim, 98 | len(temporal_window_width) * 2, 2, dropout=dropout) 99 | self.cls_head = MLP(2*model_dim, model_dim, 100 | len(temporal_window_width), 2, dropout=dropout) 101 | 102 | def forward(self, feature_global: Tensor, feature_obj: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]: 103 | """ 104 | Args: 105 | feature_global: [b, c, t] 106 | feature_obj: [b, c, t] 107 | Return: 108 | offset: [b, t*n_window, 2] 109 | pred_score: [b, t*n_window] 110 | """ 111 | 112 | feature_flatten = torch.cat( 113 | [feature_obj, feature_global], dim=1).transpose(1, 2) # [b, t, 2*c] 114 | offset = self.reg_head(feature_flatten) # [b, t, 2*n_window] 115 | offset = offset.contiguous().view(-1, 116 | offset.shape[1] * len(self.temporal_window_width), 2) # [b, t*n_window, 2] 117 | pred_score = self.cls_head(feature_flatten) # [b, t, n_window] 118 | pred_score = torch.sigmoid(pred_score).contiguous().view( 119 | pred_score.size(0), -1) # [b, t*n_window] 120 | return { 121 | 'temporal_offset': offset, 122 | 'temporal_score': pred_score 123 | } 124 | 125 | 126 | class TemporalDecoderRegression(nn.Module): 127 | def __init__(self, model_dim: int = 256, dropout: float = 0.1): 128 | super().__init__() 129 | self.decoder_type = 'temporal_regression' 130 | self.score_head = MLP(2*model_dim, model_dim, 1, 2, dropout=dropout) 131 | self.iou_head = MLP(2*model_dim, model_dim, 1, 2, dropout=dropout) 132 | self.reg_head = MLP(2*model_dim, model_dim, 2, 2, dropout=dropout) 133 | 134 | def forward(self, feature_global: Tensor, feature_obj: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]: 135 | """ 136 | Args: 137 | feature_global: [b, c, t] 138 | feature_obj: [b, c, t] 139 | Return: 140 | pred_score: [b, t] 141 | pred_reg: [b, t, 2] 142 | pred_iou: [b, t] 143 | """ 144 | 145 | feature_flatten = torch.cat( 146 | [feature_obj, feature_global], dim=1).transpose(1, 2) # [b, t, 2*c] 147 | pred_score = self.score_head( 148 | feature_flatten).squeeze(-1).sigmoid() # [b, t] 149 | pred_reg = self.reg_head(feature_flatten) # [b, t, 2] 150 | pred_iou = self.iou_head( 151 | feature_flatten).squeeze(-1).sigmoid() # [b, t] 152 | return { 153 | 'temporal_score': pred_score, 154 | 'temporal_reg': pred_reg, 155 | 'temporal_iou': pred_iou, 156 | } 157 | 158 | 159 | class MLP(nn.Module): 160 | def __init__(self, input_dim: int = 256, hidden_dim: int = 256, output_dim: int = 256, 161 | num_layers: int = 2, normalization: bool = True, dropout: float = 0.1): 162 | super().__init__() 163 | self.num_layers = num_layers 164 | h = [hidden_dim] * (num_layers - 1) 165 | if normalization: 166 | self.layers = nn.ModuleList( 167 | nn.Sequential( 168 | nn.Linear(n, k), 169 | nn.LayerNorm(k), 170 | ) if idx < self.num_layers - 1 else nn.Linear(n, k) 171 | for idx, (n, k) in enumerate(zip([input_dim] + h, h + [output_dim])) 172 | ) 173 | else: 174 | self.layers = nn.ModuleList( 175 | nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]) 176 | ) 177 | self.dropout = dropout 178 | if dropout: 179 | self.dropout = nn.Dropout(dropout) 180 | 181 | def forward(self, x: Tensor) -> Tensor: 182 | for i, layer in enumerate(self.layers): 183 | x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x) 184 | if self.dropout and i < self.num_layers: 185 | x = self.dropout(x) 186 | return x 187 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SAW 2 | 3 | The official implementation OF paper "Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query" \[[Paper](https://ieeexplore.ieee.org/document/10043827)\] 4 | 5 | ![](./docs/net.png) 6 | 7 | We propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. The unified framework can be utilized to varies video-text action localization tasks, e.g., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding. 8 | 9 | ## Requirements 10 | 11 | * python 3.8 12 | 13 | * pytorch 1.9.1 14 | 15 | * torchtext 0.10.1 16 | 17 | 18 | ## Referring Video Segmentation 19 | 20 | run `cd referring_segmentation` for referring video segmentation task. 21 | 22 | ### 1. Dataset 23 | 24 | Download the A2D Sentences dataset and J-HMDB Sentences dataset from [https://kgavrilyuk.github.io/publication/actor_action/](https://kgavrilyuk.github.io/publication/actor_action/) and convert the videos to RGB frames. 25 | 26 | For A2D Sentences dataset, run `python pre_proc\video2imgs.py` to convert videos to RGB frames. The following directory structure is expected: 27 | 28 | ```python 29 | -a2d_sentences 30 | -Rename_Images 31 | -a2d_annotation_with_instances 32 | -videoset.csv 33 | -a2d_missed_videos.txt 34 | -a2d_annotation.txt 35 | -jhmdb_sentences 36 | -Rename_Images 37 | -puppet_mask 38 | -jhmdb_annotation.txt 39 | ``` 40 | 41 | Edit the item `datasets_root` in `json/onfig_$DATASET$.json` to be the current dataset path. 42 | 43 | Run `python pre_proc\generate_data_list.py` to generate the training and testing data splits. 44 | 45 | ### 2. Backbone 46 | 47 | Download the pretrained DeepLabResNet from [https://github.com/VainF/DeepLabV3Plus-Pytorch](https://github.com/VainF/DeepLabV3Plus-Pytorch) and put it into `model/pretrained/`. 48 | 49 | ### 4. Training 50 | 51 | Only the A2D Sentences dataset is adopted for training, run: 52 | 53 | ```python 54 | python main.py --json_file=json\config_a2d_sentences.json --mode=train 55 | ``` 56 | 57 | ### 5. Evaluation 58 | 59 | For A2d Sentences dataset, run: 60 | 61 | ```python 62 | python main.py --json_file=json\config_a2d_sentences.json --mode=test 63 | ``` 64 | 65 | For JHMDB Sentences dataset, run: 66 | 67 | ```python 68 | python main.py --json_file=json\config_jhmdb_sentences.json --mode=test 69 | ``` 70 | 71 | ## Temporal Sentence Grounding 72 | 73 | run `cd temporal_grounding` for referring temporal sentence grounding task. 74 | 75 | ### 1. Dataset 76 | 77 | * For charades-STA dataset, download the pre-extracted I3D features following [LGI4temporalgrounding](https://github.com/JonghwanMun/LGI4temporalgrounding) and the pre-extracted VGG feature following [2D-TAN](https://github.com/microsoft/VideoX/tree/master/2D-TAN). 78 | 79 | * For TACoS dataset, download the pre-extracted C3D features following [2D-TAN](https://github.com/microsoft/VideoX/tree/master/2D-TAN) 80 | 81 | * For ActivityNet Captions dataset, download the pre-extracted C3D features from [http://activity-net.org/challenges/2016/download.html](http://activity-net.org/challenges/2016/download.html). 82 | 83 | ### 2. Training and Evaluation 84 | 85 | The config files can be find in `./json` and the following model settings are supported 86 | 87 | ``` 88 | -config_ActivityNet_C3D_anchor.json 89 | -config_ActivityNet_C3D_regression.json 90 | -config_Charades-STA_I3D_anchor.json 91 | -config_Charades-STA_I3D_regression.json 92 | -config_Charades-STA_VGG_anchor.json 93 | -config_Charades-STA_VGG_regression.json 94 | -config_TACoS_C3D_anchor.json 95 | -config_TACoS_C3D_regression.json 96 | ``` 97 | 98 | Set the `"datasets_root"` in each config file to be your feature path. 99 | 100 | To train on different dataset with different grounding heads, run 101 | 102 | ``` 103 | python main.py --json_file=$JSON_FILE_PATH$ --mode=train 104 | ``` 105 | 106 | For evaluation, run 107 | 108 | ``` 109 | python main.py --json_file=$JSON_FILE_PATH$ --mode=test --checkpoint=$CHECKPOINT_PATH$ 110 | ``` 111 | 112 | The pretrained models and their correspondance performance are shown bellow 113 | 114 | | Datasets | Feature | Decoder | Checkpoints | 115 | |--------------|---------|------------|-------------| 116 | | Charades-STA | I3D | Regression | \[[Baidu](https://pan.baidu.com/s/1GQBkElQITd-exS1njNZrwQ) \| gj54 \] | 117 | | Charades-STA | I3D | Anchor | \[[Baidu](https://pan.baidu.com/s/1MXZqAEBLOzauR8cOLjo3QA) \| 5j3a \] | 118 | | Charades-STA | VGG | Regression | \[[Baidu](https://pan.baidu.com/s/1Yacke_tkaAELzMY_ePyIhw) \| 52xf \] | 119 | | Charades-STA | VGG | Anchor | \[[Baidu](https://pan.baidu.com/s/1PcIZ7QEWcYnfzne1dkMsng) \| rdmx \] | 120 | | ActivityNet | C3D | Regression | \[[Baidu](https://pan.baidu.com/s/1zlH64seHimscTOtNry-6Ag) \| 6sbh \] | 121 | | ActivityNet | C3D | Anchor | \[[Baidu](https://pan.baidu.com/s/1mi8M2wBUAqskWQQqHdmi2Q) \| ysr5 \] | 122 | | TACOS | C3D | Regression | \[[Baidu](https://pan.baidu.com/s/140m-9geYbktSRfP7Pa1rzA) \| iwx2 \] | 123 | | TACOS | C3D | Anchor | \[[Baidu](https://pan.baidu.com/s/1dzIIb4dKQY9t-oAF-N2sLw) \| 1ube \] | 124 | 125 | 126 | ## Spatiotemporal Video Grounding 127 | 128 | run `cd spatiotemporal_grounding` for spatiotemporal video grounding task. The code for spatiotemporal grounding is built on the [TubeDETR codebase](https://github.com/antoyang/TubeDETR). 129 | 130 | ### 1. Dataset 131 | 132 | We prepare the `HC-STVG` and `VidSTG` datasets following the [TubeDETR](https://github.com/antoyang/TubeDETR). The annotation formation of the VidSTG dataset has been optimized to reduce the training memory usage. 133 | 134 | **videos** 135 | 136 | VidSTG dataset: Download VidOR videos from [the VidOR dataset providers](https://xdshang.github.io/docs/vidor.html) 137 | 138 | HC-STVG dataset: Download HC-STVG videos from [the HC-STVG dataset providers](https://github.com/tzhhhh123/HC-STVG). 139 | 140 | Edit the item `vidstg_vid_path` in `spatiotemporal_grounding/config/vidstg.json` and the `hcstvg_vid_path` in `spatiotemporal_grounding/config/hcstvg.json` to be the current video path. 141 | 142 | **annotations** 143 | 144 | Download the preprocessed annotation files from \[[https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w](https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w), password: n6y4\]. Then put the downloaded `annotations` into `spatiotemporal_grounding`. 145 | 146 | ### 2. Training and Evaluation 147 | 148 | To train on HC-STVG dataset, run 149 | 150 | ``` 151 | python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result 152 | ``` 153 | 154 | To train on VidSTG dataset, run 155 | 156 | ``` 157 | python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result 158 | ``` 159 | 160 | To evaluate on HC-STVG dataset, run: 161 | 162 | ``` 163 | python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result --eval --resume=$CHECKPOINT_PATH$ 164 | ``` 165 | 166 | To evaluate on VidSTG dataset, run 167 | 168 | ``` 169 | python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result --eval --resume=$CHECKPOINT_PATH$ 170 | ``` 171 | 172 | ## Citation 173 | 174 | ``` 175 | @article{2023saw, 176 | title = {Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query}, 177 | author = {Yuting Su, Weikang Wang, Jing Liu, Shuang Ma, Xiaokang Yang}, 178 | booktitle = {IEEE Transactions on Image Processing}, 179 | year = {2023} 180 | } 181 | ``` --------------------------------------------------------------------------------