├── .gitignore ├── Experiments on ActivityNet, FCVID and Mini-Kinetics ├── README.md ├── basic_tools │ ├── __init__.py │ ├── checkpoint.py │ ├── logger.py │ └── utils.py ├── conf │ └── default.yaml ├── main_dist.py ├── models │ ├── gfv_net.py │ ├── mobilenet.py │ ├── ppo.py │ ├── resnet.py │ └── utils.py └── ops │ ├── dataset.py │ ├── dataset_config.py │ ├── transforms.py │ ├── utils.py │ └── video_jpg.py ├── Experiments on Something-Something V1&V2 ├── README.md ├── basic_tools │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-38.pyc │ │ ├── checkpoint.cpython-38.pyc │ │ ├── logger.cpython-38.pyc │ │ └── utils.cpython-38.pyc │ ├── checkpoint.py │ ├── logger.py │ └── utils.py ├── conf │ ├── evaluate.yaml │ ├── stage1.yaml │ ├── stage2.yaml │ └── stage3.yaml ├── evaluate.py ├── evaluate.sh ├── models │ ├── __pycache__ │ │ ├── ppo.cpython-38.pyc │ │ ├── ppo_continuous.cpython-38.pyc │ │ ├── resnet.cpython-38.pyc │ │ ├── tsn.cpython-38.pyc │ │ └── utils.cpython-38.pyc │ ├── constant.py │ ├── gfv_net.py │ ├── mobilenetv2.py │ ├── ppo.py │ ├── ppo_continuous.py │ ├── resnet.py │ ├── tsn.py │ └── utils.py ├── ops │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-38.pyc │ │ ├── temporal_shift.cpython-38.pyc │ │ ├── transforms.cpython-38.pyc │ │ └── utils.cpython-38.pyc │ ├── basic_ops.py │ ├── dataset.py │ ├── dataset_config.py │ ├── models_ada.py │ ├── my_logger.py │ ├── net_flops_table.py │ ├── sal_rank_loss.py │ ├── temporal_shift.py │ ├── transforms.py │ ├── utils.py │ └── video_jpg.py ├── stage1.py ├── stage2.py ├── stage3.py ├── train_stage1.sh ├── train_stage2.sh └── train_stage3.sh ├── README.md └── figure ├── actnet.png ├── intro.png ├── sthsth.png └── visualization.png /.gitignore: -------------------------------------------------------------------------------- 1 | outputs/ 2 | *.tar 3 | Experiments on ActivityNet, FCVID and Mini-Kinetics/.DS_Store 4 | .idea 5 | .pyc 6 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/README.md: -------------------------------------------------------------------------------- 1 | # Experiments on ActivityNet, FCVID and Mini-Kinetics 2 | 3 | ## Requirements 4 | - python 3.8 5 | - pytorch 1.7.0 6 | - torchvision 0.8.0 7 | - [hydra](https://hydra.cc/docs/intro/) 1.1.0 8 | 9 | ## Datasets 10 | 1. Please get the train/test split files for each dataset from [Google Drive](https://drive.google.com/drive/folders/1L41U4mczsrnwiSx3KiY57BblrRE5fjnU?usp=sharing) and put them in `PATH_TO_DATASET`. 11 | 2. Download videos from following links, or contact the corresponding authors for the access. Save them to `PATH_TO_DATASET/videos` 12 | - [ActivityNet-v1.3](http://activity-net.org/download.html) 13 | - [FCVID](https://drive.google.com/drive/folders/1cPSc3neTQwvtSPiVcjVZrj0RvXrKY5xj) 14 | - [Mini-Kinetics](https://deepmind.com/research/open-source/kinetics). Please download [Kinetics 400](https://storage.googleapis.com/deepmind-media/Datasets/kinetics400.tar.gz). For Mini-Kinetics used in our paper, you need to use the train/val splits file from [AR-Net](https://github.com/mengyuest/AR-Net#dataset-preparation). 15 | 3. Extract frames using [ops/video_jpg.py](ops/video_jpg.py), the frames will be saved to `PATH_TO_DATASET/frames`. Minor modifications on file path are needed when extracting frames from different dataset. 16 | 17 | 18 | 19 | ## Pre-trained Models on ActivityNet 20 | 21 | Please download pre-trained weights and checkpoints from [Google Drive](https://drive.google.com/drive/folders/1v5UnucCr2CjmH41HEJePPI2WtDO8SSsp?usp=sharing). 22 | 23 | - globalcnn.pth.tar: pre-trained weights for global CNN (MobileNet-v2). 24 | - localcnn.pth.tar: pre-trained weights for local CNN (ResNet-50). 25 | - 128checkpoint.pth.tar: checkpoint of stage 1 with patch size 128x128. 26 | - 160checkpoint.pth.tar: checkpoint of stage 1 with patch size 160x160. 27 | - 192checkpoint.pth.tar: checkpoint of stage 1 with patch size 192x192. 28 | - 128s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 128x128. 29 | - 160s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 160x160. 30 | - 192s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 192x192. 31 | 32 | ## Training 33 | 34 | - Here we take training the model with patch size 128x128 on ActivityNet dataset for example. 35 | - All logs and checkpoints will be saved in the directory: `./outputs/YYYY-MM-DD/HH-MM-SS` 36 | - Note that we store a set of default hyper-parameters in [conf/default.yaml](conf/default.yaml) which can be overrided through command line. You can also use your own config files. 37 | - Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command: 38 | 39 | for Global CNN: 40 | ``` 41 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=true 42 | ``` 43 | for Local CNN: 44 | ``` 45 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=false 46 | ``` 47 | 48 | - Training stage 1, pre-trained weights for Global CNN and Local CNN are required: 49 | ``` 50 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=1 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrained_glancer=PATH_TO_CHECKPOINTS pretrained_focuser=PATH_TO_CHECKPOINTS 51 | ``` 52 | 53 | - Training stage 2, a stage-1 checkpoint is required: 54 | ``` 55 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=2 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false 56 | ``` 57 | 58 | - Training stage 3, a stage-2 checkpoint is required: 59 | ``` 60 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false 61 | ``` 62 | 63 | ## Evaluate Pre-trained Models 64 | - Here we take evaluating model with patch size 128x128 on ActivityNet for example. 65 | ``` 66 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false evaluate=true 67 | ``` 68 | 69 | 70 | ## Acknowledgement 71 | We use the implementation of MobileNet-v2 and ResNet from [Pytorch source code](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html). We also borrow some codes for dataset preparation from [AR-Net](https://github.com/mengyuest/AR-Net#dataset-preparation) and PPO from [here](https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py). 72 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/__init__.py: -------------------------------------------------------------------------------- 1 | import basic_tools.utils as utils 2 | import basic_tools.logger as logger 3 | import basic_tools.checkpoint as checkpoint 4 | 5 | import sys 6 | import os 7 | from omegaconf import DictConfig, OmegaConf 8 | 9 | def start(args): 10 | # checkpoint.init_checkpoint() 11 | 12 | # sys.stdout = logger.Logger("./log.log", mode="w") 13 | # sys.stderr = logger.Logger("./log.err", mode="w") 14 | 15 | cmd_line = " ".join(sys.argv) 16 | print(f"{cmd_line}") 17 | print(f"Working dir: {os.getcwd()}") 18 | 19 | print(OmegaConf.to_yaml(args)) 20 | return OmegaConf.to_yaml(args) 21 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/checkpoint.py: -------------------------------------------------------------------------------- 1 | import time 2 | import signal 3 | import os 4 | import sys 5 | import torch 6 | import socket 7 | 8 | 9 | ''' 10 | Usage: 11 | 12 | init_checkpoint() 13 | 14 | if exist_checkpoint(): 15 | any_object = load_checkpoint() 16 | 17 | save_checkpoint(any_object) 18 | ''' 19 | 20 | CHECKPOINT_filename = 'checkpoint.pth.tar' 21 | CHECKPOINT_tempfile = 'checkpoint.temp' 22 | SIGNAL_RECEIVED = False 23 | 24 | def SIGTERMHandler(a, b): 25 | print('received sigterm') 26 | pass 27 | 28 | 29 | def signalHandler(a, b): 30 | global SIGNAL_RECEIVED 31 | print('Signal received', a, time.time(), flush=True) 32 | SIGNAL_RECEIVED = True 33 | 34 | print("caught signal", a) 35 | print(socket.gethostname(), "USR1 signal caught.") 36 | # do other stuff to cleanup here 37 | print('requeuing job ' + os.environ['SLURM_JOB_ID']) 38 | os.system('scontrol requeue ' + os.environ['SLURM_JOB_ID']) 39 | sys.exit(-1) 40 | 41 | 42 | def init_checkpoint(): 43 | signal.signal(signal.SIGUSR1, signalHandler) 44 | signal.signal(signal.SIGTERM, SIGTERMHandler) 45 | print('Signal handler installed', flush=True) 46 | 47 | def save_checkpoint(state): 48 | global CHECKPOINT_filename, CHECKPOINT_tempfile 49 | torch.save(state, CHECKPOINT_tempfile) 50 | if os.path.isfile(CHECKPOINT_tempfile): 51 | os.rename(CHECKPOINT_tempfile, CHECKPOINT_filename) 52 | print("Checkpoint done") 53 | 54 | def save_checkpoint_if_signal(state): 55 | global SIGNAL_RECEIVED 56 | if SIGNAL_RECEIVED: 57 | save_checkpoint(state) 58 | 59 | def exist_checkpoint(): 60 | global CHECKPOINT_filename 61 | return os.path.isfile(CHECKPOINT_filename) 62 | 63 | def load_checkpoint(filename=None): 64 | global CHECKPOINT_filename 65 | if filename is None: 66 | filename = CHECKPOINT_filename 67 | 68 | # optionally resume from a checkpoint 69 | # if args.resume: 70 | #if os.path.isfile(args.resume): 71 | # To make the script simple to understand, we do resume whenever there is 72 | # a checkpoint file 73 | if os.path.isfile(filename): 74 | print(f"=> loading checkpoint {filename}") 75 | checkpoint = torch.load(filename) 76 | print(f"=> loaded checkpoint {filename}") 77 | return checkpoint 78 | else: 79 | raise RuntimeError(f"=> no checkpoint found at '{filename}'") 80 | 81 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import logging 4 | 5 | class Logger: 6 | def __init__(self, path, mode='w'): 7 | assert mode in {'w', 'a'}, 'unknown mode for logger %s' % mode 8 | 9 | fh = logging.FileHandler(path, mode=mode) 10 | formatter = logging.Formatter('[%(asctime)s][%(name)s][%(levelname)s] - %(message)s') 11 | fh.setFormatter(formatter) 12 | # ch = logging.StreamHandler(sys.__stdout__) 13 | 14 | self.logger = logging.getLogger() 15 | self.logger.addHandler(fh) 16 | # self.logger.addHandler(ch) 17 | 18 | def write(self, message): 19 | if message == "\n": return 20 | # Remove \n at the end. 21 | self.logger.info(message.strip()) 22 | 23 | def flush(self): 24 | # for python 3 compatibility. 25 | pass 26 | 27 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import sys 3 | import random 4 | import numpy as np 5 | import os 6 | import subprocess 7 | 8 | from torch import optim 9 | 10 | def set_all_seeds(rand_seed): 11 | random.seed(rand_seed) 12 | np.random.seed(rand_seed) 13 | torch.manual_seed(rand_seed) 14 | torch.cuda.manual_seed(rand_seed) 15 | 16 | def to_cpu(x): 17 | if isinstance(x, dict): 18 | return { k : to_cpu(v) for k, v in x.items() } 19 | elif isinstance(x, list): 20 | return [ to_cpu(v) for v in x ] 21 | elif isinstance(x, torch.Tensor): 22 | return x.cpu() 23 | else: 24 | return x 25 | 26 | def model2numpy(model): 27 | return { k : v.cpu().numpy() for k, v in model.state_dict().items() } 28 | 29 | def activation2numpy(output): 30 | if isinstance(output, dict): 31 | return { k : activation2numpy(v) for k, v in output.items() } 32 | elif isinstance(output, list): 33 | return [ activation2numpy(v) for v in output ] 34 | elif isinstance(output, Variable): 35 | return output.data.cpu().numpy() 36 | 37 | def count_size(x): 38 | if isinstance(x, dict): 39 | return sum([ count_size(v) for k, v in x.items() ]) 40 | elif isinstance(x, list) or isinstance(x, tuple): 41 | return sum([ count_size(v) for v in x ]) 42 | elif isinstance(x, torch.Tensor): 43 | return x.nelement() * x.element_size() 44 | else: 45 | return sys.getsizeof(x) 46 | 47 | def mem2str(num_bytes): 48 | assert num_bytes >= 0 49 | if num_bytes >= 2 ** 30: # GB 50 | val = float(num_bytes) / (2 ** 30) 51 | result = "%.3f GB" % val 52 | elif num_bytes >= 2 ** 20: # MB 53 | val = float(num_bytes) / (2 ** 20) 54 | result = "%.3f MB" % val 55 | elif num_bytes >= 2 ** 10: # KB 56 | val = float(num_bytes) / (2 ** 10) 57 | result = "%.3f KB" % val 58 | else: 59 | result = "%d bytes" % num_bytes 60 | return result 61 | 62 | def get_mem_usage(): 63 | import psutil 64 | 65 | mem = psutil.virtual_memory() 66 | result = "" 67 | result += "available: %s\t" % (mem2str(mem.available)) 68 | result += "used: %s\t" % (mem2str(mem.used)) 69 | result += "free: %s\t" % (mem2str(mem.free)) 70 | # result += "active: %s\t" % (mem2str(mem.active)) 71 | # result += "inactive: %s\t" % (mem2str(mem.inactive)) 72 | # result += "buffers: %s\t" % (mem2str(mem.buffers)) 73 | # result += "cached: %s\t" % (mem2str(mem.cached)) 74 | # result += "shared: %s\t" % (mem2str(mem.shared)) 75 | # result += "slab: %s\t" % (mem2str(mem.slab)) 76 | return result 77 | 78 | def get_github_string(): 79 | _, output = subprocess.getstatusoutput("git -C ./ log --pretty=format:'%H' -n 1") 80 | ret, _ = subprocess.getstatusoutput("git -C ./ diff-index --quiet HEAD --") 81 | return f"Githash: {output}, unstaged: {ret}" 82 | 83 | 84 | def accumulate(all_y, y): 85 | if all_y is None: 86 | all_y = dict() 87 | for k, v in y.items(): 88 | if isinstance(v, list): 89 | all_y[k] = [ [vv] for vv in v ] 90 | else: 91 | all_y[k] = [v] 92 | else: 93 | for k, v in all_y.items(): 94 | if isinstance(y[k], list): 95 | for vv, yy in zip(v, y[k]): 96 | vv.append(yy) 97 | else: 98 | v.append(y[k]) 99 | 100 | return all_y 101 | 102 | def combine(all_y): 103 | output = dict() 104 | for k, v in all_y.items(): 105 | if isinstance(v[0], list): 106 | output[k] = [ torch.cat(vv) for vv in v ] 107 | else: 108 | output[k] = torch.cat(v) 109 | 110 | return output 111 | 112 | def concatOutput(loader, nets, condition=None): 113 | outputs = [None] * len(nets) 114 | 115 | use_cnn = nets[0].use_cnn 116 | 117 | with torch.no_grad(): 118 | for i, (x, _) in enumerate(loader): 119 | if not use_cnn: 120 | x = x.view(x.size(0), -1) 121 | x = x.cuda() 122 | 123 | outputs = [ accumulate(output, to_cpu(net(x))) for net, output in zip(nets, outputs) ] 124 | if condition is not None and not condition(i): 125 | break 126 | 127 | return [ combine(output) for output in outputs ] 128 | 129 | 130 | def adjust_learning_rate(args, optimizer, epoch): 131 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 132 | lrs = args.lr_steps.split('-') 133 | lr_steps = [int(lr) for lr in lrs] 134 | if args.lr_type == 'step': 135 | decay = 0.1 ** (sum(epoch >= np.array(lr_steps))) 136 | backbone_lr = args.backbone_lr * decay 137 | fc_lr = args.fc_lr * decay 138 | decay = args.weight_decay 139 | elif args.lr_type == 'cos': 140 | import math 141 | backbone_lr = 0.5 * args.backbone_lr * (1 + math.cos(math.pi * epoch / args.epochs)) 142 | fc_lr = 0.5 * args.fc_lr * (1 + math.cos(math.pi * epoch / args.epochs)) 143 | decay = args.weight_decay 144 | else: 145 | raise NotImplementedError 146 | 147 | if args.train_stage == 0: 148 | optimizer.param_groups[0]['lr'] = backbone_lr # Glancer 149 | optimizer.param_groups[1]['lr'] = backbone_lr # Focuser 150 | optimizer.param_groups[2]['lr'] = fc_lr # GRU 151 | elif args.train_stage == 1: 152 | optimizer.param_groups[0]['lr'] = backbone_lr # Focuser 153 | optimizer.param_groups[1]['lr'] = fc_lr # GRU 154 | elif args.train_stage == 2: 155 | pass 156 | elif args.train_stage == 3: 157 | optimizer.param_groups[0]['lr'] = fc_lr # GRU 158 | 159 | for param_group in optimizer.param_groups: 160 | # param_group['lr'] = lr 161 | param_group['weight_decay'] = decay 162 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/conf/default.yaml: -------------------------------------------------------------------------------- 1 | dataset: actnet 2 | train_list: 3 | val_list: 4 | root_path: 5 | data_dir: 6 | resume: 7 | 8 | pretrained_glancer: 9 | pretrained_focuser: 10 | train_stage: 11 | 12 | pretrain_glancer: true 13 | arch: resnet 14 | num_segments: 16 15 | k: 3 16 | dropout: 0.5 17 | num_classes: 200 18 | evaluate: false 19 | eval_freq: 5 20 | 21 | dense_sample: false 22 | partial_fcvid_eval: false 23 | partial_ratio: 0.2 24 | ada_reso_skip: false 25 | reso_list: 224 26 | random_crop: false 27 | center_crop: false 28 | ada_crop_list: 29 | rescale_to: 224 30 | policy_input_offset: 0 31 | save_meta: false 32 | 33 | epochs: 50 34 | batch_size: 64 35 | backbone_lr: 0.01 36 | fc_lr: 0.005 37 | lr_type: cos # support step or cos 38 | lr_steps: 50-100 39 | momentum: 0.9 40 | weight_decay: 0.0001 41 | clip_grad: 20 42 | npb: true 43 | 44 | input_size: 224 45 | patch_size: 96 46 | glance_size: 224 47 | random_patch: false 48 | feature_map_channels: 1280 49 | action_dim: 49 50 | hidden_state_dim: 1024 #for policy network, focuser 51 | policy_conv: true 52 | hidden_dim: 1024 #for gru 53 | penalty: 0.5 54 | consensus: gru 55 | reward: random 56 | gamma: 0.7 57 | policy_lr: 0.0003 58 | with_glancer: true 59 | continuous: false 60 | 61 | seed: 1007 62 | gpus: 0 63 | gpu: 64 | workers: 16 65 | world_size: 1 66 | rank: 0 67 | dist_url: tcp://127.0.0.1:8888 68 | dist_backend: nccl 69 | multiprocessing_distributed: true 70 | distributed: 71 | amp: true 72 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/models/mobilenet.py: -------------------------------------------------------------------------------- 1 | from torch import nn 2 | from .utils import load_state_dict_from_url 3 | 4 | __all__ = ['MobileNetV2', 'mobilenet_v2'] 5 | 6 | 7 | model_urls = { 8 | 'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth', 9 | } 10 | 11 | 12 | def _make_divisible(v, divisor, min_value=None): 13 | """ 14 | This function is taken from the original tf repo. 15 | It ensures that all layers have a channel number that is divisible by 8 16 | It can be seen here: 17 | https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py 18 | :param v: 19 | :param divisor: 20 | :param min_value: 21 | :return: 22 | """ 23 | if min_value is None: 24 | min_value = divisor 25 | new_v = max(min_value, int(v + divisor / 2) // divisor * divisor) 26 | # Make sure that round down does not go down by more than 10%. 27 | if new_v < 0.9 * v: 28 | new_v += divisor 29 | return new_v 30 | 31 | 32 | class ConvBNReLU(nn.Sequential): 33 | def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1): 34 | padding = (kernel_size - 1) // 2 35 | super(ConvBNReLU, self).__init__( 36 | nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False), 37 | nn.BatchNorm2d(out_planes), 38 | nn.ReLU6(inplace=True) 39 | ) 40 | 41 | 42 | class InvertedResidual(nn.Module): 43 | def __init__(self, inp, oup, stride, expand_ratio): 44 | super(InvertedResidual, self).__init__() 45 | self.stride = stride 46 | assert stride in [1, 2] 47 | 48 | hidden_dim = int(round(inp * expand_ratio)) 49 | self.use_res_connect = self.stride == 1 and inp == oup 50 | 51 | layers = [] 52 | if expand_ratio != 1: 53 | # pw 54 | layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1)) 55 | layers.extend([ 56 | # dw 57 | ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim), 58 | # pw-linear 59 | nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), 60 | nn.BatchNorm2d(oup), 61 | ]) 62 | self.conv = nn.Sequential(*layers) 63 | 64 | def forward(self, x): 65 | if self.use_res_connect: 66 | return x + self.conv(x) 67 | else: 68 | return self.conv(x) 69 | 70 | 71 | class MobileNetV2(nn.Module): 72 | def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8): 73 | """ 74 | MobileNet V2 main class 75 | 76 | Args: 77 | num_classes (int): Number of classes 78 | width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount 79 | inverted_residual_setting: Network structure 80 | round_nearest (int): Round the number of channels in each layer to be a multiple of this number 81 | Set to 1 to turn off rounding 82 | """ 83 | super(MobileNetV2, self).__init__() 84 | block = InvertedResidual 85 | input_channel = 32 86 | last_channel = 1280 87 | 88 | if inverted_residual_setting is None: 89 | inverted_residual_setting = [ 90 | # t, c, n, s 91 | [1, 16, 1, 1], 92 | [6, 24, 2, 2], 93 | [6, 32, 3, 2], 94 | [6, 64, 4, 2], 95 | [6, 96, 3, 1], 96 | [6, 160, 3, 2], 97 | [6, 320, 1, 1], 98 | ] 99 | 100 | # only check the first element, assuming user knows t,c,n,s are required 101 | if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4: 102 | raise ValueError("inverted_residual_setting should be non-empty " 103 | "or a 4-element list, got {}".format(inverted_residual_setting)) 104 | 105 | # building first layer 106 | input_channel = _make_divisible(input_channel * width_mult, round_nearest) 107 | self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest) 108 | features = [ConvBNReLU(3, input_channel, stride=2)] 109 | # building inverted residual blocks 110 | for t, c, n, s in inverted_residual_setting: 111 | output_channel = _make_divisible(c * width_mult, round_nearest) 112 | for i in range(n): 113 | stride = s if i == 0 else 1 114 | features.append(block(input_channel, output_channel, stride, expand_ratio=t)) 115 | input_channel = output_channel 116 | # building last several layers 117 | features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1)) 118 | # make it nn.Sequential 119 | self.features = nn.Sequential(*features) 120 | 121 | # building classifier 122 | self.classifier = nn.Sequential( 123 | nn.Dropout(0.2), 124 | nn.Linear(self.last_channel, num_classes), 125 | ) 126 | 127 | # weight initialization 128 | for m in self.modules(): 129 | if isinstance(m, nn.Conv2d): 130 | nn.init.kaiming_normal_(m.weight, mode='fan_out') 131 | if m.bias is not None: 132 | nn.init.zeros_(m.bias) 133 | elif isinstance(m, nn.BatchNorm2d): 134 | nn.init.ones_(m.weight) 135 | nn.init.zeros_(m.bias) 136 | elif isinstance(m, nn.Linear): 137 | nn.init.normal_(m.weight, 0, 0.01) 138 | nn.init.zeros_(m.bias) 139 | 140 | def forward(self, x): 141 | x = self.features(x) 142 | x = x.mean([2, 3]) 143 | x = self.classifier(x) 144 | return x 145 | 146 | def get_featmap(self, x): 147 | x = self.features(x) 148 | return x, x.mean([2, 3]) 149 | 150 | @property 151 | def feature_dim(self): 152 | return self.last_channel 153 | 154 | 155 | def mobilenet_v2(pretrained=False, progress=True, **kwargs): 156 | """ 157 | Constructs a MobileNetV2 architecture from 158 | `"MobileNetV2: Inverted Residuals and Linear Bottlenecks" `_. 159 | 160 | Args: 161 | pretrained (bool): If True, returns a model pre-trained on ImageNet 162 | progress (bool): If True, displays a progress bar of the download to stderr 163 | """ 164 | model = MobileNetV2(**kwargs) 165 | if pretrained: 166 | state_dict = load_state_dict_from_url(model_urls['mobilenet_v2'], 167 | progress=progress) 168 | model.load_state_dict(state_dict) 169 | return model 170 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/models/ppo.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import math 6 | from torch.distributions import Categorical 7 | 8 | 9 | class Memory: 10 | def __init__(self): 11 | self.actions = [] 12 | self.states = [] 13 | self.logprobs = [] 14 | self.rewards = [] 15 | self.is_terminals = [] 16 | self.hidden = [] 17 | 18 | def clear_memory(self): 19 | del self.actions[:] 20 | del self.states[:] 21 | del self.logprobs[:] 22 | del self.rewards[:] 23 | del self.is_terminals[:] 24 | del self.hidden[:] 25 | 26 | 27 | class ActorCritic(nn.Module): 28 | def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim=1024, policy_conv=True): 29 | super(ActorCritic, self).__init__() 30 | 31 | # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet 32 | if policy_conv: 33 | self.state_encoder = nn.Sequential( 34 | nn.Conv2d(feature_dim, 32, kernel_size=1, stride=1, padding=0, bias=False), 35 | nn.ReLU(), 36 | nn.Flatten(), 37 | nn.Linear(int(state_dim * 32 / feature_dim), hidden_state_dim), 38 | nn.ReLU() 39 | ) 40 | # encoder with linear layer for ResNet and DenseNet 41 | else: 42 | self.state_encoder = nn.Sequential( 43 | nn.Linear(state_dim, 2048), 44 | nn.ReLU(), 45 | nn.Linear(2048, hidden_state_dim), 46 | nn.ReLU() 47 | ) 48 | 49 | self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False) 50 | 51 | self.actor = nn.Sequential( 52 | nn.Linear(hidden_state_dim, action_dim), 53 | nn.Softmax(dim=-1)) 54 | 55 | self.critic = nn.Sequential( 56 | nn.Linear(hidden_state_dim, 1)) 57 | 58 | self.hidden_state_dim = hidden_state_dim 59 | self.action_dim = action_dim 60 | self.policy_conv = policy_conv 61 | self.feature_dim = feature_dim 62 | self.feature_ratio = int(math.sqrt(state_dim/feature_dim)) 63 | 64 | def forward(self): 65 | raise NotImplementedError 66 | 67 | def act(self, state_ini, memory, restart_batch=False, training=True): 68 | if restart_batch: 69 | del memory.hidden[:] 70 | memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda()) 71 | 72 | if not self.policy_conv: 73 | state = state_ini.flatten(1) 74 | else: 75 | state = state_ini 76 | 77 | # print(state.shape) 78 | state = self.state_encoder(state) 79 | 80 | state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1]) 81 | memory.hidden.append(hidden_output) 82 | 83 | state = state[0] 84 | action_probs = self.actor(state) 85 | dist = Categorical(action_probs) 86 | 87 | if training: 88 | action = dist.sample() 89 | action_logprob = dist.log_prob(action) 90 | memory.states.append(state_ini) 91 | memory.actions.append(action) 92 | memory.logprobs.append(action_logprob) 93 | else: 94 | action = action_probs.max(1)[1] 95 | 96 | return action 97 | 98 | def evaluate(self, state, action): 99 | seq_l = state.size(0) 100 | batch_size = state.size(1) 101 | 102 | if not self.policy_conv: 103 | state = state.flatten(2) 104 | state = state.view(seq_l * batch_size, state.size(2)) 105 | else: 106 | state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4)) 107 | 108 | state = self.state_encoder(state) 109 | state = state.view(seq_l, batch_size, -1) 110 | 111 | state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda()) 112 | state = state.view(seq_l * batch_size, -1) 113 | 114 | action_probs = self.actor(state) 115 | dist = Categorical(action_probs) 116 | action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda() 117 | dist_entropy = dist.entropy().cuda() 118 | state_value = self.critic(state) 119 | 120 | return action_logprobs.view(seq_l, batch_size), \ 121 | state_value.view(seq_l, batch_size), \ 122 | dist_entropy.view(seq_l, batch_size) 123 | 124 | 125 | class PPO(nn.Module): 126 | def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv, gpu=0, 127 | lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2): 128 | super(PPO, self).__init__() 129 | self.lr = lr 130 | self.betas = betas 131 | self.gamma = gamma 132 | self.eps_clip = eps_clip 133 | self.K_epochs = K_epochs 134 | 135 | self.policy = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu) 136 | 137 | self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas) 138 | 139 | self.policy_old = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu) 140 | self.policy_old.load_state_dict(self.policy.state_dict()) 141 | 142 | self.MseLoss = nn.MSELoss() 143 | 144 | def select_action(self, state, memory, restart_batch=False, training=True): 145 | return self.policy_old.act(state, memory, restart_batch, training) 146 | 147 | def update(self, memory): 148 | rewards = [] 149 | discounted_reward = 0 150 | 151 | for reward in reversed(memory.rewards): 152 | discounted_reward = reward + (self.gamma * discounted_reward) 153 | rewards.insert(0, discounted_reward) 154 | 155 | rewards = torch.cat(rewards, 0).cuda() 156 | 157 | rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5) 158 | 159 | old_states = torch.stack(memory.states, 0).cuda().detach() 160 | old_actions = torch.stack(memory.actions, 0).cuda().detach() 161 | old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach() 162 | 163 | for _ in range(self.K_epochs): 164 | logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions) 165 | 166 | ratios = torch.exp(logprobs - old_logprobs.detach()) 167 | 168 | advantages = rewards - state_values.detach() 169 | surr1 = ratios * advantages 170 | surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages 171 | 172 | loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy 173 | 174 | self.optimizer.zero_grad() 175 | loss.mean().backward() 176 | self.optimizer.step() 177 | 178 | self.policy_old.load_state_dict(self.policy.state_dict()) -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/models/utils.py: -------------------------------------------------------------------------------- 1 | 2 | try: 3 | from torch.hub import load_state_dict_from_url 4 | except ImportError: 5 | from torch.utils.model_zoo import load_url as load_state_dict_from_url 6 | 7 | import torchvision 8 | import torch 9 | import numpy as np 10 | 11 | def prep_a_net(model_name, shall_pretrain): 12 | model = getattr(torchvision.models, model_name)(shall_pretrain) 13 | if "resnet" in model_name: 14 | model.last_layer_name = 'fc' 15 | elif "mobilenet_v2" in model_name: 16 | model.last_layer_name = 'classifier' 17 | return model 18 | 19 | def zero_pad(im, pad_size): 20 | """Performs zero padding (CHW format).""" 21 | pad_width = ((0, 0), (pad_size, pad_size), (pad_size, pad_size)) 22 | return np.pad(im, pad_width, mode="constant") 23 | 24 | def random_crop(im, size, pad_size=0): 25 | """Performs random crop (CHW format).""" 26 | if pad_size > 0: 27 | im = zero_pad(im=im, pad_size=pad_size) 28 | h, w = im.shape[1:] 29 | if size == h: 30 | return im 31 | y = np.random.randint(0, h - size) 32 | x = np.random.randint(0, w - size) 33 | im_crop = im[:, y : (y + size), x : (x + size)] 34 | assert im_crop.shape[1:] == (size, size) 35 | return im_crop 36 | 37 | def get_patch(images, action_sequence, patch_size): 38 | """Get small patch of the original image""" 39 | batch_size = images.size(0) 40 | image_size = images.size(2) 41 | 42 | patch_coordinate = torch.floor(action_sequence * (image_size - patch_size)).int() 43 | patches = [] 44 | for i in range(batch_size): 45 | per_patch = images[i, :, 46 | (patch_coordinate[i, 0].item()): ((patch_coordinate[i, 0] + patch_size).item()), 47 | (patch_coordinate[i, 1].item()): ((patch_coordinate[i, 1] + patch_size).item())] 48 | 49 | patches.append(per_patch.view(1, per_patch.size(0), per_patch.size(1), per_patch.size(2))) 50 | 51 | return torch.cat(patches, 0) -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/dataset.py: -------------------------------------------------------------------------------- 1 | import torch.utils.data as data 2 | import torch 3 | 4 | from PIL import Image 5 | import os 6 | import numpy as np 7 | from numpy.random import randint 8 | 9 | 10 | class VideoRecord(object): 11 | def __init__(self, row): 12 | self._data = row 13 | self._labels = torch.tensor([-1, -1, -1]) 14 | labels = sorted(list(set([int(x) for x in self._data[2:]]))) 15 | for i, l in enumerate(labels): 16 | self._labels[i] = l 17 | 18 | @property 19 | def path(self): 20 | return self._data[0] 21 | 22 | @property 23 | def num_frames(self): 24 | return int(self._data[1]) 25 | 26 | @property 27 | def label(self): 28 | if self._labels[-2] > -1: 29 | if self._labels[-1] > -1: 30 | return self._labels[torch.randperm(self._labels.shape[0])] 31 | else: 32 | if torch.rand(1) > 0.5: 33 | return self._labels[[0,1,2]] 34 | else: 35 | return self.label[[1,0,2]] 36 | else: 37 | return self._labels 38 | 39 | 40 | class TSNDataSet(data.Dataset): 41 | def __init__(self, root_path, list_file, 42 | num_segments=3, image_tmpl='img_{:05d}.jpg', transform=None, 43 | random_shift=True, test_mode=False, 44 | remove_missing=False, dense_sample=False, twice_sample=False, 45 | dataset=None, partial_fcvid_eval=False, partial_ratio=None, 46 | ada_reso_skip=False, reso_list=None, random_crop=False, center_crop=False, ada_crop_list=None, 47 | rescale_to=224, policy_input_offset=None, save_meta=False): 48 | 49 | self.root_path = root_path 50 | 51 | self.list_file = \ 52 | ".".join(list_file.split(".")[:-1]) + "." + list_file.split(".")[-1] # TODO 53 | self.num_segments = num_segments 54 | self.image_tmpl = image_tmpl 55 | self.transform = transform 56 | self.random_shift = random_shift 57 | self.test_mode = test_mode 58 | self.remove_missing = remove_missing 59 | self.dense_sample = dense_sample # using dense sample as I3D 60 | self.twice_sample = twice_sample # twice sample for more validation 61 | 62 | # TODO(yue) 63 | self.dataset = dataset 64 | self.partial_fcvid_eval = partial_fcvid_eval 65 | self.partial_ratio = partial_ratio 66 | self.ada_reso_skip = ada_reso_skip 67 | self.reso_list = reso_list 68 | self.random_crop = random_crop 69 | self.center_crop = center_crop 70 | self.ada_crop_list = ada_crop_list 71 | self.rescale_to = rescale_to 72 | self.policy_input_offset = policy_input_offset 73 | self.save_meta = save_meta 74 | 75 | if self.dense_sample: 76 | print('=> Using dense sample for the dataset...') 77 | if self.twice_sample: 78 | print('=> Using twice sample for the dataset...') 79 | 80 | self._parse_list() 81 | 82 | def _load_image(self, directory, idx): 83 | try: 84 | return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert('RGB')] 85 | except Exception: 86 | print('error loading image:', os.path.join(self.root_path, directory, self.image_tmpl.format(idx))) 87 | return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB')] 88 | 89 | def _parse_list(self): 90 | # check the frame number is large >3: 91 | splitter = "," if self.dataset in ["actnet", "fcvid"] else " " 92 | if self.dataset == "kinetics": 93 | splitter = ";" 94 | tmp = [x.strip().split(splitter) for x in open(self.list_file)] 95 | 96 | if any(len(items) >= 3 for items in tmp) and self.dataset == "minik": 97 | tmp = [[splitter.join(x[:-2]), x[-2], x[-1]] for x in tmp] 98 | 99 | if self.dataset == "kinetics": 100 | tmp = [[x[0], x[-2], x[-1]] for x in tmp] 101 | 102 | if not self.test_mode or self.remove_missing: 103 | tmp = [item for item in tmp if int(item[1]) >= 3] 104 | 105 | if self.partial_fcvid_eval and self.dataset == "fcvid": 106 | tmp = tmp[:int(len(tmp) * self.partial_ratio)] 107 | 108 | self.video_list = [VideoRecord(item) for item in tmp] 109 | 110 | if self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 111 | for v in self.video_list: 112 | v._data[1] = int(v._data[1]) / 2 113 | print('video number:%d' % (len(self.video_list))) 114 | 115 | def _sample_indices(self, record): 116 | """ 117 | :param record: VideoRecord 118 | :return: list 119 | """ 120 | if self.dense_sample: # i3d dense sample 121 | sample_pos = max(1, 1 + record.num_frames - 64) 122 | t_stride = 64 // self.num_segments 123 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 124 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)] 125 | return np.array(offsets) + 1 126 | else: # normal sample 127 | average_duration = record.num_frames // self.num_segments 128 | if average_duration > 0: 129 | offsets = np.multiply(list(range(self.num_segments)), average_duration) + randint(average_duration, 130 | size=self.num_segments) 131 | elif record.num_frames > self.num_segments: 132 | offsets = np.sort(randint(record.num_frames, size=self.num_segments)) 133 | else: 134 | offsets = np.array( 135 | list(range(record.num_frames)) + [record.num_frames - 1] * (self.num_segments - record.num_frames)) 136 | return offsets + 1 137 | 138 | def _get_val_indices(self, record): 139 | if self.dense_sample: # i3d dense sample 140 | sample_pos = max(1, 1 + record.num_frames - 64) 141 | t_stride = 64 // self.num_segments 142 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 143 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)] 144 | return np.array(offsets) + 1 145 | else: 146 | if record.num_frames > self.num_segments: 147 | tick = record.num_frames / float(self.num_segments) 148 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)]) 149 | else: 150 | offsets = np.array( 151 | list(range(record.num_frames)) + [record.num_frames - 1] * (self.num_segments - record.num_frames)) 152 | return offsets + 1 153 | 154 | def _get_test_indices(self, record): 155 | if self.dense_sample: 156 | sample_pos = max(1, 1 + record.num_frames - 64) 157 | t_stride = 64 // self.num_segments 158 | start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int) 159 | offsets = [] 160 | for start_idx in start_list.tolist(): 161 | offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)] 162 | return np.array(offsets) + 1 163 | elif self.twice_sample: 164 | tick = record.num_frames / float(self.num_segments) 165 | 166 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)] + 167 | [int(tick * x) for x in range(self.num_segments)]) 168 | 169 | return offsets + 1 170 | else: 171 | tick = record.num_frames / float(self.num_segments) 172 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)]) 173 | return offsets + 1 174 | 175 | def __getitem__(self, index): 176 | record = self.video_list[index] 177 | # check this is a legit video folder 178 | if self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 179 | file_name = self.image_tmpl.format(int(record.path), 'x', 1) 180 | full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name) 181 | else: 182 | file_name = self.image_tmpl.format(1) 183 | full_path = os.path.join(self.root_path, record.path, file_name) 184 | 185 | err_cnt = 0 186 | while not os.path.exists(full_path): 187 | err_cnt += 1 188 | if err_cnt > 3: 189 | exit("Sth wrong with the dataloader to get items. Check your data path. Exit...") 190 | print('################## Not Found:', os.path.join(self.root_path, record.path, file_name)) 191 | index = np.random.randint(len(self.video_list)) 192 | record = self.video_list[index] 193 | if self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 194 | file_name = self.image_tmpl.format(int(record.path), 'x', 1) 195 | full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name) 196 | else: 197 | file_name = self.image_tmpl.format(1) 198 | full_path = os.path.join(self.root_path, record.path, file_name) 199 | 200 | if not self.test_mode: 201 | segment_indices = self._sample_indices(record) if self.random_shift else self._get_val_indices(record) 202 | else: 203 | segment_indices = self._get_test_indices(record) 204 | return self.get(record, segment_indices) 205 | 206 | def get(self, record, indices): 207 | 208 | images = list() 209 | for seg_ind in indices: 210 | images.extend(self._load_image(record.path, int(seg_ind))) 211 | 212 | process_data = self.transform(images) 213 | if self.ada_reso_skip: 214 | return_items = [process_data] 215 | if self.random_crop: 216 | rescaled = [self.random_crop_proc(process_data, (x, x)) for x in self.reso_list[1:]] 217 | elif self.center_crop: 218 | rescaled = [self.center_crop_proc(process_data, (x, x)) for x in self.reso_list[1:]] 219 | else: 220 | rescaled = [self.rescale_proc(process_data, (x, x)) for x in self.reso_list[1:]] 221 | return_items = return_items + rescaled 222 | if self.save_meta: 223 | return_items = return_items + [record.path] + [indices] # [torch.tensor(indices)] 224 | return_items = return_items + [record.label] 225 | 226 | return tuple(return_items) 227 | else: 228 | if self.rescale_to == 224: 229 | rescaled = process_data 230 | else: 231 | x = self.rescale_to 232 | if self.random_crop: 233 | rescaled = self.random_crop_proc(process_data, (x, x)) 234 | elif self.center_crop: 235 | rescaled = self.center_crop_proc(process_data, (x, x)) 236 | else: 237 | rescaled = self.rescale_proc(process_data, (x, x)) 238 | 239 | return rescaled, record.label 240 | 241 | # TODO(yue) 242 | # (NC, H, W)->(NC, H', W') 243 | def rescale_proc(self, input_data, size): 244 | return torch.nn.functional.interpolate(input_data.unsqueeze(1), size=size, mode='nearest').squeeze(1) 245 | 246 | def center_crop_proc(self, input_data, size): 247 | h = input_data.shape[1] // 2 248 | w = input_data.shape[2] // 2 249 | return input_data[:, h - size[0] // 2:h + size[0] // 2, w - size[1] // 2:w + size[1] // 2] 250 | 251 | def random_crop_proc(self, input_data, size): 252 | H = input_data.shape[1] 253 | W = input_data.shape[2] 254 | input_data_nchw = input_data.view(-1, 3, H, W) 255 | batchsize = input_data_nchw.shape[0] 256 | return_list = [] 257 | hs0 = np.random.randint(0, H - size[0], batchsize) 258 | ws0 = np.random.randint(0, W - size[1], batchsize) 259 | for i in range(batchsize): 260 | return_list.append(input_data_nchw[i, :, hs0[i]:hs0[i] + size[0], ws0[i]:ws0[i] + size[1]]) 261 | return torch.stack(return_list).view(batchsize * 3, size[0], size[1]) 262 | 263 | def __len__(self): 264 | return len(self.video_list) -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/dataset_config.py: -------------------------------------------------------------------------------- 1 | from os.path import join as ospj 2 | 3 | def return_actnet(data_dir): 4 | filename_categories = ospj(data_dir, 'classInd.txt') 5 | root_data = data_dir + "/frames" 6 | filename_imglist_train = ospj(data_dir, 'actnet_train_split.txt') 7 | filename_imglist_val = ospj(data_dir, 'actnet_val_split.txt') 8 | prefix = 'image_{:05d}.jpg' 9 | 10 | return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix 11 | 12 | 13 | def return_fcvid(data_dir): 14 | filename_categories = ospj(data_dir, 'classInd.txt') 15 | root_data = data_dir + "/frames" 16 | filename_imglist_train = ospj(data_dir, 'fcvid_train_split.txt') 17 | filename_imglist_val = ospj(data_dir, 'fcvid_val_split.txt') 18 | prefix = 'image_{:05d}.jpg' 19 | 20 | return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix 21 | 22 | 23 | def return_minik(data_dir): 24 | filename_categories = ospj(data_dir, 'minik_classInd.txt') 25 | root_data = data_dir + "/frames" 26 | filename_imglist_train = ospj(data_dir, 'mini_train_videofolder.txt') 27 | filename_imglist_val = ospj(data_dir, 'mini_val_videofolder.txt') 28 | prefix = 'image_{:05d}.jpg' 29 | 30 | return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix 31 | 32 | 33 | def return_dataset(dataset, data_dir): 34 | dict_single = {'actnet': return_actnet, 'fcvid': return_fcvid, 'minik': return_minik} 35 | if dataset in dict_single: 36 | file_categories, file_imglist_train, file_imglist_val, root_data, prefix = dict_single[dataset](data_dir) 37 | else: 38 | raise ValueError('Unknown dataset ' + dataset) 39 | 40 | if isinstance(file_categories, str): 41 | with open(file_categories) as f: 42 | lines = f.readlines() 43 | categories = [item.rstrip() for item in lines] 44 | else: # number of categories 45 | categories = [None] * file_categories 46 | n_class = len(categories) 47 | print('{}: {} classes'.format(dataset, n_class)) 48 | return n_class, file_imglist_train, file_imglist_val, root_data, prefix 49 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/transforms.py: -------------------------------------------------------------------------------- 1 | import torchvision 2 | import random 3 | from PIL import Image, ImageOps 4 | import numpy as np 5 | import numbers 6 | import math 7 | import torch 8 | 9 | 10 | class GroupRandomCrop(object): 11 | def __init__(self, size): 12 | if isinstance(size, numbers.Number): 13 | self.size = (int(size), int(size)) 14 | else: 15 | self.size = size 16 | 17 | def __call__(self, img_group): 18 | 19 | w, h = img_group[0].size 20 | th, tw = self.size 21 | 22 | out_images = list() 23 | 24 | x1 = random.randint(0, w - tw) 25 | y1 = random.randint(0, h - th) 26 | 27 | for img in img_group: 28 | assert (img.size[0] == w and img.size[1] == h) 29 | if w == tw and h == th: 30 | out_images.append(img) 31 | else: 32 | out_images.append(img.crop((x1, y1, x1 + tw, y1 + th))) 33 | 34 | return out_images 35 | 36 | 37 | class GroupCenterCrop(object): 38 | def __init__(self, size): 39 | self.worker = torchvision.transforms.CenterCrop(size) 40 | 41 | def __call__(self, img_group): 42 | return [self.worker(img) for img in img_group] 43 | 44 | 45 | class GroupRandomHorizontalFlip(object): 46 | """Randomly horizontally flips the given PIL.Image with a probability of 0.5 47 | """ 48 | 49 | def __init__(self, is_flow=False): 50 | self.is_flow = is_flow 51 | 52 | def __call__(self, img_group, is_flow=False): 53 | v = random.random() 54 | if v < 0.5: 55 | ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group] 56 | if self.is_flow: 57 | for i in range(0, len(ret), 2): 58 | ret[i] = ImageOps.invert(ret[i]) # invert flow pixel values when flipping 59 | return ret 60 | else: 61 | return img_group 62 | 63 | 64 | class GroupNormalize(object): 65 | def __init__(self, mean, std): 66 | self.mean = mean 67 | self.std = std 68 | 69 | def __call__(self, tensor): 70 | rep_mean = self.mean * (tensor.size()[0] // len(self.mean)) 71 | rep_std = self.std * (tensor.size()[0] // len(self.std)) 72 | 73 | # TODO: make efficient 74 | for t, m, s in zip(tensor, rep_mean, rep_std): 75 | t.sub_(m).div_(s) 76 | 77 | return tensor 78 | 79 | 80 | class GroupScale(object): 81 | """ Rescales the input PIL.Image to the given 'size'. 82 | 'size' will be the size of the smaller edge. 83 | For example, if height > width, then image will be 84 | rescaled to (size * height / width, size) 85 | size: size of the smaller edge 86 | interpolation: Default: PIL.Image.BILINEAR 87 | """ 88 | 89 | def __init__(self, size, interpolation=Image.BILINEAR): 90 | self.worker = torchvision.transforms.Resize(size, interpolation) 91 | 92 | def __call__(self, img_group): 93 | return [self.worker(img) for img in img_group] 94 | 95 | 96 | class GroupOverSample(object): 97 | def __init__(self, crop_size, scale_size=None, flip=True): 98 | self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size) 99 | 100 | if scale_size is not None: 101 | self.scale_worker = GroupScale(scale_size) 102 | else: 103 | self.scale_worker = None 104 | self.flip = flip 105 | 106 | def __call__(self, img_group): 107 | 108 | if self.scale_worker is not None: 109 | img_group = self.scale_worker(img_group) 110 | 111 | image_w, image_h = img_group[0].size 112 | crop_w, crop_h = self.crop_size 113 | 114 | offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h) 115 | oversample_group = list() 116 | for o_w, o_h in offsets: 117 | normal_group = list() 118 | flip_group = list() 119 | for i, img in enumerate(img_group): 120 | crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h)) 121 | normal_group.append(crop) 122 | flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT) 123 | 124 | if img.mode == 'L' and i % 2 == 0: 125 | flip_group.append(ImageOps.invert(flip_crop)) 126 | else: 127 | flip_group.append(flip_crop) 128 | 129 | oversample_group.extend(normal_group) 130 | if self.flip: 131 | oversample_group.extend(flip_group) 132 | return oversample_group 133 | 134 | 135 | class GroupFullResSample(object): 136 | def __init__(self, crop_size, scale_size=None, flip=True): 137 | self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size) 138 | 139 | if scale_size is not None: 140 | self.scale_worker = GroupScale(scale_size) 141 | else: 142 | self.scale_worker = None 143 | self.flip = flip 144 | 145 | def __call__(self, img_group): 146 | 147 | if self.scale_worker is not None: 148 | img_group = self.scale_worker(img_group) 149 | 150 | image_w, image_h = img_group[0].size 151 | crop_w, crop_h = self.crop_size 152 | 153 | w_step = (image_w - crop_w) // 4 154 | h_step = (image_h - crop_h) // 4 155 | 156 | offsets = list() 157 | offsets.append((0 * w_step, 2 * h_step)) # left 158 | offsets.append((4 * w_step, 2 * h_step)) # right 159 | offsets.append((2 * w_step, 2 * h_step)) # center 160 | 161 | oversample_group = list() 162 | for o_w, o_h in offsets: 163 | normal_group = list() 164 | flip_group = list() 165 | for i, img in enumerate(img_group): 166 | crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h)) 167 | normal_group.append(crop) 168 | if self.flip: 169 | flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT) 170 | 171 | if img.mode == 'L' and i % 2 == 0: 172 | flip_group.append(ImageOps.invert(flip_crop)) 173 | else: 174 | flip_group.append(flip_crop) 175 | 176 | oversample_group.extend(normal_group) 177 | oversample_group.extend(flip_group) 178 | return oversample_group 179 | 180 | 181 | class GroupMultiScaleCrop(object): 182 | 183 | def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True): 184 | self.scales = scales if scales is not None else [1, .875, .75, .66] 185 | self.max_distort = max_distort 186 | self.fix_crop = fix_crop 187 | self.more_fix_crop = more_fix_crop 188 | self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size] 189 | self.interpolation = Image.BILINEAR 190 | 191 | def __call__(self, img_group): 192 | 193 | im_size = img_group[0].size 194 | 195 | crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size) 196 | crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group] 197 | ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation) 198 | for img in crop_img_group] 199 | return ret_img_group 200 | 201 | def _sample_crop_size(self, im_size): 202 | image_w, image_h = im_size[0], im_size[1] 203 | 204 | # find a crop size 205 | base_size = min(image_w, image_h) 206 | crop_sizes = [int(base_size * x) for x in self.scales] 207 | crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes] 208 | crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes] 209 | 210 | pairs = [] 211 | for i, h in enumerate(crop_h): 212 | for j, w in enumerate(crop_w): 213 | if abs(i - j) <= self.max_distort: 214 | pairs.append((w, h)) 215 | 216 | crop_pair = random.choice(pairs) 217 | if not self.fix_crop: 218 | w_offset = random.randint(0, image_w - crop_pair[0]) 219 | h_offset = random.randint(0, image_h - crop_pair[1]) 220 | else: 221 | w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1]) 222 | 223 | return crop_pair[0], crop_pair[1], w_offset, h_offset 224 | 225 | def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h): 226 | offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h) 227 | return random.choice(offsets) 228 | 229 | @staticmethod 230 | def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h): 231 | w_step = (image_w - crop_w) // 4 232 | h_step = (image_h - crop_h) // 4 233 | 234 | ret = list() 235 | ret.append((0, 0)) # upper left 236 | ret.append((4 * w_step, 0)) # upper right 237 | ret.append((0, 4 * h_step)) # lower left 238 | ret.append((4 * w_step, 4 * h_step)) # lower right 239 | ret.append((2 * w_step, 2 * h_step)) # center 240 | 241 | if more_fix_crop: 242 | ret.append((0, 2 * h_step)) # center left 243 | ret.append((4 * w_step, 2 * h_step)) # center right 244 | ret.append((2 * w_step, 4 * h_step)) # lower center 245 | ret.append((2 * w_step, 0 * h_step)) # upper center 246 | 247 | ret.append((1 * w_step, 1 * h_step)) # upper left quarter 248 | ret.append((3 * w_step, 1 * h_step)) # upper right quarter 249 | ret.append((1 * w_step, 3 * h_step)) # lower left quarter 250 | ret.append((3 * w_step, 3 * h_step)) # lower righ quarter 251 | 252 | return ret 253 | 254 | 255 | class GroupRandomSizedCrop(object): 256 | """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size 257 | and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio 258 | This is popularly used to train the Inception networks 259 | size: size of the smaller edge 260 | interpolation: Default: PIL.Image.BILINEAR 261 | """ 262 | 263 | def __init__(self, size, interpolation=Image.BILINEAR): 264 | self.size = size 265 | self.interpolation = interpolation 266 | 267 | def __call__(self, img_group): 268 | for attempt in range(10): 269 | area = img_group[0].size[0] * img_group[0].size[1] 270 | target_area = random.uniform(0.08, 1.0) * area 271 | aspect_ratio = random.uniform(3. / 4, 4. / 3) 272 | 273 | w = int(round(math.sqrt(target_area * aspect_ratio))) 274 | h = int(round(math.sqrt(target_area / aspect_ratio))) 275 | 276 | if random.random() < 0.5: 277 | w, h = h, w 278 | 279 | if w <= img_group[0].size[0] and h <= img_group[0].size[1]: 280 | x1 = random.randint(0, img_group[0].size[0] - w) 281 | y1 = random.randint(0, img_group[0].size[1] - h) 282 | found = True 283 | break 284 | else: 285 | found = False 286 | x1 = 0 287 | y1 = 0 288 | 289 | if found: 290 | out_group = list() 291 | for img in img_group: 292 | img = img.crop((x1, y1, x1 + w, y1 + h)) 293 | assert (img.size == (w, h)) 294 | out_group.append(img.resize((self.size, self.size), self.interpolation)) 295 | return out_group 296 | else: 297 | # Fallback 298 | scale = GroupScale(self.size, interpolation=self.interpolation) 299 | crop = GroupRandomCrop(self.size) 300 | return crop(scale(img_group)) 301 | 302 | 303 | class Stack(object): 304 | 305 | def __init__(self, roll=False): 306 | self.roll = roll 307 | 308 | def __call__(self, img_group): 309 | if img_group[0].mode == 'L': 310 | return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2) 311 | elif img_group[0].mode == 'RGB': 312 | if self.roll: 313 | return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2) 314 | else: 315 | return np.concatenate(img_group, axis=2) 316 | 317 | 318 | class ToTorchFormatTensor(object): 319 | """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255] 320 | to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """ 321 | 322 | def __init__(self, div=True): 323 | self.div = div 324 | 325 | def __call__(self, pic): 326 | if isinstance(pic, np.ndarray): 327 | # handle numpy array 328 | img = torch.from_numpy(pic).permute(2, 0, 1).contiguous() 329 | else: 330 | # handle PIL Image 331 | img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes())) 332 | img = img.view(pic.size[1], pic.size[0], len(pic.mode)) 333 | # put it from HWC to CHW format 334 | # yikes, this transpose takes 80% of the loading time/CPU 335 | img = img.transpose(0, 1).transpose(0, 2).contiguous() 336 | return img.float().div(255) if self.div else img.float() 337 | 338 | 339 | class IdentityTransform(object): 340 | 341 | def __call__(self, data): 342 | return data 343 | 344 | 345 | if __name__ == "__main__": 346 | trans = torchvision.transforms.Compose([ 347 | GroupScale(256), 348 | GroupRandomCrop(224), 349 | Stack(), 350 | ToTorchFormatTensor(), 351 | GroupNormalize( 352 | mean=[.485, .456, .406], 353 | std=[.229, .224, .225] 354 | )] 355 | ) 356 | 357 | im = Image.open('../tensorflow-model-zoo.torch/lena_299.png') 358 | 359 | color_group = [im] * 3 360 | rst = trans(color_group) 361 | 362 | gray_group = [im.convert('L')] * 9 363 | gray_rst = trans(gray_group) 364 | 365 | trans2 = torchvision.transforms.Compose([ 366 | GroupRandomSizedCrop(256), 367 | Stack(), 368 | ToTorchFormatTensor(), 369 | GroupNormalize( 370 | mean=[.485, .456, .406], 371 | std=[.229, .224, .225]) 372 | ]) 373 | print(trans2(color_group)) 374 | -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn.functional as F 4 | from PIL import Image, ImageDraw, ImageFont 5 | 6 | 7 | def softmax(scores): 8 | es = np.exp(scores - scores.max(axis=-1)[..., None]) 9 | return es / es.sum(axis=-1)[..., None] 10 | 11 | class AverageMeter(object): 12 | """Computes and stores the average and current value""" 13 | def __init__(self, name, fmt=':f'): 14 | self.name = name 15 | self.fmt = fmt 16 | self.reset() 17 | 18 | def reset(self): 19 | self.val = 0 20 | self.avg = 0 21 | self.sum = 0 22 | self.count = 0 23 | 24 | def update(self, val, n=1): 25 | self.val = val 26 | self.sum += val * n 27 | self.count += n 28 | self.avg = self.sum / self.count 29 | 30 | def __str__(self): 31 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 32 | return fmtstr.format(**self.__dict__) 33 | 34 | 35 | def accuracy(output, target, topk=(1,)): 36 | """Computes the accuracy over the k top predictions for the specified values of k""" 37 | with torch.no_grad(): 38 | maxk = max(topk) 39 | batch_size = target.size(0) 40 | 41 | _, pred = output.topk(maxk, 1, True, True) 42 | pred = pred.t() 43 | correct = pred.eq(target.reshape(1, -1).expand_as(pred)) 44 | 45 | res = [] 46 | for k in topk: 47 | correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True) 48 | res.append(correct_k.mul_(100.0 / batch_size)) 49 | return res 50 | 51 | def get_multi_hot(test_y, classes, assumes_starts_zero=True): 52 | bs = test_y.shape[0] 53 | label_cnt = 0 54 | 55 | # TODO ranking labels: (-1,-1,4,5,3,7)->(4,4,2,1,0,3) 56 | if not assumes_starts_zero: 57 | for label_val in torch.unique(test_y): 58 | if label_val >= 0: 59 | test_y[test_y == label_val] = label_cnt 60 | label_cnt += 1 61 | 62 | gt = torch.zeros(bs, classes + 1) # TODO(yue) +1 for -1 in multi-label case 63 | for i in range(test_y.shape[1]): 64 | gt[torch.LongTensor(range(bs)), test_y[:, i]] = 1 # TODO(yue) see? 65 | 66 | return gt[:, :classes] 67 | 68 | def cal_map(output, old_test_y): 69 | batch_size = output.size(0) 70 | num_classes = output.size(1) 71 | ap = torch.zeros(num_classes) 72 | test_y = old_test_y.clone() 73 | 74 | gt = get_multi_hot(test_y, num_classes, False) 75 | 76 | probs = F.softmax(output, dim=1) 77 | 78 | rg = torch.range(1, batch_size).float() 79 | for k in range(num_classes): 80 | scores = probs[:, k] 81 | targets = gt[:, k] 82 | _, sortind = torch.sort(scores, 0, True) 83 | truth = targets[sortind] 84 | tp = truth.float().cumsum(0) 85 | precision = tp.div(rg) 86 | 87 | ap[k] = precision[truth.byte()].sum() / max(float(truth.sum()), 1) 88 | return ap.mean() * 100, ap * 100 89 | 90 | def cal_reward(confidence, confidence_last, patch_size_list, penalty=0.5): 91 | reward = confidence - confidence_last 92 | reward = reward - penalty*(patch_size_list/100.)**2 93 | return reward 94 | 95 | class ProgressMeter(object): 96 | def __init__(self, num_batches, *meters, prefix=""): 97 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 98 | self.meters = meters 99 | self.prefix = prefix 100 | 101 | def print(self, batch): 102 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 103 | entries += [str(meter) for meter in self.meters] 104 | out = '\t'.join(entries) 105 | print(out) 106 | return out + '\n' 107 | 108 | def _get_batch_fmtstr(self, num_batches): 109 | num_digits = len(str(num_batches // 1)) 110 | fmt = '{:' + str(num_digits) + 'd}' 111 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 112 | 113 | class Recorder: 114 | def __init__(self, larger_is_better=True): 115 | self.history = [] 116 | self.larger_is_better = larger_is_better 117 | self.best_at = None 118 | self.best_val = None 119 | 120 | def is_better_than(self, x, y): 121 | if self.larger_is_better: 122 | return x > y 123 | else: 124 | return x < y 125 | 126 | def update(self, val): 127 | self.history.append(val) 128 | if len(self.history) == 1 or self.is_better_than(val, self.best_val): 129 | self.best_val = val 130 | self.best_at = len(self.history) - 1 131 | 132 | def is_current_best(self): 133 | return self.best_at == len(self.history) - 1 -------------------------------------------------------------------------------- /Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/video_jpg.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import os 3 | import time 4 | import subprocess 5 | from tqdm import tqdm 6 | import argparse 7 | from multiprocessing import Pool 8 | 9 | parser = argparse.ArgumentParser(description="Dataset processor: Video->Frames") 10 | parser.add_argument("dir_path", type=str, help="original dataset path") 11 | parser.add_argument("dst_dir_path", type=str, help="dest path to save the frames") 12 | parser.add_argument("--prefix", type=str, default="image_%05d.jpg", help="output image type") 13 | parser.add_argument("--accepted_formats", type=str, default=[".mp4", ".mkv", ".webm"], nargs="+", 14 | help="list of input video formats") 15 | parser.add_argument("--begin", type=int, default=0) 16 | parser.add_argument("--end", type=int, default=666666666) 17 | parser.add_argument("--file_list", type=str, default="") 18 | parser.add_argument("--frame_rate", type=int, default=-1) 19 | parser.add_argument("--num_workers", type=int, default=16) 20 | parser.add_argument("--dry_run", action="store_true") 21 | parser.add_argument("--parallel", action="store_true") 22 | args = parser.parse_args() 23 | 24 | 25 | def par_job(command): 26 | if args.dry_run: 27 | print(command) 28 | else: 29 | subprocess.call(command, shell=True) 30 | 31 | 32 | if __name__ == "__main__": 33 | t0 = time.time() 34 | dir_path = args.dir_path 35 | dst_dir_path = args.dst_dir_path 36 | 37 | if args.file_list == "": 38 | file_names = sorted(os.listdir(dir_path)) 39 | else: 40 | file_names = [x.strip() for x in open(args.file_list).readlines()] 41 | del_list = [] 42 | for i, file_name in enumerate(file_names): 43 | if not any([x in file_name for x in args.accepted_formats]): 44 | del_list.append(i) 45 | file_names = [x for i, x in enumerate(file_names) if i not in del_list] 46 | file_names = file_names[args.begin:args.end + 1] 47 | print("%d videos to handle (after %d being removed)" % (len(file_names), len(del_list))) 48 | cmd_list = [] 49 | for file_name in tqdm(file_names): 50 | 51 | name, ext = os.path.splitext(file_name) 52 | dst_directory_path = os.path.join(dst_dir_path, name) 53 | 54 | video_file_path = os.path.join(dir_path, file_name) 55 | if not os.path.exists(dst_directory_path): 56 | os.makedirs(dst_directory_path, exist_ok=True) 57 | 58 | if args.frame_rate > 0: 59 | frame_rate_str = "-r %d" % args.frame_rate 60 | else: 61 | frame_rate_str = "" 62 | cmd = 'ffmpeg -nostats -loglevel 0 -i {} -vf scale=-1:360 {} {}/{}'.format(video_file_path, frame_rate_str, 63 | dst_directory_path, args.prefix) 64 | if not args.parallel: 65 | if args.dry_run: 66 | print(cmd) 67 | else: 68 | subprocess.call(cmd, shell=True) 69 | cmd_list.append(cmd) 70 | 71 | if args.parallel: 72 | with Pool(processes=args.num_workers) as pool: 73 | with tqdm(total=len(cmd_list)) as pbar: 74 | for _ in tqdm(pool.imap_unordered(par_job, cmd_list)): 75 | pbar.update() 76 | t1 = time.time() 77 | print("Finished in %.4f seconds" % (t1 - t0)) 78 | os.system("stty sane") 79 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/README.md: -------------------------------------------------------------------------------- 1 | # Experiments on Something-Something V1&V2 2 | 3 | ## Requirements 4 | - python 3.8 5 | - pytorch 1.8.0 6 | - torchvision 0.8.0 7 | - [hydra](https://hydra.cc/docs/intro/) 1.1.0 8 | 9 | ## Datasets 10 | Please follow the instruction of [TSM](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) to prepare the Something-Something V1/V2 dataset. 11 | 12 | ## Pre-trained Models on Something-Something-V1 (V2) 13 | 14 | Please download pre-trained weights and checkpoints from [Google Drive](https://drive.google.com/drive/folders/1QgIjU6FLT3RZbAGAVutgOPuOOOtBPpFb?usp=sharing). 15 | 16 | - Something-Something-V1 (V2) 17 | - mobilenetv2_segment8.pth.tar: pre-trained weights for global CNN (MobileNet-v2). 18 | - resnet50_segment12.pth.tar: pre-trained weights for local CNN (ResNet-50). 19 | - 144x144.pth.tar: checkpoint to reproduce the result in paper with patch size 144x144. 20 | - 160x160.pth.tar: checkpoint to reproduce the result in paper with patch size 160x160. 21 | - 176x176.pth.tar: checkpoint to reproduce the result in paper with patch size 176x176. 22 | 23 | ## Training 24 | 25 | - Here we take training the model with patch size 144x144 on Something-Something-V1 dataset for example. 26 | - All logs and checkpoints will be saved in the directory: `./outputs/YYYY-MM-DD/HH-MM-SS` 27 | - Note that we store a set of default hyper-parameters for each stage in [conf directory](conf) which can be overrided through command line. You can also use your own config files. 28 | 29 | - Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command: 30 | 31 | For Global CNN: please use the [TSM code](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) and use the following command: 32 | ``` 33 | python main.py something RGB \ 34 | --arch mobilenetv2 --num_segments 8 \ 35 | --gd 20 --lr 0.01 --lr_steps 20 40 --epochs 50 \ 36 | --batch-size 64 -j 16 --dropout 0.5 --consensus_type=avg --eval-freq=1 \ 37 | --shift --shift_div=8 --shift_place=blockres --npb 38 | ``` 39 | For Local CNN: please use the [TSM code](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) and use the following command: 40 | ``` 41 | python main.py something RGB \ 42 | --arch resnet50 --num_segments 12 \ 43 | --gd 20 --lr 0.01 --lr_steps 20 40 --epochs 50 \ 44 | --batch-size 64 -j 16 --dropout 0.5 --consensus_type=avg --eval-freq=1 \ 45 | --shift --shift_div=8 --shift_place=blockres --npb 46 | ``` 47 | 48 | - Training stage 1, we provide the command in train_stage1.sh. The pre-trained weights for Global CNN and Local CNN are required, so you should set the pretrained_glancer and pretrained_focuser arguments both right in train_stage1.sh first and then run it: 49 | ``` 50 | bash train_stage1.sh 51 | ``` 52 | 53 | - Training stage 2, we provide the command in train_stage2.sh. A stage-1 checkpoint is required, so you should set the pretrained argument right in train_stage2.sh and then run it: 54 | ``` 55 | bash train_stage2.sh 56 | ``` 57 | 58 | - Training stage 3, we provide the command in train_stage2.sh. A stage-1 checkpoint is required, so you should set the pretrained_s2 argument right in train_stage3.sh and then run it: 59 | ``` 60 | bash train_stage3.sh 61 | ``` 62 | 63 | 64 | ## Evaluate Pre-trained Models 65 | - Here we take evaluating model with patch size 144x144 on Something-Something-V1 dataset for example. 66 | 67 | We provide the command in evaluate.sh. The pre-trained weights is required, so you should set the resume and patch_size arguments both right in evaluate.sh first and then run it: 68 | 69 | ``` 70 | bash evaluate.sh 71 | ``` 72 | 73 | ## Acknowledgement 74 | We use the official implementation of [temporal-shift-module](https://github.com/mit-han-lab/temporal-shift-module) and PPO from [here](https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py). 75 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/__init__.py: -------------------------------------------------------------------------------- 1 | import basic_tools.utils as utils 2 | import basic_tools.logger as logger 3 | import basic_tools.checkpoint as checkpoint 4 | 5 | import sys 6 | import os 7 | 8 | def start(args): 9 | cmd_line = " ".join(sys.argv) 10 | print(f"{cmd_line}") 11 | print(f"Working dir: {os.getcwd()}") 12 | utils.set_all_seeds(args.seed) 13 | 14 | print(args) 15 | return args 16 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/__pycache__/checkpoint.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/checkpoint.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/__pycache__/logger.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/logger.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/checkpoint.py: -------------------------------------------------------------------------------- 1 | import time 2 | import signal 3 | import os 4 | import sys 5 | import torch 6 | import socket 7 | 8 | 9 | ''' 10 | Usage: 11 | 12 | init_checkpoint() 13 | 14 | if exist_checkpoint(): 15 | any_object = load_checkpoint() 16 | 17 | save_checkpoint(any_object) 18 | ''' 19 | 20 | CHECKPOINT_filename = 'checkpoint.pth.tar' 21 | CHECKPOINT_tempfile = 'checkpoint.temp' 22 | SIGNAL_RECEIVED = False 23 | 24 | def SIGTERMHandler(a, b): 25 | print('received sigterm') 26 | pass 27 | 28 | 29 | def signalHandler(a, b): 30 | global SIGNAL_RECEIVED 31 | print('Signal received', a, time.time(), flush=True) 32 | SIGNAL_RECEIVED = True 33 | 34 | print("caught signal", a) 35 | print(socket.gethostname(), "USR1 signal caught.") 36 | # do other stuff to cleanup here 37 | print('requeuing job ' + os.environ['SLURM_JOB_ID']) 38 | os.system('scontrol requeue ' + os.environ['SLURM_JOB_ID']) 39 | sys.exit(-1) 40 | 41 | 42 | def init_checkpoint(): 43 | signal.signal(signal.SIGUSR1, signalHandler) 44 | signal.signal(signal.SIGTERM, SIGTERMHandler) 45 | print('Signal handler installed', flush=True) 46 | 47 | def save_checkpoint(state): 48 | global CHECKPOINT_filename, CHECKPOINT_tempfile 49 | torch.save(state, CHECKPOINT_tempfile) 50 | if os.path.isfile(CHECKPOINT_tempfile): 51 | os.rename(CHECKPOINT_tempfile, CHECKPOINT_filename) 52 | print("Checkpoint done") 53 | 54 | def save_checkpoint_if_signal(state): 55 | global SIGNAL_RECEIVED 56 | if SIGNAL_RECEIVED: 57 | save_checkpoint(state) 58 | 59 | def exist_checkpoint(): 60 | global CHECKPOINT_filename 61 | return os.path.isfile(CHECKPOINT_filename) 62 | 63 | def load_checkpoint(filename=None): 64 | global CHECKPOINT_filename 65 | if filename is None: 66 | filename = CHECKPOINT_filename 67 | 68 | # optionally resume from a checkpoint 69 | # if args.resume: 70 | #if os.path.isfile(args.resume): 71 | # To make the script simple to understand, we do resume whenever there is 72 | # a checkpoint file 73 | if os.path.isfile(filename): 74 | print(f"=> loading checkpoint {filename}") 75 | checkpoint = torch.load(filename) 76 | print(f"=> loaded checkpoint {filename}") 77 | return checkpoint 78 | else: 79 | raise RuntimeError(f"=> no checkpoint found at '{filename}'") 80 | 81 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import logging 4 | 5 | class Logger: 6 | def __init__(self, path, mode='w'): 7 | assert mode in {'w', 'a'}, 'unknown mode for logger %s' % mode 8 | 9 | fh = logging.FileHandler(path, mode=mode) 10 | formatter = logging.Formatter('[%(asctime)s][%(name)s][%(levelname)s] - %(message)s') 11 | fh.setFormatter(formatter) 12 | # ch = logging.StreamHandler(sys.__stdout__) 13 | 14 | self.logger = logging.getLogger() 15 | self.logger.addHandler(fh) 16 | # self.logger.addHandler(ch) 17 | 18 | def write(self, message): 19 | if message == "\n": return 20 | # Remove \n at the end. 21 | self.logger.info(message.strip()) 22 | 23 | def flush(self): 24 | # for python 3 compatibility. 25 | pass 26 | 27 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/basic_tools/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import sys 3 | import random 4 | import numpy as np 5 | import os 6 | import subprocess 7 | 8 | from torch import optim 9 | 10 | def set_all_seeds(rand_seed): 11 | random.seed(rand_seed) 12 | np.random.seed(rand_seed) 13 | torch.manual_seed(rand_seed) 14 | torch.cuda.manual_seed(rand_seed) 15 | 16 | def to_cpu(x): 17 | if isinstance(x, dict): 18 | return { k : to_cpu(v) for k, v in x.items() } 19 | elif isinstance(x, list): 20 | return [ to_cpu(v) for v in x ] 21 | elif isinstance(x, torch.Tensor): 22 | return x.cpu() 23 | else: 24 | return x 25 | 26 | def model2numpy(model): 27 | return { k : v.cpu().numpy() for k, v in model.state_dict().items() } 28 | 29 | def activation2numpy(output): 30 | if isinstance(output, dict): 31 | return { k : activation2numpy(v) for k, v in output.items() } 32 | elif isinstance(output, list): 33 | return [ activation2numpy(v) for v in output ] 34 | elif isinstance(output, Variable): 35 | return output.data.cpu().numpy() 36 | 37 | def count_size(x): 38 | if isinstance(x, dict): 39 | return sum([ count_size(v) for k, v in x.items() ]) 40 | elif isinstance(x, list) or isinstance(x, tuple): 41 | return sum([ count_size(v) for v in x ]) 42 | elif isinstance(x, torch.Tensor): 43 | return x.nelement() * x.element_size() 44 | else: 45 | return sys.getsizeof(x) 46 | 47 | def mem2str(num_bytes): 48 | assert num_bytes >= 0 49 | if num_bytes >= 2 ** 30: # GB 50 | val = float(num_bytes) / (2 ** 30) 51 | result = "%.3f GB" % val 52 | elif num_bytes >= 2 ** 20: # MB 53 | val = float(num_bytes) / (2 ** 20) 54 | result = "%.3f MB" % val 55 | elif num_bytes >= 2 ** 10: # KB 56 | val = float(num_bytes) / (2 ** 10) 57 | result = "%.3f KB" % val 58 | else: 59 | result = "%d bytes" % num_bytes 60 | return result 61 | 62 | def get_mem_usage(): 63 | import psutil 64 | 65 | mem = psutil.virtual_memory() 66 | result = "" 67 | result += "available: %s\t" % (mem2str(mem.available)) 68 | result += "used: %s\t" % (mem2str(mem.used)) 69 | result += "free: %s\t" % (mem2str(mem.free)) 70 | # result += "active: %s\t" % (mem2str(mem.active)) 71 | # result += "inactive: %s\t" % (mem2str(mem.inactive)) 72 | # result += "buffers: %s\t" % (mem2str(mem.buffers)) 73 | # result += "cached: %s\t" % (mem2str(mem.cached)) 74 | # result += "shared: %s\t" % (mem2str(mem.shared)) 75 | # result += "slab: %s\t" % (mem2str(mem.slab)) 76 | return result 77 | 78 | def get_github_string(): 79 | _, output = subprocess.getstatusoutput("git -C ./ log --pretty=format:'%H' -n 1") 80 | ret, _ = subprocess.getstatusoutput("git -C ./ diff-index --quiet HEAD --") 81 | return f"Githash: {output}, unstaged: {ret}" 82 | 83 | 84 | def accumulate(all_y, y): 85 | if all_y is None: 86 | all_y = dict() 87 | for k, v in y.items(): 88 | if isinstance(v, list): 89 | all_y[k] = [ [vv] for vv in v ] 90 | else: 91 | all_y[k] = [v] 92 | else: 93 | for k, v in all_y.items(): 94 | if isinstance(y[k], list): 95 | for vv, yy in zip(v, y[k]): 96 | vv.append(yy) 97 | else: 98 | v.append(y[k]) 99 | 100 | return all_y 101 | 102 | def combine(all_y): 103 | output = dict() 104 | for k, v in all_y.items(): 105 | if isinstance(v[0], list): 106 | output[k] = [ torch.cat(vv) for vv in v ] 107 | else: 108 | output[k] = torch.cat(v) 109 | 110 | return output 111 | 112 | def concatOutput(loader, nets, condition=None): 113 | outputs = [None] * len(nets) 114 | 115 | use_cnn = nets[0].use_cnn 116 | 117 | with torch.no_grad(): 118 | for i, (x, _) in enumerate(loader): 119 | if not use_cnn: 120 | x = x.view(x.size(0), -1) 121 | x = x.cuda() 122 | 123 | outputs = [ accumulate(output, to_cpu(net(x))) for net, output in zip(nets, outputs) ] 124 | if condition is not None and not condition(i): 125 | break 126 | 127 | return [ combine(output) for output in outputs ] 128 | 129 | 130 | def adjust_learning_rate(args, optimizer, epoch): 131 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 132 | lrs = args.lr_steps.split('-') 133 | lr_steps = [int(lr) for lr in lrs] 134 | if args.lr_type == 'step': 135 | decay = 0.1 ** (sum(epoch >= np.array(lr_steps))) 136 | backbone_lr = args.backbone_lr * decay 137 | fc_lr = args.fc_lr * decay 138 | decay = args.weight_decay 139 | elif args.lr_type == 'cos': 140 | import math 141 | backbone_lr = 0.5 * args.backbone_lr * (1 + math.cos(math.pi * epoch / args.epochs)) 142 | fc_lr = 0.5 * args.fc_lr * (1 + math.cos(math.pi * epoch / args.epochs)) 143 | decay = args.weight_decay 144 | else: 145 | raise NotImplementedError 146 | 147 | optimizer.param_groups[0]['lr'] = backbone_lr # Glancer 148 | # optimizer.param_groups[1]['lr'] = backbone_lr # Focuser 149 | optimizer.param_groups[1]['lr'] = fc_lr # LSTM 150 | for param_group in optimizer.param_groups: 151 | # param_group['lr'] = lr 152 | param_group['weight_decay'] = decay 153 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/conf/evaluate.yaml: -------------------------------------------------------------------------------- 1 | hydra: 2 | run: 3 | dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S} 4 | 5 | dataset: somethingv1 6 | train_list: 7 | val_list: 8 | root_path: 9 | data_dir: PATH_TO_DATASET 10 | resume: 11 | 12 | pretrained: 13 | pretrained_glancer: 14 | pretrained_focuser: 15 | 16 | train_stage: 2 17 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune 18 | pretrain_glancer: true 19 | arch: resnet 20 | k: 3 21 | dropout: 0.5 22 | num_classes: 200 23 | evaluate: false 24 | eval_freq: 5 25 | print_freq: 100 26 | 27 | # tsn params 28 | video_div: 1 29 | num_segments_glancer: 8 30 | num_segments_focuser: 12 31 | modality: RGB 32 | base_model: resnet50 33 | partial_bn: false 34 | pretrain: imagenet 35 | is_shift: true 36 | shift_div: 8 37 | shift_place: blockres 38 | fc_lr5: false 39 | temporal_pool: false 40 | non_local: false 41 | 42 | dense_sample: false 43 | partial_fcvid_eval: false 44 | partial_ratio: 0.2 45 | ada_reso_skip: false #TODO: 46 | reso_list: 224 47 | random_crop: false 48 | center_crop: false 49 | ada_crop_list: 50 | rescale_to: 224 51 | policy_input_offset: 0 52 | save_meta: false 53 | 54 | epochs: 50 55 | batch_size: 64 56 | backbone_lr: 0.01 57 | fc_lr: 0.005 58 | policy_lr: 0.0003 59 | lr_type: step 60 | lr_steps: 50-100 61 | momentum: 0.9 62 | weight_decay: 0.0001 63 | clip_grad: 20 64 | npb: true 65 | 66 | patch_size: 96 67 | glance_size: 224 68 | random_patch: false 69 | feature_map_channels: 1280 70 | action_dim: 25 71 | hidden_state_dim: 1024 #for policy network, focuser 72 | policy_conv: true 73 | hidden_dim: 1024 #for LSTM classifier 74 | penalty: 0.5 75 | consensus: lstm 76 | ppo_continuous: false 77 | reward: 1 78 | # random: contrast to random patching 79 | # padding: contrast to zeros padding 80 | # prev: contrast to previous time step 81 | dropout_lstm: false 82 | gamma: 0.7 #for ppo 83 | with_glancer: true 84 | action_std: 0.1 85 | actorcritic_with_bn: true 86 | 87 | seed: 1007 88 | gpus: 0 89 | gpu: 90 | workers: 16 91 | world_size: 1 92 | rank: 0 93 | dist_url: tcp://127.0.0.1:8822 94 | dist_backend: nccl 95 | multiprocessing_distributed: false 96 | distributed: 97 | amp: false 98 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/conf/stage1.yaml: -------------------------------------------------------------------------------- 1 | hydra: 2 | run: 3 | dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S} 4 | 5 | dataset: somethingv1 6 | train_list: 7 | val_list: 8 | root_path: 9 | data_dir: PATH_TO_DATASET 10 | resume: 11 | 12 | pretrained_glancer: 13 | pretrained_focuser: 14 | load_pretrained_focuser_fc: false 15 | # /home/nzl17/data/gfvideo/temporal-shift-module/pretrained/something-something-v1/TSM_something_RGB_resnet50_shift8_blockres_avg_segment8_e45_remove_module.pth 16 | 17 | train_stage: 1 18 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune 19 | pretrain_glancer: true 20 | arch: resnet 21 | k: 3 22 | dropout: 0.5 23 | num_classes: 174 24 | evaluate: false 25 | eval_freq: 5 26 | start_eval: 40 27 | print_freq: 100 28 | 29 | # tsn params 30 | video_div: 1 31 | num_segments_glancer: 8 32 | num_segments_focuser: 12 33 | modality: RGB 34 | base_model: resnet50 35 | partial_bn: false 36 | pretrain: imagenet 37 | is_shift: true 38 | shift_div: 8 39 | shift_place: blockres 40 | fc_lr5: false 41 | temporal_pool: false 42 | non_local: false 43 | 44 | dense_sample: false 45 | partial_fcvid_eval: false 46 | partial_ratio: 0.2 47 | ada_reso_skip: false #TODO: 48 | reso_list: 224 49 | random_crop: false 50 | center_crop: false 51 | ada_crop_list: 52 | rescale_to: 224 53 | policy_input_offset: 0 54 | save_meta: false 55 | 56 | epochs: 50 57 | batch_size: 64 58 | backbone_lr: 0.01 # 0.001 59 | fc_lr: 0.01 60 | policy_lr: 0.0003 61 | lr_type: cos 62 | lr_steps: 20-40 63 | momentum: 0.9 64 | weight_decay: 0.0001 65 | clip_grad: 20 66 | npb: true 67 | 68 | patch_size: 144 #variable 69 | train_with_larger_patch_size: 0 70 | glance_size: 224 71 | glance_fewer_frame: false 72 | random_patch: true 73 | feature_map_channels: 1280 74 | action_dim: 25 75 | hidden_state_dim: 1024 #for policy network, focuser 76 | policy_conv: true 77 | hidden_dim: 1024 #for LSTM classifier 78 | penalty: 0.5 79 | consensus: lstm 80 | ppo_continuous: false 81 | dropout_lstm: false 82 | gamma: 0.7 #for ppo 83 | with_glancer: true 84 | action_std: 0.1 85 | actorcritic_with_bn: true 86 | 87 | seed: 1007 88 | gpus: 0 89 | gpu: 90 | workers: 8 91 | world_size: 1 92 | rank: 0 93 | dist_url: tcp://127.0.0.1:8822 94 | dist_backend: nccl 95 | multiprocessing_distributed: true 96 | distributed: 97 | amp: true 98 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/conf/stage2.yaml: -------------------------------------------------------------------------------- 1 | hydra: 2 | run: 3 | dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S} 4 | 5 | dataset: somethingv1 6 | train_list: 7 | val_list: 8 | root_path: 9 | data_dir: PATH_TO_DATASET 10 | resume: 11 | 12 | pretrained: 13 | pretrained_glancer: 14 | pretrained_focuser: 15 | 16 | train_stage: 2 17 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune 18 | pretrain_glancer: true 19 | arch: resnet 20 | k: 3 21 | dropout: 0.5 22 | num_classes: 174 23 | evaluate: false 24 | eval_freq: 5 25 | print_freq: 100 26 | 27 | # tsn params 28 | video_div: 1 29 | num_segments_glancer: 8 30 | num_segments_focuser: 12 31 | modality: RGB 32 | base_model: resnet50 33 | partial_bn: false 34 | pretrain: imagenet 35 | is_shift: true 36 | shift_div: 8 37 | shift_place: blockres 38 | fc_lr5: false 39 | temporal_pool: false 40 | non_local: false 41 | 42 | dense_sample: false 43 | partial_fcvid_eval: false 44 | partial_ratio: 0.2 45 | ada_reso_skip: false #TODO: 46 | reso_list: 224 47 | random_crop: false 48 | center_crop: false 49 | ada_crop_list: 50 | rescale_to: 224 51 | policy_input_offset: 0 52 | save_meta: false 53 | 54 | epochs: 50 55 | batch_size: 64 56 | backbone_lr: 0.01 57 | fc_lr: 0.005 58 | policy_lr: 0.0003 59 | lr_type: cos 60 | lr_steps: 50-100 61 | momentum: 0.9 62 | weight_decay: 0.0001 63 | clip_grad: 20 64 | npb: true 65 | 66 | patch_size: 144 67 | glance_size: 224 68 | random_patch: false 69 | feature_map_channels: 1280 70 | action_dim: 25 71 | hidden_state_dim: 1024 #for policy network, focuser 72 | policy_conv: true 73 | hidden_dim: 1024 #for LSTM classifier 74 | penalty: 0.5 75 | consensus: lstm 76 | ppo_continuous: True 77 | dropout_lstm: false 78 | gamma: 0.7 #for ppo 79 | with_glancer: true 80 | action_std: 0.1 81 | actorcritic_with_bn: true 82 | 83 | seed: 1007 84 | gpus: 0 85 | gpu: 86 | workers: 16 87 | world_size: 1 88 | rank: 0 89 | dist_url: tcp://127.0.0.1:8822 90 | dist_backend: nccl 91 | multiprocessing_distributed: false 92 | distributed: 93 | amp: false 94 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/conf/stage3.yaml: -------------------------------------------------------------------------------- 1 | hydra: 2 | run: 3 | dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S} 4 | 5 | dataset: somethingv1 6 | train_list: 7 | val_list: 8 | root_path: 9 | data_dir: PATH_TO_DATASET 10 | resume: 11 | 12 | pretrained_s2: 13 | load_pretrained_s2_fc: true 14 | pretrained_glancer: 15 | pretrained_focuser: 16 | 17 | train_stage: 3 18 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune 19 | pretrain_glancer: true 20 | arch: resnet 21 | k: 3 22 | dropout: 0.5 23 | num_classes: 174 24 | evaluate: false 25 | eval_freq: 5 26 | start_eval: 0 27 | print_freq: 100 28 | 29 | # tsn params 30 | video_div: 1 31 | num_segments_glancer: 8 32 | num_segments_focuser: 12 33 | modality: RGB 34 | base_model: resnet50 35 | partial_bn: false 36 | pretrain: imagenet 37 | is_shift: true 38 | shift_div: 8 39 | shift_place: blockres 40 | fc_lr5: false 41 | temporal_pool: false 42 | non_local: false 43 | 44 | dense_sample: false 45 | partial_fcvid_eval: false 46 | partial_ratio: 0.2 47 | ada_reso_skip: false #TODO: 48 | reso_list: 224 49 | random_crop: false 50 | center_crop: false 51 | ada_crop_list: 52 | rescale_to: 224 53 | policy_input_offset: 0 54 | save_meta: false 55 | 56 | epochs: 50 57 | batch_size: 64 58 | backbone_lr: 0.01 59 | fc_lr: 0.005 60 | policy_lr: 0.0003 61 | lr_type: step 62 | lr_steps: 50-100 63 | momentum: 0.9 64 | weight_decay: 0.0001 65 | clip_grad: 20 66 | npb: true 67 | 68 | patch_size: 160 69 | glance_size: 224 70 | random_patch: false 71 | feature_map_channels: 1280 72 | action_dim: 25 73 | hidden_state_dim: 1024 #for policy network, focuser 74 | policy_conv: true 75 | hidden_dim: 1024 #for LSTM classifier 76 | penalty: 0.5 77 | consensus: lstm 78 | ppo_continuous: true 79 | dropout_lstm: false 80 | gamma: 0.7 #for ppo 81 | with_glancer: true 82 | action_std: 0.1 83 | actorcritic_with_bn: true 84 | 85 | seed: 1007 86 | gpus: 0 87 | gpu: 88 | workers: 16 89 | world_size: 1 90 | rank: 0 91 | dist_url: tcp://127.0.0.1:8822 92 | dist_backend: nccl 93 | multiprocessing_distributed: true 94 | distributed: 95 | amp: true 96 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/evaluate.py: -------------------------------------------------------------------------------- 1 | import torch 2 | torch.multiprocessing.set_sharing_strategy('file_system') 3 | import torch.nn.parallel 4 | import torch.optim 5 | import torch.nn.functional as F 6 | 7 | from ops.dataset import TSNDataSet 8 | from ops.transforms import * 9 | from ops import dataset_config 10 | from ops.utils import AverageMeter, accuracy, ProgressMeter 11 | from models.gfv_net import GFV 12 | 13 | import os 14 | import time 15 | import hydra 16 | import basic_tools 17 | from collections import OrderedDict 18 | 19 | 20 | def parse_gpus(gpus): 21 | if type(gpus) is int: 22 | return [gpus] 23 | gpu_list = gpus.split('-') 24 | return [int(g) for g in gpu_list] 25 | 26 | 27 | @hydra.main(config_path="conf", config_name="evaluate.yaml") 28 | def main(args): 29 | config_yaml = basic_tools.start(args) 30 | with open('training.log', 'a+') as f_handler: 31 | f_handler.writelines(config_yaml) 32 | 33 | best_acc1 = 0 34 | num_class, args.train_list, args.val_list, args.root_path, prefix = \ 35 | dataset_config.return_dataset(args.dataset, modality='RGB', root_dataset=args.data_dir) 36 | args.num_classes = num_class 37 | 38 | model = GFV(args).cuda() 39 | 40 | if args.pretrained_glancer: 41 | pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained_glancer), map_location='cpu') 42 | 43 | new_state_dict = OrderedDict() 44 | for k, v in pretrained_ckpt['state_dict'].items(): 45 | if k[:18] == 'module.base_model.': 46 | name = k[18:] # remove `module.` 47 | new_state_dict[name] = v 48 | elif k[:14] == 'module.new_fc.': 49 | name = 'classifier.' + k[14:] # replace `module.new_fc` with 'classifier' 50 | new_state_dict[name] = v 51 | else: 52 | new_state_dict[k] = v 53 | 54 | model.glancer.net.load_state_dict(new_state_dict, strict=True) 55 | 56 | print('Load Pretrained Glancer from {}!'.format(args.pretrained_glancer)) 57 | with open('training.log', 'a+') as f_handler: 58 | f_handler.writelines('Load Pretrained Glancer from {}!'.format(args.pretrained_glancer)) 59 | 60 | if args.pretrained_focuser: 61 | pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained_focuser), map_location='cpu') 62 | 63 | new_state_dict = OrderedDict() 64 | new_fc_state_ditc = OrderedDict() 65 | for k, v in pretrained_ckpt['state_dict'].items(): 66 | print('Load ckpt param: {}'.format(k)) 67 | if k[:7] == 'module.' and 'new_fc' not in k: 68 | name = k[7:] # remove `module.` 69 | new_state_dict[name] = v 70 | elif 'module.new_fc.' in k: 71 | name = k[14:] # remove `module.` 72 | new_fc_state_ditc[name] = v 73 | else: 74 | new_state_dict[k] = v 75 | 76 | model.classifier.load_state_dict(new_fc_state_ditc, strict=True) 77 | model.focuser.net.load_state_dict(new_state_dict, strict=False) 78 | 79 | print('Load Pretrained Focuser from {}!'.format(args.pretrained_focuser)) 80 | with open('training.log', 'a+') as f_handler: 81 | f_handler.writelines('Load Pretrained Focuser from {}!'.format(args.pretrained_focuser)) 82 | 83 | model.focuser.net.base_model = torch.nn.Sequential(*list(model.focuser.net.base_model.children())[:-1]) 84 | print(model) 85 | print(model.focuser.policy.policy) 86 | with open('training.log', 'a+') as f_handler: 87 | f_handler.writelines('model: {}'.format(model)) 88 | f_handler.writelines('policy net: {}'.format(model.focuser.policy.policy)) 89 | 90 | scale_size = model.scale_size 91 | crop_size = model.crop_size 92 | input_mean = model.input_mean 93 | input_std = model.input_std 94 | 95 | # data loading code 96 | normalize = GroupNormalize(input_mean, input_std) 97 | 98 | val_loader = torch.utils.data.DataLoader( 99 | TSNDataSet(args.root_path, args.val_list, 100 | num_segments_glancer=args.num_segments_glancer, 101 | num_segments_focuser=args.num_segments_focuser, 102 | new_length=1, 103 | modality='RGB', 104 | image_tmpl=prefix, 105 | random_shift=False, 106 | transform=torchvision.transforms.Compose([ 107 | GroupScale(int(scale_size)), 108 | GroupCenterCrop(crop_size), 109 | Stack(roll=False), 110 | ToTorchFormatTensor(div=True), 111 | normalize, 112 | ]), dense_sample=args.dense_sample), 113 | batch_size=args.batch_size, shuffle=False, 114 | num_workers=args.workers, pin_memory=False) 115 | 116 | criterion = torch.nn.CrossEntropyLoss().cuda() 117 | 118 | if args.pretrained: 119 | pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained)) 120 | 121 | start_epoch = pretrained_ckpt['epoch'] 122 | print('Load pretrained ckpt from: {}'.format(os.path.expanduser(args.pretrained))) 123 | print('Load pretrained ckpt from epoch: {}'.format(start_epoch)) 124 | 125 | model.glancer.load_state_dict(pretrained_ckpt['glancer'], strict=True) 126 | model.focuser.load_state_dict(pretrained_ckpt['focuser'], strict=True) 127 | model.classifier.load_state_dict(pretrained_ckpt['fc'], strict=True) 128 | 129 | ckpt_acc1 = pretrained_ckpt['best_acc'] 130 | print('best ckpt_acc1 for ckpt: {}'.format(ckpt_acc1)) 131 | with open('training.log', 'a+') as f_handler: 132 | f_handler.writelines('Load pretrained ckpt from: {}'.format(os.path.expanduser(args.pretrained))) 133 | f_handler.writelines('Load pretrained ckpt from epoch: {}'.format(start_epoch)) 134 | f_handler.writelines('best ckpt_acc1 for ckpt: {}'.format(ckpt_acc1)) 135 | 136 | if args.resume: 137 | resume_ckpt = torch.load(os.path.expanduser(args.resume)) 138 | 139 | start_epoch = resume_ckpt['epoch'] 140 | print('resume from epoch: {}'.format(start_epoch)) 141 | 142 | model.glancer.load_state_dict(resume_ckpt['glancer'], strict=True) 143 | model.focuser.load_state_dict(resume_ckpt['focuser'], strict=True) 144 | model.classifier.load_state_dict(resume_ckpt['fc'], strict=True) 145 | model.focuser.policy.policy.load_state_dict(resume_ckpt['policy']) 146 | model.focuser.policy.policy_old.load_state_dict(resume_ckpt['policy']) 147 | 148 | best_acc1 = resume_ckpt['best_acc'] 149 | print('best acc1 for ckpt: {}'.format(best_acc1)) 150 | with open('training.log', 'a+') as f_handler: 151 | f_handler.writelines('Resume from: {}'.format(os.path.expanduser(args.resume))) 152 | f_handler.writelines('Resume from epoch: {}'.format(start_epoch)) 153 | f_handler.writelines('best_acc1 for resume: {}'.format(best_acc1)) 154 | else: 155 | start_epoch = 0 156 | 157 | if args.evaluate: 158 | acc1, val_logs = validate(val_loader, model, criterion, args) 159 | with open('training.log', 'a+') as f_handler: 160 | f_handler.writelines(val_logs) 161 | print('Best Acc@1 = {}'.format(acc1)) 162 | return 163 | 164 | 165 | def validate(val_loader, model, criterion, args): 166 | batch_time = AverageMeter('Time', ':6.3f') 167 | losses = AverageMeter('Loss', ':.4e') 168 | top1 = AverageMeter('Acc@1', ':6.2f') 169 | top5 = AverageMeter('Acc@5', ':6.2f') 170 | reward_list = [AverageMeter('Rew', ':6.5f') for _ in range(args.video_div)] 171 | progress = ProgressMeter(len(val_loader), batch_time, losses, top1, top5, prefix='Test: ') 172 | 173 | logs = [] 174 | # switch to evaluate mode 175 | model.eval() 176 | model.focuser.policy.policy.eval() 177 | model.focuser.policy.policy_old.eval() 178 | 179 | all_targets = [] 180 | with torch.no_grad(): 181 | end = time.time() 182 | for i, (glancer_images, focuser_images, target) in enumerate(val_loader): 183 | _b = target.shape[0] 184 | all_targets.append(target) 185 | glancer_images = glancer_images.cuda() 186 | focuser_images = focuser_images.cuda() 187 | target = target.cuda() 188 | glancer_images = torch.nn.functional.interpolate(glancer_images, (args.glance_size, args.glance_size)) 189 | glancer_images = glancer_images.cuda() 190 | 191 | # compute output 192 | focuser_images = focuser_images.view(_b, args.num_segments_focuser, 3, model.input_size, model.input_size) 193 | # MDP Focusing 194 | with torch.no_grad(): 195 | global_feat_map, global_feat_logit = model.glance( 196 | glancer_images) # feat_map (B, T, C, H, W) feat_vec (B, T, _) 197 | 198 | for focus_time_step in range(args.video_div): 199 | pred, baseline_logit, local_patch = model.action_stage2( 200 | focuser_images, global_feat_map, global_feat_logit, focus_time_step, args, 201 | prev_local_patch=None if focus_time_step == 0 else local_patch, training=False) 202 | 203 | loss = criterion(pred, target) 204 | confidence = torch.gather(F.softmax(pred.detach(), 1), dim=1, index=target.view(-1, 1)).view(1, -1) 205 | 206 | bsl_confidence = torch.gather(F.softmax(baseline_logit.detach(), 1), dim=1, 207 | index=target.view(-1, 1)).view(1, -1) 208 | reward = confidence - bsl_confidence 209 | 210 | reward_list[focus_time_step].update(reward.data.mean().item(), glancer_images.size(0)) 211 | 212 | # Update evaluation metrics 213 | acc1, acc5 = accuracy(pred, target, topk=(1, 5)) 214 | losses.update(loss.item(), glancer_images.size(0)) 215 | top1.update(acc1[0], glancer_images.size(0)) 216 | top5.update(acc5[0], glancer_images.size(0)) 217 | 218 | batch_time.update(time.time() - end) 219 | end = time.time() 220 | _reward = [reward.avg for reward in reward_list] 221 | print('reward of each step: {}'.format(_reward)) 222 | 223 | logs.append(progress.print(i)) 224 | logs.append(' '.join(map(str, _reward)) + '\n') 225 | 226 | return top1.avg, logs 227 | 228 | 229 | if __name__ == "__main__": 230 | main() 231 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/evaluate.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0 python evaluate.py \ 2 | dataset=somethingv1 \ 3 | data_dir=PATH_TO_DATASET \ 4 | train_stage=2 \ 5 | batch_size=32 \ 6 | num_segments_glancer=8 \ 7 | num_segments_focuser=12 \ 8 | video_div=1 \ 9 | workers=4 \ 10 | policy_lr=0.0003 \ 11 | epochs=50 \ 12 | eval_freq=1 \ 13 | random_patch=False \ 14 | glance_size=224 \ 15 | patch_size=144 \ 16 | gamma=0.7 \ 17 | with_glancer=True \ 18 | reward=2 \ 19 | ppo_continuous=True \ 20 | action_std=0.25 \ 21 | actorcritic_with_bn=True \ 22 | evaluate=True \ 23 | resume=PATH_TO_PRETRAINED_MODEL # load the pretrained model -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/__pycache__/ppo.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/ppo.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/__pycache__/ppo_continuous.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/ppo_continuous.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/__pycache__/resnet.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/resnet.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/__pycache__/tsn.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/tsn.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/constant.py: -------------------------------------------------------------------------------- 1 | # Some definition of constants 2 | 3 | patch_sizes = [64, 96, 128, 160, 192, 224, 0] 4 | # patch_sizes = [96] 5 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/mobilenetv2.py: -------------------------------------------------------------------------------- 1 | # Code adapted from https://github.com/tonylins/pytorch-mobilenet-v2 2 | 3 | import torch.nn as nn 4 | import math 5 | 6 | 7 | def conv_bn(inp, oup, stride): 8 | return nn.Sequential( 9 | nn.Conv2d(inp, oup, 3, stride, 1, bias=False), 10 | nn.BatchNorm2d(oup), 11 | nn.ReLU6(inplace=True) 12 | ) 13 | 14 | 15 | def conv_1x1_bn(inp, oup): 16 | return nn.Sequential( 17 | nn.Conv2d(inp, oup, 1, 1, 0, bias=False), 18 | nn.BatchNorm2d(oup), 19 | nn.ReLU6(inplace=True) 20 | ) 21 | 22 | 23 | def make_divisible(x, divisible_by=8): 24 | import numpy as np 25 | return int(np.ceil(x * 1. / divisible_by) * divisible_by) 26 | 27 | 28 | class InvertedResidual(nn.Module): 29 | def __init__(self, inp, oup, stride, expand_ratio): 30 | super(InvertedResidual, self).__init__() 31 | self.stride = stride 32 | assert stride in [1, 2] 33 | 34 | hidden_dim = int(inp * expand_ratio) 35 | self.use_res_connect = self.stride == 1 and inp == oup 36 | 37 | if expand_ratio == 1: 38 | self.conv = nn.Sequential( 39 | # dw 40 | nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), 41 | nn.BatchNorm2d(hidden_dim), 42 | nn.ReLU6(inplace=True), 43 | # pw-linear 44 | nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), 45 | nn.BatchNorm2d(oup), 46 | ) 47 | else: 48 | self.conv = nn.Sequential( 49 | # pw 50 | nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), 51 | nn.BatchNorm2d(hidden_dim), 52 | nn.ReLU6(inplace=True), 53 | # dw 54 | nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), 55 | nn.BatchNorm2d(hidden_dim), 56 | nn.ReLU6(inplace=True), 57 | # pw-linear 58 | nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), 59 | nn.BatchNorm2d(oup), 60 | ) 61 | 62 | def forward(self, x): 63 | if self.use_res_connect: 64 | return x + self.conv(x) 65 | else: 66 | return self.conv(x) 67 | 68 | 69 | class MobileNetV2(nn.Module): 70 | def __init__(self, n_class=1000, input_size=224, width_mult=1.): 71 | super(MobileNetV2, self).__init__() 72 | block = InvertedResidual 73 | input_channel = 32 74 | last_channel = 1280 75 | interverted_residual_setting = [ 76 | # t, c, n, s 77 | [1, 16, 1, 1], 78 | [6, 24, 2, 2], 79 | [6, 32, 3, 2], 80 | [6, 64, 4, 2], 81 | [6, 96, 3, 1], 82 | [6, 160, 3, 2], 83 | [6, 320, 1, 1], 84 | ] 85 | 86 | # building first layer 87 | assert input_size % 32 == 0 88 | # input_channel = make_divisible(input_channel * width_mult) # first channel is always 32! 89 | self.last_channel = make_divisible(last_channel * width_mult) if width_mult > 1.0 else last_channel 90 | self.features = [conv_bn(3, input_channel, 2)] 91 | # building inverted residual blocks 92 | for t, c, n, s in interverted_residual_setting: 93 | output_channel = make_divisible(c * width_mult) if t > 1 else c 94 | for i in range(n): 95 | if i == 0: 96 | self.features.append(block(input_channel, output_channel, s, expand_ratio=t)) 97 | else: 98 | self.features.append(block(input_channel, output_channel, 1, expand_ratio=t)) 99 | input_channel = output_channel 100 | # building last several layers 101 | self.features.append(conv_1x1_bn(input_channel, self.last_channel)) 102 | # make it nn.Sequential 103 | self.features = nn.Sequential(*self.features) 104 | 105 | # building classifier 106 | self.classifier = nn.Linear(self.last_channel, n_class) 107 | 108 | self._initialize_weights() 109 | 110 | def forward(self, x): 111 | x = self.features(x) 112 | x = x.mean(3).mean(2) 113 | x = self.classifier(x) 114 | return x 115 | 116 | def get_featmap(self, x): 117 | # x = self.features(x) 118 | # return x, x.mean([2, 3]) 119 | x = self.features(x) 120 | logit = self.classifier(x.mean(3).mean(2)) 121 | return x, logit 122 | 123 | def _initialize_weights(self): 124 | for m in self.modules(): 125 | if isinstance(m, nn.Conv2d): 126 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels 127 | m.weight.data.normal_(0, math.sqrt(2. / n)) 128 | if m.bias is not None: 129 | m.bias.data.zero_() 130 | elif isinstance(m, nn.BatchNorm2d): 131 | m.weight.data.fill_(1) 132 | m.bias.data.zero_() 133 | elif isinstance(m, nn.Linear): 134 | n = m.weight.size(1) 135 | m.weight.data.normal_(0, 0.01) 136 | m.bias.data.zero_() 137 | 138 | @property 139 | def feature_dim(self): 140 | return self.last_channel 141 | 142 | 143 | def mobilenet_v2(n_class, pretrained=True): 144 | model = MobileNetV2(n_class=n_class, width_mult=1) 145 | 146 | if pretrained: 147 | try: 148 | from torch.hub import load_state_dict_from_url 149 | except ImportError: 150 | from torch.utils.model_zoo import load_url as load_state_dict_from_url 151 | state_dict = load_state_dict_from_url( 152 | 'https://www.dropbox.com/s/47tyzpofuuyyv1b/mobilenetv2_1.0-f2a8633.pth.tar?dl=1', progress=True) 153 | model.load_state_dict(state_dict) 154 | print('Loaded pretrained weight!') 155 | return model 156 | 157 | 158 | if __name__ == '__main__': 159 | net = mobilenet_v2(True) 160 | 161 | 162 | 163 | 164 | 165 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/ppo.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import math 6 | from torch.distributions import Categorical 7 | 8 | 9 | class Memory: 10 | def __init__(self): 11 | self.actions = [] 12 | self.states = [] 13 | self.logprobs = [] 14 | self.rewards = [] 15 | self.is_terminals = [] 16 | self.hidden = [] 17 | 18 | def clear_memory(self): 19 | del self.actions[:] 20 | del self.states[:] 21 | del self.logprobs[:] 22 | del self.rewards[:] 23 | del self.is_terminals[:] 24 | del self.hidden[:] 25 | 26 | 27 | class ActorCritic(nn.Module): 28 | def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim=1024, policy_conv=True): 29 | super(ActorCritic, self).__init__() 30 | 31 | # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet 32 | if policy_conv: 33 | self.state_encoder = nn.Sequential( 34 | # nn.Conv2d(feature_dim, 32, kernel_size=1, stride=1, padding=0, bias=False), 35 | nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False), 36 | nn.BatchNorm2d(64), 37 | nn.ReLU(), 38 | nn.Flatten(), 39 | # nn.Linear(int(state_dim * 32 / feature_dim), hidden_state_dim), 40 | nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim), 41 | nn.BatchNorm1d(hidden_state_dim), 42 | nn.ReLU() 43 | ) 44 | 45 | # encoder with linear layer for ResNet and DenseNet 46 | else: 47 | self.state_encoder = nn.Sequential( 48 | nn.Linear(state_dim, 2048), 49 | nn.ReLU(), 50 | nn.Linear(2048, hidden_state_dim), 51 | nn.ReLU() 52 | ) 53 | 54 | self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False) 55 | 56 | self.actor = nn.Sequential( 57 | nn.Linear(hidden_state_dim, action_dim), 58 | nn.Softmax(dim=-1)) 59 | 60 | self.critic = nn.Sequential( 61 | nn.Linear(hidden_state_dim, 1)) 62 | 63 | self.hidden_state_dim = hidden_state_dim 64 | self.action_dim = action_dim 65 | self.policy_conv = policy_conv 66 | self.feature_dim = feature_dim 67 | self.feature_ratio = int(math.sqrt(state_dim/feature_dim)) 68 | 69 | def forward(self): 70 | raise NotImplementedError 71 | 72 | def act(self, state_ini, memory, restart_batch=False, training=True): 73 | if restart_batch: 74 | del memory.hidden[:] 75 | memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda()) 76 | 77 | if not self.policy_conv: 78 | state = state_ini.flatten(1) 79 | else: 80 | state = state_ini 81 | 82 | state = self.state_encoder(state) 83 | 84 | state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1]) 85 | memory.hidden.append(hidden_output) 86 | 87 | state = state[0] 88 | action_probs = self.actor(state) 89 | dist = Categorical(action_probs) 90 | 91 | if training: 92 | action = dist.sample() 93 | action_logprob = dist.log_prob(action) 94 | memory.states.append(state_ini) 95 | memory.actions.append(action) 96 | memory.logprobs.append(action_logprob) 97 | else: 98 | action = action_probs.max(1)[1] 99 | 100 | return action 101 | 102 | def evaluate(self, state, action): 103 | seq_l = state.size(0) 104 | batch_size = state.size(1) 105 | 106 | if not self.policy_conv: 107 | state = state.flatten(2) 108 | state = state.view(seq_l * batch_size, state.size(2)) 109 | else: 110 | state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4)) 111 | 112 | state = self.state_encoder(state) 113 | state = state.view(seq_l, batch_size, -1) 114 | 115 | state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda()) 116 | state = state.view(seq_l * batch_size, -1) 117 | 118 | action_probs = self.actor(state) 119 | dist = Categorical(action_probs) 120 | action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda() 121 | dist_entropy = dist.entropy().cuda() 122 | state_value = self.critic(state) 123 | 124 | return action_logprobs.view(seq_l, batch_size), \ 125 | state_value.view(seq_l, batch_size), \ 126 | dist_entropy.view(seq_l, batch_size) 127 | 128 | 129 | class PPO: 130 | def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv, gpu=0, 131 | lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2): 132 | self.lr = lr 133 | self.betas = betas 134 | self.gamma = gamma 135 | self.eps_clip = eps_clip 136 | self.K_epochs = K_epochs 137 | 138 | self.policy = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu) 139 | 140 | self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas) 141 | 142 | self.policy_old = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu) 143 | self.policy_old.load_state_dict(self.policy.state_dict()) 144 | 145 | self.MseLoss = nn.MSELoss() 146 | 147 | def select_action(self, state, memory, restart_batch=False, training=True): 148 | return self.policy_old.act(state, memory, restart_batch, training) 149 | 150 | def update(self, memory): 151 | rewards = [] 152 | discounted_reward = 0 153 | 154 | for reward in reversed(memory.rewards): 155 | discounted_reward = reward + (self.gamma * discounted_reward) 156 | rewards.insert(0, discounted_reward) 157 | 158 | rewards = torch.cat(rewards, 0).cuda() 159 | 160 | rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5) 161 | 162 | old_states = torch.stack(memory.states, 0).cuda().detach() 163 | old_actions = torch.stack(memory.actions, 0).cuda().detach() 164 | old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach() 165 | 166 | for _ in range(self.K_epochs): 167 | logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions) 168 | 169 | ratios = torch.exp(logprobs - old_logprobs.detach()) 170 | 171 | advantages = rewards - state_values.detach() 172 | surr1 = ratios * advantages 173 | surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages 174 | 175 | loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy 176 | 177 | self.optimizer.zero_grad() 178 | loss.mean().backward() 179 | self.optimizer.step() 180 | 181 | self.policy_old.load_state_dict(self.policy.state_dict()) -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/ppo_continuous.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import math 6 | 7 | 8 | class Memory: 9 | def __init__(self): 10 | self.actions = [] 11 | self.states = [] 12 | self.logprobs = [] 13 | self.rewards = [] 14 | self.is_terminals = [] 15 | self.hidden = [] 16 | 17 | def clear_memory(self): 18 | del self.actions[:] 19 | del self.states[:] 20 | del self.logprobs[:] 21 | del self.rewards[:] 22 | del self.is_terminals[:] 23 | del self.hidden[:] 24 | 25 | 26 | class ActorCritic(nn.Module): 27 | def __init__(self, feature_dim, state_dim, hidden_state_dim=1024, policy_conv=True, action_std=0.1, with_bn=False): 28 | super(ActorCritic, self).__init__() 29 | 30 | # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet 31 | if policy_conv: 32 | if with_bn: 33 | self.state_encoder = nn.Sequential( 34 | nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False), 35 | nn.BatchNorm2d(64), 36 | nn.ReLU(), 37 | nn.Flatten(), 38 | nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim), 39 | nn.BatchNorm1d(hidden_state_dim), 40 | nn.ReLU() 41 | ) 42 | else: 43 | self.state_encoder = nn.Sequential( 44 | nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False), 45 | nn.ReLU(), 46 | nn.Flatten(), 47 | nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim), 48 | nn.ReLU() 49 | ) 50 | # encoder with linear layer for ResNet and DenseNet 51 | else: 52 | self.state_encoder = nn.Sequential( 53 | nn.Linear(state_dim, 2048), 54 | nn.ReLU(), 55 | nn.Linear(2048, hidden_state_dim), 56 | nn.ReLU() 57 | ) 58 | 59 | self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False) 60 | 61 | self.actor = nn.Sequential( 62 | nn.Linear(hidden_state_dim, 2), 63 | nn.Sigmoid()) 64 | 65 | self.critic = nn.Sequential( 66 | nn.Linear(hidden_state_dim, 1)) 67 | 68 | self.action_var = torch.full((2,), action_std).cuda() 69 | 70 | self.hidden_state_dim = hidden_state_dim 71 | self.policy_conv = policy_conv 72 | self.feature_dim = feature_dim 73 | self.feature_ratio = int(math.sqrt(state_dim/feature_dim)) 74 | 75 | def forward(self): 76 | raise NotImplementedError 77 | 78 | def act(self, state_ini, memory, restart_batch=False, training=False): 79 | if restart_batch: 80 | del memory.hidden[:] 81 | memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda()) 82 | 83 | if not self.policy_conv: 84 | state = state_ini.flatten(1) 85 | else: 86 | state = state_ini 87 | 88 | state = self.state_encoder(state) 89 | 90 | state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1]) 91 | memory.hidden.append(hidden_output) 92 | 93 | state = state[0] 94 | action_mean = self.actor(state) 95 | 96 | cov_mat = torch.diag(self.action_var).cuda() 97 | dist = torch.distributions.multivariate_normal.MultivariateNormal(action_mean, scale_tril=cov_mat) 98 | action = dist.sample().cuda() 99 | if training: 100 | action = F.relu(action) 101 | action = 1 - F.relu(1 - action) 102 | action_logprob = dist.log_prob(action).cuda() 103 | memory.states.append(state_ini) 104 | memory.actions.append(action) 105 | memory.logprobs.append(action_logprob) 106 | else: 107 | action = action_mean 108 | 109 | return action.detach() 110 | 111 | def evaluate(self, state, action): 112 | seq_l = state.size(0) 113 | batch_size = state.size(1) 114 | 115 | if not self.policy_conv: 116 | state = state.flatten(2) 117 | state = state.view(seq_l * batch_size, state.size(2)) 118 | else: 119 | state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4)) 120 | 121 | state = self.state_encoder(state) 122 | state = state.view(seq_l, batch_size, -1) 123 | 124 | state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda()) 125 | state = state.view(seq_l * batch_size, -1) 126 | 127 | action_mean = self.actor(state) 128 | 129 | cov_mat = torch.diag(self.action_var).cuda() 130 | 131 | dist = torch.distributions.multivariate_normal.MultivariateNormal(action_mean, scale_tril=cov_mat) 132 | 133 | action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda() 134 | dist_entropy = dist.entropy().cuda() 135 | state_value = self.critic(state) 136 | 137 | return action_logprobs.view(seq_l, batch_size), \ 138 | state_value.view(seq_l, batch_size), \ 139 | dist_entropy.view(seq_l, batch_size) 140 | 141 | 142 | class PPO_Continuous: 143 | def __init__(self, feature_dim, state_dim, hidden_state_dim, policy_conv, gpu=0, action_std=0.1, 144 | lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2, with_bn=False): 145 | self.lr = lr 146 | self.betas = betas 147 | self.gamma = gamma 148 | self.eps_clip = eps_clip 149 | self.K_epochs = K_epochs 150 | 151 | self.policy = ActorCritic(feature_dim, state_dim, hidden_state_dim, policy_conv, action_std, 152 | with_bn=with_bn).cuda(gpu) 153 | 154 | self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas) 155 | 156 | self.policy_old = ActorCritic(feature_dim, state_dim, hidden_state_dim, policy_conv, action_std, 157 | with_bn=with_bn).cuda(gpu) 158 | self.policy_old.load_state_dict(self.policy.state_dict()) 159 | 160 | self.MseLoss = nn.MSELoss() 161 | 162 | def select_action(self, state, memory, restart_batch=False, training=True): 163 | return self.policy_old.act(state, memory, restart_batch, training) 164 | 165 | def update(self, memory): 166 | rewards = [] 167 | discounted_reward = 0 168 | 169 | for reward in reversed(memory.rewards): 170 | discounted_reward = reward + (self.gamma * discounted_reward) 171 | rewards.insert(0, discounted_reward) 172 | 173 | rewards = torch.cat(rewards, 0).cuda() 174 | 175 | rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5) 176 | 177 | old_states = torch.stack(memory.states, 0).cuda().detach() 178 | old_actions = torch.stack(memory.actions, 0).cuda().detach() 179 | old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach() 180 | 181 | for _ in range(self.K_epochs): 182 | logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions) 183 | 184 | ratios = torch.exp(logprobs - old_logprobs.detach()) 185 | 186 | advantages = rewards - state_values.detach() 187 | surr1 = ratios * advantages 188 | surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages 189 | 190 | loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy 191 | 192 | self.optimizer.zero_grad() 193 | loss.mean().backward() 194 | self.optimizer.step() 195 | 196 | self.policy_old.load_state_dict(self.policy.state_dict()) -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/resnet.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from .utils import load_state_dict_from_url 4 | 5 | __all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101', 6 | 'resnet152', 'resnext50_32x4d', 'resnext101_32x8d', 7 | 'wide_resnet50_2', 'wide_resnet101_2'] 8 | 9 | 10 | model_urls = { 11 | 'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth', 12 | 'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth', 13 | 'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth', 14 | 'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth', 15 | 'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth', 16 | 'resnext50_32x4d': 'https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth', 17 | 'resnext101_32x8d': 'https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth', 18 | 'wide_resnet50_2': 'https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth', 19 | 'wide_resnet101_2': 'https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth', 20 | } 21 | 22 | 23 | def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1): 24 | """3x3 convolution with padding""" 25 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, 26 | padding=dilation, groups=groups, bias=False, dilation=dilation) 27 | 28 | 29 | def conv1x1(in_planes, out_planes, stride=1): 30 | """1x1 convolution""" 31 | return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False) 32 | 33 | 34 | class BasicBlock(nn.Module): 35 | expansion = 1 36 | 37 | def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1, 38 | base_width=64, dilation=1, norm_layer=None): 39 | super(BasicBlock, self).__init__() 40 | if norm_layer is None: 41 | norm_layer = nn.BatchNorm2d 42 | if groups != 1 or base_width != 64: 43 | raise ValueError('BasicBlock only supports groups=1 and base_width=64') 44 | if dilation > 1: 45 | raise NotImplementedError("Dilation > 1 not supported in BasicBlock") 46 | # Both self.conv1 and self.downsample layers downsample the input when stride != 1 47 | self.conv1 = conv3x3(inplanes, planes, stride) 48 | self.bn1 = norm_layer(planes) 49 | self.relu = nn.ReLU(inplace=True) 50 | self.conv2 = conv3x3(planes, planes) 51 | self.bn2 = norm_layer(planes) 52 | self.downsample = downsample 53 | self.stride = stride 54 | 55 | def forward(self, x): 56 | identity = x 57 | 58 | out = self.conv1(x) 59 | out = self.bn1(out) 60 | out = self.relu(out) 61 | 62 | out = self.conv2(out) 63 | out = self.bn2(out) 64 | 65 | if self.downsample is not None: 66 | identity = self.downsample(x) 67 | 68 | out += identity 69 | out = self.relu(out) 70 | 71 | return out 72 | 73 | 74 | class Bottleneck(nn.Module): 75 | expansion = 4 76 | 77 | def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1, 78 | base_width=64, dilation=1, norm_layer=None): 79 | super(Bottleneck, self).__init__() 80 | if norm_layer is None: 81 | norm_layer = nn.BatchNorm2d 82 | width = int(planes * (base_width / 64.)) * groups 83 | # Both self.conv2 and self.downsample layers downsample the input when stride != 1 84 | self.conv1 = conv1x1(inplanes, width) 85 | self.bn1 = norm_layer(width) 86 | self.conv2 = conv3x3(width, width, stride, groups, dilation) 87 | self.bn2 = norm_layer(width) 88 | self.conv3 = conv1x1(width, planes * self.expansion) 89 | self.bn3 = norm_layer(planes * self.expansion) 90 | self.relu = nn.ReLU(inplace=True) 91 | self.downsample = downsample 92 | self.stride = stride 93 | 94 | def forward(self, x): 95 | identity = x 96 | 97 | out = self.conv1(x) 98 | out = self.bn1(out) 99 | out = self.relu(out) 100 | 101 | out = self.conv2(out) 102 | out = self.bn2(out) 103 | out = self.relu(out) 104 | 105 | out = self.conv3(out) 106 | out = self.bn3(out) 107 | 108 | if self.downsample is not None: 109 | identity = self.downsample(x) 110 | 111 | out += identity 112 | out = self.relu(out) 113 | 114 | return out 115 | 116 | 117 | class ResNet(nn.Module): 118 | 119 | def __init__(self, block, layers, num_classes=1000, zero_init_residual=False, 120 | groups=1, width_per_group=64, replace_stride_with_dilation=None, 121 | norm_layer=None): 122 | super(ResNet, self).__init__() 123 | if norm_layer is None: 124 | norm_layer = nn.BatchNorm2d 125 | self._norm_layer = norm_layer 126 | 127 | self.inplanes = 64 128 | self.dilation = 1 129 | if replace_stride_with_dilation is None: 130 | # each element in the tuple indicates if we should replace 131 | # the 2x2 stride with a dilated convolution instead 132 | replace_stride_with_dilation = [False, False, False] 133 | if len(replace_stride_with_dilation) != 3: 134 | raise ValueError("replace_stride_with_dilation should be None " 135 | "or a 3-element tuple, got {}".format(replace_stride_with_dilation)) 136 | self.groups = groups 137 | self.base_width = width_per_group 138 | self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, 139 | bias=False) 140 | self.bn1 = norm_layer(self.inplanes) 141 | self.relu = nn.ReLU(inplace=True) 142 | self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) 143 | self.layer1 = self._make_layer(block, 64, layers[0]) 144 | self.layer2 = self._make_layer(block, 128, layers[1], stride=2, 145 | dilate=replace_stride_with_dilation[0]) 146 | self.layer3 = self._make_layer(block, 256, layers[2], stride=2, 147 | dilate=replace_stride_with_dilation[1]) 148 | self.layer4 = self._make_layer(block, 512, layers[3], stride=2, 149 | dilate=replace_stride_with_dilation[2]) 150 | self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) 151 | self.fc = nn.Linear(512 * block.expansion, num_classes) 152 | 153 | for m in self.modules(): 154 | if isinstance(m, nn.Conv2d): 155 | nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') 156 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 157 | nn.init.constant_(m.weight, 1) 158 | nn.init.constant_(m.bias, 0) 159 | 160 | # Zero-initialize the last BN in each residual branch, 161 | # so that the residual branch starts with zeros, and each residual block behaves like an identity. 162 | # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677 163 | if zero_init_residual: 164 | for m in self.modules(): 165 | if isinstance(m, Bottleneck): 166 | nn.init.constant_(m.bn3.weight, 0) 167 | elif isinstance(m, BasicBlock): 168 | nn.init.constant_(m.bn2.weight, 0) 169 | 170 | def _make_layer(self, block, planes, blocks, stride=1, dilate=False): 171 | norm_layer = self._norm_layer 172 | downsample = None 173 | previous_dilation = self.dilation 174 | if dilate: 175 | self.dilation *= stride 176 | stride = 1 177 | if stride != 1 or self.inplanes != planes * block.expansion: 178 | downsample = nn.Sequential( 179 | conv1x1(self.inplanes, planes * block.expansion, stride), 180 | norm_layer(planes * block.expansion), 181 | ) 182 | 183 | layers = [] 184 | layers.append(block(self.inplanes, planes, stride, downsample, self.groups, 185 | self.base_width, previous_dilation, norm_layer)) 186 | self.inplanes = planes * block.expansion 187 | for _ in range(1, blocks): 188 | layers.append(block(self.inplanes, planes, groups=self.groups, 189 | base_width=self.base_width, dilation=self.dilation, 190 | norm_layer=norm_layer)) 191 | 192 | return nn.Sequential(*layers) 193 | 194 | def forward(self, x): 195 | x = self.conv1(x) 196 | x = self.bn1(x) 197 | x = self.relu(x) 198 | x = self.maxpool(x) 199 | 200 | x = self.layer1(x) 201 | x = self.layer2(x) 202 | x = self.layer3(x) 203 | x = self.layer4(x) 204 | 205 | x = self.avgpool(x) 206 | x = torch.flatten(x, 1) 207 | x = self.fc(x) 208 | 209 | return x 210 | 211 | def get_featmap(self, x, pooled=True): 212 | x = self.conv1(x) 213 | x = self.bn1(x) 214 | x = self.relu(x) 215 | x = self.maxpool(x) 216 | 217 | x = self.layer1(x) 218 | x = self.layer2(x) 219 | x = self.layer3(x) 220 | x = self.layer4(x) 221 | 222 | if pooled: 223 | return self.avgpool(x) 224 | else: 225 | return x 226 | 227 | def get_featvec(self, x): 228 | x = self.conv1(x) 229 | x = self.bn1(x) 230 | x = self.relu(x) 231 | x = self.maxpool(x) 232 | 233 | x = self.layer1(x) 234 | x = self.layer2(x) 235 | x = self.layer3(x) 236 | x = self.layer4(x) 237 | 238 | x = self.avgpool(x) 239 | featvec = torch.flatten(x, 1) 240 | return featvec 241 | 242 | @property 243 | def feature_dim(self): 244 | return self.fc.weight.shape[-1] 245 | 246 | 247 | def _resnet(arch, block, layers, pretrained, progress, **kwargs): 248 | model = ResNet(block, layers, **kwargs) 249 | if pretrained: 250 | state_dict = load_state_dict_from_url(model_urls[arch], 251 | progress=progress) 252 | model.load_state_dict(state_dict) 253 | return model 254 | 255 | 256 | def resnet18(pretrained=False, progress=True, **kwargs): 257 | r"""ResNet-18 model from 258 | `"Deep Residual Learning for Image Recognition" `_ 259 | 260 | Args: 261 | pretrained (bool): If True, returns a model pre-trained on ImageNet 262 | progress (bool): If True, displays a progress bar of the download to stderr 263 | """ 264 | return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress, 265 | **kwargs) 266 | 267 | 268 | def resnet34(pretrained=False, progress=True, **kwargs): 269 | r"""ResNet-34 model from 270 | `"Deep Residual Learning for Image Recognition" `_ 271 | 272 | Args: 273 | pretrained (bool): If True, returns a model pre-trained on ImageNet 274 | progress (bool): If True, displays a progress bar of the download to stderr 275 | """ 276 | return _resnet('resnet34', BasicBlock, [3, 4, 6, 3], pretrained, progress, 277 | **kwargs) 278 | 279 | 280 | def resnet50(pretrained=False, progress=True, **kwargs): 281 | r"""ResNet-50 model from 282 | `"Deep Residual Learning for Image Recognition" `_ 283 | 284 | Args: 285 | pretrained (bool): If True, returns a model pre-trained on ImageNet 286 | progress (bool): If True, displays a progress bar of the download to stderr 287 | """ 288 | return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, progress, 289 | **kwargs) 290 | 291 | 292 | def resnet101(pretrained=False, progress=True, **kwargs): 293 | r"""ResNet-101 model from 294 | `"Deep Residual Learning for Image Recognition" `_ 295 | 296 | Args: 297 | pretrained (bool): If True, returns a model pre-trained on ImageNet 298 | progress (bool): If True, displays a progress bar of the download to stderr 299 | """ 300 | return _resnet('resnet101', Bottleneck, [3, 4, 23, 3], pretrained, progress, 301 | **kwargs) 302 | 303 | 304 | def resnet152(pretrained=False, progress=True, **kwargs): 305 | r"""ResNet-152 model from 306 | `"Deep Residual Learning for Image Recognition" `_ 307 | 308 | Args: 309 | pretrained (bool): If True, returns a model pre-trained on ImageNet 310 | progress (bool): If True, displays a progress bar of the download to stderr 311 | """ 312 | return _resnet('resnet152', Bottleneck, [3, 8, 36, 3], pretrained, progress, 313 | **kwargs) 314 | 315 | 316 | def resnext50_32x4d(pretrained=False, progress=True, **kwargs): 317 | r"""ResNeXt-50 32x4d model from 318 | `"Aggregated Residual Transformation for Deep Neural Networks" `_ 319 | 320 | Args: 321 | pretrained (bool): If True, returns a model pre-trained on ImageNet 322 | progress (bool): If True, displays a progress bar of the download to stderr 323 | """ 324 | kwargs['groups'] = 32 325 | kwargs['width_per_group'] = 4 326 | return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3], 327 | pretrained, progress, **kwargs) 328 | 329 | 330 | def resnext101_32x8d(pretrained=False, progress=True, **kwargs): 331 | r"""ResNeXt-101 32x8d model from 332 | `"Aggregated Residual Transformation for Deep Neural Networks" `_ 333 | 334 | Args: 335 | pretrained (bool): If True, returns a model pre-trained on ImageNet 336 | progress (bool): If True, displays a progress bar of the download to stderr 337 | """ 338 | kwargs['groups'] = 32 339 | kwargs['width_per_group'] = 8 340 | return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3], 341 | pretrained, progress, **kwargs) 342 | 343 | 344 | def wide_resnet50_2(pretrained=False, progress=True, **kwargs): 345 | r"""Wide ResNet-50-2 model from 346 | `"Wide Residual Networks" `_ 347 | 348 | The model is the same as ResNet except for the bottleneck number of channels 349 | which is twice larger in every block. The number of channels in outer 1x1 350 | convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 351 | channels, and in Wide ResNet-50-2 has 2048-1024-2048. 352 | 353 | Args: 354 | pretrained (bool): If True, returns a model pre-trained on ImageNet 355 | progress (bool): If True, displays a progress bar of the download to stderr 356 | """ 357 | kwargs['width_per_group'] = 64 * 2 358 | return _resnet('wide_resnet50_2', Bottleneck, [3, 4, 6, 3], 359 | pretrained, progress, **kwargs) 360 | 361 | 362 | def wide_resnet101_2(pretrained=False, progress=True, **kwargs): 363 | r"""Wide ResNet-101-2 model from 364 | `"Wide Residual Networks" `_ 365 | 366 | The model is the same as ResNet except for the bottleneck number of channels 367 | which is twice larger in every block. The number of channels in outer 1x1 368 | convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 369 | channels, and in Wide ResNet-50-2 has 2048-1024-2048. 370 | 371 | Args: 372 | pretrained (bool): If True, returns a model pre-trained on ImageNet 373 | progress (bool): If True, displays a progress bar of the download to stderr 374 | """ 375 | kwargs['width_per_group'] = 64 * 2 376 | return _resnet('wide_resnet101_2', Bottleneck, [3, 4, 23, 3], 377 | pretrained, progress, **kwargs) 378 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/tsn.py: -------------------------------------------------------------------------------- 1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding" 2 | # arXiv:1811.08383 3 | # Ji Lin*, Chuang Gan, Song Han 4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu 5 | 6 | 7 | from ops.transforms import * 8 | from torch import nn 9 | 10 | 11 | class TSN(nn.Module): 12 | def __init__(self, 13 | # num_class, 14 | num_segments, 15 | modality='RGB', 16 | base_model='resnet50', 17 | new_length=None, 18 | # consensus_type='avg', 19 | # before_softmax=True, 20 | # dropout=0.8, 21 | # img_feature_dim=256, 22 | crop_num=1, 23 | partial_bn=True, 24 | print_spec=True, 25 | pretrain='imagenet', 26 | is_shift=False, 27 | shift_div=8, 28 | shift_place='blockres', 29 | fc_lr5=False, 30 | temporal_pool=False, 31 | non_local=False): 32 | super(TSN, self).__init__() 33 | self.modality = modality 34 | self.num_segments = num_segments 35 | self.reshape = False 36 | self.crop_num = crop_num 37 | self.pretrain = pretrain 38 | # self.before_softma 39 | # self.dropout = dropout 40 | # self.consensus_type = consensus_type 41 | # self.img_feature_dim = img_feature_dim # the dimension of the CNN feature to represent each frame 42 | 43 | self.is_shift = is_shift 44 | self.shift_div = shift_div 45 | self.shift_place = shift_place 46 | self.base_model_name = base_model 47 | self.fc_lr5 = fc_lr5 48 | self.temporal_pool = temporal_pool 49 | self.non_local = non_local 50 | 51 | # if not before_softmax and consensus_type != 'avg': 52 | # raise ValueError("Only avg consensus can be used after Softmax") 53 | 54 | if new_length is None: 55 | self.new_length = 1 if modality == "RGB" else 5 56 | else: 57 | self.new_length = new_length 58 | if print_spec: 59 | print((""" 60 | Initializing TSN with base model: {}. 61 | TSN Configurations: 62 | input_modality: {} 63 | num_segments: {} 64 | new_length: {} 65 | """.format(base_model, self.modality, self.num_segments, self.new_length))) 66 | 67 | self._prepare_base_model(base_model) 68 | 69 | # feature_dim = self._prepare_tsn(num_class) 70 | 71 | # if self.modality == 'Flow': 72 | # print("Converting the ImageNet model to a flow init model") 73 | # self.base_model = self._construct_flow_model(self.base_model) 74 | # print("Done. Flow model ready...") 75 | # elif self.modality == 'RGBDiff': 76 | # print("Converting the ImageNet model to RGB+Diff init model") 77 | # self.base_model = self._construct_diff_model(self.base_model) 78 | # print("Done. RGBDiff model ready.") 79 | assert self.modality == 'RGB' 80 | 81 | # self.consensus = ConsensusModule(consensus_type) 82 | 83 | # if not self.before_softmax: 84 | # self.softmax = nn.Softmax() 85 | 86 | self._enable_pbn = partial_bn 87 | if partial_bn: 88 | self.partialBN(True) 89 | 90 | # def _prepare_tsn(self, num_class): 91 | # feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features 92 | # if self.dropout == 0: 93 | # setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class)) 94 | # self.new_fc = None 95 | # else: 96 | # setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout)) 97 | # self.new_fc = nn.Linear(feature_dim, num_class) 98 | # 99 | # std = 0.001 100 | # if self.new_fc is None: 101 | # normal_(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std) 102 | # constant_(getattr(self.base_model, self.base_model.last_layer_name).bias, 0) 103 | # else: 104 | # if hasattr(self.new_fc, 'weight'): 105 | # normal_(self.new_fc.weight, 0, std) 106 | # constant_(self.new_fc.bias, 0) 107 | # return feature_dim 108 | 109 | def _prepare_base_model(self, base_model): 110 | print('=> base model: {}'.format(base_model)) 111 | 112 | if 'resnet' in base_model: 113 | # self.base_model = getattr(torchvision.models, base_model)(True if self.pretrain == 'imagenet' else False) 114 | self.base_model = getattr(torchvision.models, base_model)(False) 115 | # self.base_model = resnet50(pretrained=True) 116 | if self.is_shift: 117 | print('Adding temporal shift...') 118 | from ops.temporal_shift import make_temporal_shift 119 | make_temporal_shift(self.base_model, self.num_segments, 120 | n_div=self.shift_div, place=self.shift_place, temporal_pool=self.temporal_pool) 121 | 122 | if self.non_local: 123 | raise NotImplementedError 124 | # print('Adding non-local module...') 125 | # from ops.non_local import make_non_local 126 | # make_non_local(self.base_model, self.num_segments) 127 | 128 | # self.base_model.last_layer_name = 'fc' 129 | # self.input_size = 224 130 | # self.input_mean = [0.485, 0.456, 0.406] 131 | # self.input_std = [0.229, 0.224, 0.225] 132 | 133 | self.base_model.avgpool = nn.AdaptiveAvgPool2d(1) 134 | # self.base_model = torch.nn.Sequential(*list(self.base_model.children())[:-1]) # remove the last fc layer 135 | 136 | # if self.modality == 'Flow': 137 | # self.input_mean = [0.5] 138 | # self.input_std = [np.mean(self.input_std)] 139 | # elif self.modality == 'RGBDiff': 140 | # self.input_mean = [0.485, 0.456, 0.406] + [0] * 3 * self.new_length 141 | # self.input_std = self.input_std + [np.mean(self.input_std) * 2] * 3 * self.new_length 142 | assert self.modality == 'RGB' 143 | else: 144 | raise ValueError('Unknown base model: {}'.format(base_model)) 145 | 146 | def train(self, mode=True): 147 | """ 148 | Override the default train() to freeze the BN parameters 149 | :return: 150 | """ 151 | super(TSN, self).train(mode) 152 | count = 0 153 | if self._enable_pbn and mode: 154 | print("Freezing BatchNorm2D except the first one.") 155 | for m in self.base_model.modules(): 156 | if isinstance(m, nn.BatchNorm2d): 157 | count += 1 158 | if count >= (2 if self._enable_pbn else 1): 159 | m.eval() 160 | # shutdown update in frozen mode 161 | m.weight.requires_grad = False 162 | m.bias.requires_grad = False 163 | 164 | def partialBN(self, enable): 165 | self._enable_pbn = enable 166 | 167 | def get_optim_policies(self): 168 | first_conv_weight = [] 169 | first_conv_bias = [] 170 | normal_weight = [] 171 | normal_bias = [] 172 | bn = [] 173 | 174 | conv_cnt = 0 175 | bn_cnt = 0 176 | for m in self.modules(): 177 | if isinstance(m, torch.nn.Conv2d) or isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv3d): 178 | ps = list(m.parameters()) 179 | conv_cnt += 1 180 | if conv_cnt == 1: 181 | first_conv_weight.append(ps[0]) 182 | if len(ps) == 2: 183 | first_conv_bias.append(ps[1]) 184 | else: 185 | normal_weight.append(ps[0]) 186 | if len(ps) == 2: 187 | normal_bias.append(ps[1]) 188 | elif isinstance(m, torch.nn.BatchNorm2d): 189 | bn_cnt += 1 190 | # later BN's are frozen 191 | if not self._enable_pbn or bn_cnt == 1: 192 | bn.extend(list(m.parameters())) 193 | elif isinstance(m, torch.nn.BatchNorm3d): 194 | bn_cnt += 1 195 | # later BN's are frozen 196 | if not self._enable_pbn or bn_cnt == 1: 197 | bn.extend(list(m.parameters())) 198 | elif len(m._modules) == 0: 199 | if len(list(m.parameters())) > 0: 200 | raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m))) 201 | 202 | return [ 203 | {'params': first_conv_weight, 'lr_mult': 5 if self.modality == 'Flow' else 1, 'decay_mult': 1, 204 | 'name': "first_conv_weight"}, 205 | {'params': first_conv_bias, 'lr_mult': 10 if self.modality == 'Flow' else 2, 'decay_mult': 0, 206 | 'name': "first_conv_bias"}, 207 | {'params': normal_weight, 'lr_mult': 1, 'decay_mult': 1, 208 | 'name': "normal_weight"}, 209 | {'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0, 210 | 'name': "normal_bias"}, 211 | {'params': bn, 'lr_mult': 1, 'decay_mult': 0, 212 | 'name': "BN scale/shift"}, 213 | ] 214 | 215 | def forward(self, input, no_reshape=False): 216 | if not no_reshape: 217 | sample_len = (3 if self.modality == "RGB" else 2) * self.new_length # self.new_length = 1 if modality == "RGB" else 5 218 | 219 | if self.modality == 'RGBDiff': 220 | sample_len = 3 * self.new_length 221 | input = self._get_diff(input) 222 | 223 | base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:])) # original input size: n, t*c, h, w; after reshape n*t, c, h, w 224 | else: 225 | base_out = self.base_model(input) 226 | 227 | # if self.dropout > 0: 228 | # base_out = self.new_fc(base_out) 229 | # 230 | # if not self.before_softmax: 231 | # base_out = self.softmax(base_out) 232 | 233 | if self.reshape: # self.reshape=True 234 | if self.is_shift and self.temporal_pool: 235 | base_out = base_out.view((-1, self.num_segments // 2) + base_out.size()[1:]) 236 | else: 237 | base_out = base_out.view((-1, self.num_segments) + base_out.size()[1:]) # base_out size: n, t, c, h, w 238 | # output = self.consensus(base_out) 239 | # return output.squeeze(1) 240 | 241 | return base_out.squeeze() 242 | 243 | @property 244 | def feature_dim(self): 245 | return 2048 -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/models/utils.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Description: 3 | Author: Zhaoxi Chen 4 | Github: https://github.com/FrozenBurning 5 | Date: 2020-12-14 13:54:36 6 | LastEditors: Zhaoxi Chen 7 | LastEditTime: 2020-12-14 15:02:20 8 | ''' 9 | try: 10 | from torch.hub import load_state_dict_from_url 11 | except ImportError: 12 | from torch.utils.model_zoo import load_url as load_state_dict_from_url 13 | 14 | import torchvision 15 | import torch 16 | import numpy as np 17 | 18 | def prep_a_net(model_name, shall_pretrain): 19 | model = getattr(torchvision.models, model_name)(shall_pretrain) 20 | if "resnet" in model_name: 21 | model.last_layer_name = 'fc' 22 | elif "mobilenet_v2" in model_name: 23 | model.last_layer_name = 'classifier' 24 | return model 25 | 26 | def zero_pad(im, pad_size): 27 | """Performs zero padding (CHW format).""" 28 | pad_width = ((0, 0), (pad_size, pad_size), (pad_size, pad_size)) 29 | return np.pad(im, pad_width, mode="constant") 30 | 31 | def random_crop(im, size, pad_size=0): 32 | """Performs random crop (CHW format).""" 33 | if pad_size > 0: 34 | im = zero_pad(im=im, pad_size=pad_size) 35 | h, w = im.shape[1:] 36 | if size == h: 37 | return im 38 | y = np.random.randint(0, h - size) 39 | x = np.random.randint(0, w - size) 40 | im_crop = im[:, y : (y + size), x : (x + size)] 41 | assert im_crop.shape[1:] == (size, size) 42 | return im_crop 43 | 44 | def get_patch(images, action_sequence, patch_size): 45 | """Get small patch of the original image""" 46 | batch_size = images.size(0) 47 | image_size = images.size(2) 48 | 49 | patch_coordinate = torch.floor(action_sequence * (image_size - patch_size)).int() 50 | patches = [] 51 | for i in range(batch_size): 52 | per_patch = images[i, :, 53 | (patch_coordinate[i, 0].item()): ((patch_coordinate[i, 0] + patch_size).item()), 54 | (patch_coordinate[i, 1].item()): ((patch_coordinate[i, 1] + patch_size).item())] 55 | 56 | patches.append(per_patch.view(1, per_patch.size(0), per_patch.size(1), per_patch.size(2))) 57 | 58 | return torch.cat(patches, 0) 59 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/__init__.py: -------------------------------------------------------------------------------- 1 | from ops.basic_ops import * 2 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/__pycache__/temporal_shift.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/temporal_shift.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/__pycache__/transforms.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/transforms.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/basic_ops.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | class Identity(torch.nn.Module): 5 | def forward(self, input): 6 | return input 7 | 8 | 9 | class SegmentConsensus(torch.nn.Module): 10 | 11 | def __init__(self, consensus_type, dim=1): 12 | super(SegmentConsensus, self).__init__() 13 | self.consensus_type = consensus_type 14 | self.dim = dim 15 | self.shape = None 16 | 17 | def forward(self, input_tensor): 18 | self.shape = input_tensor.size() 19 | if self.consensus_type == 'avg': 20 | output = input_tensor.mean(dim=self.dim, keepdim=True) 21 | elif self.consensus_type == 'identity': 22 | output = input_tensor 23 | else: 24 | output = None 25 | 26 | return output 27 | 28 | 29 | class ConsensusModule(torch.nn.Module): 30 | 31 | def __init__(self, consensus_type, dim=1): 32 | super(ConsensusModule, self).__init__() 33 | self.consensus_type = consensus_type if consensus_type != 'rnn' else 'identity' 34 | self.dim = dim 35 | 36 | def forward(self, input): 37 | return SegmentConsensus(self.consensus_type, self.dim)(input) 38 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/dataset.py: -------------------------------------------------------------------------------- 1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding" 2 | # arXiv:1811.08383 3 | # Ji Lin*, Chuang Gan, Song Han 4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu 5 | 6 | import torch.utils.data as data 7 | 8 | from PIL import Image 9 | import os 10 | import numpy as np 11 | from numpy.random import randint 12 | 13 | 14 | class VideoRecord(object): 15 | def __init__(self, row): 16 | self._data = row 17 | 18 | @property 19 | def path(self): 20 | return self._data[0] 21 | 22 | @property 23 | def num_frames(self): 24 | return int(self._data[1]) 25 | 26 | @property 27 | def label(self): 28 | return int(self._data[2]) 29 | 30 | 31 | class TSNDataSet(data.Dataset): 32 | def __init__(self, root_path, list_file, num_segments_glancer=8, 33 | num_segments_focuser=8, new_length=1, modality='RGB', 34 | image_tmpl='img_{:05d}.jpg', transform=None, 35 | random_shift=True, test_mode=False, 36 | remove_missing=False, dense_sample=False, twice_sample=False): 37 | 38 | self.root_path = root_path 39 | self.list_file = list_file 40 | self.num_segments_glancer = num_segments_glancer 41 | self.num_segments_focuser = num_segments_focuser 42 | self.new_length = new_length 43 | self.modality = modality 44 | self.image_tmpl = image_tmpl 45 | self.transform = transform 46 | self.random_shift = random_shift 47 | self.test_mode = test_mode 48 | # print('self.test_mode:', self.test_mode) 49 | self.remove_missing = remove_missing 50 | self.dense_sample = dense_sample # using dense sample as I3D 51 | self.twice_sample = twice_sample # twice sample for more validation 52 | if self.dense_sample: 53 | print('=> Using dense sample for the dataset...') 54 | if self.twice_sample: 55 | print('=> Using twice sample for the dataset...') 56 | 57 | if self.modality == 'RGBDiff': 58 | self.new_length += 1 # Diff needs one more image to calculate diff 59 | 60 | self._parse_list() 61 | 62 | def _load_image(self, directory, idx): 63 | if self.modality == 'RGB' or self.modality == 'RGBDiff': 64 | try: 65 | return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert('RGB')] 66 | except Exception: 67 | print('error loading image:', os.path.join(self.root_path, directory, self.image_tmpl.format(idx))) 68 | return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB')] 69 | elif self.modality == 'Flow': 70 | if self.image_tmpl == 'flow_{}_{:05d}.jpg': # ucf 71 | x_img = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format('x', idx))).convert( 72 | 'L') 73 | y_img = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format('y', idx))).convert( 74 | 'L') 75 | elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg': # something v1 flow 76 | x_img = Image.open(os.path.join(self.root_path, '{:06d}'.format(int(directory)), self.image_tmpl. 77 | format(int(directory), 'x', idx))).convert('L') 78 | y_img = Image.open(os.path.join(self.root_path, '{:06d}'.format(int(directory)), self.image_tmpl. 79 | format(int(directory), 'y', idx))).convert('L') 80 | else: 81 | try: 82 | # idx_skip = 1 + (idx-1)*5 83 | flow = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert( 84 | 'RGB') 85 | except Exception: 86 | print('error loading flow file:', 87 | os.path.join(self.root_path, directory, self.image_tmpl.format(idx))) 88 | flow = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB') 89 | # the input flow file is RGB image with (flow_x, flow_y, blank) for each channel 90 | flow_x, flow_y, _ = flow.split() 91 | x_img = flow_x.convert('L') 92 | y_img = flow_y.convert('L') 93 | 94 | return [x_img, y_img] 95 | 96 | def _parse_list(self): 97 | # check the frame number is large >3: 98 | tmp = [x.strip().split(' ') for x in open(self.list_file)] 99 | if not self.test_mode or self.remove_missing: 100 | tmp = [item for item in tmp if int(item[1]) >= 3] 101 | self.video_list = [VideoRecord(item) for item in tmp] 102 | 103 | if self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 104 | for v in self.video_list: 105 | v._data[1] = int(v._data[1]) / 2 106 | print('video number:%d' % (len(self.video_list))) 107 | 108 | def _sample_indices(self, record): 109 | """ 110 | 111 | :param record: VideoRecord 112 | :return: list 113 | """ 114 | ### For glancer 115 | if self.dense_sample: # i3d dense sample 116 | sample_pos = max(1, 1 + record.num_frames - 64) 117 | t_stride = 64 // self.num_segments_glancer 118 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 119 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)] 120 | offsets_glancer = np.array(offsets) + 1 121 | else: # normal sample 122 | # print('normal sample') 123 | average_duration = (record.num_frames - self.new_length + 1) // self.num_segments_glancer # RGB self.new_length=1 124 | if average_duration > 0: 125 | offsets = np.multiply(list(range(self.num_segments_glancer)), average_duration) + randint(average_duration, 126 | size=self.num_segments_glancer) 127 | elif record.num_frames > self.num_segments_glancer: 128 | offsets = np.sort(randint(record.num_frames - self.new_length + 1, size=self.num_segments_glancer)) 129 | else: 130 | offsets = np.zeros((self.num_segments_glancer,)) 131 | offsets_glancer = offsets + 1 132 | 133 | ### For focuser 134 | if self.dense_sample: # i3d dense sample 135 | sample_pos = max(1, 1 + record.num_frames - 64) 136 | t_stride = 64 // self.num_segments_focuser 137 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 138 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_focuser)] 139 | offsets_focuser = np.array(offsets) + 1 140 | else: # normal sample 141 | # print('normal sample') 142 | average_duration = (record.num_frames - self.new_length + 1) // self.num_segments_focuser # RGB self.new_length=1 143 | if average_duration > 0: 144 | offsets = np.multiply(list(range(self.num_segments_focuser)), average_duration) + randint(average_duration, 145 | size=self.num_segments_focuser) 146 | elif record.num_frames > self.num_segments_focuser: 147 | offsets = np.sort(randint(record.num_frames - self.new_length + 1, size=self.num_segments_focuser)) 148 | else: 149 | offsets = np.zeros((self.num_segments_focuser,)) 150 | offsets_focuser = offsets + 1 151 | 152 | return offsets_glancer, offsets_focuser 153 | 154 | def _get_val_indices(self, record): 155 | 156 | ### For glancer 157 | if self.dense_sample: # i3d dense sample 158 | sample_pos = max(1, 1 + record.num_frames - 64) 159 | t_stride = 64 // self.num_segments_glancer 160 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 161 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)] 162 | offsets_glancer = np.array(offsets) + 1 163 | else: 164 | if record.num_frames > self.num_segments_glancer + self.new_length - 1: 165 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer) 166 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)]) 167 | else: 168 | offsets = np.zeros((self.num_segments_glancer,)) 169 | offsets_glancer = offsets + 1 170 | 171 | ### For focuser 172 | if self.dense_sample: # i3d dense sample 173 | sample_pos = max(1, 1 + record.num_frames - 64) 174 | t_stride = 64 // self.num_segments_focuser 175 | start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1) 176 | offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_focuser)] 177 | offsets_focuser = np.array(offsets) + 1 178 | else: 179 | if record.num_frames > self.num_segments_focuser + self.new_length - 1: 180 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser) 181 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)]) 182 | else: 183 | offsets = np.zeros((self.num_segments_focuser,)) 184 | offsets_focuser = offsets + 1 185 | 186 | return offsets_glancer, offsets_focuser 187 | 188 | def _get_test_indices(self, record): 189 | 190 | ### For glancer 191 | if self.dense_sample: 192 | sample_pos = max(1, 1 + record.num_frames - 64) 193 | t_stride = 64 // self.num_segments_glancer 194 | start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int) 195 | offsets = [] 196 | for start_idx in start_list.tolist(): 197 | offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)] 198 | offsets_glancer = np.array(offsets) + 1 199 | elif self.twice_sample: 200 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer) 201 | 202 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)] + 203 | [int(tick * x) for x in range(self.num_segments_glancer)]) 204 | 205 | offsets_glancer = offsets + 1 206 | else: 207 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer) 208 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)]) 209 | offsets_glancer = offsets + 1 210 | 211 | ### For glancer 212 | if self.dense_sample: 213 | sample_pos = max(1, 1 + record.num_frames - 64) 214 | t_stride = 64 // self.num_segments_focuser 215 | start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int) 216 | offsets = [] 217 | for start_idx in start_list.tolist(): 218 | offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in 219 | range(self.num_segments_focuser)] 220 | offsets_focuser = np.array(offsets) + 1 221 | elif self.twice_sample: 222 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser) 223 | 224 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)] + 225 | [int(tick * x) for x in range(self.num_segments_focuser)]) 226 | 227 | offsets_focuser = offsets + 1 228 | else: 229 | tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser) 230 | offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)]) 231 | offsets_focuser = offsets + 1 232 | 233 | return offsets_glancer, offsets_focuser 234 | 235 | def __getitem__(self, index): 236 | # print('get item') 237 | record = self.video_list[index] 238 | # check this is a legit video folder 239 | 240 | if self.image_tmpl == 'flow_{}_{:05d}.jpg': 241 | file_name = self.image_tmpl.format('x', 1) 242 | full_path = os.path.join(self.root_path, record.path, file_name) 243 | elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 244 | file_name = self.image_tmpl.format(int(record.path), 'x', 1) 245 | full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name) 246 | else: 247 | file_name = self.image_tmpl.format(1) 248 | full_path = os.path.join(self.root_path, record.path, file_name) 249 | 250 | while not os.path.exists(full_path): 251 | print('################## Not Found:', os.path.join(self.root_path, record.path, file_name)) 252 | index = np.random.randint(len(self.video_list)) 253 | record = self.video_list[index] 254 | if self.image_tmpl == 'flow_{}_{:05d}.jpg': 255 | file_name = self.image_tmpl.format('x', 1) 256 | full_path = os.path.join(self.root_path, record.path, file_name) 257 | elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg': 258 | file_name = self.image_tmpl.format(int(record.path), 'x', 1) 259 | full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name) 260 | else: 261 | file_name = self.image_tmpl.format(1) 262 | full_path = os.path.join(self.root_path, record.path, file_name) 263 | 264 | # print('record:', record) 265 | if not self.test_mode: 266 | # print('test model False') 267 | segment_indices_glancer, segment_indices_focuser = self._sample_indices(record) \ 268 | if self.random_shift else self._get_val_indices(record) 269 | else: 270 | # print('test model True') 271 | segment_indices_glancer, segment_indices_focuser = self._get_test_indices(record) 272 | 273 | return self.get(record, segment_indices_glancer, segment_indices_focuser) 274 | 275 | def get(self, record, indices_glancer, indices_focuser): 276 | 277 | images_glancer = list() 278 | for seg_ind in indices_glancer: 279 | p = int(seg_ind) 280 | for i in range(self.new_length): 281 | seg_imgs = self._load_image(record.path, p) 282 | images_glancer.extend(seg_imgs) 283 | if p < record.num_frames: 284 | p += 1 285 | 286 | process_data_glancer = self.transform(images_glancer) 287 | 288 | images_focuser = list() 289 | for seg_ind in indices_focuser: 290 | p = int(seg_ind) 291 | for i in range(self.new_length): 292 | seg_imgs = self._load_image(record.path, p) 293 | images_focuser.extend(seg_imgs) 294 | if p < record.num_frames: 295 | p += 1 296 | 297 | process_data_focuser = self.transform(images_focuser) 298 | return process_data_glancer, process_data_focuser, record.label 299 | 300 | def __len__(self): 301 | return len(self.video_list) 302 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/dataset_config.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | 4 | def return_somethingv1(modality, root_dataset): 5 | filename_categories = 'something-something-v1/category.txt' 6 | if modality == 'RGB': 7 | root_data = os.path.join(root_dataset, 'something-something-v1/20bn-something-something-v1') 8 | filename_imglist_train = 'something-something-v1/train_videofolder.txt' 9 | filename_imglist_val = 'something-something-v1/val_videofolder.txt' 10 | prefix = '{:05d}.jpg' 11 | elif modality == 'Flow': 12 | root_data = os.path.join(root_dataset, 'something-something-v1/20bn-something-something-v1-flow') 13 | filename_imglist_train = 'something-something-v1/train_videofolder_flow.txt' 14 | filename_imglist_val = 'something-something-v1/val_videofolder_flow.txt' 15 | prefix = '{:06d}-{}_{:05d}.jpg' 16 | else: 17 | print('no such modality:'+modality) 18 | raise NotImplementedError 19 | return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix 20 | 21 | 22 | def return_somethingv2(modality, root_dataset): 23 | filename_categories = 'something-something-v2/category.txt' 24 | if modality == 'RGB': 25 | root_data = os.path.join(root_dataset, 'something-something-v2/20bn-something-something-v2-frames') 26 | filename_imglist_train = 'something-something-v2/train_videofolder.txt' 27 | filename_imglist_val = 'something-something-v2/val_videofolder.txt' 28 | prefix = '{:06d}.jpg' 29 | elif modality == 'Flow': 30 | root_data = os.path.join(root_dataset, 'something-something-v2/20bn-something-something-v2-flow') 31 | filename_imglist_train = 'something-something-v2/train_videofolder_flow.txt' 32 | filename_imglist_val = 'something-something-v2/val_videofolder_flow.txt' 33 | prefix = '{:06d}.jpg' 34 | else: 35 | raise NotImplementedError('no such modality:'+modality) 36 | return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix 37 | 38 | 39 | def return_dataset(dataset, modality, root_dataset): 40 | dict_single = {'somethingv1': return_somethingv1, 'somethingv2': return_somethingv2} 41 | if dataset in dict_single: 42 | file_categories, file_imglist_train, file_imglist_val, root_data, prefix = dict_single[dataset](modality, root_dataset) 43 | else: 44 | raise ValueError('Unknown dataset '+dataset) 45 | 46 | file_imglist_train = os.path.join(root_dataset, file_imglist_train) 47 | file_imglist_val = os.path.join(root_dataset, file_imglist_val) 48 | if isinstance(file_categories, str): 49 | file_categories = os.path.join(root_dataset, file_categories) 50 | with open(file_categories) as f: 51 | lines = f.readlines() 52 | categories = [item.rstrip() for item in lines] 53 | else: # number of categories 54 | categories = [None] * file_categories 55 | n_class = len(categories) 56 | print('{}: {} classes'.format(dataset, n_class)) 57 | return n_class, file_imglist_train, file_imglist_val, root_data, prefix -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/my_logger.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | from datetime import datetime 4 | 5 | 6 | # TODO(Yue) Overrided the logger 7 | class Logger(object): 8 | def __init__(self, log_prefix=""): 9 | self._terminal = sys.stdout 10 | self._timestr = datetime.fromtimestamp(time.time()).strftime("%m%d-%H%M%S") 11 | self._log_path = None 12 | self._log_dir_name = None 13 | self._log_file_name = None 14 | self._history_records = [" ".join(["python"] + sys.argv + ["\n"])] # TODO(yue) remember the CLI input 15 | self._write_mode = "bear_in_mind" 16 | self._prefix = log_prefix 17 | 18 | # TODO(yue) pre_write and create_log help when we don't want to save logs so early because of some early-check process 19 | # we just bear in mind, and when we realy need to write them down, we do that 20 | # without Logger: terminal 21 | # bear_in_mind: terminal->RAM 22 | # take_notes: RAM->FILE 23 | # normal: terminal->FILE 24 | def create_log(self, log_path, test_mode, t, bs, k): 25 | self._log_dir_name = log_path 26 | if test_mode: 27 | self._log_file_name = "test-%s-t%02d-bz%02d-k%02d.txt" % (self._timestr, t, bs, k) 28 | else: 29 | self._log_file_name = "log-%s.txt" % self._timestr 30 | self._log_path = log_path + "/" + self._log_file_name 31 | self.log = open(self._log_path, "a", 1) 32 | self._write_mode = "take_notes" 33 | for record in self._history_records: 34 | self.write(record) 35 | self._history_records = [] 36 | self._write_mode = "normal" 37 | 38 | def write(self, message): 39 | if self._write_mode in ["bear_in_mind", "normal"]: 40 | self._terminal.write(message) 41 | if self._write_mode in ["take_notes", "normal"]: 42 | self.log.write(message.replace("\033[0m", ""). \ 43 | replace("\033[95m", "").replace("\033[94m", "").replace("\033[93m", "").replace("\033[92m", 44 | "").replace( 45 | "\033[91m", "")) 46 | else: 47 | self._history_records.append(message) 48 | 49 | def flush(self): 50 | pass 51 | 52 | def close_log(self): 53 | a = 1 54 | self.log.close() 55 | return sys.stdout 56 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/net_flops_table.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | sys.path.insert(0, "../") 4 | 5 | import torch 6 | import torchvision 7 | from torch import nn 8 | from thop import profile 9 | 10 | feat_dim_dict = { 11 | "resnet18": 512, 12 | "resnet34": 512, 13 | "resnet50": 2048, 14 | "resnet101": 2048, 15 | "mobilenet_v2": 1280, 16 | "efficientnet-b0": 1280, 17 | "efficientnet-b1": 1280, 18 | "efficientnet-b2": 1408, 19 | "efficientnet-b3": 1536, 20 | "efficientnet-b4": 1792, 21 | "efficientnet-b5": 2048, 22 | } 23 | 24 | prior_dict = { 25 | "efficientnet-b0": (0.39, 5.3), 26 | "efficientnet-b1": (0.70, 7.8), 27 | "efficientnet-b2": (1.00, 9.2), 28 | "efficientnet-b3": (1.80, 12), 29 | "efficientnet-b4": (4.20, 19), 30 | "efficientnet-b5": (9.90, 30), 31 | } 32 | 33 | 34 | def get_gflops_params(model_name, resolution, num_classes, seg_len=-1, pretrained=True): 35 | if model_name in prior_dict: 36 | gflops, params = prior_dict[model_name] 37 | gflops = gflops / 224 / 224 * resolution * resolution 38 | return gflops, params 39 | 40 | if "resnet" in model_name: 41 | model = getattr(torchvision.models, model_name)(pretrained) 42 | last_layer = "fc" 43 | elif model_name == "mobilenet_v2": 44 | model = getattr(torchvision.models, model_name)(pretrained) 45 | last_layer = "classifier" 46 | else: 47 | exit("I don't know what is %s" % model_name) 48 | feat_dim = feat_dim_dict[model_name] 49 | 50 | setattr(model, last_layer, nn.Linear(feat_dim, num_classes)) 51 | 52 | if seg_len == -1: 53 | dummy_data = torch.randn(1, 3, resolution, resolution) 54 | else: 55 | dummy_data = torch.randn(1, 3, seg_len, resolution, resolution) 56 | 57 | hooks = {} 58 | flops, params = profile(model, inputs=(dummy_data,), custom_ops=hooks) 59 | gflops = flops / 1e9 60 | params = params / 1e6 61 | 62 | return gflops, params 63 | 64 | 65 | if __name__ == "__main__": 66 | str_list = [] 67 | for s in str_list: 68 | print(s) 69 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/sal_rank_loss.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | def cal_sal_rank_loss(real_pred, lite_pred, target, margin=0): 6 | B, T, K = real_pred.shape 7 | 8 | # TODO(shape) B * T 9 | b_idx = [[x] * T for x in range(B)] 10 | t_idx = [list(range(T)) for _ in range(B)] 11 | k_idx = [[tgt] * T for tgt in target[:, 0].cpu().numpy()] 12 | 13 | # TODO(shape) B * T 14 | real_cfd = real_pred[b_idx, t_idx, k_idx] 15 | lite_cfd = lite_pred[b_idx, t_idx, k_idx] 16 | 17 | x, y = torch.triu_indices(T - 1, T - 1) + torch.tensor([[0], [1]]) 18 | 19 | # TODO(shape) B * (T*(T-1)/2) 20 | real_cfd_x = real_cfd[:, x] 21 | real_cfd_y = real_cfd[:, y] 22 | lite_cfd_x = lite_cfd[:, x] 23 | lite_cfd_y = lite_cfd[:, y] 24 | 25 | psu_label = (real_cfd_x > real_cfd_y).double() * 2 - 1 26 | 27 | return F.margin_ranking_loss(lite_cfd_x, lite_cfd_y, psu_label, margin=margin, reduction="mean") 28 | 29 | 30 | if __name__ == "__main__": 31 | # B=2, T=3, K=4 32 | a = torch.tensor([[[0.1, 0.2, 0.3, 0.4], [0.5, 0.2, 0.1, 0.2], [0.3, 0.3, 0.3, 0.1]], 33 | [[0.3, 0.1, 0.1, 0.1], [0.2, 0.2, 0.5, 0.3], [0.4, 0.4, 0.3, 0.1]]]) 34 | b = torch.tensor([[[0.0, 0.0, 0.0, 0.3], [0.0, 0.0, 0.0, 0.2], [0.0, 0.0, 0.0, 0.3]], 35 | [[0.0, 0.0, 0.1, 0.0], [0.0, 0.0, 0.3, 0.0], [0.0, 0.0, 0.2, 0.0]]]) 36 | target = torch.tensor([3, 2]) 37 | margin = 0 38 | 39 | print("Expect: 0.0000, reality:", cal_sal_rank_loss(a, a, target, 0)) # TODO(yue) expect to become 0 40 | print("Expect: 0.0167, reality:", cal_sal_rank_loss(a, b, target, 0)) # TODO(yue) expect to become 0.1/6=0.0167 41 | print("Expect: 0.9333, reality:", cal_sal_rank_loss(a, b, target, 1)) # TODO(yue) expect to become 5.6/6=0.9333 42 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/temporal_shift.py: -------------------------------------------------------------------------------- 1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding" 2 | # arXiv:1811.08383 3 | # Ji Lin*, Chuang Gan, Song Han 4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu 5 | 6 | import torch 7 | import torch.nn as nn 8 | import torch.nn.functional as F 9 | 10 | from models.resnet import resnet50 11 | 12 | 13 | class TemporalShift(nn.Module): 14 | def __init__(self, net, n_segment=3, n_div=8, inplace=False): 15 | super(TemporalShift, self).__init__() 16 | self.net = net 17 | self.n_segment = n_segment 18 | self.fold_div = n_div 19 | self.inplace = inplace 20 | if inplace: 21 | print('=> Using in-place shift...') 22 | print('=> Using fold div: {}'.format(self.fold_div)) 23 | 24 | def forward(self, x): 25 | x = self.shift(x, self.n_segment, fold_div=self.fold_div, inplace=self.inplace) 26 | return self.net(x) 27 | 28 | @staticmethod 29 | def shift(x, n_segment, fold_div=3, inplace=False): 30 | nt, c, h, w = x.size() 31 | n_batch = nt // n_segment 32 | x = x.view(n_batch, n_segment, c, h, w) 33 | 34 | fold = c // fold_div 35 | if inplace: 36 | # Due to some out of order error when performing parallel computing. 37 | # May need to write a CUDA kernel. 38 | raise NotImplementedError 39 | # out = InplaceShift.apply(x, fold) 40 | else: 41 | out = torch.zeros_like(x) 42 | out[:, :-1, :fold] = x[:, 1:, :fold] # shift left 43 | out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right 44 | out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift 45 | 46 | return out.view(nt, c, h, w) 47 | 48 | 49 | class InplaceShift(torch.autograd.Function): 50 | # Special thanks to @raoyongming for the help to this function 51 | @staticmethod 52 | def forward(ctx, input, fold): 53 | # not support higher order gradient 54 | # input = input.detach_() 55 | ctx.fold_ = fold 56 | n, t, c, h, w = input.size() 57 | buffer = input.data.new(n, t, fold, h, w).zero_() 58 | buffer[:, :-1] = input.data[:, 1:, :fold] 59 | input.data[:, :, :fold] = buffer 60 | buffer.zero_() 61 | buffer[:, 1:] = input.data[:, :-1, fold: 2 * fold] 62 | input.data[:, :, fold: 2 * fold] = buffer 63 | return input 64 | 65 | @staticmethod 66 | def backward(ctx, grad_output): 67 | # grad_output = grad_output.detach_() 68 | fold = ctx.fold_ 69 | n, t, c, h, w = grad_output.size() 70 | buffer = grad_output.data.new(n, t, fold, h, w).zero_() 71 | buffer[:, 1:] = grad_output.data[:, :-1, :fold] 72 | grad_output.data[:, :, :fold] = buffer 73 | buffer.zero_() 74 | buffer[:, :-1] = grad_output.data[:, 1:, fold: 2 * fold] 75 | grad_output.data[:, :, fold: 2 * fold] = buffer 76 | return grad_output, None 77 | 78 | 79 | class TemporalPool(nn.Module): 80 | def __init__(self, net, n_segment): 81 | super(TemporalPool, self).__init__() 82 | self.net = net 83 | self.n_segment = n_segment 84 | 85 | def forward(self, x): 86 | x = self.temporal_pool(x, n_segment=self.n_segment) 87 | return self.net(x) 88 | 89 | @staticmethod 90 | def temporal_pool(x, n_segment): 91 | nt, c, h, w = x.size() 92 | n_batch = nt // n_segment 93 | x = x.view(n_batch, n_segment, c, h, w).transpose(1, 2) # n, c, t, h, w 94 | x = F.max_pool3d(x, kernel_size=(3, 1, 1), stride=(2, 1, 1), padding=(1, 0, 0)) 95 | x = x.transpose(1, 2).contiguous().view(nt // 2, c, h, w) 96 | return x 97 | 98 | 99 | def make_temporal_shift(net, n_segment, n_div=8, place='blockres', temporal_pool=False): 100 | if temporal_pool: 101 | n_segment_list = [n_segment, n_segment // 2, n_segment // 2, n_segment // 2] 102 | else: 103 | n_segment_list = [n_segment] * 4 104 | assert n_segment_list[-1] > 0 105 | print('=> n_segment per stage: {}'.format(n_segment_list)) 106 | 107 | import torchvision 108 | # if isinstance(net, torchvision.models.ResNet): 109 | if isinstance(net, torch.nn.Module): 110 | if place == 'block': 111 | def make_block_temporal(stage, this_segment): 112 | blocks = list(stage.children()) 113 | print('=> Processing stage with {} blocks'.format(len(blocks))) 114 | for i, b in enumerate(blocks): 115 | blocks[i] = TemporalShift(b, n_segment=this_segment, n_div=n_div) 116 | return nn.Sequential(*(blocks)) 117 | 118 | net.layer1 = make_block_temporal(net.layer1, n_segment_list[0]) 119 | net.layer2 = make_block_temporal(net.layer2, n_segment_list[1]) 120 | net.layer3 = make_block_temporal(net.layer3, n_segment_list[2]) 121 | net.layer4 = make_block_temporal(net.layer4, n_segment_list[3]) 122 | 123 | elif 'blockres' in place: 124 | n_round = 1 125 | if len(list(net.layer3.children())) >= 23: 126 | n_round = 2 127 | print('=> Using n_round {} to insert temporal shift'.format(n_round)) 128 | 129 | def make_block_temporal(stage, this_segment): 130 | blocks = list(stage.children()) 131 | print('=> Processing stage with {} blocks residual'.format(len(blocks))) 132 | for i, b in enumerate(blocks): 133 | if i % n_round == 0: 134 | blocks[i].conv1 = TemporalShift(b.conv1, n_segment=this_segment, n_div=n_div) 135 | return nn.Sequential(*blocks) 136 | 137 | net.layer1 = make_block_temporal(net.layer1, n_segment_list[0]) 138 | net.layer2 = make_block_temporal(net.layer2, n_segment_list[1]) 139 | net.layer3 = make_block_temporal(net.layer3, n_segment_list[2]) 140 | net.layer4 = make_block_temporal(net.layer4, n_segment_list[3]) 141 | else: 142 | raise NotImplementedError(place) 143 | 144 | 145 | def make_temporal_pool(net, n_segment): 146 | import torchvision 147 | if isinstance(net, torchvision.models.ResNet): 148 | print('=> Injecting nonlocal pooling') 149 | net.layer2 = TemporalPool(net.layer2, n_segment) 150 | else: 151 | raise NotImplementedError 152 | 153 | 154 | if __name__ == '__main__': 155 | # test inplace shift v.s. vanilla shift 156 | tsm1 = TemporalShift(nn.Sequential(), n_segment=8, n_div=8, inplace=False) 157 | tsm2 = TemporalShift(nn.Sequential(), n_segment=8, n_div=8, inplace=True) 158 | 159 | print('=> Testing CPU...') 160 | # test forward 161 | with torch.no_grad(): 162 | for i in range(10): 163 | x = torch.rand(2 * 8, 3, 224, 224) 164 | y1 = tsm1(x) 165 | y2 = tsm2(x) 166 | assert torch.norm(y1 - y2).item() < 1e-5 167 | 168 | # test backward 169 | with torch.enable_grad(): 170 | for i in range(10): 171 | x1 = torch.rand(2 * 8, 3, 224, 224) 172 | x1.requires_grad_() 173 | x2 = x1.clone() 174 | y1 = tsm1(x1) 175 | y2 = tsm2(x2) 176 | grad1 = torch.autograd.grad((y1 ** 2).mean(), [x1])[0] 177 | grad2 = torch.autograd.grad((y2 ** 2).mean(), [x2])[0] 178 | assert torch.norm(grad1 - grad2).item() < 1e-5 179 | 180 | print('=> Testing GPU...') 181 | tsm1.cuda() 182 | tsm2.cuda() 183 | # test forward 184 | with torch.no_grad(): 185 | for i in range(10): 186 | x = torch.rand(2 * 8, 3, 224, 224).cuda() 187 | y1 = tsm1(x) 188 | y2 = tsm2(x) 189 | assert torch.norm(y1 - y2).item() < 1e-5 190 | 191 | # test backward 192 | with torch.enable_grad(): 193 | for i in range(10): 194 | x1 = torch.rand(2 * 8, 3, 224, 224).cuda() 195 | x1.requires_grad_() 196 | x2 = x1.clone() 197 | y1 = tsm1(x1) 198 | y2 = tsm2(x2) 199 | grad1 = torch.autograd.grad((y1 ** 2).mean(), [x1])[0] 200 | grad2 = torch.autograd.grad((y2 ** 2).mean(), [x2])[0] 201 | assert torch.norm(grad1 - grad2).item() < 1e-5 202 | print('Test passed.') 203 | 204 | 205 | 206 | 207 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/transforms.py: -------------------------------------------------------------------------------- 1 | import torchvision 2 | import random 3 | from PIL import Image, ImageOps 4 | import numpy as np 5 | import numbers 6 | import math 7 | import torch 8 | 9 | 10 | class GroupRandomCrop(object): 11 | def __init__(self, size): 12 | if isinstance(size, numbers.Number): 13 | self.size = (int(size), int(size)) 14 | else: 15 | self.size = size 16 | 17 | def __call__(self, img_group): 18 | 19 | w, h = img_group[0].size 20 | th, tw = self.size 21 | 22 | out_images = list() 23 | 24 | x1 = random.randint(0, w - tw) 25 | y1 = random.randint(0, h - th) 26 | 27 | for img in img_group: 28 | assert (img.size[0] == w and img.size[1] == h) 29 | if w == tw and h == th: 30 | out_images.append(img) 31 | else: 32 | out_images.append(img.crop((x1, y1, x1 + tw, y1 + th))) 33 | 34 | return out_images 35 | 36 | 37 | class GroupCenterCrop(object): 38 | def __init__(self, size): 39 | self.worker = torchvision.transforms.CenterCrop(size) 40 | 41 | def __call__(self, img_group): 42 | return [self.worker(img) for img in img_group] 43 | 44 | 45 | class GroupRandomHorizontalFlip(object): 46 | """Randomly horizontally flips the given PIL.Image with a probability of 0.5 47 | """ 48 | 49 | def __init__(self, is_flow=False): 50 | self.is_flow = is_flow 51 | 52 | def __call__(self, img_group, is_flow=False): 53 | v = random.random() 54 | if v < 0.5: 55 | ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group] 56 | if self.is_flow: 57 | for i in range(0, len(ret), 2): 58 | ret[i] = ImageOps.invert(ret[i]) # invert flow pixel values when flipping 59 | return ret 60 | else: 61 | return img_group 62 | 63 | 64 | class GroupNormalize(object): 65 | def __init__(self, mean, std): 66 | self.mean = mean 67 | self.std = std 68 | 69 | def __call__(self, tensor): 70 | rep_mean = self.mean * (tensor.size()[0] // len(self.mean)) 71 | rep_std = self.std * (tensor.size()[0] // len(self.std)) 72 | 73 | # TODO: make efficient 74 | for t, m, s in zip(tensor, rep_mean, rep_std): 75 | t.sub_(m).div_(s) 76 | 77 | return tensor 78 | 79 | 80 | class GroupScale(object): 81 | """ Rescales the input PIL.Image to the given 'size'. 82 | 'size' will be the size of the smaller edge. 83 | For example, if height > width, then image will be 84 | rescaled to (size * height / width, size) 85 | size: size of the smaller edge 86 | interpolation: Default: PIL.Image.BILINEAR 87 | """ 88 | 89 | def __init__(self, size, interpolation=Image.BILINEAR): 90 | self.worker = torchvision.transforms.Resize(size, interpolation) 91 | 92 | def __call__(self, img_group): 93 | return [self.worker(img) for img in img_group] 94 | 95 | 96 | class GroupOverSample(object): 97 | def __init__(self, crop_size, scale_size=None, flip=True): 98 | self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size) 99 | 100 | if scale_size is not None: 101 | self.scale_worker = GroupScale(scale_size) 102 | else: 103 | self.scale_worker = None 104 | self.flip = flip 105 | 106 | def __call__(self, img_group): 107 | 108 | if self.scale_worker is not None: 109 | img_group = self.scale_worker(img_group) 110 | 111 | image_w, image_h = img_group[0].size 112 | crop_w, crop_h = self.crop_size 113 | 114 | offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h) 115 | oversample_group = list() 116 | for o_w, o_h in offsets: 117 | normal_group = list() 118 | flip_group = list() 119 | for i, img in enumerate(img_group): 120 | crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h)) 121 | normal_group.append(crop) 122 | flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT) 123 | 124 | if img.mode == 'L' and i % 2 == 0: 125 | flip_group.append(ImageOps.invert(flip_crop)) 126 | else: 127 | flip_group.append(flip_crop) 128 | 129 | oversample_group.extend(normal_group) 130 | if self.flip: 131 | oversample_group.extend(flip_group) 132 | return oversample_group 133 | 134 | 135 | class GroupFullResSample(object): 136 | def __init__(self, crop_size, scale_size=None, flip=True): 137 | self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size) 138 | 139 | if scale_size is not None: 140 | self.scale_worker = GroupScale(scale_size) 141 | else: 142 | self.scale_worker = None 143 | self.flip = flip 144 | 145 | def __call__(self, img_group): 146 | 147 | if self.scale_worker is not None: 148 | img_group = self.scale_worker(img_group) 149 | 150 | image_w, image_h = img_group[0].size 151 | crop_w, crop_h = self.crop_size 152 | 153 | w_step = (image_w - crop_w) // 4 154 | h_step = (image_h - crop_h) // 4 155 | 156 | offsets = list() 157 | offsets.append((0 * w_step, 2 * h_step)) # left 158 | offsets.append((4 * w_step, 2 * h_step)) # right 159 | offsets.append((2 * w_step, 2 * h_step)) # center 160 | 161 | oversample_group = list() 162 | for o_w, o_h in offsets: 163 | normal_group = list() 164 | flip_group = list() 165 | for i, img in enumerate(img_group): 166 | crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h)) 167 | normal_group.append(crop) 168 | if self.flip: 169 | flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT) 170 | 171 | if img.mode == 'L' and i % 2 == 0: 172 | flip_group.append(ImageOps.invert(flip_crop)) 173 | else: 174 | flip_group.append(flip_crop) 175 | 176 | oversample_group.extend(normal_group) 177 | oversample_group.extend(flip_group) 178 | return oversample_group 179 | 180 | 181 | class GroupMultiScaleCrop(object): 182 | 183 | def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True): 184 | self.scales = scales if scales is not None else [1, .875, .75, .66] 185 | self.max_distort = max_distort 186 | self.fix_crop = fix_crop 187 | self.more_fix_crop = more_fix_crop 188 | self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size] 189 | self.interpolation = Image.BILINEAR 190 | 191 | def __call__(self, img_group): 192 | 193 | im_size = img_group[0].size 194 | 195 | crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size) 196 | crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group] 197 | ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation) 198 | for img in crop_img_group] 199 | return ret_img_group 200 | 201 | def _sample_crop_size(self, im_size): 202 | image_w, image_h = im_size[0], im_size[1] 203 | 204 | # find a crop size 205 | base_size = min(image_w, image_h) 206 | crop_sizes = [int(base_size * x) for x in self.scales] 207 | crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes] 208 | crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes] 209 | 210 | pairs = [] 211 | for i, h in enumerate(crop_h): 212 | for j, w in enumerate(crop_w): 213 | if abs(i - j) <= self.max_distort: 214 | pairs.append((w, h)) 215 | 216 | crop_pair = random.choice(pairs) 217 | if not self.fix_crop: 218 | w_offset = random.randint(0, image_w - crop_pair[0]) 219 | h_offset = random.randint(0, image_h - crop_pair[1]) 220 | else: 221 | w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1]) 222 | 223 | return crop_pair[0], crop_pair[1], w_offset, h_offset 224 | 225 | def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h): 226 | offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h) 227 | return random.choice(offsets) 228 | 229 | @staticmethod 230 | def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h): 231 | w_step = (image_w - crop_w) // 4 232 | h_step = (image_h - crop_h) // 4 233 | 234 | ret = list() 235 | ret.append((0, 0)) # upper left 236 | ret.append((4 * w_step, 0)) # upper right 237 | ret.append((0, 4 * h_step)) # lower left 238 | ret.append((4 * w_step, 4 * h_step)) # lower right 239 | ret.append((2 * w_step, 2 * h_step)) # center 240 | 241 | if more_fix_crop: 242 | ret.append((0, 2 * h_step)) # center left 243 | ret.append((4 * w_step, 2 * h_step)) # center right 244 | ret.append((2 * w_step, 4 * h_step)) # lower center 245 | ret.append((2 * w_step, 0 * h_step)) # upper center 246 | 247 | ret.append((1 * w_step, 1 * h_step)) # upper left quarter 248 | ret.append((3 * w_step, 1 * h_step)) # upper right quarter 249 | ret.append((1 * w_step, 3 * h_step)) # lower left quarter 250 | ret.append((3 * w_step, 3 * h_step)) # lower righ quarter 251 | 252 | return ret 253 | 254 | 255 | class GroupRandomSizedCrop(object): 256 | """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size 257 | and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio 258 | This is popularly used to train the Inception networks 259 | size: size of the smaller edge 260 | interpolation: Default: PIL.Image.BILINEAR 261 | """ 262 | 263 | def __init__(self, size, interpolation=Image.BILINEAR): 264 | self.size = size 265 | self.interpolation = interpolation 266 | 267 | def __call__(self, img_group): 268 | for attempt in range(10): 269 | area = img_group[0].size[0] * img_group[0].size[1] 270 | target_area = random.uniform(0.08, 1.0) * area 271 | aspect_ratio = random.uniform(3. / 4, 4. / 3) 272 | 273 | w = int(round(math.sqrt(target_area * aspect_ratio))) 274 | h = int(round(math.sqrt(target_area / aspect_ratio))) 275 | 276 | if random.random() < 0.5: 277 | w, h = h, w 278 | 279 | if w <= img_group[0].size[0] and h <= img_group[0].size[1]: 280 | x1 = random.randint(0, img_group[0].size[0] - w) 281 | y1 = random.randint(0, img_group[0].size[1] - h) 282 | found = True 283 | break 284 | else: 285 | found = False 286 | x1 = 0 287 | y1 = 0 288 | 289 | if found: 290 | out_group = list() 291 | for img in img_group: 292 | img = img.crop((x1, y1, x1 + w, y1 + h)) 293 | assert (img.size == (w, h)) 294 | out_group.append(img.resize((self.size, self.size), self.interpolation)) 295 | return out_group 296 | else: 297 | # Fallback 298 | scale = GroupScale(self.size, interpolation=self.interpolation) 299 | crop = GroupRandomCrop(self.size) 300 | return crop(scale(img_group)) 301 | 302 | 303 | class Stack(object): 304 | 305 | def __init__(self, roll=False): 306 | self.roll = roll 307 | 308 | def __call__(self, img_group): 309 | if img_group[0].mode == 'L': 310 | return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2) 311 | elif img_group[0].mode == 'RGB': 312 | if self.roll: 313 | return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2) 314 | else: 315 | return np.concatenate(img_group, axis=2) 316 | 317 | 318 | class ToTorchFormatTensor(object): 319 | """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255] 320 | to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """ 321 | 322 | def __init__(self, div=True): 323 | self.div = div 324 | 325 | def __call__(self, pic): 326 | if isinstance(pic, np.ndarray): 327 | # handle numpy array 328 | img = torch.from_numpy(pic).permute(2, 0, 1).contiguous() 329 | else: 330 | # handle PIL Image 331 | img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes())) 332 | img = img.view(pic.size[1], pic.size[0], len(pic.mode)) 333 | # put it from HWC to CHW format 334 | # yikes, this transpose takes 80% of the loading time/CPU 335 | img = img.transpose(0, 1).transpose(0, 2).contiguous() 336 | return img.float().div(255) if self.div else img.float() 337 | 338 | 339 | class IdentityTransform(object): 340 | 341 | def __call__(self, data): 342 | return data 343 | 344 | 345 | if __name__ == "__main__": 346 | trans = torchvision.transforms.Compose([ 347 | GroupScale(256), 348 | GroupRandomCrop(224), 349 | Stack(), 350 | ToTorchFormatTensor(), 351 | GroupNormalize( 352 | mean=[.485, .456, .406], 353 | std=[.229, .224, .225] 354 | )] 355 | ) 356 | 357 | im = Image.open('../tensorflow-model-zoo.torch/lena_299.png') 358 | 359 | color_group = [im] * 3 360 | rst = trans(color_group) 361 | 362 | gray_group = [im.convert('L')] * 9 363 | gray_rst = trans(gray_group) 364 | 365 | trans2 = torchvision.transforms.Compose([ 366 | GroupRandomSizedCrop(256), 367 | Stack(), 368 | ToTorchFormatTensor(), 369 | GroupNormalize( 370 | mean=[.485, .456, .406], 371 | std=[.229, .224, .225]) 372 | ]) 373 | print(trans2(color_group)) 374 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/ops/video_jpg.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import os 3 | import time 4 | import subprocess 5 | from tqdm import tqdm 6 | import argparse 7 | from multiprocessing import Pool 8 | 9 | parser = argparse.ArgumentParser(description="Dataset processor: Video->Frames") 10 | parser.add_argument("dir_path", type=str, help="original dataset path") 11 | parser.add_argument("dst_dir_path", type=str, help="dest path to save the frames") 12 | parser.add_argument("--prefix", type=str, default="image_%05d.jpg", help="output image type") 13 | parser.add_argument("--accepted_formats", type=str, default=[".mp4", ".mkv", ".webm"], nargs="+", 14 | help="list of input video formats") 15 | parser.add_argument("--begin", type=int, default=0) 16 | parser.add_argument("--end", type=int, default=666666666) 17 | parser.add_argument("--file_list", type=str, default="") 18 | parser.add_argument("--frame_rate", type=int, default=-1) 19 | parser.add_argument("--num_workers", type=int, default=16) 20 | parser.add_argument("--dry_run", action="store_true") 21 | parser.add_argument("--parallel", action="store_true") 22 | args = parser.parse_args() 23 | 24 | 25 | def par_job(command): 26 | if args.dry_run: 27 | print(command) 28 | else: 29 | subprocess.call(command, shell=True) 30 | 31 | 32 | if __name__ == "__main__": 33 | t0 = time.time() 34 | dir_path = args.dir_path 35 | dst_dir_path = args.dst_dir_path 36 | 37 | if args.file_list == "": 38 | file_names = sorted(os.listdir(dir_path)) 39 | else: 40 | file_names = [x.strip() for x in open(args.file_list).readlines()] 41 | del_list = [] 42 | for i, file_name in enumerate(file_names): 43 | if not any([x in file_name for x in args.accepted_formats]): 44 | del_list.append(i) 45 | file_names = [x for i, x in enumerate(file_names) if i not in del_list] 46 | file_names = file_names[args.begin:args.end + 1] 47 | print("%d videos to handle (after %d being removed)" % (len(file_names), len(del_list))) 48 | cmd_list = [] 49 | for file_name in tqdm(file_names): 50 | 51 | name, ext = os.path.splitext(file_name) 52 | dst_directory_path = os.path.join(dst_dir_path, name) 53 | 54 | video_file_path = os.path.join(dir_path, file_name) 55 | if not os.path.exists(dst_directory_path): 56 | os.makedirs(dst_directory_path, exist_ok=True) 57 | 58 | if args.frame_rate > 0: 59 | frame_rate_str = "-r %d" % args.frame_rate 60 | else: 61 | frame_rate_str = "" 62 | cmd = 'ffmpeg -nostats -loglevel 0 -i {} -vf scale=-1:360 {} {}/{}'.format(video_file_path, frame_rate_str, 63 | dst_directory_path, args.prefix) 64 | if not args.parallel: 65 | if args.dry_run: 66 | print(cmd) 67 | else: 68 | subprocess.call(cmd, shell=True) 69 | cmd_list.append(cmd) 70 | 71 | if args.parallel: 72 | with Pool(processes=args.num_workers) as pool: 73 | with tqdm(total=len(cmd_list)) as pbar: 74 | for _ in tqdm(pool.imap_unordered(par_job, cmd_list)): 75 | pbar.update() 76 | t1 = time.time() 77 | print("Finished in %.4f seconds" % (t1 - t0)) 78 | os.system("stty sane") 79 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/train_stage1.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0,1,2,3 python stage1.py \ 2 | dataset=somethingv1 \ 3 | data_dir=PATH_TO_DATASET \ 4 | train_stage=1 \ 5 | batch_size=64 \ 6 | num_segments_glancer=8 \ 7 | num_segments_focuser=12 \ 8 | glance_size=224 \ 9 | patch_size=144 \ 10 | random_patch=True \ 11 | epochs=10 \ 12 | backbone_lr=0.00001 \ 13 | fc_lr=0.01 \ 14 | lr_type=cos \ 15 | dropout=0.5 \ 16 | load_pretrained_focuser_fc=False \ 17 | dist_url=tcp://127.0.0.1:8816 \ 18 | eval_freq=1 \ 19 | start_eval=0 \ 20 | print_freq=25 \ 21 | workers=8 \ 22 | pretrained_glancer=PATH_TO_PRETRAINED_GLANCER \ 23 | pretrained_focuser=PATH_TO_PRETRAINED_FOCUSER # load the pretrained model 24 | 25 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/train_stage2.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0 python stage2.py \ 2 | dataset=somethingv1 \ 3 | data_dir=PATH_TO_DATASET \ 4 | train_stage=2 \ 5 | batch_size=64 \ 6 | num_segments_glancer=8 \ 7 | num_segments_focuser=12 \ 8 | glance_size=224 \ 9 | patch_size=144 \ 10 | random_patch=False \ 11 | epochs=50 \ 12 | policy_lr=0.0003 \ 13 | gamma=0.7 \ 14 | with_glancer=True \ 15 | ppo_continuous=True \ 16 | action_std=0.25 \ 17 | actorcritic_with_bn=True \ 18 | workers=8 \ 19 | eval_freq=1 \ 20 | pretrained=PATH_TO_STAGE1_PRETRAINED_MODEL # load the stage1 pretrained model 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /Experiments on Something-Something V1&V2/train_stage3.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=4,5,6,7 python stage3.py \ 2 | dataset=somethingv1 \ 3 | data_dir=PATH_TO_DATASET \ 4 | train_stage=3 \ 5 | batch_size=32 \ 6 | num_segments_glancer=8 \ 7 | num_segments_focuser=12 \ 8 | glance_size=224 \ 9 | patch_size=144 \ 10 | random_patch=False \ 11 | epochs=10 \ 12 | backbone_lr=0. \ 13 | fc_lr=0.0005 \ 14 | lr_type=cos \ 15 | workers=8 \ 16 | dropout=0 \ 17 | ppo_continuous=True \ 18 | action_std=0.25 \ 19 | actorcritic_with_bn=True \ 20 | with_glancer=True \ 21 | load_pretrained_s2_fc=True \ 22 | dist_url=tcp://127.0.0.1:8815 \ 23 | eval_freq=1 \ 24 | start_eval=0 \ 25 | print_freq=25 \ 26 | amp=False \ 27 | multiprocessing_distributed=False \ 28 | pretrained_s2=PATH_TO_STAGE2_PRETRAINED_MODEL # load the stage2 pretrained model 29 | 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AdaFocus (ICCV-2021 Oral) 2 | 3 | This repo contains the official code and pre-trained models for AdaFocus. 4 | 5 | - [Adaptive Focus for Efficient Video Recognition](http://arxiv.org/abs/2105.03245) 6 | 7 | **Update: The latest version of the AdaFocus series, [Uni-AdaFocus (TPAMI'24)](https://github.com/LeapLabTHU/Uni-AdaFocus) has been released!** This repository is no longer maintained. 8 | 9 | ## Reference 10 | If you find our code or paper useful for your research, please cite: 11 | ``` 12 | @InProceedings{wang2021adafocus, 13 | author = {Wang, Yulin and Chen, Zhaoxi and Jiang, Haojun and Song, Shiji and Han, Yizeng and Huang, Gao}, 14 | title = {Adaptive Focus for Efficient Video Recognition}, 15 | booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, 16 | month = {October}, 17 | year = {2021} 18 | } 19 | 20 | @InProceedings{wang2022adafocusv2, 21 | author = {Wang, Yulin and Yue, Yang and Lin, Yuanze and Jiang, Haojun and Lai, Zihang and Kulikov, Victor and Orlov, Nikita and Shi, Humphrey and Huang, Gao}, 22 | title = {AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition}, 23 | booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 24 | year = {2022} 25 | } 26 | ``` 27 | 28 | ## Introduction 29 | 30 | In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1\&V2, demonstrate that our method is significantly more efficient than the competitive baselines. 31 | 32 | 33 |

34 | 35 |

36 | 37 | 38 | ## Result 39 | 40 | - ActivityNet 41 | 42 |

43 | 44 |

45 | 46 | 47 | - Something-Something V1&V2 48 | 49 |

50 | 51 |

52 | 53 | 54 | - Visualization 55 | 56 |

57 | 58 |

59 | 60 | ## Get Started 61 | 62 | Please go to the folder [Experiments on ActivityNet, FCVID and Mini-Kinetics](Experiments%20on%20ActivityNet,%20FCVID%20and%20Mini-Kinetics/) and [Experiments on Something-Something V1&V2](Experiments%20on%20Something-Something%20V1&V2) for specific docs. 63 | 64 | 65 | ## Contact 66 | If you have any question, feel free to contact the authors or raise an issue. 67 | Yulin Wang: wang-yl19@mails.tsinghua.edu.cn. 68 | -------------------------------------------------------------------------------- /figure/actnet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/actnet.png -------------------------------------------------------------------------------- /figure/intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/intro.png -------------------------------------------------------------------------------- /figure/sthsth.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/sthsth.png -------------------------------------------------------------------------------- /figure/visualization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/visualization.png --------------------------------------------------------------------------------