├── .gitignore
├── Experiments on ActivityNet, FCVID and Mini-Kinetics
    ├── README.md
    ├── basic_tools
    │   ├── __init__.py
    │   ├── checkpoint.py
    │   ├── logger.py
    │   └── utils.py
    ├── conf
    │   └── default.yaml
    ├── main_dist.py
    ├── models
    │   ├── gfv_net.py
    │   ├── mobilenet.py
    │   ├── ppo.py
    │   ├── resnet.py
    │   └── utils.py
    └── ops
    │   ├── dataset.py
    │   ├── dataset_config.py
    │   ├── transforms.py
    │   ├── utils.py
    │   └── video_jpg.py
├── Experiments on Something-Something V1&V2
    ├── README.md
    ├── basic_tools
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── checkpoint.cpython-38.pyc
    │   │   ├── logger.cpython-38.pyc
    │   │   └── utils.cpython-38.pyc
    │   ├── checkpoint.py
    │   ├── logger.py
    │   └── utils.py
    ├── conf
    │   ├── evaluate.yaml
    │   ├── stage1.yaml
    │   ├── stage2.yaml
    │   └── stage3.yaml
    ├── evaluate.py
    ├── evaluate.sh
    ├── models
    │   ├── __pycache__
    │   │   ├── ppo.cpython-38.pyc
    │   │   ├── ppo_continuous.cpython-38.pyc
    │   │   ├── resnet.cpython-38.pyc
    │   │   ├── tsn.cpython-38.pyc
    │   │   └── utils.cpython-38.pyc
    │   ├── constant.py
    │   ├── gfv_net.py
    │   ├── mobilenetv2.py
    │   ├── ppo.py
    │   ├── ppo_continuous.py
    │   ├── resnet.py
    │   ├── tsn.py
    │   └── utils.py
    ├── ops
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-38.pyc
    │   │   ├── temporal_shift.cpython-38.pyc
    │   │   ├── transforms.cpython-38.pyc
    │   │   └── utils.cpython-38.pyc
    │   ├── basic_ops.py
    │   ├── dataset.py
    │   ├── dataset_config.py
    │   ├── models_ada.py
    │   ├── my_logger.py
    │   ├── net_flops_table.py
    │   ├── sal_rank_loss.py
    │   ├── temporal_shift.py
    │   ├── transforms.py
    │   ├── utils.py
    │   └── video_jpg.py
    ├── stage1.py
    ├── stage2.py
    ├── stage3.py
    ├── train_stage1.sh
    ├── train_stage2.sh
    └── train_stage3.sh
├── README.md
└── figure
    ├── actnet.png
    ├── intro.png
    ├── sthsth.png
    └── visualization.png


/.gitignore:
--------------------------------------------------------------------------------
1 | outputs/
2 | *.tar
3 | Experiments on ActivityNet, FCVID and Mini-Kinetics/.DS_Store
4 | .idea
5 | .pyc
6 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/README.md:
--------------------------------------------------------------------------------
 1 | # Experiments on ActivityNet, FCVID and Mini-Kinetics
 2 | 
 3 | ## Requirements
 4 | - python 3.8
 5 | - pytorch 1.7.0
 6 | - torchvision 0.8.0
 7 | - [hydra](https://hydra.cc/docs/intro/) 1.1.0
 8 | 
 9 | ## Datasets
10 | 1. Please get the train/test split files for each dataset from [Google Drive](https://drive.google.com/drive/folders/1L41U4mczsrnwiSx3KiY57BblrRE5fjnU?usp=sharing) and put them in `PATH_TO_DATASET`.
11 | 2. Download videos from following links, or contact the corresponding authors for the access. Save them to `PATH_TO_DATASET/videos`
12 | - [ActivityNet-v1.3](http://activity-net.org/download.html) 
13 | - [FCVID](https://drive.google.com/drive/folders/1cPSc3neTQwvtSPiVcjVZrj0RvXrKY5xj)
14 | - [Mini-Kinetics](https://deepmind.com/research/open-source/kinetics). Please download [Kinetics 400](https://storage.googleapis.com/deepmind-media/Datasets/kinetics400.tar.gz). For Mini-Kinetics used in our paper, you need to use the train/val splits file from [AR-Net](https://github.com/mengyuest/AR-Net#dataset-preparation).
15 | 3. Extract frames using [ops/video_jpg.py](ops/video_jpg.py), the frames will be saved to `PATH_TO_DATASET/frames`. Minor modifications on file path are needed when extracting frames from different dataset.
16 | 
17 | 
18 | 
19 | ## Pre-trained Models on ActivityNet
20 | 
21 | Please download pre-trained weights and checkpoints from [Google Drive](https://drive.google.com/drive/folders/1v5UnucCr2CjmH41HEJePPI2WtDO8SSsp?usp=sharing).
22 | 
23 | - globalcnn.pth.tar: pre-trained weights for global CNN (MobileNet-v2).
24 | - localcnn.pth.tar: pre-trained weights for local CNN (ResNet-50).
25 | - 128checkpoint.pth.tar: checkpoint of stage 1 with patch size 128x128.
26 | - 160checkpoint.pth.tar: checkpoint of stage 1 with patch size 160x160.
27 | - 192checkpoint.pth.tar: checkpoint of stage 1 with patch size 192x192.
28 | - 128s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 128x128.
29 | - 160s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 160x160.
30 | - 192s3_checkpoint.pth.tar: checkpoint to reproduce the result in paper with patch size 192x192.
31 |  
32 | ## Training
33 | 
34 | - Here we take training the model with patch size 128x128 on ActivityNet dataset for example.
35 | - All logs and checkpoints will be saved in the directory: `./outputs/YYYY-MM-DD/HH-MM-SS`
36 | - Note that we store a set of default hyper-parameters in [conf/default.yaml](conf/default.yaml) which can be overrided through command line. You can also use your own config files.
37 | - Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command:
38 | 
39 | for Global CNN:
40 | ```
41 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=true
42 | ```
43 | for Local CNN:
44 | ```
45 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=false
46 | ```
47 | 
48 | - Training stage 1, pre-trained weights for Global CNN and Local CNN are required:
49 | ```
50 | CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=1 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrained_glancer=PATH_TO_CHECKPOINTS pretrained_focuser=PATH_TO_CHECKPOINTS
51 | ```
52 | 
53 | - Training stage 2, a stage-1 checkpoint is required:
54 | ```
55 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=2 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false
56 | ```
57 | 
58 | - Training stage 3, a stage-2 checkpoint is required:
59 | ```
60 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false
61 | ```
62 | 
63 | ## Evaluate Pre-trained Models
64 | - Here we take evaluating model with patch size 128x128 on ActivityNet for example.
65 | ```
66 | CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false evaluate=true
67 | ```
68 | 
69 | 
70 | ## Acknowledgement
71 | We use the implementation of MobileNet-v2 and ResNet from [Pytorch source code](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html). We also borrow some codes for dataset preparation from [AR-Net](https://github.com/mengyuest/AR-Net#dataset-preparation) and PPO from [here](https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py).
72 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/__init__.py:
--------------------------------------------------------------------------------
 1 | import basic_tools.utils as utils
 2 | import basic_tools.logger as logger
 3 | import basic_tools.checkpoint as checkpoint
 4 | 
 5 | import sys
 6 | import os
 7 | from omegaconf import DictConfig, OmegaConf
 8 | 
 9 | def start(args):
10 |     # checkpoint.init_checkpoint()
11 | 
12 |     # sys.stdout = logger.Logger("./log.log", mode="w")
13 |     # sys.stderr = logger.Logger("./log.err", mode="w")
14 | 
15 |     cmd_line = " ".join(sys.argv)
16 |     print(f"{cmd_line}")
17 |     print(f"Working dir: {os.getcwd()}")
18 | 
19 |     print(OmegaConf.to_yaml(args))
20 |     return OmegaConf.to_yaml(args)
21 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/checkpoint.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import signal
 3 | import os
 4 | import sys
 5 | import torch
 6 | import socket 
 7 | 
 8 | 
 9 | '''
10 | Usage:
11 | 
12 |     init_checkpoint()
13 | 
14 |     if exist_checkpoint():
15 |         any_object = load_checkpoint()
16 | 
17 |     save_checkpoint(any_object)
18 | '''
19 | 
20 | CHECKPOINT_filename = 'checkpoint.pth.tar'
21 | CHECKPOINT_tempfile = 'checkpoint.temp'
22 | SIGNAL_RECEIVED = False
23 | 
24 | def SIGTERMHandler(a, b):
25 |     print('received sigterm')
26 |     pass
27 | 
28 | 
29 | def signalHandler(a, b):
30 |     global SIGNAL_RECEIVED
31 |     print('Signal received', a, time.time(), flush=True)
32 |     SIGNAL_RECEIVED = True
33 | 
34 |     print("caught signal", a)
35 |     print(socket.gethostname(), "USR1 signal caught.")
36 |     # do other stuff to cleanup here
37 |     print('requeuing job ' + os.environ['SLURM_JOB_ID'])
38 |     os.system('scontrol requeue ' + os.environ['SLURM_JOB_ID'])
39 |     sys.exit(-1)
40 | 
41 | 
42 | def init_checkpoint():
43 |     signal.signal(signal.SIGUSR1, signalHandler)
44 |     signal.signal(signal.SIGTERM, SIGTERMHandler)
45 |     print('Signal handler installed', flush=True)
46 | 
47 | def save_checkpoint(state):
48 |     global CHECKPOINT_filename, CHECKPOINT_tempfile
49 |     torch.save(state, CHECKPOINT_tempfile)
50 |     if os.path.isfile(CHECKPOINT_tempfile):
51 |         os.rename(CHECKPOINT_tempfile, CHECKPOINT_filename)
52 |     print("Checkpoint done")
53 | 
54 | def save_checkpoint_if_signal(state):
55 |     global SIGNAL_RECEIVED
56 |     if SIGNAL_RECEIVED:
57 |         save_checkpoint(state)
58 | 
59 | def exist_checkpoint():
60 |     global CHECKPOINT_filename
61 |     return os.path.isfile(CHECKPOINT_filename)
62 | 
63 | def load_checkpoint(filename=None):
64 |     global CHECKPOINT_filename
65 |     if filename is None:
66 |         filename = CHECKPOINT_filename
67 | 
68 |     # optionally resume from a checkpoint
69 |     # if args.resume:
70 |         #if os.path.isfile(args.resume):
71 |     # To make the script simple to understand, we do resume whenever there is
72 |     # a checkpoint file
73 |     if os.path.isfile(filename):
74 |         print(f"=> loading checkpoint {filename}")
75 |         checkpoint = torch.load(filename)
76 |         print(f"=> loaded checkpoint {filename}")
77 |         return checkpoint
78 |     else:
79 |         raise RuntimeError(f"=> no checkpoint found at '{filename}'")
80 | 
81 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/logger.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import logging
 4 | 
 5 | class Logger:
 6 |     def __init__(self, path, mode='w'):
 7 |         assert mode in {'w', 'a'}, 'unknown mode for logger %s' % mode
 8 | 
 9 |         fh = logging.FileHandler(path, mode=mode)
10 |         formatter = logging.Formatter('[%(asctime)s][%(name)s][%(levelname)s] - %(message)s')
11 |         fh.setFormatter(formatter)
12 |         # ch = logging.StreamHandler(sys.__stdout__)
13 | 
14 |         self.logger = logging.getLogger()
15 |         self.logger.addHandler(fh)
16 |         # self.logger.addHandler(ch)
17 | 
18 |     def write(self, message):
19 |         if message == "\n": return
20 |         # Remove \n at the end.
21 |         self.logger.info(message.strip())
22 | 
23 |     def flush(self):
24 |         # for python 3 compatibility.
25 |         pass
26 | 
27 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/basic_tools/utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import sys
  3 | import random
  4 | import numpy as np
  5 | import os
  6 | import subprocess
  7 | 
  8 | from torch import optim
  9 | 
 10 | def set_all_seeds(rand_seed):
 11 |     random.seed(rand_seed)
 12 |     np.random.seed(rand_seed)
 13 |     torch.manual_seed(rand_seed)
 14 |     torch.cuda.manual_seed(rand_seed)
 15 | 
 16 | def to_cpu(x):
 17 |     if isinstance(x, dict):
 18 |         return { k : to_cpu(v) for k, v in x.items() }
 19 |     elif isinstance(x, list):
 20 |         return [ to_cpu(v) for v in x ]
 21 |     elif isinstance(x, torch.Tensor):
 22 |         return x.cpu()
 23 |     else:
 24 |         return x
 25 | 
 26 | def model2numpy(model):
 27 |     return { k : v.cpu().numpy() for k, v in model.state_dict().items() }
 28 | 
 29 | def activation2numpy(output):
 30 |     if isinstance(output, dict):
 31 |         return { k : activation2numpy(v) for k, v in output.items() }
 32 |     elif isinstance(output, list):
 33 |         return [ activation2numpy(v) for v in output ]
 34 |     elif isinstance(output, Variable):
 35 |         return output.data.cpu().numpy()
 36 | 
 37 | def count_size(x):
 38 |     if isinstance(x, dict):
 39 |         return sum([ count_size(v) for k, v in x.items() ])
 40 |     elif isinstance(x, list) or isinstance(x, tuple):
 41 |         return sum([ count_size(v) for v in x ])
 42 |     elif isinstance(x, torch.Tensor):
 43 |         return x.nelement() * x.element_size()
 44 |     else:
 45 |         return sys.getsizeof(x)
 46 | 
 47 | def mem2str(num_bytes):
 48 |     assert num_bytes >= 0
 49 |     if num_bytes >= 2 ** 30:  # GB
 50 |         val = float(num_bytes) / (2 ** 30)
 51 |         result = "%.3f GB" % val
 52 |     elif num_bytes >= 2 ** 20:  # MB
 53 |         val = float(num_bytes) / (2 ** 20)
 54 |         result = "%.3f MB" % val
 55 |     elif num_bytes >= 2 ** 10:  # KB
 56 |         val = float(num_bytes) / (2 ** 10)
 57 |         result = "%.3f KB" % val
 58 |     else:
 59 |         result = "%d bytes" % num_bytes
 60 |     return result
 61 | 
 62 | def get_mem_usage():
 63 |     import psutil
 64 | 
 65 |     mem = psutil.virtual_memory()
 66 |     result = ""
 67 |     result += "available: %s\t" % (mem2str(mem.available))
 68 |     result += "used: %s\t" % (mem2str(mem.used))
 69 |     result += "free: %s\t" % (mem2str(mem.free))
 70 |     # result += "active: %s\t" % (mem2str(mem.active))
 71 |     # result += "inactive: %s\t" % (mem2str(mem.inactive))
 72 |     # result += "buffers: %s\t" % (mem2str(mem.buffers))
 73 |     # result += "cached: %s\t" % (mem2str(mem.cached))
 74 |     # result += "shared: %s\t" % (mem2str(mem.shared))
 75 |     # result += "slab: %s\t" % (mem2str(mem.slab))
 76 |     return result
 77 | 
 78 | def get_github_string():
 79 |     _, output = subprocess.getstatusoutput("git -C ./ log --pretty=format:'%H' -n 1")
 80 |     ret, _ = subprocess.getstatusoutput("git -C ./ diff-index --quiet HEAD --")
 81 |     return f"Githash: {output}, unstaged: {ret}"
 82 | 
 83 | 
 84 | def accumulate(all_y, y):
 85 |     if all_y is None:
 86 |         all_y = dict()
 87 |         for k, v in y.items():
 88 |             if isinstance(v, list):
 89 |                 all_y[k] = [ [vv] for vv in v ]
 90 |             else:
 91 |                 all_y[k] = [v]
 92 |     else:
 93 |         for k, v in all_y.items():
 94 |             if isinstance(y[k], list):
 95 |                 for vv, yy in zip(v, y[k]):
 96 |                     vv.append(yy)
 97 |             else:
 98 |                 v.append(y[k])
 99 | 
100 |     return all_y
101 | 
102 | def combine(all_y):
103 |     output = dict()
104 |     for k, v in all_y.items():
105 |         if isinstance(v[0], list):
106 |             output[k] = [ torch.cat(vv) for vv in v ]
107 |         else:
108 |             output[k] = torch.cat(v)
109 | 
110 |     return output
111 | 
112 | def concatOutput(loader, nets, condition=None):
113 |     outputs = [None] * len(nets)
114 | 
115 |     use_cnn = nets[0].use_cnn
116 | 
117 |     with torch.no_grad():
118 |         for i, (x, _) in enumerate(loader):
119 |             if not use_cnn:
120 |                 x = x.view(x.size(0), -1)
121 |             x = x.cuda()
122 | 
123 |             outputs = [ accumulate(output, to_cpu(net(x))) for net, output in zip(nets, outputs) ]
124 |             if condition is not None and not condition(i):
125 |                break
126 | 
127 |     return [ combine(output) for output in outputs ]
128 | 
129 | 
130 | def adjust_learning_rate(args, optimizer, epoch):
131 |     """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
132 |     lrs = args.lr_steps.split('-')
133 |     lr_steps = [int(lr) for lr in lrs]
134 |     if args.lr_type == 'step':
135 |         decay = 0.1 ** (sum(epoch >= np.array(lr_steps)))
136 |         backbone_lr = args.backbone_lr * decay
137 |         fc_lr = args.fc_lr * decay
138 |         decay = args.weight_decay
139 |     elif args.lr_type == 'cos':
140 |         import math
141 |         backbone_lr = 0.5 * args.backbone_lr * (1 + math.cos(math.pi * epoch / args.epochs))
142 |         fc_lr = 0.5 * args.fc_lr * (1 + math.cos(math.pi * epoch / args.epochs))
143 |         decay = args.weight_decay
144 |     else:
145 |         raise NotImplementedError
146 | 
147 |     if args.train_stage == 0:
148 |         optimizer.param_groups[0]['lr'] = backbone_lr # Glancer
149 |         optimizer.param_groups[1]['lr'] = backbone_lr # Focuser
150 |         optimizer.param_groups[2]['lr'] = fc_lr # GRU
151 |     elif args.train_stage == 1:
152 |         optimizer.param_groups[0]['lr'] = backbone_lr # Focuser
153 |         optimizer.param_groups[1]['lr'] = fc_lr # GRU
154 |     elif args.train_stage == 2:
155 |         pass
156 |     elif args.train_stage == 3:
157 |         optimizer.param_groups[0]['lr'] = fc_lr # GRU
158 | 
159 |     for param_group in optimizer.param_groups:
160 |         # param_group['lr'] = lr
161 |         param_group['weight_decay'] = decay
162 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/conf/default.yaml:
--------------------------------------------------------------------------------
 1 | dataset: actnet
 2 | train_list:
 3 | val_list:
 4 | root_path:
 5 | data_dir: 
 6 | resume:
 7 | 
 8 | pretrained_glancer: 
 9 | pretrained_focuser: 
10 | train_stage: 
11 | 
12 | pretrain_glancer: true
13 | arch: resnet
14 | num_segments: 16
15 | k: 3
16 | dropout: 0.5
17 | num_classes: 200
18 | evaluate: false
19 | eval_freq: 5
20 | 
21 | dense_sample: false
22 | partial_fcvid_eval: false
23 | partial_ratio: 0.2
24 | ada_reso_skip: false
25 | reso_list: 224
26 | random_crop: false
27 | center_crop: false
28 | ada_crop_list:
29 | rescale_to: 224
30 | policy_input_offset: 0
31 | save_meta: false
32 | 
33 | epochs: 50
34 | batch_size: 64
35 | backbone_lr: 0.01
36 | fc_lr: 0.005
37 | lr_type: cos # support step or cos
38 | lr_steps: 50-100
39 | momentum: 0.9
40 | weight_decay: 0.0001
41 | clip_grad: 20
42 | npb: true
43 | 
44 | input_size: 224
45 | patch_size: 96
46 | glance_size: 224
47 | random_patch: false
48 | feature_map_channels: 1280
49 | action_dim: 49
50 | hidden_state_dim: 1024 #for policy network, focuser
51 | policy_conv: true
52 | hidden_dim: 1024 #for gru
53 | penalty: 0.5
54 | consensus: gru
55 | reward: random
56 | gamma: 0.7
57 | policy_lr: 0.0003
58 | with_glancer: true
59 | continuous: false
60 | 
61 | seed: 1007
62 | gpus: 0
63 | gpu: 
64 | workers: 16
65 | world_size: 1
66 | rank: 0
67 | dist_url: tcp://127.0.0.1:8888
68 | dist_backend: nccl
69 | multiprocessing_distributed: true
70 | distributed:
71 | amp: true
72 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/models/mobilenet.py:
--------------------------------------------------------------------------------
  1 | from torch import nn
  2 | from .utils import load_state_dict_from_url
  3 | 
  4 | __all__ = ['MobileNetV2', 'mobilenet_v2']
  5 | 
  6 | 
  7 | model_urls = {
  8 |     'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
  9 | }
 10 | 
 11 | 
 12 | def _make_divisible(v, divisor, min_value=None):
 13 |     """
 14 |     This function is taken from the original tf repo.
 15 |     It ensures that all layers have a channel number that is divisible by 8
 16 |     It can be seen here:
 17 |     https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
 18 |     :param v:
 19 |     :param divisor:
 20 |     :param min_value:
 21 |     :return:
 22 |     """
 23 |     if min_value is None:
 24 |         min_value = divisor
 25 |     new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
 26 |     # Make sure that round down does not go down by more than 10%.
 27 |     if new_v < 0.9 * v:
 28 |         new_v += divisor
 29 |     return new_v
 30 | 
 31 | 
 32 | class ConvBNReLU(nn.Sequential):
 33 |     def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
 34 |         padding = (kernel_size - 1) // 2
 35 |         super(ConvBNReLU, self).__init__(
 36 |             nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
 37 |             nn.BatchNorm2d(out_planes),
 38 |             nn.ReLU6(inplace=True)
 39 |         )
 40 | 
 41 | 
 42 | class InvertedResidual(nn.Module):
 43 |     def __init__(self, inp, oup, stride, expand_ratio):
 44 |         super(InvertedResidual, self).__init__()
 45 |         self.stride = stride
 46 |         assert stride in [1, 2]
 47 | 
 48 |         hidden_dim = int(round(inp * expand_ratio))
 49 |         self.use_res_connect = self.stride == 1 and inp == oup
 50 | 
 51 |         layers = []
 52 |         if expand_ratio != 1:
 53 |             # pw
 54 |             layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
 55 |         layers.extend([
 56 |             # dw
 57 |             ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
 58 |             # pw-linear
 59 |             nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
 60 |             nn.BatchNorm2d(oup),
 61 |         ])
 62 |         self.conv = nn.Sequential(*layers)
 63 | 
 64 |     def forward(self, x):
 65 |         if self.use_res_connect:
 66 |             return x + self.conv(x)
 67 |         else:
 68 |             return self.conv(x)
 69 | 
 70 | 
 71 | class MobileNetV2(nn.Module):
 72 |     def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8):
 73 |         """
 74 |         MobileNet V2 main class
 75 | 
 76 |         Args:
 77 |             num_classes (int): Number of classes
 78 |             width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
 79 |             inverted_residual_setting: Network structure
 80 |             round_nearest (int): Round the number of channels in each layer to be a multiple of this number
 81 |             Set to 1 to turn off rounding
 82 |         """
 83 |         super(MobileNetV2, self).__init__()
 84 |         block = InvertedResidual
 85 |         input_channel = 32
 86 |         last_channel = 1280
 87 | 
 88 |         if inverted_residual_setting is None:
 89 |             inverted_residual_setting = [
 90 |                 # t, c, n, s
 91 |                 [1, 16, 1, 1],
 92 |                 [6, 24, 2, 2],
 93 |                 [6, 32, 3, 2],
 94 |                 [6, 64, 4, 2],
 95 |                 [6, 96, 3, 1],
 96 |                 [6, 160, 3, 2],
 97 |                 [6, 320, 1, 1],
 98 |             ]
 99 | 
100 |         # only check the first element, assuming user knows t,c,n,s are required
101 |         if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
102 |             raise ValueError("inverted_residual_setting should be non-empty "
103 |                              "or a 4-element list, got {}".format(inverted_residual_setting))
104 | 
105 |         # building first layer
106 |         input_channel = _make_divisible(input_channel * width_mult, round_nearest)
107 |         self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
108 |         features = [ConvBNReLU(3, input_channel, stride=2)]
109 |         # building inverted residual blocks
110 |         for t, c, n, s in inverted_residual_setting:
111 |             output_channel = _make_divisible(c * width_mult, round_nearest)
112 |             for i in range(n):
113 |                 stride = s if i == 0 else 1
114 |                 features.append(block(input_channel, output_channel, stride, expand_ratio=t))
115 |                 input_channel = output_channel
116 |         # building last several layers
117 |         features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1))
118 |         # make it nn.Sequential
119 |         self.features = nn.Sequential(*features)
120 | 
121 |         # building classifier
122 |         self.classifier = nn.Sequential(
123 |             nn.Dropout(0.2),
124 |             nn.Linear(self.last_channel, num_classes),
125 |         )
126 | 
127 |         # weight initialization
128 |         for m in self.modules():
129 |             if isinstance(m, nn.Conv2d):
130 |                 nn.init.kaiming_normal_(m.weight, mode='fan_out')
131 |                 if m.bias is not None:
132 |                     nn.init.zeros_(m.bias)
133 |             elif isinstance(m, nn.BatchNorm2d):
134 |                 nn.init.ones_(m.weight)
135 |                 nn.init.zeros_(m.bias)
136 |             elif isinstance(m, nn.Linear):
137 |                 nn.init.normal_(m.weight, 0, 0.01)
138 |                 nn.init.zeros_(m.bias)
139 | 
140 |     def forward(self, x):
141 |         x = self.features(x)
142 |         x = x.mean([2, 3])
143 |         x = self.classifier(x)
144 |         return x
145 |     
146 |     def get_featmap(self, x):
147 |         x = self.features(x)
148 |         return x, x.mean([2, 3])
149 | 
150 |     @property
151 |     def feature_dim(self):
152 |         return self.last_channel
153 | 
154 | 
155 | def mobilenet_v2(pretrained=False, progress=True, **kwargs):
156 |     """
157 |     Constructs a MobileNetV2 architecture from
158 |     `"MobileNetV2: Inverted Residuals and Linear Bottlenecks" <https://arxiv.org/abs/1801.04381>`_.
159 | 
160 |     Args:
161 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
162 |         progress (bool): If True, displays a progress bar of the download to stderr
163 |     """
164 |     model = MobileNetV2(**kwargs)
165 |     if pretrained:
166 |         state_dict = load_state_dict_from_url(model_urls['mobilenet_v2'],
167 |                                               progress=progress)
168 |         model.load_state_dict(state_dict)
169 |     return model
170 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/models/ppo.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torchvision
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | import math
  6 | from torch.distributions import Categorical
  7 | 
  8 | 
  9 | class Memory:
 10 |     def __init__(self):
 11 |         self.actions = []
 12 |         self.states = []
 13 |         self.logprobs = []
 14 |         self.rewards = []
 15 |         self.is_terminals = []
 16 |         self.hidden = []
 17 | 
 18 |     def clear_memory(self):
 19 |         del self.actions[:]
 20 |         del self.states[:]
 21 |         del self.logprobs[:]
 22 |         del self.rewards[:]
 23 |         del self.is_terminals[:]
 24 |         del self.hidden[:]
 25 | 
 26 | 
 27 | class ActorCritic(nn.Module):
 28 |     def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim=1024, policy_conv=True):
 29 |         super(ActorCritic, self).__init__()
 30 |         
 31 |         # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet
 32 |         if policy_conv:
 33 |             self.state_encoder = nn.Sequential(
 34 |                 nn.Conv2d(feature_dim, 32, kernel_size=1, stride=1, padding=0, bias=False),
 35 |                 nn.ReLU(),
 36 |                 nn.Flatten(),
 37 |                 nn.Linear(int(state_dim * 32 / feature_dim), hidden_state_dim),
 38 |                 nn.ReLU()
 39 |             )
 40 |         # encoder with linear layer for ResNet and DenseNet
 41 |         else:
 42 |             self.state_encoder = nn.Sequential(
 43 |                 nn.Linear(state_dim, 2048),
 44 |                 nn.ReLU(),
 45 |                 nn.Linear(2048, hidden_state_dim),
 46 |                 nn.ReLU()
 47 |             )
 48 | 
 49 |         self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False)
 50 |         
 51 |         self.actor = nn.Sequential(
 52 |             nn.Linear(hidden_state_dim, action_dim),
 53 |             nn.Softmax(dim=-1))
 54 | 
 55 |         self.critic = nn.Sequential(
 56 |             nn.Linear(hidden_state_dim, 1))
 57 | 
 58 |         self.hidden_state_dim = hidden_state_dim
 59 |         self.action_dim = action_dim
 60 |         self.policy_conv = policy_conv
 61 |         self.feature_dim = feature_dim
 62 |         self.feature_ratio = int(math.sqrt(state_dim/feature_dim))
 63 | 
 64 |     def forward(self):
 65 |         raise NotImplementedError
 66 | 
 67 |     def act(self, state_ini, memory, restart_batch=False, training=True):
 68 |         if restart_batch:
 69 |             del memory.hidden[:]
 70 |             memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda())
 71 | 
 72 |         if not self.policy_conv:
 73 |             state = state_ini.flatten(1)
 74 |         else:
 75 |             state = state_ini
 76 | 
 77 |         # print(state.shape)
 78 |         state = self.state_encoder(state)
 79 | 
 80 |         state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1])
 81 |         memory.hidden.append(hidden_output)
 82 | 
 83 |         state = state[0]
 84 |         action_probs = self.actor(state)
 85 |         dist = Categorical(action_probs)
 86 | 
 87 |         if training:
 88 |             action = dist.sample()
 89 |             action_logprob = dist.log_prob(action)
 90 |             memory.states.append(state_ini)
 91 |             memory.actions.append(action)
 92 |             memory.logprobs.append(action_logprob)
 93 |         else:
 94 |             action = action_probs.max(1)[1]
 95 | 
 96 |         return action
 97 | 
 98 |     def evaluate(self, state, action):
 99 |         seq_l = state.size(0)
100 |         batch_size = state.size(1)
101 | 
102 |         if not self.policy_conv:
103 |             state = state.flatten(2)
104 |             state = state.view(seq_l * batch_size, state.size(2))
105 |         else:
106 |             state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4))
107 | 
108 |         state = self.state_encoder(state)
109 |         state = state.view(seq_l, batch_size, -1)
110 | 
111 |         state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda())
112 |         state = state.view(seq_l * batch_size, -1)
113 | 
114 |         action_probs = self.actor(state)
115 |         dist = Categorical(action_probs)
116 |         action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda()
117 |         dist_entropy = dist.entropy().cuda()
118 |         state_value = self.critic(state)
119 | 
120 |         return action_logprobs.view(seq_l, batch_size), \
121 |                state_value.view(seq_l, batch_size), \
122 |                dist_entropy.view(seq_l, batch_size)
123 | 
124 | 
125 | class PPO(nn.Module):
126 |     def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv, gpu=0,
127 |                 lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2):
128 |         super(PPO, self).__init__()
129 |         self.lr = lr
130 |         self.betas = betas
131 |         self.gamma = gamma
132 |         self.eps_clip = eps_clip
133 |         self.K_epochs = K_epochs
134 | 
135 |         self.policy = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu)
136 | 
137 |         self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
138 | 
139 |         self.policy_old = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu)
140 |         self.policy_old.load_state_dict(self.policy.state_dict())
141 | 
142 |         self.MseLoss = nn.MSELoss()
143 | 
144 |     def select_action(self, state, memory, restart_batch=False, training=True):
145 |         return self.policy_old.act(state, memory, restart_batch, training)
146 | 
147 |     def update(self, memory):
148 |         rewards = []
149 |         discounted_reward = 0
150 | 
151 |         for reward in reversed(memory.rewards):
152 |             discounted_reward = reward + (self.gamma * discounted_reward)
153 |             rewards.insert(0, discounted_reward)
154 | 
155 |         rewards = torch.cat(rewards, 0).cuda()
156 | 
157 |         rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
158 | 
159 |         old_states = torch.stack(memory.states, 0).cuda().detach()
160 |         old_actions = torch.stack(memory.actions, 0).cuda().detach()
161 |         old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach()
162 | 
163 |         for _ in range(self.K_epochs):
164 |             logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)
165 | 
166 |             ratios = torch.exp(logprobs - old_logprobs.detach())
167 | 
168 |             advantages = rewards - state_values.detach()
169 |             surr1 = ratios * advantages
170 |             surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
171 | 
172 |             loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy
173 | 
174 |             self.optimizer.zero_grad()
175 |             loss.mean().backward()
176 |             self.optimizer.step()
177 | 
178 |         self.policy_old.load_state_dict(self.policy.state_dict())


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/models/utils.py:
--------------------------------------------------------------------------------
 1 | 
 2 | try:
 3 |     from torch.hub import load_state_dict_from_url
 4 | except ImportError:
 5 |     from torch.utils.model_zoo import load_url as load_state_dict_from_url
 6 | 
 7 | import torchvision
 8 | import torch
 9 | import numpy as np
10 | 
11 | def prep_a_net(model_name, shall_pretrain):
12 |     model = getattr(torchvision.models, model_name)(shall_pretrain)
13 |     if "resnet" in model_name:
14 |         model.last_layer_name = 'fc'
15 |     elif "mobilenet_v2" in model_name:
16 |         model.last_layer_name = 'classifier'
17 |     return model
18 | 
19 | def zero_pad(im, pad_size):
20 |     """Performs zero padding (CHW format)."""
21 |     pad_width = ((0, 0), (pad_size, pad_size), (pad_size, pad_size))
22 |     return np.pad(im, pad_width, mode="constant")
23 | 
24 | def random_crop(im, size, pad_size=0):
25 |     """Performs random crop (CHW format)."""
26 |     if pad_size > 0:
27 |         im = zero_pad(im=im, pad_size=pad_size)
28 |     h, w = im.shape[1:]
29 |     if size == h:
30 |         return im
31 |     y = np.random.randint(0, h - size)
32 |     x = np.random.randint(0, w - size)
33 |     im_crop = im[:, y : (y + size), x : (x + size)]
34 |     assert im_crop.shape[1:] == (size, size)
35 |     return im_crop
36 | 
37 | def get_patch(images, action_sequence, patch_size):
38 |     """Get small patch of the original image"""
39 |     batch_size = images.size(0)
40 |     image_size = images.size(2)
41 | 
42 |     patch_coordinate = torch.floor(action_sequence * (image_size - patch_size)).int()
43 |     patches = []
44 |     for i in range(batch_size):
45 |         per_patch = images[i, :,
46 |                     (patch_coordinate[i, 0].item()): ((patch_coordinate[i, 0] + patch_size).item()),
47 |                     (patch_coordinate[i, 1].item()): ((patch_coordinate[i, 1] + patch_size).item())]
48 | 
49 |         patches.append(per_patch.view(1, per_patch.size(0), per_patch.size(1), per_patch.size(2)))
50 | 
51 |     return torch.cat(patches, 0)


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/dataset.py:
--------------------------------------------------------------------------------
  1 | import torch.utils.data as data
  2 | import torch
  3 | 
  4 | from PIL import Image
  5 | import os
  6 | import numpy as np
  7 | from numpy.random import randint
  8 | 
  9 | 
 10 | class VideoRecord(object):
 11 |     def __init__(self, row):
 12 |         self._data = row
 13 |         self._labels = torch.tensor([-1, -1, -1])
 14 |         labels = sorted(list(set([int(x) for x in self._data[2:]])))
 15 |         for i, l in enumerate(labels):
 16 |             self._labels[i] = l
 17 | 
 18 |     @property
 19 |     def path(self):
 20 |         return self._data[0]
 21 | 
 22 |     @property
 23 |     def num_frames(self):
 24 |         return int(self._data[1])
 25 | 
 26 |     @property
 27 |     def label(self):
 28 |         if self._labels[-2] > -1:
 29 |             if self._labels[-1] > -1:
 30 |                 return self._labels[torch.randperm(self._labels.shape[0])]
 31 |             else:
 32 |                 if torch.rand(1) > 0.5:
 33 |                     return self._labels[[0,1,2]]
 34 |                 else:
 35 |                     return self.label[[1,0,2]]
 36 |         else:
 37 |             return self._labels
 38 | 
 39 | 
 40 | class TSNDataSet(data.Dataset):
 41 |     def __init__(self, root_path, list_file,
 42 |                  num_segments=3, image_tmpl='img_{:05d}.jpg', transform=None,
 43 |                  random_shift=True, test_mode=False,
 44 |                  remove_missing=False, dense_sample=False, twice_sample=False,
 45 |                  dataset=None, partial_fcvid_eval=False, partial_ratio=None,
 46 |                  ada_reso_skip=False, reso_list=None, random_crop=False, center_crop=False, ada_crop_list=None,
 47 |                  rescale_to=224, policy_input_offset=None, save_meta=False):
 48 | 
 49 |         self.root_path = root_path
 50 | 
 51 |         self.list_file = \
 52 |             ".".join(list_file.split(".")[:-1]) + "." + list_file.split(".")[-1]  # TODO
 53 |         self.num_segments = num_segments
 54 |         self.image_tmpl = image_tmpl
 55 |         self.transform = transform
 56 |         self.random_shift = random_shift
 57 |         self.test_mode = test_mode
 58 |         self.remove_missing = remove_missing
 59 |         self.dense_sample = dense_sample  # using dense sample as I3D
 60 |         self.twice_sample = twice_sample  # twice sample for more validation
 61 | 
 62 |         # TODO(yue)
 63 |         self.dataset = dataset
 64 |         self.partial_fcvid_eval = partial_fcvid_eval
 65 |         self.partial_ratio = partial_ratio
 66 |         self.ada_reso_skip = ada_reso_skip
 67 |         self.reso_list = reso_list
 68 |         self.random_crop = random_crop
 69 |         self.center_crop = center_crop
 70 |         self.ada_crop_list = ada_crop_list
 71 |         self.rescale_to = rescale_to
 72 |         self.policy_input_offset = policy_input_offset
 73 |         self.save_meta = save_meta
 74 | 
 75 |         if self.dense_sample:
 76 |             print('=> Using dense sample for the dataset...')
 77 |         if self.twice_sample:
 78 |             print('=> Using twice sample for the dataset...')
 79 | 
 80 |         self._parse_list()
 81 | 
 82 |     def _load_image(self, directory, idx):
 83 |         try:
 84 |             return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert('RGB')]
 85 |         except Exception:
 86 |             print('error loading image:', os.path.join(self.root_path, directory, self.image_tmpl.format(idx)))
 87 |             return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB')]
 88 | 
 89 |     def _parse_list(self):
 90 |         # check the frame number is large >3:
 91 |         splitter = "," if self.dataset in ["actnet", "fcvid"] else " "
 92 |         if self.dataset == "kinetics":
 93 |             splitter = ";"
 94 |         tmp = [x.strip().split(splitter) for x in open(self.list_file)]
 95 | 
 96 |         if any(len(items) >= 3 for items in tmp) and self.dataset == "minik":
 97 |             tmp = [[splitter.join(x[:-2]), x[-2], x[-1]] for x in tmp]
 98 | 
 99 |         if self.dataset == "kinetics":
100 |             tmp = [[x[0], x[-2], x[-1]] for x in tmp]
101 | 
102 |         if not self.test_mode or self.remove_missing:
103 |             tmp = [item for item in tmp if int(item[1]) >= 3]
104 | 
105 |         if self.partial_fcvid_eval and self.dataset == "fcvid":
106 |             tmp = tmp[:int(len(tmp) * self.partial_ratio)]
107 | 
108 |         self.video_list = [VideoRecord(item) for item in tmp]
109 | 
110 |         if self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
111 |             for v in self.video_list:
112 |                 v._data[1] = int(v._data[1]) / 2
113 |         print('video number:%d' % (len(self.video_list)))
114 | 
115 |     def _sample_indices(self, record):
116 |         """
117 |         :param record: VideoRecord
118 |         :return: list
119 |         """
120 |         if self.dense_sample:  # i3d dense sample
121 |             sample_pos = max(1, 1 + record.num_frames - 64)
122 |             t_stride = 64 // self.num_segments
123 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
124 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)]
125 |             return np.array(offsets) + 1
126 |         else:  # normal sample
127 |             average_duration = record.num_frames // self.num_segments
128 |             if average_duration > 0:
129 |                 offsets = np.multiply(list(range(self.num_segments)), average_duration) + randint(average_duration,
130 |                                                                                                   size=self.num_segments)
131 |             elif record.num_frames > self.num_segments:
132 |                 offsets = np.sort(randint(record.num_frames, size=self.num_segments))
133 |             else:
134 |                 offsets = np.array(
135 |                     list(range(record.num_frames)) + [record.num_frames - 1] * (self.num_segments - record.num_frames))
136 |             return offsets + 1
137 | 
138 |     def _get_val_indices(self, record):
139 |         if self.dense_sample:  # i3d dense sample
140 |             sample_pos = max(1, 1 + record.num_frames - 64)
141 |             t_stride = 64 // self.num_segments
142 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
143 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)]
144 |             return np.array(offsets) + 1
145 |         else:
146 |             if record.num_frames > self.num_segments:
147 |                 tick = record.num_frames / float(self.num_segments)
148 |                 offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)])
149 |             else:
150 |                 offsets = np.array(
151 |                     list(range(record.num_frames)) + [record.num_frames - 1] * (self.num_segments - record.num_frames))
152 |             return offsets + 1
153 | 
154 |     def _get_test_indices(self, record):
155 |         if self.dense_sample:
156 |             sample_pos = max(1, 1 + record.num_frames - 64)
157 |             t_stride = 64 // self.num_segments
158 |             start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int)
159 |             offsets = []
160 |             for start_idx in start_list.tolist():
161 |                 offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments)]
162 |             return np.array(offsets) + 1
163 |         elif self.twice_sample:
164 |             tick = record.num_frames / float(self.num_segments)
165 | 
166 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)] +
167 |                                [int(tick * x) for x in range(self.num_segments)])
168 | 
169 |             return offsets + 1
170 |         else:
171 |             tick = record.num_frames / float(self.num_segments)
172 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)])
173 |             return offsets + 1
174 | 
175 |     def __getitem__(self, index):
176 |         record = self.video_list[index]
177 |         # check this is a legit video folder
178 |         if self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
179 |             file_name = self.image_tmpl.format(int(record.path), 'x', 1)
180 |             full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name)
181 |         else:
182 |             file_name = self.image_tmpl.format(1)
183 |             full_path = os.path.join(self.root_path, record.path, file_name)
184 | 
185 |         err_cnt = 0
186 |         while not os.path.exists(full_path):
187 |             err_cnt += 1
188 |             if err_cnt > 3:
189 |                 exit("Sth wrong with the dataloader to get items. Check your data path. Exit...")
190 |             print('################## Not Found:', os.path.join(self.root_path, record.path, file_name))
191 |             index = np.random.randint(len(self.video_list))
192 |             record = self.video_list[index]
193 |             if self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
194 |                 file_name = self.image_tmpl.format(int(record.path), 'x', 1)
195 |                 full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name)
196 |             else:
197 |                 file_name = self.image_tmpl.format(1)
198 |                 full_path = os.path.join(self.root_path, record.path, file_name)
199 | 
200 |         if not self.test_mode:
201 |             segment_indices = self._sample_indices(record) if self.random_shift else self._get_val_indices(record)
202 |         else:
203 |             segment_indices = self._get_test_indices(record)
204 |         return self.get(record, segment_indices)
205 | 
206 |     def get(self, record, indices):
207 | 
208 |         images = list()
209 |         for seg_ind in indices:
210 |             images.extend(self._load_image(record.path, int(seg_ind)))
211 | 
212 |         process_data = self.transform(images)
213 |         if self.ada_reso_skip:
214 |             return_items = [process_data]
215 |             if self.random_crop:
216 |                 rescaled = [self.random_crop_proc(process_data, (x, x)) for x in self.reso_list[1:]]
217 |             elif self.center_crop:
218 |                 rescaled = [self.center_crop_proc(process_data, (x, x)) for x in self.reso_list[1:]]
219 |             else:
220 |                 rescaled = [self.rescale_proc(process_data, (x, x)) for x in self.reso_list[1:]]
221 |             return_items = return_items + rescaled
222 |             if self.save_meta:
223 |                 return_items = return_items + [record.path] + [indices]  # [torch.tensor(indices)]
224 |             return_items = return_items + [record.label]
225 | 
226 |             return tuple(return_items)
227 |         else:
228 |             if self.rescale_to == 224:
229 |                 rescaled = process_data
230 |             else:
231 |                 x = self.rescale_to
232 |                 if self.random_crop:
233 |                     rescaled = self.random_crop_proc(process_data, (x, x))
234 |                 elif self.center_crop:
235 |                     rescaled = self.center_crop_proc(process_data, (x, x))
236 |                 else:
237 |                     rescaled = self.rescale_proc(process_data, (x, x))
238 | 
239 |             return rescaled, record.label
240 | 
241 |     # TODO(yue)
242 |     # (NC, H, W)->(NC, H', W')
243 |     def rescale_proc(self, input_data, size):
244 |         return torch.nn.functional.interpolate(input_data.unsqueeze(1), size=size, mode='nearest').squeeze(1)
245 | 
246 |     def center_crop_proc(self, input_data, size):
247 |         h = input_data.shape[1] // 2
248 |         w = input_data.shape[2] // 2
249 |         return input_data[:, h - size[0] // 2:h + size[0] // 2, w - size[1] // 2:w + size[1] // 2]
250 | 
251 |     def random_crop_proc(self, input_data, size):
252 |         H = input_data.shape[1]
253 |         W = input_data.shape[2]
254 |         input_data_nchw = input_data.view(-1, 3, H, W)
255 |         batchsize = input_data_nchw.shape[0]
256 |         return_list = []
257 |         hs0 = np.random.randint(0, H - size[0], batchsize)
258 |         ws0 = np.random.randint(0, W - size[1], batchsize)
259 |         for i in range(batchsize):
260 |             return_list.append(input_data_nchw[i, :, hs0[i]:hs0[i] + size[0], ws0[i]:ws0[i] + size[1]])
261 |         return torch.stack(return_list).view(batchsize * 3, size[0], size[1])
262 | 
263 |     def __len__(self):
264 |         return len(self.video_list)


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/dataset_config.py:
--------------------------------------------------------------------------------
 1 | from os.path import join as ospj
 2 | 
 3 | def return_actnet(data_dir):
 4 |     filename_categories = ospj(data_dir, 'classInd.txt')
 5 |     root_data = data_dir + "/frames"
 6 |     filename_imglist_train = ospj(data_dir, 'actnet_train_split.txt')
 7 |     filename_imglist_val = ospj(data_dir, 'actnet_val_split.txt')
 8 |     prefix = 'image_{:05d}.jpg'
 9 | 
10 |     return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix
11 | 
12 | 
13 | def return_fcvid(data_dir):
14 |     filename_categories = ospj(data_dir, 'classInd.txt')
15 |     root_data = data_dir + "/frames"
16 |     filename_imglist_train = ospj(data_dir, 'fcvid_train_split.txt')
17 |     filename_imglist_val = ospj(data_dir, 'fcvid_val_split.txt')
18 |     prefix = 'image_{:05d}.jpg'
19 | 
20 |     return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix
21 | 
22 | 
23 | def return_minik(data_dir):
24 |     filename_categories = ospj(data_dir, 'minik_classInd.txt')
25 |     root_data = data_dir + "/frames"
26 |     filename_imglist_train = ospj(data_dir, 'mini_train_videofolder.txt')
27 |     filename_imglist_val = ospj(data_dir, 'mini_val_videofolder.txt')
28 |     prefix = 'image_{:05d}.jpg'
29 | 
30 |     return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix
31 | 
32 | 
33 | def return_dataset(dataset, data_dir):
34 |     dict_single = {'actnet': return_actnet, 'fcvid': return_fcvid, 'minik': return_minik}
35 |     if dataset in dict_single:
36 |         file_categories, file_imglist_train, file_imglist_val, root_data, prefix = dict_single[dataset](data_dir)
37 |     else:
38 |         raise ValueError('Unknown dataset ' + dataset)
39 | 
40 |     if isinstance(file_categories, str):
41 |         with open(file_categories) as f:
42 |             lines = f.readlines()
43 |         categories = [item.rstrip() for item in lines]
44 |     else:  # number of categories
45 |         categories = [None] * file_categories
46 |     n_class = len(categories)
47 |     print('{}: {} classes'.format(dataset, n_class))
48 |     return n_class, file_imglist_train, file_imglist_val, root_data, prefix
49 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/transforms.py:
--------------------------------------------------------------------------------
  1 | import torchvision
  2 | import random
  3 | from PIL import Image, ImageOps
  4 | import numpy as np
  5 | import numbers
  6 | import math
  7 | import torch
  8 | 
  9 | 
 10 | class GroupRandomCrop(object):
 11 |     def __init__(self, size):
 12 |         if isinstance(size, numbers.Number):
 13 |             self.size = (int(size), int(size))
 14 |         else:
 15 |             self.size = size
 16 | 
 17 |     def __call__(self, img_group):
 18 | 
 19 |         w, h = img_group[0].size
 20 |         th, tw = self.size
 21 | 
 22 |         out_images = list()
 23 | 
 24 |         x1 = random.randint(0, w - tw)
 25 |         y1 = random.randint(0, h - th)
 26 | 
 27 |         for img in img_group:
 28 |             assert (img.size[0] == w and img.size[1] == h)
 29 |             if w == tw and h == th:
 30 |                 out_images.append(img)
 31 |             else:
 32 |                 out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
 33 | 
 34 |         return out_images
 35 | 
 36 | 
 37 | class GroupCenterCrop(object):
 38 |     def __init__(self, size):
 39 |         self.worker = torchvision.transforms.CenterCrop(size)
 40 | 
 41 |     def __call__(self, img_group):
 42 |         return [self.worker(img) for img in img_group]
 43 | 
 44 | 
 45 | class GroupRandomHorizontalFlip(object):
 46 |     """Randomly horizontally flips the given PIL.Image with a probability of 0.5
 47 |     """
 48 | 
 49 |     def __init__(self, is_flow=False):
 50 |         self.is_flow = is_flow
 51 | 
 52 |     def __call__(self, img_group, is_flow=False):
 53 |         v = random.random()
 54 |         if v < 0.5:
 55 |             ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
 56 |             if self.is_flow:
 57 |                 for i in range(0, len(ret), 2):
 58 |                     ret[i] = ImageOps.invert(ret[i])  # invert flow pixel values when flipping
 59 |             return ret
 60 |         else:
 61 |             return img_group
 62 | 
 63 | 
 64 | class GroupNormalize(object):
 65 |     def __init__(self, mean, std):
 66 |         self.mean = mean
 67 |         self.std = std
 68 | 
 69 |     def __call__(self, tensor):
 70 |         rep_mean = self.mean * (tensor.size()[0] // len(self.mean))
 71 |         rep_std = self.std * (tensor.size()[0] // len(self.std))
 72 | 
 73 |         # TODO: make efficient
 74 |         for t, m, s in zip(tensor, rep_mean, rep_std):
 75 |             t.sub_(m).div_(s)
 76 | 
 77 |         return tensor
 78 | 
 79 | 
 80 | class GroupScale(object):
 81 |     """ Rescales the input PIL.Image to the given 'size'.
 82 |     'size' will be the size of the smaller edge.
 83 |     For example, if height > width, then image will be
 84 |     rescaled to (size * height / width, size)
 85 |     size: size of the smaller edge
 86 |     interpolation: Default: PIL.Image.BILINEAR
 87 |     """
 88 | 
 89 |     def __init__(self, size, interpolation=Image.BILINEAR):
 90 |         self.worker = torchvision.transforms.Resize(size, interpolation)
 91 | 
 92 |     def __call__(self, img_group):
 93 |         return [self.worker(img) for img in img_group]
 94 | 
 95 | 
 96 | class GroupOverSample(object):
 97 |     def __init__(self, crop_size, scale_size=None, flip=True):
 98 |         self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
 99 | 
100 |         if scale_size is not None:
101 |             self.scale_worker = GroupScale(scale_size)
102 |         else:
103 |             self.scale_worker = None
104 |         self.flip = flip
105 | 
106 |     def __call__(self, img_group):
107 | 
108 |         if self.scale_worker is not None:
109 |             img_group = self.scale_worker(img_group)
110 | 
111 |         image_w, image_h = img_group[0].size
112 |         crop_w, crop_h = self.crop_size
113 | 
114 |         offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h)
115 |         oversample_group = list()
116 |         for o_w, o_h in offsets:
117 |             normal_group = list()
118 |             flip_group = list()
119 |             for i, img in enumerate(img_group):
120 |                 crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
121 |                 normal_group.append(crop)
122 |                 flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
123 | 
124 |                 if img.mode == 'L' and i % 2 == 0:
125 |                     flip_group.append(ImageOps.invert(flip_crop))
126 |                 else:
127 |                     flip_group.append(flip_crop)
128 | 
129 |             oversample_group.extend(normal_group)
130 |             if self.flip:
131 |                 oversample_group.extend(flip_group)
132 |         return oversample_group
133 | 
134 | 
135 | class GroupFullResSample(object):
136 |     def __init__(self, crop_size, scale_size=None, flip=True):
137 |         self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
138 | 
139 |         if scale_size is not None:
140 |             self.scale_worker = GroupScale(scale_size)
141 |         else:
142 |             self.scale_worker = None
143 |         self.flip = flip
144 | 
145 |     def __call__(self, img_group):
146 | 
147 |         if self.scale_worker is not None:
148 |             img_group = self.scale_worker(img_group)
149 | 
150 |         image_w, image_h = img_group[0].size
151 |         crop_w, crop_h = self.crop_size
152 | 
153 |         w_step = (image_w - crop_w) // 4
154 |         h_step = (image_h - crop_h) // 4
155 | 
156 |         offsets = list()
157 |         offsets.append((0 * w_step, 2 * h_step))  # left
158 |         offsets.append((4 * w_step, 2 * h_step))  # right
159 |         offsets.append((2 * w_step, 2 * h_step))  # center
160 | 
161 |         oversample_group = list()
162 |         for o_w, o_h in offsets:
163 |             normal_group = list()
164 |             flip_group = list()
165 |             for i, img in enumerate(img_group):
166 |                 crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
167 |                 normal_group.append(crop)
168 |                 if self.flip:
169 |                     flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
170 | 
171 |                     if img.mode == 'L' and i % 2 == 0:
172 |                         flip_group.append(ImageOps.invert(flip_crop))
173 |                     else:
174 |                         flip_group.append(flip_crop)
175 | 
176 |             oversample_group.extend(normal_group)
177 |             oversample_group.extend(flip_group)
178 |         return oversample_group
179 | 
180 | 
181 | class GroupMultiScaleCrop(object):
182 | 
183 |     def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
184 |         self.scales = scales if scales is not None else [1, .875, .75, .66]
185 |         self.max_distort = max_distort
186 |         self.fix_crop = fix_crop
187 |         self.more_fix_crop = more_fix_crop
188 |         self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size]
189 |         self.interpolation = Image.BILINEAR
190 | 
191 |     def __call__(self, img_group):
192 | 
193 |         im_size = img_group[0].size
194 | 
195 |         crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
196 |         crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
197 |         ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
198 |                          for img in crop_img_group]
199 |         return ret_img_group
200 | 
201 |     def _sample_crop_size(self, im_size):
202 |         image_w, image_h = im_size[0], im_size[1]
203 | 
204 |         # find a crop size
205 |         base_size = min(image_w, image_h)
206 |         crop_sizes = [int(base_size * x) for x in self.scales]
207 |         crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
208 |         crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]
209 | 
210 |         pairs = []
211 |         for i, h in enumerate(crop_h):
212 |             for j, w in enumerate(crop_w):
213 |                 if abs(i - j) <= self.max_distort:
214 |                     pairs.append((w, h))
215 | 
216 |         crop_pair = random.choice(pairs)
217 |         if not self.fix_crop:
218 |             w_offset = random.randint(0, image_w - crop_pair[0])
219 |             h_offset = random.randint(0, image_h - crop_pair[1])
220 |         else:
221 |             w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1])
222 | 
223 |         return crop_pair[0], crop_pair[1], w_offset, h_offset
224 | 
225 |     def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
226 |         offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h)
227 |         return random.choice(offsets)
228 | 
229 |     @staticmethod
230 |     def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
231 |         w_step = (image_w - crop_w) // 4
232 |         h_step = (image_h - crop_h) // 4
233 | 
234 |         ret = list()
235 |         ret.append((0, 0))  # upper left
236 |         ret.append((4 * w_step, 0))  # upper right
237 |         ret.append((0, 4 * h_step))  # lower left
238 |         ret.append((4 * w_step, 4 * h_step))  # lower right
239 |         ret.append((2 * w_step, 2 * h_step))  # center
240 | 
241 |         if more_fix_crop:
242 |             ret.append((0, 2 * h_step))  # center left
243 |             ret.append((4 * w_step, 2 * h_step))  # center right
244 |             ret.append((2 * w_step, 4 * h_step))  # lower center
245 |             ret.append((2 * w_step, 0 * h_step))  # upper center
246 | 
247 |             ret.append((1 * w_step, 1 * h_step))  # upper left quarter
248 |             ret.append((3 * w_step, 1 * h_step))  # upper right quarter
249 |             ret.append((1 * w_step, 3 * h_step))  # lower left quarter
250 |             ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
251 | 
252 |         return ret
253 | 
254 | 
255 | class GroupRandomSizedCrop(object):
256 |     """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size
257 |     and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
258 |     This is popularly used to train the Inception networks
259 |     size: size of the smaller edge
260 |     interpolation: Default: PIL.Image.BILINEAR
261 |     """
262 | 
263 |     def __init__(self, size, interpolation=Image.BILINEAR):
264 |         self.size = size
265 |         self.interpolation = interpolation
266 | 
267 |     def __call__(self, img_group):
268 |         for attempt in range(10):
269 |             area = img_group[0].size[0] * img_group[0].size[1]
270 |             target_area = random.uniform(0.08, 1.0) * area
271 |             aspect_ratio = random.uniform(3. / 4, 4. / 3)
272 | 
273 |             w = int(round(math.sqrt(target_area * aspect_ratio)))
274 |             h = int(round(math.sqrt(target_area / aspect_ratio)))
275 | 
276 |             if random.random() < 0.5:
277 |                 w, h = h, w
278 | 
279 |             if w <= img_group[0].size[0] and h <= img_group[0].size[1]:
280 |                 x1 = random.randint(0, img_group[0].size[0] - w)
281 |                 y1 = random.randint(0, img_group[0].size[1] - h)
282 |                 found = True
283 |                 break
284 |         else:
285 |             found = False
286 |             x1 = 0
287 |             y1 = 0
288 | 
289 |         if found:
290 |             out_group = list()
291 |             for img in img_group:
292 |                 img = img.crop((x1, y1, x1 + w, y1 + h))
293 |                 assert (img.size == (w, h))
294 |                 out_group.append(img.resize((self.size, self.size), self.interpolation))
295 |             return out_group
296 |         else:
297 |             # Fallback
298 |             scale = GroupScale(self.size, interpolation=self.interpolation)
299 |             crop = GroupRandomCrop(self.size)
300 |             return crop(scale(img_group))
301 | 
302 | 
303 | class Stack(object):
304 | 
305 |     def __init__(self, roll=False):
306 |         self.roll = roll
307 | 
308 |     def __call__(self, img_group):
309 |         if img_group[0].mode == 'L':
310 |             return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2)
311 |         elif img_group[0].mode == 'RGB':
312 |             if self.roll:
313 |                 return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2)
314 |             else:
315 |                 return np.concatenate(img_group, axis=2)
316 | 
317 | 
318 | class ToTorchFormatTensor(object):
319 |     """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
320 |     to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """
321 | 
322 |     def __init__(self, div=True):
323 |         self.div = div
324 | 
325 |     def __call__(self, pic):
326 |         if isinstance(pic, np.ndarray):
327 |             # handle numpy array
328 |             img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
329 |         else:
330 |             # handle PIL Image
331 |             img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
332 |             img = img.view(pic.size[1], pic.size[0], len(pic.mode))
333 |             # put it from HWC to CHW format
334 |             # yikes, this transpose takes 80% of the loading time/CPU
335 |             img = img.transpose(0, 1).transpose(0, 2).contiguous()
336 |         return img.float().div(255) if self.div else img.float()
337 | 
338 | 
339 | class IdentityTransform(object):
340 | 
341 |     def __call__(self, data):
342 |         return data
343 | 
344 | 
345 | if __name__ == "__main__":
346 |     trans = torchvision.transforms.Compose([
347 |         GroupScale(256),
348 |         GroupRandomCrop(224),
349 |         Stack(),
350 |         ToTorchFormatTensor(),
351 |         GroupNormalize(
352 |             mean=[.485, .456, .406],
353 |             std=[.229, .224, .225]
354 |         )]
355 |     )
356 | 
357 |     im = Image.open('../tensorflow-model-zoo.torch/lena_299.png')
358 | 
359 |     color_group = [im] * 3
360 |     rst = trans(color_group)
361 | 
362 |     gray_group = [im.convert('L')] * 9
363 |     gray_rst = trans(gray_group)
364 | 
365 |     trans2 = torchvision.transforms.Compose([
366 |         GroupRandomSizedCrop(256),
367 |         Stack(),
368 |         ToTorchFormatTensor(),
369 |         GroupNormalize(
370 |             mean=[.485, .456, .406],
371 |             std=[.229, .224, .225])
372 |     ])
373 |     print(trans2(color_group))
374 | 


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch
  3 | import torch.nn.functional as F
  4 | from PIL import Image, ImageDraw, ImageFont
  5 | 
  6 | 
  7 | def softmax(scores):
  8 |     es = np.exp(scores - scores.max(axis=-1)[..., None])
  9 |     return es / es.sum(axis=-1)[..., None]
 10 | 
 11 | class AverageMeter(object):
 12 |     """Computes and stores the average and current value"""
 13 |     def __init__(self, name, fmt=':f'):
 14 |         self.name = name
 15 |         self.fmt = fmt
 16 |         self.reset()
 17 | 
 18 |     def reset(self):
 19 |         self.val = 0
 20 |         self.avg = 0
 21 |         self.sum = 0
 22 |         self.count = 0
 23 | 
 24 |     def update(self, val, n=1):
 25 |         self.val = val
 26 |         self.sum += val * n
 27 |         self.count += n
 28 |         self.avg = self.sum / self.count
 29 | 
 30 |     def __str__(self):
 31 |         fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
 32 |         return fmtstr.format(**self.__dict__)
 33 | 
 34 | 
 35 | def accuracy(output, target, topk=(1,)):
 36 |     """Computes the accuracy over the k top predictions for the specified values of k"""
 37 |     with torch.no_grad():
 38 |         maxk = max(topk)
 39 |         batch_size = target.size(0)
 40 | 
 41 |         _, pred = output.topk(maxk, 1, True, True)
 42 |         pred = pred.t()
 43 |         correct = pred.eq(target.reshape(1, -1).expand_as(pred))
 44 | 
 45 |         res = []
 46 |         for k in topk:
 47 |             correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
 48 |             res.append(correct_k.mul_(100.0 / batch_size))
 49 |         return res
 50 | 
 51 | def get_multi_hot(test_y, classes, assumes_starts_zero=True):
 52 |     bs = test_y.shape[0]
 53 |     label_cnt = 0
 54 | 
 55 |     # TODO ranking labels: (-1,-1,4,5,3,7)->(4,4,2,1,0,3)
 56 |     if not assumes_starts_zero:
 57 |         for label_val in torch.unique(test_y):
 58 |             if label_val >= 0:
 59 |                 test_y[test_y == label_val] = label_cnt
 60 |                 label_cnt += 1
 61 | 
 62 |     gt = torch.zeros(bs, classes + 1)  # TODO(yue) +1 for -1 in multi-label case
 63 |     for i in range(test_y.shape[1]):
 64 |         gt[torch.LongTensor(range(bs)), test_y[:, i]] = 1  # TODO(yue) see?
 65 | 
 66 |     return gt[:, :classes]
 67 | 
 68 | def cal_map(output, old_test_y):
 69 |     batch_size = output.size(0)
 70 |     num_classes = output.size(1)
 71 |     ap = torch.zeros(num_classes)
 72 |     test_y = old_test_y.clone()
 73 | 
 74 |     gt = get_multi_hot(test_y, num_classes, False)
 75 | 
 76 |     probs = F.softmax(output, dim=1)
 77 | 
 78 |     rg = torch.range(1, batch_size).float()
 79 |     for k in range(num_classes):
 80 |         scores = probs[:, k]
 81 |         targets = gt[:, k]
 82 |         _, sortind = torch.sort(scores, 0, True)
 83 |         truth = targets[sortind]
 84 |         tp = truth.float().cumsum(0)
 85 |         precision = tp.div(rg)
 86 | 
 87 |         ap[k] = precision[truth.byte()].sum() / max(float(truth.sum()), 1)
 88 |     return ap.mean() * 100, ap * 100
 89 | 
 90 | def cal_reward(confidence, confidence_last, patch_size_list, penalty=0.5):
 91 |     reward = confidence - confidence_last
 92 |     reward = reward - penalty*(patch_size_list/100.)**2
 93 |     return reward
 94 | 
 95 | class ProgressMeter(object):
 96 |     def __init__(self, num_batches, *meters, prefix=""):
 97 |         self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
 98 |         self.meters = meters
 99 |         self.prefix = prefix
100 | 
101 |     def print(self, batch):
102 |         entries = [self.prefix + self.batch_fmtstr.format(batch)]
103 |         entries += [str(meter) for meter in self.meters]
104 |         out = '\t'.join(entries)
105 |         print(out)
106 |         return out + '\n'
107 | 
108 |     def _get_batch_fmtstr(self, num_batches):
109 |         num_digits = len(str(num_batches // 1))
110 |         fmt = '{:' + str(num_digits) + 'd}'
111 |         return '[' + fmt + '/' + fmt.format(num_batches) + ']'
112 | 
113 | class Recorder:
114 |     def __init__(self, larger_is_better=True):
115 |         self.history = []
116 |         self.larger_is_better = larger_is_better
117 |         self.best_at = None
118 |         self.best_val = None
119 | 
120 |     def is_better_than(self, x, y):
121 |         if self.larger_is_better:
122 |             return x > y
123 |         else:
124 |             return x < y
125 | 
126 |     def update(self, val):
127 |         self.history.append(val)
128 |         if len(self.history) == 1 or self.is_better_than(val, self.best_val):
129 |             self.best_val = val
130 |             self.best_at = len(self.history) - 1
131 | 
132 |     def is_current_best(self):
133 |         return self.best_at == len(self.history) - 1


--------------------------------------------------------------------------------
/Experiments on ActivityNet, FCVID and Mini-Kinetics/ops/video_jpg.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, division
 2 | import os
 3 | import time
 4 | import subprocess
 5 | from tqdm import tqdm
 6 | import argparse
 7 | from multiprocessing import Pool
 8 | 
 9 | parser = argparse.ArgumentParser(description="Dataset processor: Video->Frames")
10 | parser.add_argument("dir_path", type=str, help="original dataset path")
11 | parser.add_argument("dst_dir_path", type=str, help="dest path to save the frames")
12 | parser.add_argument("--prefix", type=str, default="image_%05d.jpg", help="output image type")
13 | parser.add_argument("--accepted_formats", type=str, default=[".mp4", ".mkv", ".webm"], nargs="+",
14 |                     help="list of input video formats")
15 | parser.add_argument("--begin", type=int, default=0)
16 | parser.add_argument("--end", type=int, default=666666666)
17 | parser.add_argument("--file_list", type=str, default="")
18 | parser.add_argument("--frame_rate", type=int, default=-1)
19 | parser.add_argument("--num_workers", type=int, default=16)
20 | parser.add_argument("--dry_run", action="store_true")
21 | parser.add_argument("--parallel", action="store_true")
22 | args = parser.parse_args()
23 | 
24 | 
25 | def par_job(command):
26 |     if args.dry_run:
27 |         print(command)
28 |     else:
29 |         subprocess.call(command, shell=True)
30 | 
31 | 
32 | if __name__ == "__main__":
33 |     t0 = time.time()
34 |     dir_path = args.dir_path
35 |     dst_dir_path = args.dst_dir_path
36 | 
37 |     if args.file_list == "":
38 |         file_names = sorted(os.listdir(dir_path))
39 |     else:
40 |         file_names = [x.strip() for x in open(args.file_list).readlines()]
41 |     del_list = []
42 |     for i, file_name in enumerate(file_names):
43 |         if not any([x in file_name for x in args.accepted_formats]):
44 |             del_list.append(i)
45 |     file_names = [x for i, x in enumerate(file_names) if i not in del_list]
46 |     file_names = file_names[args.begin:args.end + 1]
47 |     print("%d videos to handle (after %d being removed)" % (len(file_names), len(del_list)))
48 |     cmd_list = []
49 |     for file_name in tqdm(file_names):
50 | 
51 |         name, ext = os.path.splitext(file_name)
52 |         dst_directory_path = os.path.join(dst_dir_path, name)
53 | 
54 |         video_file_path = os.path.join(dir_path, file_name)
55 |         if not os.path.exists(dst_directory_path):
56 |             os.makedirs(dst_directory_path, exist_ok=True)
57 | 
58 |         if args.frame_rate > 0:
59 |             frame_rate_str = "-r %d" % args.frame_rate
60 |         else:
61 |             frame_rate_str = ""
62 |         cmd = 'ffmpeg -nostats -loglevel 0 -i {} -vf scale=-1:360 {} {}/{}'.format(video_file_path, frame_rate_str,
63 |                                                                                    dst_directory_path, args.prefix)
64 |         if not args.parallel:
65 |             if args.dry_run:
66 |                 print(cmd)
67 |             else:
68 |                 subprocess.call(cmd, shell=True)
69 |         cmd_list.append(cmd)
70 | 
71 |     if args.parallel:
72 |         with Pool(processes=args.num_workers) as pool:
73 |             with tqdm(total=len(cmd_list)) as pbar:
74 |                 for _ in tqdm(pool.imap_unordered(par_job, cmd_list)):
75 |                     pbar.update()
76 |     t1 = time.time()
77 |     print("Finished in %.4f seconds" % (t1 - t0))
78 |     os.system("stty sane")
79 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/README.md:
--------------------------------------------------------------------------------
 1 | # Experiments on Something-Something V1&V2
 2 | 
 3 | ## Requirements
 4 | - python 3.8
 5 | - pytorch 1.8.0
 6 | - torchvision 0.8.0
 7 | - [hydra](https://hydra.cc/docs/intro/) 1.1.0
 8 | 
 9 | ## Datasets
10 | Please follow the instruction of [TSM](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) to prepare the Something-Something V1/V2 dataset.
11 | 
12 | ## Pre-trained Models on Something-Something-V1 (V2)
13 | 
14 | Please download pre-trained weights and checkpoints from [Google Drive](https://drive.google.com/drive/folders/1QgIjU6FLT3RZbAGAVutgOPuOOOtBPpFb?usp=sharing).
15 | 
16 | - Something-Something-V1 (V2)
17 |     - mobilenetv2_segment8.pth.tar: pre-trained weights for global CNN (MobileNet-v2).
18 |     - resnet50_segment12.pth.tar: pre-trained weights for local CNN (ResNet-50).
19 |     - 144x144.pth.tar: checkpoint to reproduce the result in paper with patch size 144x144.
20 |     - 160x160.pth.tar: checkpoint to reproduce the result in paper with patch size 160x160.
21 |     - 176x176.pth.tar: checkpoint to reproduce the result in paper with patch size 176x176.
22 | 
23 | ## Training
24 | 
25 | - Here we take training the model with patch size 144x144 on Something-Something-V1 dataset for example.
26 | - All logs and checkpoints will be saved in the directory: `./outputs/YYYY-MM-DD/HH-MM-SS`
27 | - Note that we store a set of default hyper-parameters for each stage in [conf directory](conf) which can be overrided through command line. You can also use your own config files.
28 | 
29 | - Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command:
30 | 
31 | For Global CNN: please use the [TSM code](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) and use the following command:
32 | ```
33 |  python main.py something RGB \
34 |        --arch mobilenetv2 --num_segments 8 \
35 |        --gd 20 --lr 0.01 --lr_steps 20 40 --epochs 50 \
36 |        --batch-size 64 -j 16 --dropout 0.5 --consensus_type=avg --eval-freq=1 \
37 |        --shift --shift_div=8 --shift_place=blockres --npb
38 | ```
39 | For Local CNN: please use the [TSM code](https://github.com/mit-han-lab/temporal-shift-module#data-preparation) and use the following command:
40 | ```
41 |  python main.py something RGB \
42 |        --arch resnet50 --num_segments 12 \
43 |        --gd 20 --lr 0.01 --lr_steps 20 40 --epochs 50 \
44 |        --batch-size 64 -j 16 --dropout 0.5 --consensus_type=avg --eval-freq=1 \
45 |        --shift --shift_div=8 --shift_place=blockres --npb
46 | ```
47 | 
48 | - Training stage 1, we provide the command in train_stage1.sh. The pre-trained weights for Global CNN and Local CNN are required, so you should set the pretrained_glancer and pretrained_focuser arguments both right in train_stage1.sh first and then run it:
49 | ```
50 | bash train_stage1.sh
51 | ```
52 | 
53 | - Training stage 2, we provide the command in train_stage2.sh. A stage-1 checkpoint is required, so you should set the pretrained argument right in train_stage2.sh and then run it:
54 | ```
55 | bash train_stage2.sh
56 | ```
57 | 
58 | - Training stage 3, we provide the command in train_stage2.sh. A stage-1 checkpoint is required, so you should set the pretrained_s2 argument right in train_stage3.sh and then run it:
59 | ```
60 | bash train_stage3.sh
61 | ```
62 | 
63 | 
64 | ## Evaluate Pre-trained Models
65 | - Here we take evaluating model with patch size 144x144 on Something-Something-V1 dataset for example. 
66 |  
67 |     We provide the command in evaluate.sh. The pre-trained weights is required, so you should set the resume and patch_size arguments both right in evaluate.sh first and then run it:
68 | 
69 | ```
70 | bash evaluate.sh
71 | ```
72 | 
73 | ## Acknowledgement
74 | We use the official implementation of [temporal-shift-module](https://github.com/mit-han-lab/temporal-shift-module) and PPO from [here](https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py).
75 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/__init__.py:
--------------------------------------------------------------------------------
 1 | import basic_tools.utils as utils
 2 | import basic_tools.logger as logger
 3 | import basic_tools.checkpoint as checkpoint
 4 | 
 5 | import sys
 6 | import os
 7 | 
 8 | def start(args):
 9 |     cmd_line = " ".join(sys.argv)
10 |     print(f"{cmd_line}")
11 |     print(f"Working dir: {os.getcwd()}")
12 |     utils.set_all_seeds(args.seed)
13 | 
14 |     print(args)
15 |     return args
16 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/__pycache__/checkpoint.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/checkpoint.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/__pycache__/logger.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/logger.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/__pycache__/utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/basic_tools/__pycache__/utils.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/checkpoint.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import signal
 3 | import os
 4 | import sys
 5 | import torch
 6 | import socket 
 7 | 
 8 | 
 9 | '''
10 | Usage:
11 | 
12 |     init_checkpoint()
13 | 
14 |     if exist_checkpoint():
15 |         any_object = load_checkpoint()
16 | 
17 |     save_checkpoint(any_object)
18 | '''
19 | 
20 | CHECKPOINT_filename = 'checkpoint.pth.tar'
21 | CHECKPOINT_tempfile = 'checkpoint.temp'
22 | SIGNAL_RECEIVED = False
23 | 
24 | def SIGTERMHandler(a, b):
25 |     print('received sigterm')
26 |     pass
27 | 
28 | 
29 | def signalHandler(a, b):
30 |     global SIGNAL_RECEIVED
31 |     print('Signal received', a, time.time(), flush=True)
32 |     SIGNAL_RECEIVED = True
33 | 
34 |     print("caught signal", a)
35 |     print(socket.gethostname(), "USR1 signal caught.")
36 |     # do other stuff to cleanup here
37 |     print('requeuing job ' + os.environ['SLURM_JOB_ID'])
38 |     os.system('scontrol requeue ' + os.environ['SLURM_JOB_ID'])
39 |     sys.exit(-1)
40 | 
41 | 
42 | def init_checkpoint():
43 |     signal.signal(signal.SIGUSR1, signalHandler)
44 |     signal.signal(signal.SIGTERM, SIGTERMHandler)
45 |     print('Signal handler installed', flush=True)
46 | 
47 | def save_checkpoint(state):
48 |     global CHECKPOINT_filename, CHECKPOINT_tempfile
49 |     torch.save(state, CHECKPOINT_tempfile)
50 |     if os.path.isfile(CHECKPOINT_tempfile):
51 |         os.rename(CHECKPOINT_tempfile, CHECKPOINT_filename)
52 |     print("Checkpoint done")
53 | 
54 | def save_checkpoint_if_signal(state):
55 |     global SIGNAL_RECEIVED
56 |     if SIGNAL_RECEIVED:
57 |         save_checkpoint(state)
58 | 
59 | def exist_checkpoint():
60 |     global CHECKPOINT_filename
61 |     return os.path.isfile(CHECKPOINT_filename)
62 | 
63 | def load_checkpoint(filename=None):
64 |     global CHECKPOINT_filename
65 |     if filename is None:
66 |         filename = CHECKPOINT_filename
67 | 
68 |     # optionally resume from a checkpoint
69 |     # if args.resume:
70 |         #if os.path.isfile(args.resume):
71 |     # To make the script simple to understand, we do resume whenever there is
72 |     # a checkpoint file
73 |     if os.path.isfile(filename):
74 |         print(f"=> loading checkpoint {filename}")
75 |         checkpoint = torch.load(filename)
76 |         print(f"=> loaded checkpoint {filename}")
77 |         return checkpoint
78 |     else:
79 |         raise RuntimeError(f"=> no checkpoint found at '{filename}'")
80 | 
81 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/logger.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import logging
 4 | 
 5 | class Logger:
 6 |     def __init__(self, path, mode='w'):
 7 |         assert mode in {'w', 'a'}, 'unknown mode for logger %s' % mode
 8 | 
 9 |         fh = logging.FileHandler(path, mode=mode)
10 |         formatter = logging.Formatter('[%(asctime)s][%(name)s][%(levelname)s] - %(message)s')
11 |         fh.setFormatter(formatter)
12 |         # ch = logging.StreamHandler(sys.__stdout__)
13 | 
14 |         self.logger = logging.getLogger()
15 |         self.logger.addHandler(fh)
16 |         # self.logger.addHandler(ch)
17 | 
18 |     def write(self, message):
19 |         if message == "\n": return
20 |         # Remove \n at the end.
21 |         self.logger.info(message.strip())
22 | 
23 |     def flush(self):
24 |         # for python 3 compatibility.
25 |         pass
26 | 
27 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/basic_tools/utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import sys
  3 | import random
  4 | import numpy as np
  5 | import os
  6 | import subprocess
  7 | 
  8 | from torch import optim
  9 | 
 10 | def set_all_seeds(rand_seed):
 11 |     random.seed(rand_seed)
 12 |     np.random.seed(rand_seed)
 13 |     torch.manual_seed(rand_seed)
 14 |     torch.cuda.manual_seed(rand_seed)
 15 | 
 16 | def to_cpu(x):
 17 |     if isinstance(x, dict):
 18 |         return { k : to_cpu(v) for k, v in x.items() }
 19 |     elif isinstance(x, list):
 20 |         return [ to_cpu(v) for v in x ]
 21 |     elif isinstance(x, torch.Tensor):
 22 |         return x.cpu()
 23 |     else:
 24 |         return x
 25 | 
 26 | def model2numpy(model):
 27 |     return { k : v.cpu().numpy() for k, v in model.state_dict().items() }
 28 | 
 29 | def activation2numpy(output):
 30 |     if isinstance(output, dict):
 31 |         return { k : activation2numpy(v) for k, v in output.items() }
 32 |     elif isinstance(output, list):
 33 |         return [ activation2numpy(v) for v in output ]
 34 |     elif isinstance(output, Variable):
 35 |         return output.data.cpu().numpy()
 36 | 
 37 | def count_size(x):
 38 |     if isinstance(x, dict):
 39 |         return sum([ count_size(v) for k, v in x.items() ])
 40 |     elif isinstance(x, list) or isinstance(x, tuple):
 41 |         return sum([ count_size(v) for v in x ])
 42 |     elif isinstance(x, torch.Tensor):
 43 |         return x.nelement() * x.element_size()
 44 |     else:
 45 |         return sys.getsizeof(x)
 46 | 
 47 | def mem2str(num_bytes):
 48 |     assert num_bytes >= 0
 49 |     if num_bytes >= 2 ** 30:  # GB
 50 |         val = float(num_bytes) / (2 ** 30)
 51 |         result = "%.3f GB" % val
 52 |     elif num_bytes >= 2 ** 20:  # MB
 53 |         val = float(num_bytes) / (2 ** 20)
 54 |         result = "%.3f MB" % val
 55 |     elif num_bytes >= 2 ** 10:  # KB
 56 |         val = float(num_bytes) / (2 ** 10)
 57 |         result = "%.3f KB" % val
 58 |     else:
 59 |         result = "%d bytes" % num_bytes
 60 |     return result
 61 | 
 62 | def get_mem_usage():
 63 |     import psutil
 64 | 
 65 |     mem = psutil.virtual_memory()
 66 |     result = ""
 67 |     result += "available: %s\t" % (mem2str(mem.available))
 68 |     result += "used: %s\t" % (mem2str(mem.used))
 69 |     result += "free: %s\t" % (mem2str(mem.free))
 70 |     # result += "active: %s\t" % (mem2str(mem.active))
 71 |     # result += "inactive: %s\t" % (mem2str(mem.inactive))
 72 |     # result += "buffers: %s\t" % (mem2str(mem.buffers))
 73 |     # result += "cached: %s\t" % (mem2str(mem.cached))
 74 |     # result += "shared: %s\t" % (mem2str(mem.shared))
 75 |     # result += "slab: %s\t" % (mem2str(mem.slab))
 76 |     return result
 77 | 
 78 | def get_github_string():
 79 |     _, output = subprocess.getstatusoutput("git -C ./ log --pretty=format:'%H' -n 1")
 80 |     ret, _ = subprocess.getstatusoutput("git -C ./ diff-index --quiet HEAD --")
 81 |     return f"Githash: {output}, unstaged: {ret}"
 82 | 
 83 | 
 84 | def accumulate(all_y, y):
 85 |     if all_y is None:
 86 |         all_y = dict()
 87 |         for k, v in y.items():
 88 |             if isinstance(v, list):
 89 |                 all_y[k] = [ [vv] for vv in v ]
 90 |             else:
 91 |                 all_y[k] = [v]
 92 |     else:
 93 |         for k, v in all_y.items():
 94 |             if isinstance(y[k], list):
 95 |                 for vv, yy in zip(v, y[k]):
 96 |                     vv.append(yy)
 97 |             else:
 98 |                 v.append(y[k])
 99 | 
100 |     return all_y
101 | 
102 | def combine(all_y):
103 |     output = dict()
104 |     for k, v in all_y.items():
105 |         if isinstance(v[0], list):
106 |             output[k] = [ torch.cat(vv) for vv in v ]
107 |         else:
108 |             output[k] = torch.cat(v)
109 | 
110 |     return output
111 | 
112 | def concatOutput(loader, nets, condition=None):
113 |     outputs = [None] * len(nets)
114 | 
115 |     use_cnn = nets[0].use_cnn
116 | 
117 |     with torch.no_grad():
118 |         for i, (x, _) in enumerate(loader):
119 |             if not use_cnn:
120 |                 x = x.view(x.size(0), -1)
121 |             x = x.cuda()
122 | 
123 |             outputs = [ accumulate(output, to_cpu(net(x))) for net, output in zip(nets, outputs) ]
124 |             if condition is not None and not condition(i):
125 |                break
126 | 
127 |     return [ combine(output) for output in outputs ]
128 | 
129 | 
130 | def adjust_learning_rate(args, optimizer, epoch):
131 |     """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
132 |     lrs = args.lr_steps.split('-')
133 |     lr_steps = [int(lr) for lr in lrs]
134 |     if args.lr_type == 'step':
135 |         decay = 0.1 ** (sum(epoch >= np.array(lr_steps)))
136 |         backbone_lr = args.backbone_lr * decay
137 |         fc_lr = args.fc_lr * decay
138 |         decay = args.weight_decay
139 |     elif args.lr_type == 'cos':
140 |         import math
141 |         backbone_lr = 0.5 * args.backbone_lr * (1 + math.cos(math.pi * epoch / args.epochs))
142 |         fc_lr = 0.5 * args.fc_lr * (1 + math.cos(math.pi * epoch / args.epochs))
143 |         decay = args.weight_decay
144 |     else:
145 |         raise NotImplementedError
146 | 
147 |     optimizer.param_groups[0]['lr'] = backbone_lr # Glancer
148 |     # optimizer.param_groups[1]['lr'] = backbone_lr # Focuser
149 |     optimizer.param_groups[1]['lr'] = fc_lr # LSTM
150 |     for param_group in optimizer.param_groups:
151 |         # param_group['lr'] = lr
152 |         param_group['weight_decay'] = decay
153 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/conf/evaluate.yaml:
--------------------------------------------------------------------------------
 1 | hydra:
 2 |   run:
 3 |     dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
 4 | 
 5 | dataset: somethingv1
 6 | train_list:
 7 | val_list:
 8 | root_path:
 9 | data_dir: PATH_TO_DATASET
10 | resume:
11 | 
12 | pretrained:
13 | pretrained_glancer:
14 | pretrained_focuser:
15 | 
16 | train_stage: 2
17 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune
18 | pretrain_glancer: true
19 | arch: resnet
20 | k: 3
21 | dropout: 0.5
22 | num_classes: 200
23 | evaluate: false
24 | eval_freq: 5
25 | print_freq: 100
26 | 
27 | # tsn params
28 | video_div: 1
29 | num_segments_glancer: 8
30 | num_segments_focuser: 12
31 | modality: RGB
32 | base_model: resnet50
33 | partial_bn: false
34 | pretrain: imagenet
35 | is_shift: true
36 | shift_div: 8
37 | shift_place: blockres
38 | fc_lr5: false
39 | temporal_pool: false
40 | non_local: false
41 | 
42 | dense_sample: false
43 | partial_fcvid_eval: false
44 | partial_ratio: 0.2
45 | ada_reso_skip: false #TODO: 
46 | reso_list: 224
47 | random_crop: false
48 | center_crop: false
49 | ada_crop_list:
50 | rescale_to: 224
51 | policy_input_offset: 0
52 | save_meta: false
53 | 
54 | epochs: 50
55 | batch_size: 64
56 | backbone_lr: 0.01
57 | fc_lr: 0.005
58 | policy_lr: 0.0003
59 | lr_type: step
60 | lr_steps: 50-100
61 | momentum: 0.9
62 | weight_decay: 0.0001
63 | clip_grad: 20
64 | npb: true
65 | 
66 | patch_size: 96
67 | glance_size: 224
68 | random_patch: false
69 | feature_map_channels: 1280
70 | action_dim: 25
71 | hidden_state_dim: 1024 #for policy network, focuser
72 | policy_conv: true
73 | hidden_dim: 1024 #for LSTM classifier
74 | penalty: 0.5
75 | consensus: lstm
76 | ppo_continuous: false
77 | reward: 1
78 | # random: contrast to random patching
79 | # padding: contrast to zeros padding
80 | # prev: contrast to previous time step
81 | dropout_lstm: false
82 | gamma: 0.7 #for ppo
83 | with_glancer: true
84 | action_std: 0.1
85 | actorcritic_with_bn: true
86 | 
87 | seed: 1007
88 | gpus: 0
89 | gpu: 
90 | workers: 16
91 | world_size: 1
92 | rank: 0
93 | dist_url: tcp://127.0.0.1:8822
94 | dist_backend: nccl
95 | multiprocessing_distributed: false
96 | distributed:
97 | amp: false
98 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/conf/stage1.yaml:
--------------------------------------------------------------------------------
 1 | hydra:
 2 |   run:
 3 |     dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
 4 | 
 5 | dataset: somethingv1
 6 | train_list:
 7 | val_list:
 8 | root_path:
 9 | data_dir: PATH_TO_DATASET
10 | resume:
11 | 
12 | pretrained_glancer:
13 | pretrained_focuser:
14 | load_pretrained_focuser_fc: false
15 | # /home/nzl17/data/gfvideo/temporal-shift-module/pretrained/something-something-v1/TSM_something_RGB_resnet50_shift8_blockres_avg_segment8_e45_remove_module.pth
16 | 
17 | train_stage: 1
18 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune
19 | pretrain_glancer: true
20 | arch: resnet
21 | k: 3
22 | dropout: 0.5
23 | num_classes: 174
24 | evaluate: false
25 | eval_freq: 5
26 | start_eval: 40
27 | print_freq: 100
28 | 
29 | # tsn params
30 | video_div: 1
31 | num_segments_glancer: 8
32 | num_segments_focuser: 12
33 | modality: RGB
34 | base_model: resnet50
35 | partial_bn: false
36 | pretrain: imagenet
37 | is_shift: true
38 | shift_div: 8
39 | shift_place: blockres
40 | fc_lr5: false
41 | temporal_pool: false
42 | non_local: false
43 | 
44 | dense_sample: false
45 | partial_fcvid_eval: false
46 | partial_ratio: 0.2
47 | ada_reso_skip: false #TODO:
48 | reso_list: 224
49 | random_crop: false
50 | center_crop: false
51 | ada_crop_list:
52 | rescale_to: 224
53 | policy_input_offset: 0
54 | save_meta: false
55 | 
56 | epochs: 50
57 | batch_size: 64
58 | backbone_lr: 0.01 # 0.001
59 | fc_lr: 0.01
60 | policy_lr: 0.0003
61 | lr_type: cos
62 | lr_steps: 20-40
63 | momentum: 0.9
64 | weight_decay: 0.0001
65 | clip_grad: 20
66 | npb: true
67 | 
68 | patch_size: 144 #variable
69 | train_with_larger_patch_size: 0
70 | glance_size: 224
71 | glance_fewer_frame: false
72 | random_patch: true
73 | feature_map_channels: 1280
74 | action_dim: 25
75 | hidden_state_dim: 1024 #for policy network, focuser
76 | policy_conv: true
77 | hidden_dim: 1024 #for LSTM classifier
78 | penalty: 0.5
79 | consensus: lstm
80 | ppo_continuous: false
81 | dropout_lstm: false
82 | gamma: 0.7 #for ppo
83 | with_glancer: true
84 | action_std: 0.1
85 | actorcritic_with_bn: true
86 | 
87 | seed: 1007
88 | gpus: 0
89 | gpu:
90 | workers: 8
91 | world_size: 1
92 | rank: 0
93 | dist_url: tcp://127.0.0.1:8822
94 | dist_backend: nccl
95 | multiprocessing_distributed: true
96 | distributed:
97 | amp: true
98 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/conf/stage2.yaml:
--------------------------------------------------------------------------------
 1 | hydra:
 2 |   run:
 3 |     dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
 4 | 
 5 | dataset: somethingv1
 6 | train_list:
 7 | val_list:
 8 | root_path:
 9 | data_dir: PATH_TO_DATASET
10 | resume:
11 | 
12 | pretrained:
13 | pretrained_glancer:
14 | pretrained_focuser:
15 | 
16 | train_stage: 2
17 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune
18 | pretrain_glancer: true
19 | arch: resnet
20 | k: 3
21 | dropout: 0.5
22 | num_classes: 174
23 | evaluate: false
24 | eval_freq: 5
25 | print_freq: 100
26 | 
27 | # tsn params
28 | video_div: 1
29 | num_segments_glancer: 8
30 | num_segments_focuser: 12
31 | modality: RGB
32 | base_model: resnet50
33 | partial_bn: false
34 | pretrain: imagenet
35 | is_shift: true
36 | shift_div: 8
37 | shift_place: blockres
38 | fc_lr5: false
39 | temporal_pool: false
40 | non_local: false
41 | 
42 | dense_sample: false
43 | partial_fcvid_eval: false
44 | partial_ratio: 0.2
45 | ada_reso_skip: false #TODO: 
46 | reso_list: 224
47 | random_crop: false
48 | center_crop: false
49 | ada_crop_list:
50 | rescale_to: 224
51 | policy_input_offset: 0
52 | save_meta: false
53 | 
54 | epochs: 50
55 | batch_size: 64
56 | backbone_lr: 0.01
57 | fc_lr: 0.005
58 | policy_lr: 0.0003
59 | lr_type: cos
60 | lr_steps: 50-100
61 | momentum: 0.9
62 | weight_decay: 0.0001
63 | clip_grad: 20
64 | npb: true
65 | 
66 | patch_size: 144
67 | glance_size: 224
68 | random_patch: false
69 | feature_map_channels: 1280
70 | action_dim: 25
71 | hidden_state_dim: 1024 #for policy network, focuser
72 | policy_conv: true
73 | hidden_dim: 1024 #for LSTM classifier
74 | penalty: 0.5
75 | consensus: lstm
76 | ppo_continuous: True
77 | dropout_lstm: false
78 | gamma: 0.7 #for ppo
79 | with_glancer: true
80 | action_std: 0.1
81 | actorcritic_with_bn: true
82 | 
83 | seed: 1007
84 | gpus: 0
85 | gpu: 
86 | workers: 16
87 | world_size: 1
88 | rank: 0
89 | dist_url: tcp://127.0.0.1:8822
90 | dist_backend: nccl
91 | multiprocessing_distributed: false
92 | distributed:
93 | amp: false
94 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/conf/stage3.yaml:
--------------------------------------------------------------------------------
 1 | hydra:
 2 |   run:
 3 |     dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
 4 | 
 5 | dataset: somethingv1
 6 | train_list:
 7 | val_list:
 8 | root_path:
 9 | data_dir: PATH_TO_DATASET
10 | resume:
11 | 
12 | pretrained_s2:
13 | load_pretrained_s2_fc: true
14 | pretrained_glancer:
15 | pretrained_focuser:
16 | 
17 | train_stage: 3
18 | # 1-pretrain, 2-train policy, 3-pretrain backbone on high-resolution video, 4-finetune
19 | pretrain_glancer: true
20 | arch: resnet
21 | k: 3
22 | dropout: 0.5
23 | num_classes: 174
24 | evaluate: false
25 | eval_freq: 5
26 | start_eval: 0
27 | print_freq: 100
28 | 
29 | # tsn params
30 | video_div: 1
31 | num_segments_glancer: 8
32 | num_segments_focuser: 12
33 | modality: RGB
34 | base_model: resnet50
35 | partial_bn: false
36 | pretrain: imagenet
37 | is_shift: true
38 | shift_div: 8
39 | shift_place: blockres
40 | fc_lr5: false
41 | temporal_pool: false
42 | non_local: false
43 | 
44 | dense_sample: false
45 | partial_fcvid_eval: false
46 | partial_ratio: 0.2
47 | ada_reso_skip: false #TODO:
48 | reso_list: 224
49 | random_crop: false
50 | center_crop: false
51 | ada_crop_list:
52 | rescale_to: 224
53 | policy_input_offset: 0
54 | save_meta: false
55 | 
56 | epochs: 50
57 | batch_size: 64
58 | backbone_lr: 0.01
59 | fc_lr: 0.005
60 | policy_lr: 0.0003
61 | lr_type: step
62 | lr_steps: 50-100
63 | momentum: 0.9
64 | weight_decay: 0.0001
65 | clip_grad: 20
66 | npb: true
67 | 
68 | patch_size: 160
69 | glance_size: 224
70 | random_patch: false
71 | feature_map_channels: 1280
72 | action_dim: 25
73 | hidden_state_dim: 1024 #for policy network, focuser
74 | policy_conv: true
75 | hidden_dim: 1024 #for LSTM classifier
76 | penalty: 0.5
77 | consensus: lstm
78 | ppo_continuous: true
79 | dropout_lstm: false
80 | gamma: 0.7 #for ppo
81 | with_glancer: true
82 | action_std: 0.1
83 | actorcritic_with_bn: true
84 | 
85 | seed: 1007
86 | gpus: 0
87 | gpu:
88 | workers: 16
89 | world_size: 1
90 | rank: 0
91 | dist_url: tcp://127.0.0.1:8822
92 | dist_backend: nccl
93 | multiprocessing_distributed: true
94 | distributed:
95 | amp: true
96 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/evaluate.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | torch.multiprocessing.set_sharing_strategy('file_system')
  3 | import torch.nn.parallel
  4 | import torch.optim
  5 | import torch.nn.functional as F
  6 | 
  7 | from ops.dataset import TSNDataSet
  8 | from ops.transforms import *
  9 | from ops import dataset_config
 10 | from ops.utils import AverageMeter, accuracy, ProgressMeter
 11 | from models.gfv_net import GFV
 12 | 
 13 | import os
 14 | import time
 15 | import hydra
 16 | import basic_tools
 17 | from collections import OrderedDict
 18 | 
 19 | 
 20 | def parse_gpus(gpus):
 21 |     if type(gpus) is int:
 22 |         return [gpus]
 23 |     gpu_list = gpus.split('-')
 24 |     return [int(g) for g in gpu_list]
 25 | 
 26 | 
 27 | @hydra.main(config_path="conf", config_name="evaluate.yaml")
 28 | def main(args):
 29 |     config_yaml = basic_tools.start(args)
 30 |     with open('training.log', 'a+') as f_handler:
 31 |         f_handler.writelines(config_yaml)
 32 | 
 33 |     best_acc1 = 0
 34 |     num_class, args.train_list, args.val_list, args.root_path, prefix = \
 35 |         dataset_config.return_dataset(args.dataset, modality='RGB', root_dataset=args.data_dir)
 36 |     args.num_classes = num_class
 37 | 
 38 |     model = GFV(args).cuda()
 39 | 
 40 |     if args.pretrained_glancer:
 41 |         pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained_glancer), map_location='cpu')
 42 | 
 43 |         new_state_dict = OrderedDict()
 44 |         for k, v in pretrained_ckpt['state_dict'].items():
 45 |             if k[:18] == 'module.base_model.':
 46 |                 name = k[18:]  # remove `module.`
 47 |                 new_state_dict[name] = v
 48 |             elif k[:14] == 'module.new_fc.':
 49 |                 name = 'classifier.' + k[14:]  # replace `module.new_fc` with 'classifier'
 50 |                 new_state_dict[name] = v
 51 |             else:
 52 |                 new_state_dict[k] = v
 53 | 
 54 |         model.glancer.net.load_state_dict(new_state_dict, strict=True)
 55 | 
 56 |         print('Load Pretrained Glancer from {}!'.format(args.pretrained_glancer))
 57 |         with open('training.log', 'a+') as f_handler:
 58 |             f_handler.writelines('Load Pretrained Glancer from {}!'.format(args.pretrained_glancer))
 59 | 
 60 |     if args.pretrained_focuser:
 61 |         pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained_focuser), map_location='cpu')
 62 | 
 63 |         new_state_dict = OrderedDict()
 64 |         new_fc_state_ditc = OrderedDict()
 65 |         for k, v in pretrained_ckpt['state_dict'].items():
 66 |             print('Load ckpt param: {}'.format(k))
 67 |             if k[:7] == 'module.' and 'new_fc' not in k:
 68 |                 name = k[7:]  # remove `module.`
 69 |                 new_state_dict[name] = v
 70 |             elif 'module.new_fc.' in k:
 71 |                 name = k[14:]  # remove `module.`
 72 |                 new_fc_state_ditc[name] = v
 73 |             else:
 74 |                 new_state_dict[k] = v
 75 | 
 76 |         model.classifier.load_state_dict(new_fc_state_ditc, strict=True)
 77 |         model.focuser.net.load_state_dict(new_state_dict, strict=False)
 78 | 
 79 |         print('Load Pretrained Focuser from {}!'.format(args.pretrained_focuser))
 80 |         with open('training.log', 'a+') as f_handler:
 81 |             f_handler.writelines('Load Pretrained Focuser from {}!'.format(args.pretrained_focuser))
 82 | 
 83 |     model.focuser.net.base_model = torch.nn.Sequential(*list(model.focuser.net.base_model.children())[:-1])
 84 |     print(model)
 85 |     print(model.focuser.policy.policy)
 86 |     with open('training.log', 'a+') as f_handler:
 87 |         f_handler.writelines('model: {}'.format(model))
 88 |         f_handler.writelines('policy net: {}'.format(model.focuser.policy.policy))
 89 | 
 90 |     scale_size = model.scale_size
 91 |     crop_size = model.crop_size
 92 |     input_mean = model.input_mean
 93 |     input_std = model.input_std
 94 | 
 95 |     # data loading code
 96 |     normalize = GroupNormalize(input_mean, input_std)
 97 | 
 98 |     val_loader = torch.utils.data.DataLoader(
 99 |         TSNDataSet(args.root_path, args.val_list,
100 |                    num_segments_glancer=args.num_segments_glancer,
101 |                    num_segments_focuser=args.num_segments_focuser,
102 |                    new_length=1,
103 |                    modality='RGB',
104 |                    image_tmpl=prefix,
105 |                    random_shift=False,
106 |                    transform=torchvision.transforms.Compose([
107 |                        GroupScale(int(scale_size)),
108 |                        GroupCenterCrop(crop_size),
109 |                        Stack(roll=False),
110 |                        ToTorchFormatTensor(div=True),
111 |                        normalize,
112 |                    ]), dense_sample=args.dense_sample),
113 |         batch_size=args.batch_size, shuffle=False,
114 |         num_workers=args.workers, pin_memory=False)
115 | 
116 |     criterion = torch.nn.CrossEntropyLoss().cuda()
117 | 
118 |     if args.pretrained:
119 |         pretrained_ckpt = torch.load(os.path.expanduser(args.pretrained))
120 | 
121 |         start_epoch = pretrained_ckpt['epoch']
122 |         print('Load pretrained ckpt from: {}'.format(os.path.expanduser(args.pretrained)))
123 |         print('Load pretrained ckpt from epoch: {}'.format(start_epoch))
124 | 
125 |         model.glancer.load_state_dict(pretrained_ckpt['glancer'], strict=True)
126 |         model.focuser.load_state_dict(pretrained_ckpt['focuser'], strict=True)
127 |         model.classifier.load_state_dict(pretrained_ckpt['fc'], strict=True)
128 | 
129 |         ckpt_acc1 = pretrained_ckpt['best_acc']
130 |         print('best ckpt_acc1 for ckpt: {}'.format(ckpt_acc1))
131 |         with open('training.log', 'a+') as f_handler:
132 |             f_handler.writelines('Load pretrained ckpt from: {}'.format(os.path.expanduser(args.pretrained)))
133 |             f_handler.writelines('Load pretrained ckpt from epoch: {}'.format(start_epoch))
134 |             f_handler.writelines('best ckpt_acc1 for ckpt: {}'.format(ckpt_acc1))
135 | 
136 |     if args.resume:
137 |         resume_ckpt = torch.load(os.path.expanduser(args.resume))
138 | 
139 |         start_epoch = resume_ckpt['epoch']
140 |         print('resume from epoch: {}'.format(start_epoch))
141 | 
142 |         model.glancer.load_state_dict(resume_ckpt['glancer'], strict=True)
143 |         model.focuser.load_state_dict(resume_ckpt['focuser'], strict=True)
144 |         model.classifier.load_state_dict(resume_ckpt['fc'], strict=True)
145 |         model.focuser.policy.policy.load_state_dict(resume_ckpt['policy'])
146 |         model.focuser.policy.policy_old.load_state_dict(resume_ckpt['policy'])
147 | 
148 |         best_acc1 = resume_ckpt['best_acc']
149 |         print('best acc1 for ckpt: {}'.format(best_acc1))
150 |         with open('training.log', 'a+') as f_handler:
151 |             f_handler.writelines('Resume from: {}'.format(os.path.expanduser(args.resume)))
152 |             f_handler.writelines('Resume from epoch: {}'.format(start_epoch))
153 |             f_handler.writelines('best_acc1 for resume: {}'.format(best_acc1))
154 |     else:
155 |         start_epoch = 0
156 | 
157 |     if args.evaluate:
158 |         acc1, val_logs = validate(val_loader, model, criterion, args)
159 |         with open('training.log', 'a+') as f_handler:
160 |             f_handler.writelines(val_logs)
161 |         print('Best Acc@1 = {}'.format(acc1))
162 |         return
163 | 
164 | 
165 | def validate(val_loader, model, criterion, args):
166 |     batch_time = AverageMeter('Time', ':6.3f')
167 |     losses = AverageMeter('Loss', ':.4e')
168 |     top1 = AverageMeter('Acc@1', ':6.2f')
169 |     top5 = AverageMeter('Acc@5', ':6.2f')
170 |     reward_list = [AverageMeter('Rew', ':6.5f') for _ in range(args.video_div)]
171 |     progress = ProgressMeter(len(val_loader), batch_time, losses, top1, top5, prefix='Test: ')
172 | 
173 |     logs = []
174 |     # switch to evaluate mode
175 |     model.eval()
176 |     model.focuser.policy.policy.eval()
177 |     model.focuser.policy.policy_old.eval()
178 | 
179 |     all_targets = []
180 |     with torch.no_grad():
181 |         end = time.time()
182 |         for i, (glancer_images, focuser_images, target) in enumerate(val_loader):
183 |             _b = target.shape[0]
184 |             all_targets.append(target)
185 |             glancer_images = glancer_images.cuda()
186 |             focuser_images = focuser_images.cuda()
187 |             target = target.cuda()
188 |             glancer_images = torch.nn.functional.interpolate(glancer_images, (args.glance_size, args.glance_size))
189 |             glancer_images = glancer_images.cuda()
190 | 
191 |             # compute output
192 |             focuser_images = focuser_images.view(_b, args.num_segments_focuser, 3, model.input_size, model.input_size)
193 |             # MDP Focusing
194 |             with torch.no_grad():
195 |                 global_feat_map, global_feat_logit = model.glance(
196 |                     glancer_images)  # feat_map (B, T, C, H, W) feat_vec (B, T, _)
197 | 
198 |             for focus_time_step in range(args.video_div):
199 |                 pred, baseline_logit, local_patch = model.action_stage2(
200 |                     focuser_images, global_feat_map, global_feat_logit, focus_time_step, args,
201 |                     prev_local_patch=None if focus_time_step == 0 else local_patch, training=False)
202 | 
203 |                 loss = criterion(pred, target)
204 |                 confidence = torch.gather(F.softmax(pred.detach(), 1), dim=1, index=target.view(-1, 1)).view(1, -1)
205 | 
206 |                 bsl_confidence = torch.gather(F.softmax(baseline_logit.detach(), 1), dim=1,
207 |                                               index=target.view(-1, 1)).view(1, -1)
208 |                 reward = confidence - bsl_confidence
209 | 
210 |                 reward_list[focus_time_step].update(reward.data.mean().item(), glancer_images.size(0))
211 | 
212 |             # Update evaluation metrics
213 |             acc1, acc5 = accuracy(pred, target, topk=(1, 5))
214 |             losses.update(loss.item(), glancer_images.size(0))
215 |             top1.update(acc1[0], glancer_images.size(0))
216 |             top5.update(acc5[0], glancer_images.size(0))
217 | 
218 |             batch_time.update(time.time() - end)
219 |             end = time.time()
220 |             _reward = [reward.avg for reward in reward_list]
221 |             print('reward of each step: {}'.format(_reward))
222 | 
223 |             logs.append(progress.print(i))
224 |             logs.append(' '.join(map(str, _reward)) + '\n')
225 | 
226 |     return top1.avg, logs
227 | 
228 | 
229 | if __name__ == "__main__":
230 |     main()
231 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/evaluate.sh:
--------------------------------------------------------------------------------
 1 | CUDA_VISIBLE_DEVICES=0 python evaluate.py \
 2 |   dataset=somethingv1 \
 3 |   data_dir=PATH_TO_DATASET \
 4 |   train_stage=2 \
 5 |   batch_size=32 \
 6 |   num_segments_glancer=8 \
 7 |   num_segments_focuser=12 \
 8 |   video_div=1 \
 9 |   workers=4 \
10 |   policy_lr=0.0003 \
11 |   epochs=50 \
12 |   eval_freq=1 \
13 |   random_patch=False \
14 |   glance_size=224 \
15 |   patch_size=144 \
16 |   gamma=0.7 \
17 |   with_glancer=True \
18 |   reward=2 \
19 |   ppo_continuous=True \
20 |   action_std=0.25 \
21 |   actorcritic_with_bn=True \
22 |   evaluate=True \
23 |   resume=PATH_TO_PRETRAINED_MODEL # load the pretrained model


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/__pycache__/ppo.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/ppo.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/__pycache__/ppo_continuous.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/ppo_continuous.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/__pycache__/resnet.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/resnet.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/__pycache__/tsn.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/tsn.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/__pycache__/utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/models/__pycache__/utils.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/constant.py:
--------------------------------------------------------------------------------
1 | # Some definition of constants
2 | 
3 | patch_sizes = [64, 96, 128, 160, 192, 224, 0]
4 | # patch_sizes = [96]
5 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/mobilenetv2.py:
--------------------------------------------------------------------------------
  1 | # Code adapted from https://github.com/tonylins/pytorch-mobilenet-v2
  2 | 
  3 | import torch.nn as nn
  4 | import math
  5 | 
  6 | 
  7 | def conv_bn(inp, oup, stride):
  8 |     return nn.Sequential(
  9 |         nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
 10 |         nn.BatchNorm2d(oup),
 11 |         nn.ReLU6(inplace=True)
 12 |     )
 13 | 
 14 | 
 15 | def conv_1x1_bn(inp, oup):
 16 |     return nn.Sequential(
 17 |         nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
 18 |         nn.BatchNorm2d(oup),
 19 |         nn.ReLU6(inplace=True)
 20 |     )
 21 | 
 22 | 
 23 | def make_divisible(x, divisible_by=8):
 24 |     import numpy as np
 25 |     return int(np.ceil(x * 1. / divisible_by) * divisible_by)
 26 | 
 27 | 
 28 | class InvertedResidual(nn.Module):
 29 |     def __init__(self, inp, oup, stride, expand_ratio):
 30 |         super(InvertedResidual, self).__init__()
 31 |         self.stride = stride
 32 |         assert stride in [1, 2]
 33 | 
 34 |         hidden_dim = int(inp * expand_ratio)
 35 |         self.use_res_connect = self.stride == 1 and inp == oup
 36 | 
 37 |         if expand_ratio == 1:
 38 |             self.conv = nn.Sequential(
 39 |                 # dw
 40 |                 nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False),
 41 |                 nn.BatchNorm2d(hidden_dim),
 42 |                 nn.ReLU6(inplace=True),
 43 |                 # pw-linear
 44 |                 nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
 45 |                 nn.BatchNorm2d(oup),
 46 |             )
 47 |         else:
 48 |             self.conv = nn.Sequential(
 49 |                 # pw
 50 |                 nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
 51 |                 nn.BatchNorm2d(hidden_dim),
 52 |                 nn.ReLU6(inplace=True),
 53 |                 # dw
 54 |                 nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False),
 55 |                 nn.BatchNorm2d(hidden_dim),
 56 |                 nn.ReLU6(inplace=True),
 57 |                 # pw-linear
 58 |                 nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
 59 |                 nn.BatchNorm2d(oup),
 60 |             )
 61 | 
 62 |     def forward(self, x):
 63 |         if self.use_res_connect:
 64 |             return x + self.conv(x)
 65 |         else:
 66 |             return self.conv(x)
 67 | 
 68 | 
 69 | class MobileNetV2(nn.Module):
 70 |     def __init__(self, n_class=1000, input_size=224, width_mult=1.):
 71 |         super(MobileNetV2, self).__init__()
 72 |         block = InvertedResidual
 73 |         input_channel = 32
 74 |         last_channel = 1280
 75 |         interverted_residual_setting = [
 76 |             # t, c, n, s
 77 |             [1, 16, 1, 1],
 78 |             [6, 24, 2, 2],
 79 |             [6, 32, 3, 2],
 80 |             [6, 64, 4, 2],
 81 |             [6, 96, 3, 1],
 82 |             [6, 160, 3, 2],
 83 |             [6, 320, 1, 1],
 84 |         ]
 85 | 
 86 |         # building first layer
 87 |         assert input_size % 32 == 0
 88 |         # input_channel = make_divisible(input_channel * width_mult)  # first channel is always 32!
 89 |         self.last_channel = make_divisible(last_channel * width_mult) if width_mult > 1.0 else last_channel
 90 |         self.features = [conv_bn(3, input_channel, 2)]
 91 |         # building inverted residual blocks
 92 |         for t, c, n, s in interverted_residual_setting:
 93 |             output_channel = make_divisible(c * width_mult) if t > 1 else c
 94 |             for i in range(n):
 95 |                 if i == 0:
 96 |                     self.features.append(block(input_channel, output_channel, s, expand_ratio=t))
 97 |                 else:
 98 |                     self.features.append(block(input_channel, output_channel, 1, expand_ratio=t))
 99 |                 input_channel = output_channel
100 |         # building last several layers
101 |         self.features.append(conv_1x1_bn(input_channel, self.last_channel))
102 |         # make it nn.Sequential
103 |         self.features = nn.Sequential(*self.features)
104 | 
105 |         # building classifier
106 |         self.classifier = nn.Linear(self.last_channel, n_class)
107 | 
108 |         self._initialize_weights()
109 | 
110 |     def forward(self, x):
111 |         x = self.features(x)
112 |         x = x.mean(3).mean(2)
113 |         x = self.classifier(x)
114 |         return x
115 | 
116 |     def get_featmap(self, x):
117 |         # x = self.features(x)
118 |         # return x, x.mean([2, 3])
119 |         x = self.features(x)
120 |         logit = self.classifier(x.mean(3).mean(2))
121 |         return x, logit
122 | 
123 |     def _initialize_weights(self):
124 |         for m in self.modules():
125 |             if isinstance(m, nn.Conv2d):
126 |                 n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
127 |                 m.weight.data.normal_(0, math.sqrt(2. / n))
128 |                 if m.bias is not None:
129 |                     m.bias.data.zero_()
130 |             elif isinstance(m, nn.BatchNorm2d):
131 |                 m.weight.data.fill_(1)
132 |                 m.bias.data.zero_()
133 |             elif isinstance(m, nn.Linear):
134 |                 n = m.weight.size(1)
135 |                 m.weight.data.normal_(0, 0.01)
136 |                 m.bias.data.zero_()
137 | 
138 |     @property
139 |     def feature_dim(self):
140 |         return self.last_channel
141 | 
142 | 
143 | def mobilenet_v2(n_class, pretrained=True):
144 |     model = MobileNetV2(n_class=n_class, width_mult=1)
145 | 
146 |     if pretrained:
147 |         try:
148 |             from torch.hub import load_state_dict_from_url
149 |         except ImportError:
150 |             from torch.utils.model_zoo import load_url as load_state_dict_from_url
151 |         state_dict = load_state_dict_from_url(
152 |             'https://www.dropbox.com/s/47tyzpofuuyyv1b/mobilenetv2_1.0-f2a8633.pth.tar?dl=1', progress=True)
153 |         model.load_state_dict(state_dict)
154 |         print('Loaded pretrained weight!')
155 |     return model
156 | 
157 | 
158 | if __name__ == '__main__':
159 |     net = mobilenet_v2(True)
160 | 
161 | 
162 | 
163 | 
164 | 
165 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/ppo.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torchvision
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | import math
  6 | from torch.distributions import Categorical
  7 | 
  8 | 
  9 | class Memory:
 10 |     def __init__(self):
 11 |         self.actions = []
 12 |         self.states = []
 13 |         self.logprobs = []
 14 |         self.rewards = []
 15 |         self.is_terminals = []
 16 |         self.hidden = []
 17 | 
 18 |     def clear_memory(self):
 19 |         del self.actions[:]
 20 |         del self.states[:]
 21 |         del self.logprobs[:]
 22 |         del self.rewards[:]
 23 |         del self.is_terminals[:]
 24 |         del self.hidden[:]
 25 | 
 26 | 
 27 | class ActorCritic(nn.Module):
 28 |     def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim=1024, policy_conv=True):
 29 |         super(ActorCritic, self).__init__()
 30 |         
 31 |         # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet
 32 |         if policy_conv:
 33 |             self.state_encoder = nn.Sequential(
 34 |                 # nn.Conv2d(feature_dim, 32, kernel_size=1, stride=1, padding=0, bias=False),
 35 |                 nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False),
 36 |                 nn.BatchNorm2d(64),
 37 |                 nn.ReLU(),
 38 |                 nn.Flatten(),
 39 |                 # nn.Linear(int(state_dim * 32 / feature_dim), hidden_state_dim),
 40 |                 nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim),
 41 |                 nn.BatchNorm1d(hidden_state_dim),
 42 |                 nn.ReLU()
 43 |             )
 44 | 
 45 |         # encoder with linear layer for ResNet and DenseNet
 46 |         else:
 47 |             self.state_encoder = nn.Sequential(
 48 |                 nn.Linear(state_dim, 2048),
 49 |                 nn.ReLU(),
 50 |                 nn.Linear(2048, hidden_state_dim),
 51 |                 nn.ReLU()
 52 |             )
 53 | 
 54 |         self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False)
 55 |         
 56 |         self.actor = nn.Sequential(
 57 |             nn.Linear(hidden_state_dim, action_dim),
 58 |             nn.Softmax(dim=-1))
 59 | 
 60 |         self.critic = nn.Sequential(
 61 |             nn.Linear(hidden_state_dim, 1))
 62 | 
 63 |         self.hidden_state_dim = hidden_state_dim
 64 |         self.action_dim = action_dim
 65 |         self.policy_conv = policy_conv
 66 |         self.feature_dim = feature_dim
 67 |         self.feature_ratio = int(math.sqrt(state_dim/feature_dim))
 68 | 
 69 |     def forward(self):
 70 |         raise NotImplementedError
 71 | 
 72 |     def act(self, state_ini, memory, restart_batch=False, training=True):
 73 |         if restart_batch:
 74 |             del memory.hidden[:]
 75 |             memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda())
 76 | 
 77 |         if not self.policy_conv:
 78 |             state = state_ini.flatten(1)
 79 |         else:
 80 |             state = state_ini
 81 | 
 82 |         state = self.state_encoder(state)
 83 | 
 84 |         state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1])
 85 |         memory.hidden.append(hidden_output)
 86 | 
 87 |         state = state[0]
 88 |         action_probs = self.actor(state)
 89 |         dist = Categorical(action_probs)
 90 | 
 91 |         if training:
 92 |             action = dist.sample()
 93 |             action_logprob = dist.log_prob(action)
 94 |             memory.states.append(state_ini)
 95 |             memory.actions.append(action)
 96 |             memory.logprobs.append(action_logprob)
 97 |         else:
 98 |             action = action_probs.max(1)[1]
 99 | 
100 |         return action
101 | 
102 |     def evaluate(self, state, action):
103 |         seq_l = state.size(0)
104 |         batch_size = state.size(1)
105 | 
106 |         if not self.policy_conv:
107 |             state = state.flatten(2)
108 |             state = state.view(seq_l * batch_size, state.size(2))
109 |         else:
110 |             state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4))
111 | 
112 |         state = self.state_encoder(state)
113 |         state = state.view(seq_l, batch_size, -1)
114 | 
115 |         state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda())
116 |         state = state.view(seq_l * batch_size, -1)
117 | 
118 |         action_probs = self.actor(state)
119 |         dist = Categorical(action_probs)
120 |         action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda()
121 |         dist_entropy = dist.entropy().cuda()
122 |         state_value = self.critic(state)
123 | 
124 |         return action_logprobs.view(seq_l, batch_size), \
125 |                state_value.view(seq_l, batch_size), \
126 |                dist_entropy.view(seq_l, batch_size)
127 | 
128 | 
129 | class PPO:
130 |     def __init__(self, feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv, gpu=0,
131 |                 lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2):
132 |         self.lr = lr
133 |         self.betas = betas
134 |         self.gamma = gamma
135 |         self.eps_clip = eps_clip
136 |         self.K_epochs = K_epochs
137 | 
138 |         self.policy = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu)
139 | 
140 |         self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
141 | 
142 |         self.policy_old = ActorCritic(feature_dim, state_dim, action_dim, hidden_state_dim, policy_conv).cuda(gpu)
143 |         self.policy_old.load_state_dict(self.policy.state_dict())
144 | 
145 |         self.MseLoss = nn.MSELoss()
146 | 
147 |     def select_action(self, state, memory, restart_batch=False, training=True):
148 |         return self.policy_old.act(state, memory, restart_batch, training)
149 | 
150 |     def update(self, memory):
151 |         rewards = []
152 |         discounted_reward = 0
153 | 
154 |         for reward in reversed(memory.rewards):
155 |             discounted_reward = reward + (self.gamma * discounted_reward)
156 |             rewards.insert(0, discounted_reward)
157 | 
158 |         rewards = torch.cat(rewards, 0).cuda()
159 | 
160 |         rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
161 | 
162 |         old_states = torch.stack(memory.states, 0).cuda().detach()
163 |         old_actions = torch.stack(memory.actions, 0).cuda().detach()
164 |         old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach()
165 | 
166 |         for _ in range(self.K_epochs):
167 |             logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)
168 | 
169 |             ratios = torch.exp(logprobs - old_logprobs.detach())
170 | 
171 |             advantages = rewards - state_values.detach()
172 |             surr1 = ratios * advantages
173 |             surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
174 | 
175 |             loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy
176 | 
177 |             self.optimizer.zero_grad()
178 |             loss.mean().backward()
179 |             self.optimizer.step()
180 | 
181 |         self.policy_old.load_state_dict(self.policy.state_dict())


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/ppo_continuous.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torchvision
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | import math
  6 | 
  7 | 
  8 | class Memory:
  9 |     def __init__(self):
 10 |         self.actions = []
 11 |         self.states = []
 12 |         self.logprobs = []
 13 |         self.rewards = []
 14 |         self.is_terminals = []
 15 |         self.hidden = []
 16 | 
 17 |     def clear_memory(self):
 18 |         del self.actions[:]
 19 |         del self.states[:]
 20 |         del self.logprobs[:]
 21 |         del self.rewards[:]
 22 |         del self.is_terminals[:]
 23 |         del self.hidden[:]
 24 | 
 25 | 
 26 | class ActorCritic(nn.Module):
 27 |     def __init__(self, feature_dim, state_dim, hidden_state_dim=1024, policy_conv=True, action_std=0.1, with_bn=False):
 28 |         super(ActorCritic, self).__init__()
 29 |         
 30 |         # encoder with convolution layer for MobileNetV3, EfficientNet and RegNet
 31 |         if policy_conv:
 32 |             if with_bn:
 33 |                 self.state_encoder = nn.Sequential(
 34 |                     nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False),
 35 |                     nn.BatchNorm2d(64),
 36 |                     nn.ReLU(),
 37 |                     nn.Flatten(),
 38 |                     nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim),
 39 |                     nn.BatchNorm1d(hidden_state_dim),
 40 |                     nn.ReLU()
 41 |                 )
 42 |             else:
 43 |                 self.state_encoder = nn.Sequential(
 44 |                     nn.Conv2d(feature_dim, 64, kernel_size=1, stride=1, padding=0, bias=False),
 45 |                     nn.ReLU(),
 46 |                     nn.Flatten(),
 47 |                     nn.Linear(int(state_dim * 64 / feature_dim), hidden_state_dim),
 48 |                     nn.ReLU()
 49 |                 )
 50 |         # encoder with linear layer for ResNet and DenseNet
 51 |         else:
 52 |             self.state_encoder = nn.Sequential(
 53 |                 nn.Linear(state_dim, 2048),
 54 |                 nn.ReLU(),
 55 |                 nn.Linear(2048, hidden_state_dim),
 56 |                 nn.ReLU()
 57 |             )
 58 | 
 59 |         self.gru = nn.GRU(hidden_state_dim, hidden_state_dim, batch_first=False)
 60 |         
 61 |         self.actor = nn.Sequential(
 62 |             nn.Linear(hidden_state_dim, 2),
 63 |             nn.Sigmoid())
 64 | 
 65 |         self.critic = nn.Sequential(
 66 |             nn.Linear(hidden_state_dim, 1))
 67 | 
 68 |         self.action_var = torch.full((2,), action_std).cuda()
 69 | 
 70 |         self.hidden_state_dim = hidden_state_dim
 71 |         self.policy_conv = policy_conv
 72 |         self.feature_dim = feature_dim
 73 |         self.feature_ratio = int(math.sqrt(state_dim/feature_dim))
 74 | 
 75 |     def forward(self):
 76 |         raise NotImplementedError
 77 | 
 78 |     def act(self, state_ini, memory, restart_batch=False, training=False):
 79 |         if restart_batch:
 80 |             del memory.hidden[:]
 81 |             memory.hidden.append(torch.zeros(1, state_ini.size(0), self.hidden_state_dim).cuda())
 82 | 
 83 |         if not self.policy_conv:
 84 |             state = state_ini.flatten(1)
 85 |         else:
 86 |             state = state_ini
 87 | 
 88 |         state = self.state_encoder(state)
 89 | 
 90 |         state, hidden_output = self.gru(state.view(1, state.size(0), state.size(1)), memory.hidden[-1])
 91 |         memory.hidden.append(hidden_output)
 92 | 
 93 |         state = state[0]
 94 |         action_mean = self.actor(state)
 95 | 
 96 |         cov_mat = torch.diag(self.action_var).cuda()
 97 |         dist = torch.distributions.multivariate_normal.MultivariateNormal(action_mean, scale_tril=cov_mat)
 98 |         action = dist.sample().cuda()
 99 |         if training:
100 |             action = F.relu(action)
101 |             action = 1 - F.relu(1 - action)
102 |             action_logprob = dist.log_prob(action).cuda()
103 |             memory.states.append(state_ini)
104 |             memory.actions.append(action)
105 |             memory.logprobs.append(action_logprob)
106 |         else:
107 |             action = action_mean
108 | 
109 |         return action.detach()
110 | 
111 |     def evaluate(self, state, action):
112 |         seq_l = state.size(0)
113 |         batch_size = state.size(1)
114 | 
115 |         if not self.policy_conv:
116 |             state = state.flatten(2)
117 |             state = state.view(seq_l * batch_size, state.size(2))
118 |         else:
119 |             state = state.view(seq_l * batch_size, state.size(2), state.size(3), state.size(4))
120 | 
121 |         state = self.state_encoder(state)
122 |         state = state.view(seq_l, batch_size, -1)
123 | 
124 |         state, hidden = self.gru(state, torch.zeros(1, batch_size, state.size(2)).cuda())
125 |         state = state.view(seq_l * batch_size, -1)
126 | 
127 |         action_mean = self.actor(state)
128 | 
129 |         cov_mat = torch.diag(self.action_var).cuda()
130 | 
131 |         dist = torch.distributions.multivariate_normal.MultivariateNormal(action_mean, scale_tril=cov_mat)
132 | 
133 |         action_logprobs = dist.log_prob(torch.squeeze(action.view(seq_l * batch_size, -1))).cuda()
134 |         dist_entropy = dist.entropy().cuda()
135 |         state_value = self.critic(state)
136 | 
137 |         return action_logprobs.view(seq_l, batch_size), \
138 |                state_value.view(seq_l, batch_size), \
139 |                dist_entropy.view(seq_l, batch_size)
140 | 
141 | 
142 | class PPO_Continuous:
143 |     def __init__(self, feature_dim, state_dim, hidden_state_dim, policy_conv, gpu=0, action_std=0.1,
144 |                  lr=0.0003, betas=(0.9, 0.999), gamma=0.7, K_epochs=1, eps_clip=0.2, with_bn=False):
145 |         self.lr = lr
146 |         self.betas = betas
147 |         self.gamma = gamma
148 |         self.eps_clip = eps_clip
149 |         self.K_epochs = K_epochs
150 | 
151 |         self.policy = ActorCritic(feature_dim, state_dim, hidden_state_dim, policy_conv, action_std,
152 |                                   with_bn=with_bn).cuda(gpu)
153 | 
154 |         self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
155 | 
156 |         self.policy_old = ActorCritic(feature_dim, state_dim, hidden_state_dim, policy_conv, action_std,
157 |                                       with_bn=with_bn).cuda(gpu)
158 |         self.policy_old.load_state_dict(self.policy.state_dict())
159 | 
160 |         self.MseLoss = nn.MSELoss()
161 | 
162 |     def select_action(self, state, memory, restart_batch=False, training=True):
163 |         return self.policy_old.act(state, memory, restart_batch, training)
164 | 
165 |     def update(self, memory):
166 |         rewards = []
167 |         discounted_reward = 0
168 | 
169 |         for reward in reversed(memory.rewards):
170 |             discounted_reward = reward + (self.gamma * discounted_reward)
171 |             rewards.insert(0, discounted_reward)
172 | 
173 |         rewards = torch.cat(rewards, 0).cuda()
174 | 
175 |         rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
176 | 
177 |         old_states = torch.stack(memory.states, 0).cuda().detach()
178 |         old_actions = torch.stack(memory.actions, 0).cuda().detach()
179 |         old_logprobs = torch.stack(memory.logprobs, 0).cuda().detach()
180 | 
181 |         for _ in range(self.K_epochs):
182 |             logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)
183 | 
184 |             ratios = torch.exp(logprobs - old_logprobs.detach())
185 | 
186 |             advantages = rewards - state_values.detach()
187 |             surr1 = ratios * advantages
188 |             surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
189 | 
190 |             loss = -torch.min(surr1, surr2) + 0.5 * self.MseLoss(state_values, rewards) - 0.01 * dist_entropy
191 | 
192 |             self.optimizer.zero_grad()
193 |             loss.mean().backward()
194 |             self.optimizer.step()
195 | 
196 |         self.policy_old.load_state_dict(self.policy.state_dict())


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/resnet.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | from .utils import load_state_dict_from_url
  4 | 
  5 | __all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
  6 |            'resnet152', 'resnext50_32x4d', 'resnext101_32x8d',
  7 |            'wide_resnet50_2', 'wide_resnet101_2']
  8 | 
  9 | 
 10 | model_urls = {
 11 |     'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
 12 |     'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth',
 13 |     'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
 14 |     'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
 15 |     'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
 16 |     'resnext50_32x4d': 'https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth',
 17 |     'resnext101_32x8d': 'https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth',
 18 |     'wide_resnet50_2': 'https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth',
 19 |     'wide_resnet101_2': 'https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth',
 20 | }
 21 | 
 22 | 
 23 | def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
 24 |     """3x3 convolution with padding"""
 25 |     return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
 26 |                      padding=dilation, groups=groups, bias=False, dilation=dilation)
 27 | 
 28 | 
 29 | def conv1x1(in_planes, out_planes, stride=1):
 30 |     """1x1 convolution"""
 31 |     return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
 32 | 
 33 | 
 34 | class BasicBlock(nn.Module):
 35 |     expansion = 1
 36 | 
 37 |     def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
 38 |                  base_width=64, dilation=1, norm_layer=None):
 39 |         super(BasicBlock, self).__init__()
 40 |         if norm_layer is None:
 41 |             norm_layer = nn.BatchNorm2d
 42 |         if groups != 1 or base_width != 64:
 43 |             raise ValueError('BasicBlock only supports groups=1 and base_width=64')
 44 |         if dilation > 1:
 45 |             raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
 46 |         # Both self.conv1 and self.downsample layers downsample the input when stride != 1
 47 |         self.conv1 = conv3x3(inplanes, planes, stride)
 48 |         self.bn1 = norm_layer(planes)
 49 |         self.relu = nn.ReLU(inplace=True)
 50 |         self.conv2 = conv3x3(planes, planes)
 51 |         self.bn2 = norm_layer(planes)
 52 |         self.downsample = downsample
 53 |         self.stride = stride
 54 | 
 55 |     def forward(self, x):
 56 |         identity = x
 57 | 
 58 |         out = self.conv1(x)
 59 |         out = self.bn1(out)
 60 |         out = self.relu(out)
 61 | 
 62 |         out = self.conv2(out)
 63 |         out = self.bn2(out)
 64 | 
 65 |         if self.downsample is not None:
 66 |             identity = self.downsample(x)
 67 | 
 68 |         out += identity
 69 |         out = self.relu(out)
 70 | 
 71 |         return out
 72 | 
 73 | 
 74 | class Bottleneck(nn.Module):
 75 |     expansion = 4
 76 | 
 77 |     def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
 78 |                  base_width=64, dilation=1, norm_layer=None):
 79 |         super(Bottleneck, self).__init__()
 80 |         if norm_layer is None:
 81 |             norm_layer = nn.BatchNorm2d
 82 |         width = int(planes * (base_width / 64.)) * groups
 83 |         # Both self.conv2 and self.downsample layers downsample the input when stride != 1
 84 |         self.conv1 = conv1x1(inplanes, width)
 85 |         self.bn1 = norm_layer(width)
 86 |         self.conv2 = conv3x3(width, width, stride, groups, dilation)
 87 |         self.bn2 = norm_layer(width)
 88 |         self.conv3 = conv1x1(width, planes * self.expansion)
 89 |         self.bn3 = norm_layer(planes * self.expansion)
 90 |         self.relu = nn.ReLU(inplace=True)
 91 |         self.downsample = downsample
 92 |         self.stride = stride
 93 | 
 94 |     def forward(self, x):
 95 |         identity = x
 96 | 
 97 |         out = self.conv1(x)
 98 |         out = self.bn1(out)
 99 |         out = self.relu(out)
100 | 
101 |         out = self.conv2(out)
102 |         out = self.bn2(out)
103 |         out = self.relu(out)
104 | 
105 |         out = self.conv3(out)
106 |         out = self.bn3(out)
107 | 
108 |         if self.downsample is not None:
109 |             identity = self.downsample(x)
110 | 
111 |         out += identity
112 |         out = self.relu(out)
113 | 
114 |         return out
115 | 
116 | 
117 | class ResNet(nn.Module):
118 | 
119 |     def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
120 |                  groups=1, width_per_group=64, replace_stride_with_dilation=None,
121 |                  norm_layer=None):
122 |         super(ResNet, self).__init__()
123 |         if norm_layer is None:
124 |             norm_layer = nn.BatchNorm2d
125 |         self._norm_layer = norm_layer
126 | 
127 |         self.inplanes = 64
128 |         self.dilation = 1
129 |         if replace_stride_with_dilation is None:
130 |             # each element in the tuple indicates if we should replace
131 |             # the 2x2 stride with a dilated convolution instead
132 |             replace_stride_with_dilation = [False, False, False]
133 |         if len(replace_stride_with_dilation) != 3:
134 |             raise ValueError("replace_stride_with_dilation should be None "
135 |                              "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
136 |         self.groups = groups
137 |         self.base_width = width_per_group
138 |         self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
139 |                                bias=False)
140 |         self.bn1 = norm_layer(self.inplanes)
141 |         self.relu = nn.ReLU(inplace=True)
142 |         self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
143 |         self.layer1 = self._make_layer(block, 64, layers[0])
144 |         self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
145 |                                        dilate=replace_stride_with_dilation[0])
146 |         self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
147 |                                        dilate=replace_stride_with_dilation[1])
148 |         self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
149 |                                        dilate=replace_stride_with_dilation[2])
150 |         self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
151 |         self.fc = nn.Linear(512 * block.expansion, num_classes)
152 | 
153 |         for m in self.modules():
154 |             if isinstance(m, nn.Conv2d):
155 |                 nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
156 |             elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
157 |                 nn.init.constant_(m.weight, 1)
158 |                 nn.init.constant_(m.bias, 0)
159 | 
160 |         # Zero-initialize the last BN in each residual branch,
161 |         # so that the residual branch starts with zeros, and each residual block behaves like an identity.
162 |         # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
163 |         if zero_init_residual:
164 |             for m in self.modules():
165 |                 if isinstance(m, Bottleneck):
166 |                     nn.init.constant_(m.bn3.weight, 0)
167 |                 elif isinstance(m, BasicBlock):
168 |                     nn.init.constant_(m.bn2.weight, 0)
169 | 
170 |     def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
171 |         norm_layer = self._norm_layer
172 |         downsample = None
173 |         previous_dilation = self.dilation
174 |         if dilate:
175 |             self.dilation *= stride
176 |             stride = 1
177 |         if stride != 1 or self.inplanes != planes * block.expansion:
178 |             downsample = nn.Sequential(
179 |                 conv1x1(self.inplanes, planes * block.expansion, stride),
180 |                 norm_layer(planes * block.expansion),
181 |             )
182 | 
183 |         layers = []
184 |         layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
185 |                             self.base_width, previous_dilation, norm_layer))
186 |         self.inplanes = planes * block.expansion
187 |         for _ in range(1, blocks):
188 |             layers.append(block(self.inplanes, planes, groups=self.groups,
189 |                                 base_width=self.base_width, dilation=self.dilation,
190 |                                 norm_layer=norm_layer))
191 | 
192 |         return nn.Sequential(*layers)
193 | 
194 |     def forward(self, x):
195 |         x = self.conv1(x)
196 |         x = self.bn1(x)
197 |         x = self.relu(x)
198 |         x = self.maxpool(x)
199 | 
200 |         x = self.layer1(x)
201 |         x = self.layer2(x)
202 |         x = self.layer3(x)
203 |         x = self.layer4(x)
204 | 
205 |         x = self.avgpool(x)
206 |         x = torch.flatten(x, 1)
207 |         x = self.fc(x)
208 | 
209 |         return x
210 |     
211 |     def get_featmap(self, x, pooled=True):
212 |         x = self.conv1(x)
213 |         x = self.bn1(x)
214 |         x = self.relu(x)
215 |         x = self.maxpool(x)
216 | 
217 |         x = self.layer1(x)
218 |         x = self.layer2(x)
219 |         x = self.layer3(x)
220 |         x = self.layer4(x)
221 | 
222 |         if pooled:
223 |             return self.avgpool(x)
224 |         else:
225 |             return x
226 |     
227 |     def get_featvec(self, x):
228 |         x = self.conv1(x)
229 |         x = self.bn1(x)
230 |         x = self.relu(x)
231 |         x = self.maxpool(x)
232 | 
233 |         x = self.layer1(x)
234 |         x = self.layer2(x)
235 |         x = self.layer3(x)
236 |         x = self.layer4(x)
237 | 
238 |         x = self.avgpool(x)
239 |         featvec = torch.flatten(x, 1)
240 |         return featvec
241 | 
242 |     @property
243 |     def feature_dim(self):
244 |         return self.fc.weight.shape[-1]
245 | 
246 | 
247 | def _resnet(arch, block, layers, pretrained, progress, **kwargs):
248 |     model = ResNet(block, layers, **kwargs)
249 |     if pretrained:
250 |         state_dict = load_state_dict_from_url(model_urls[arch],
251 |                                               progress=progress)
252 |         model.load_state_dict(state_dict)
253 |     return model
254 | 
255 | 
256 | def resnet18(pretrained=False, progress=True, **kwargs):
257 |     r"""ResNet-18 model from
258 |     `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
259 | 
260 |     Args:
261 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
262 |         progress (bool): If True, displays a progress bar of the download to stderr
263 |     """
264 |     return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
265 |                    **kwargs)
266 | 
267 | 
268 | def resnet34(pretrained=False, progress=True, **kwargs):
269 |     r"""ResNet-34 model from
270 |     `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
271 | 
272 |     Args:
273 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
274 |         progress (bool): If True, displays a progress bar of the download to stderr
275 |     """
276 |     return _resnet('resnet34', BasicBlock, [3, 4, 6, 3], pretrained, progress,
277 |                    **kwargs)
278 | 
279 | 
280 | def resnet50(pretrained=False, progress=True, **kwargs):
281 |     r"""ResNet-50 model from
282 |     `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
283 | 
284 |     Args:
285 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
286 |         progress (bool): If True, displays a progress bar of the download to stderr
287 |     """
288 |     return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, progress,
289 |                    **kwargs)
290 | 
291 | 
292 | def resnet101(pretrained=False, progress=True, **kwargs):
293 |     r"""ResNet-101 model from
294 |     `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
295 | 
296 |     Args:
297 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
298 |         progress (bool): If True, displays a progress bar of the download to stderr
299 |     """
300 |     return _resnet('resnet101', Bottleneck, [3, 4, 23, 3], pretrained, progress,
301 |                    **kwargs)
302 | 
303 | 
304 | def resnet152(pretrained=False, progress=True, **kwargs):
305 |     r"""ResNet-152 model from
306 |     `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
307 | 
308 |     Args:
309 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
310 |         progress (bool): If True, displays a progress bar of the download to stderr
311 |     """
312 |     return _resnet('resnet152', Bottleneck, [3, 8, 36, 3], pretrained, progress,
313 |                    **kwargs)
314 | 
315 | 
316 | def resnext50_32x4d(pretrained=False, progress=True, **kwargs):
317 |     r"""ResNeXt-50 32x4d model from
318 |     `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_
319 | 
320 |     Args:
321 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
322 |         progress (bool): If True, displays a progress bar of the download to stderr
323 |     """
324 |     kwargs['groups'] = 32
325 |     kwargs['width_per_group'] = 4
326 |     return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3],
327 |                    pretrained, progress, **kwargs)
328 | 
329 | 
330 | def resnext101_32x8d(pretrained=False, progress=True, **kwargs):
331 |     r"""ResNeXt-101 32x8d model from
332 |     `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_
333 | 
334 |     Args:
335 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
336 |         progress (bool): If True, displays a progress bar of the download to stderr
337 |     """
338 |     kwargs['groups'] = 32
339 |     kwargs['width_per_group'] = 8
340 |     return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3],
341 |                    pretrained, progress, **kwargs)
342 | 
343 | 
344 | def wide_resnet50_2(pretrained=False, progress=True, **kwargs):
345 |     r"""Wide ResNet-50-2 model from
346 |     `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_
347 | 
348 |     The model is the same as ResNet except for the bottleneck number of channels
349 |     which is twice larger in every block. The number of channels in outer 1x1
350 |     convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
351 |     channels, and in Wide ResNet-50-2 has 2048-1024-2048.
352 | 
353 |     Args:
354 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
355 |         progress (bool): If True, displays a progress bar of the download to stderr
356 |     """
357 |     kwargs['width_per_group'] = 64 * 2
358 |     return _resnet('wide_resnet50_2', Bottleneck, [3, 4, 6, 3],
359 |                    pretrained, progress, **kwargs)
360 | 
361 | 
362 | def wide_resnet101_2(pretrained=False, progress=True, **kwargs):
363 |     r"""Wide ResNet-101-2 model from
364 |     `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_
365 | 
366 |     The model is the same as ResNet except for the bottleneck number of channels
367 |     which is twice larger in every block. The number of channels in outer 1x1
368 |     convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
369 |     channels, and in Wide ResNet-50-2 has 2048-1024-2048.
370 | 
371 |     Args:
372 |         pretrained (bool): If True, returns a model pre-trained on ImageNet
373 |         progress (bool): If True, displays a progress bar of the download to stderr
374 |     """
375 |     kwargs['width_per_group'] = 64 * 2
376 |     return _resnet('wide_resnet101_2', Bottleneck, [3, 4, 23, 3],
377 |                    pretrained, progress, **kwargs)
378 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/tsn.py:
--------------------------------------------------------------------------------
  1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding"
  2 | # arXiv:1811.08383
  3 | # Ji Lin*, Chuang Gan, Song Han
  4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu
  5 | 
  6 | 
  7 | from ops.transforms import *
  8 | from torch import nn
  9 | 
 10 | 
 11 | class TSN(nn.Module):
 12 |     def __init__(self,
 13 |                  # num_class,
 14 |                  num_segments,
 15 |                  modality='RGB',
 16 |                  base_model='resnet50',
 17 |                  new_length=None,
 18 |                  # consensus_type='avg',
 19 |                  # before_softmax=True,
 20 |                  # dropout=0.8,
 21 |                  # img_feature_dim=256,
 22 |                  crop_num=1,
 23 |                  partial_bn=True,
 24 |                  print_spec=True,
 25 |                  pretrain='imagenet',
 26 |                  is_shift=False,
 27 |                  shift_div=8,
 28 |                  shift_place='blockres',
 29 |                  fc_lr5=False,
 30 |                  temporal_pool=False,
 31 |                  non_local=False):
 32 |         super(TSN, self).__init__()
 33 |         self.modality = modality
 34 |         self.num_segments = num_segments
 35 |         self.reshape = False
 36 |         self.crop_num = crop_num
 37 |         self.pretrain = pretrain
 38 |         # self.before_softma
 39 |         # self.dropout = dropout
 40 |         # self.consensus_type = consensus_type
 41 |         # self.img_feature_dim = img_feature_dim  # the dimension of the CNN feature to represent each frame
 42 | 
 43 |         self.is_shift = is_shift
 44 |         self.shift_div = shift_div
 45 |         self.shift_place = shift_place
 46 |         self.base_model_name = base_model
 47 |         self.fc_lr5 = fc_lr5
 48 |         self.temporal_pool = temporal_pool
 49 |         self.non_local = non_local
 50 | 
 51 |         # if not before_softmax and consensus_type != 'avg':
 52 |         #     raise ValueError("Only avg consensus can be used after Softmax")
 53 | 
 54 |         if new_length is None:
 55 |             self.new_length = 1 if modality == "RGB" else 5
 56 |         else:
 57 |             self.new_length = new_length
 58 |         if print_spec:
 59 |             print(("""
 60 |                     Initializing TSN with base model: {}.
 61 |                     TSN Configurations:
 62 |                         input_modality:     {}
 63 |                         num_segments:       {}
 64 |                         new_length:         {}
 65 |             """.format(base_model, self.modality, self.num_segments, self.new_length)))
 66 | 
 67 |         self._prepare_base_model(base_model)
 68 | 
 69 |         # feature_dim = self._prepare_tsn(num_class)
 70 | 
 71 |         # if self.modality == 'Flow':
 72 |         #     print("Converting the ImageNet model to a flow init model")
 73 |         #     self.base_model = self._construct_flow_model(self.base_model)
 74 |         #     print("Done. Flow model ready...")
 75 |         # elif self.modality == 'RGBDiff':
 76 |         #     print("Converting the ImageNet model to RGB+Diff init model")
 77 |         #     self.base_model = self._construct_diff_model(self.base_model)
 78 |         #     print("Done. RGBDiff model ready.")
 79 |         assert self.modality == 'RGB'
 80 | 
 81 |         # self.consensus = ConsensusModule(consensus_type)
 82 | 
 83 |         # if not self.before_softmax:
 84 |         #     self.softmax = nn.Softmax()
 85 | 
 86 |         self._enable_pbn = partial_bn
 87 |         if partial_bn:
 88 |             self.partialBN(True)
 89 | 
 90 |     # def _prepare_tsn(self, num_class):
 91 |     #     feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features
 92 |     #     if self.dropout == 0:
 93 |     #         setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class))
 94 |     #         self.new_fc = None
 95 |     #     else:
 96 |     #         setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout))
 97 |     #         self.new_fc = nn.Linear(feature_dim, num_class)
 98 |     #
 99 |     #     std = 0.001
100 |     #     if self.new_fc is None:
101 |     #         normal_(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std)
102 |     #         constant_(getattr(self.base_model, self.base_model.last_layer_name).bias, 0)
103 |     #     else:
104 |     #         if hasattr(self.new_fc, 'weight'):
105 |     #             normal_(self.new_fc.weight, 0, std)
106 |     #             constant_(self.new_fc.bias, 0)
107 |     #     return feature_dim
108 | 
109 |     def _prepare_base_model(self, base_model):
110 |         print('=> base model: {}'.format(base_model))
111 | 
112 |         if 'resnet' in base_model:
113 |             # self.base_model = getattr(torchvision.models, base_model)(True if self.pretrain == 'imagenet' else False)
114 |             self.base_model = getattr(torchvision.models, base_model)(False)
115 |             # self.base_model = resnet50(pretrained=True)
116 |             if self.is_shift:
117 |                 print('Adding temporal shift...')
118 |                 from ops.temporal_shift import make_temporal_shift
119 |                 make_temporal_shift(self.base_model, self.num_segments,
120 |                                     n_div=self.shift_div, place=self.shift_place, temporal_pool=self.temporal_pool)
121 | 
122 |             if self.non_local:
123 |                 raise NotImplementedError
124 |                 # print('Adding non-local module...')
125 |                 # from ops.non_local import make_non_local
126 |                 # make_non_local(self.base_model, self.num_segments)
127 | 
128 |             # self.base_model.last_layer_name = 'fc'
129 |             # self.input_size = 224
130 |             # self.input_mean = [0.485, 0.456, 0.406]
131 |             # self.input_std = [0.229, 0.224, 0.225]
132 | 
133 |             self.base_model.avgpool = nn.AdaptiveAvgPool2d(1)
134 |             # self.base_model = torch.nn.Sequential(*list(self.base_model.children())[:-1]) # remove the last fc layer
135 | 
136 |             # if self.modality == 'Flow':
137 |             #     self.input_mean = [0.5]
138 |             #     self.input_std = [np.mean(self.input_std)]
139 |             # elif self.modality == 'RGBDiff':
140 |             #     self.input_mean = [0.485, 0.456, 0.406] + [0] * 3 * self.new_length
141 |             #     self.input_std = self.input_std + [np.mean(self.input_std) * 2] * 3 * self.new_length
142 |             assert self.modality == 'RGB'
143 |         else:
144 |             raise ValueError('Unknown base model: {}'.format(base_model))
145 | 
146 |     def train(self, mode=True):
147 |         """
148 |         Override the default train() to freeze the BN parameters
149 |         :return:
150 |         """
151 |         super(TSN, self).train(mode)
152 |         count = 0
153 |         if self._enable_pbn and mode:
154 |             print("Freezing BatchNorm2D except the first one.")
155 |             for m in self.base_model.modules():
156 |                 if isinstance(m, nn.BatchNorm2d):
157 |                     count += 1
158 |                     if count >= (2 if self._enable_pbn else 1):
159 |                         m.eval()
160 |                         # shutdown update in frozen mode
161 |                         m.weight.requires_grad = False
162 |                         m.bias.requires_grad = False
163 | 
164 |     def partialBN(self, enable):
165 |         self._enable_pbn = enable
166 | 
167 |     def get_optim_policies(self):
168 |         first_conv_weight = []
169 |         first_conv_bias = []
170 |         normal_weight = []
171 |         normal_bias = []
172 |         bn = []
173 | 
174 |         conv_cnt = 0
175 |         bn_cnt = 0
176 |         for m in self.modules():
177 |             if isinstance(m, torch.nn.Conv2d) or isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv3d):
178 |                 ps = list(m.parameters())
179 |                 conv_cnt += 1
180 |                 if conv_cnt == 1:
181 |                     first_conv_weight.append(ps[0])
182 |                     if len(ps) == 2:
183 |                         first_conv_bias.append(ps[1])
184 |                 else:
185 |                     normal_weight.append(ps[0])
186 |                     if len(ps) == 2:
187 |                         normal_bias.append(ps[1])
188 |             elif isinstance(m, torch.nn.BatchNorm2d):
189 |                 bn_cnt += 1
190 |                 # later BN's are frozen
191 |                 if not self._enable_pbn or bn_cnt == 1:
192 |                     bn.extend(list(m.parameters()))
193 |             elif isinstance(m, torch.nn.BatchNorm3d):
194 |                 bn_cnt += 1
195 |                 # later BN's are frozen
196 |                 if not self._enable_pbn or bn_cnt == 1:
197 |                     bn.extend(list(m.parameters()))
198 |             elif len(m._modules) == 0:
199 |                 if len(list(m.parameters())) > 0:
200 |                     raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m)))
201 | 
202 |         return [
203 |             {'params': first_conv_weight, 'lr_mult': 5 if self.modality == 'Flow' else 1, 'decay_mult': 1,
204 |              'name': "first_conv_weight"},
205 |             {'params': first_conv_bias, 'lr_mult': 10 if self.modality == 'Flow' else 2, 'decay_mult': 0,
206 |              'name': "first_conv_bias"},
207 |             {'params': normal_weight, 'lr_mult': 1, 'decay_mult': 1,
208 |              'name': "normal_weight"},
209 |             {'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0,
210 |              'name': "normal_bias"},
211 |             {'params': bn, 'lr_mult': 1, 'decay_mult': 0,
212 |              'name': "BN scale/shift"},
213 |         ]
214 | 
215 |     def forward(self, input, no_reshape=False):
216 |         if not no_reshape:
217 |             sample_len = (3 if self.modality == "RGB" else 2) * self.new_length # self.new_length = 1 if modality == "RGB" else 5
218 | 
219 |             if self.modality == 'RGBDiff':
220 |                 sample_len = 3 * self.new_length
221 |                 input = self._get_diff(input)
222 | 
223 |             base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:])) # original input size: n, t*c, h, w; after reshape n*t, c, h, w
224 |         else:
225 |             base_out = self.base_model(input)
226 | 
227 |         # if self.dropout > 0:
228 |         #     base_out = self.new_fc(base_out)
229 |         #
230 |         # if not self.before_softmax:
231 |         #     base_out = self.softmax(base_out)
232 | 
233 |         if self.reshape: # self.reshape=True
234 |             if self.is_shift and self.temporal_pool:
235 |                 base_out = base_out.view((-1, self.num_segments // 2) + base_out.size()[1:])
236 |             else:
237 |                 base_out = base_out.view((-1, self.num_segments) + base_out.size()[1:]) # base_out size: n, t, c, h, w
238 |             # output = self.consensus(base_out)
239 |             # return output.squeeze(1)
240 | 
241 |         return base_out.squeeze()
242 | 
243 |     @property
244 |     def feature_dim(self):
245 |         return 2048


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/models/utils.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Description: 
 3 | Author: Zhaoxi Chen
 4 | Github: https://github.com/FrozenBurning
 5 | Date: 2020-12-14 13:54:36
 6 | LastEditors: Zhaoxi Chen
 7 | LastEditTime: 2020-12-14 15:02:20
 8 | '''
 9 | try:
10 |     from torch.hub import load_state_dict_from_url
11 | except ImportError:
12 |     from torch.utils.model_zoo import load_url as load_state_dict_from_url
13 | 
14 | import torchvision
15 | import torch
16 | import numpy as np
17 | 
18 | def prep_a_net(model_name, shall_pretrain):
19 |     model = getattr(torchvision.models, model_name)(shall_pretrain)
20 |     if "resnet" in model_name:
21 |         model.last_layer_name = 'fc'
22 |     elif "mobilenet_v2" in model_name:
23 |         model.last_layer_name = 'classifier'
24 |     return model
25 | 
26 | def zero_pad(im, pad_size):
27 |     """Performs zero padding (CHW format)."""
28 |     pad_width = ((0, 0), (pad_size, pad_size), (pad_size, pad_size))
29 |     return np.pad(im, pad_width, mode="constant")
30 | 
31 | def random_crop(im, size, pad_size=0):
32 |     """Performs random crop (CHW format)."""
33 |     if pad_size > 0:
34 |         im = zero_pad(im=im, pad_size=pad_size)
35 |     h, w = im.shape[1:]
36 |     if size == h:
37 |         return im
38 |     y = np.random.randint(0, h - size)
39 |     x = np.random.randint(0, w - size)
40 |     im_crop = im[:, y : (y + size), x : (x + size)]
41 |     assert im_crop.shape[1:] == (size, size)
42 |     return im_crop
43 | 
44 | def get_patch(images, action_sequence, patch_size):
45 |     """Get small patch of the original image"""
46 |     batch_size = images.size(0)
47 |     image_size = images.size(2)
48 | 
49 |     patch_coordinate = torch.floor(action_sequence * (image_size - patch_size)).int()
50 |     patches = []
51 |     for i in range(batch_size):
52 |         per_patch = images[i, :,
53 |                     (patch_coordinate[i, 0].item()): ((patch_coordinate[i, 0] + patch_size).item()),
54 |                     (patch_coordinate[i, 1].item()): ((patch_coordinate[i, 1] + patch_size).item())]
55 | 
56 |         patches.append(per_patch.view(1, per_patch.size(0), per_patch.size(1), per_patch.size(2)))
57 | 
58 |     return torch.cat(patches, 0)
59 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/__init__.py:
--------------------------------------------------------------------------------
1 | from ops.basic_ops import *
2 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/__init__.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/__pycache__/temporal_shift.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/temporal_shift.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/__pycache__/transforms.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/transforms.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/__pycache__/utils.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/Experiments on Something-Something V1&V2/ops/__pycache__/utils.cpython-38.pyc


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/basic_ops.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | class Identity(torch.nn.Module):
 5 |     def forward(self, input):
 6 |         return input
 7 | 
 8 | 
 9 | class SegmentConsensus(torch.nn.Module):
10 | 
11 |     def __init__(self, consensus_type, dim=1):
12 |         super(SegmentConsensus, self).__init__()
13 |         self.consensus_type = consensus_type
14 |         self.dim = dim
15 |         self.shape = None
16 | 
17 |     def forward(self, input_tensor):
18 |         self.shape = input_tensor.size()
19 |         if self.consensus_type == 'avg':
20 |             output = input_tensor.mean(dim=self.dim, keepdim=True)
21 |         elif self.consensus_type == 'identity':
22 |             output = input_tensor
23 |         else:
24 |             output = None
25 | 
26 |         return output
27 | 
28 | 
29 | class ConsensusModule(torch.nn.Module):
30 | 
31 |     def __init__(self, consensus_type, dim=1):
32 |         super(ConsensusModule, self).__init__()
33 |         self.consensus_type = consensus_type if consensus_type != 'rnn' else 'identity'
34 |         self.dim = dim
35 | 
36 |     def forward(self, input):
37 |         return SegmentConsensus(self.consensus_type, self.dim)(input)
38 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/dataset.py:
--------------------------------------------------------------------------------
  1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding"
  2 | # arXiv:1811.08383
  3 | # Ji Lin*, Chuang Gan, Song Han
  4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu
  5 | 
  6 | import torch.utils.data as data
  7 | 
  8 | from PIL import Image
  9 | import os
 10 | import numpy as np
 11 | from numpy.random import randint
 12 | 
 13 | 
 14 | class VideoRecord(object):
 15 |     def __init__(self, row):
 16 |         self._data = row
 17 | 
 18 |     @property
 19 |     def path(self):
 20 |         return self._data[0]
 21 | 
 22 |     @property
 23 |     def num_frames(self):
 24 |         return int(self._data[1])
 25 | 
 26 |     @property
 27 |     def label(self):
 28 |         return int(self._data[2])
 29 | 
 30 | 
 31 | class TSNDataSet(data.Dataset):
 32 |     def __init__(self, root_path, list_file, num_segments_glancer=8,
 33 |                  num_segments_focuser=8, new_length=1, modality='RGB',
 34 |                  image_tmpl='img_{:05d}.jpg', transform=None,
 35 |                  random_shift=True, test_mode=False,
 36 |                  remove_missing=False, dense_sample=False, twice_sample=False):
 37 | 
 38 |         self.root_path = root_path
 39 |         self.list_file = list_file
 40 |         self.num_segments_glancer = num_segments_glancer
 41 |         self.num_segments_focuser = num_segments_focuser
 42 |         self.new_length = new_length
 43 |         self.modality = modality
 44 |         self.image_tmpl = image_tmpl
 45 |         self.transform = transform
 46 |         self.random_shift = random_shift
 47 |         self.test_mode = test_mode
 48 |         # print('self.test_mode:', self.test_mode)
 49 |         self.remove_missing = remove_missing
 50 |         self.dense_sample = dense_sample  # using dense sample as I3D
 51 |         self.twice_sample = twice_sample  # twice sample for more validation
 52 |         if self.dense_sample:
 53 |             print('=> Using dense sample for the dataset...')
 54 |         if self.twice_sample:
 55 |             print('=> Using twice sample for the dataset...')
 56 | 
 57 |         if self.modality == 'RGBDiff':
 58 |             self.new_length += 1  # Diff needs one more image to calculate diff
 59 | 
 60 |         self._parse_list()
 61 | 
 62 |     def _load_image(self, directory, idx):
 63 |         if self.modality == 'RGB' or self.modality == 'RGBDiff':
 64 |             try:
 65 |                 return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert('RGB')]
 66 |             except Exception:
 67 |                 print('error loading image:', os.path.join(self.root_path, directory, self.image_tmpl.format(idx)))
 68 |                 return [Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB')]
 69 |         elif self.modality == 'Flow':
 70 |             if self.image_tmpl == 'flow_{}_{:05d}.jpg':  # ucf
 71 |                 x_img = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format('x', idx))).convert(
 72 |                     'L')
 73 |                 y_img = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format('y', idx))).convert(
 74 |                     'L')
 75 |             elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg':  # something v1 flow
 76 |                 x_img = Image.open(os.path.join(self.root_path, '{:06d}'.format(int(directory)), self.image_tmpl.
 77 |                                                 format(int(directory), 'x', idx))).convert('L')
 78 |                 y_img = Image.open(os.path.join(self.root_path, '{:06d}'.format(int(directory)), self.image_tmpl.
 79 |                                                 format(int(directory), 'y', idx))).convert('L')
 80 |             else:
 81 |                 try:
 82 |                     # idx_skip = 1 + (idx-1)*5
 83 |                     flow = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(idx))).convert(
 84 |                         'RGB')
 85 |                 except Exception:
 86 |                     print('error loading flow file:',
 87 |                           os.path.join(self.root_path, directory, self.image_tmpl.format(idx)))
 88 |                     flow = Image.open(os.path.join(self.root_path, directory, self.image_tmpl.format(1))).convert('RGB')
 89 |                 # the input flow file is RGB image with (flow_x, flow_y, blank) for each channel
 90 |                 flow_x, flow_y, _ = flow.split()
 91 |                 x_img = flow_x.convert('L')
 92 |                 y_img = flow_y.convert('L')
 93 | 
 94 |             return [x_img, y_img]
 95 | 
 96 |     def _parse_list(self):
 97 |         # check the frame number is large >3:
 98 |         tmp = [x.strip().split(' ') for x in open(self.list_file)]
 99 |         if not self.test_mode or self.remove_missing:
100 |             tmp = [item for item in tmp if int(item[1]) >= 3]
101 |         self.video_list = [VideoRecord(item) for item in tmp]
102 | 
103 |         if self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
104 |             for v in self.video_list:
105 |                 v._data[1] = int(v._data[1]) / 2
106 |         print('video number:%d' % (len(self.video_list)))
107 | 
108 |     def _sample_indices(self, record):
109 |         """
110 | 
111 |         :param record: VideoRecord
112 |         :return: list
113 |         """
114 |         ### For glancer
115 |         if self.dense_sample:  # i3d dense sample
116 |             sample_pos = max(1, 1 + record.num_frames - 64)
117 |             t_stride = 64 // self.num_segments_glancer
118 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
119 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)]
120 |             offsets_glancer = np.array(offsets) + 1
121 |         else:  # normal sample
122 |             # print('normal sample')
123 |             average_duration = (record.num_frames - self.new_length + 1) // self.num_segments_glancer # RGB self.new_length=1
124 |             if average_duration > 0:
125 |                 offsets = np.multiply(list(range(self.num_segments_glancer)), average_duration) + randint(average_duration,
126 |                                                                                                   size=self.num_segments_glancer)
127 |             elif record.num_frames > self.num_segments_glancer:
128 |                 offsets = np.sort(randint(record.num_frames - self.new_length + 1, size=self.num_segments_glancer))
129 |             else:
130 |                 offsets = np.zeros((self.num_segments_glancer,))
131 |             offsets_glancer = offsets + 1
132 | 
133 |         ### For focuser
134 |         if self.dense_sample:  # i3d dense sample
135 |             sample_pos = max(1, 1 + record.num_frames - 64)
136 |             t_stride = 64 // self.num_segments_focuser
137 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
138 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_focuser)]
139 |             offsets_focuser = np.array(offsets) + 1
140 |         else:  # normal sample
141 |             # print('normal sample')
142 |             average_duration = (record.num_frames - self.new_length + 1) // self.num_segments_focuser # RGB self.new_length=1
143 |             if average_duration > 0:
144 |                 offsets = np.multiply(list(range(self.num_segments_focuser)), average_duration) + randint(average_duration,
145 |                                                                                                   size=self.num_segments_focuser)
146 |             elif record.num_frames > self.num_segments_focuser:
147 |                 offsets = np.sort(randint(record.num_frames - self.new_length + 1, size=self.num_segments_focuser))
148 |             else:
149 |                 offsets = np.zeros((self.num_segments_focuser,))
150 |             offsets_focuser = offsets + 1
151 | 
152 |         return offsets_glancer, offsets_focuser
153 | 
154 |     def _get_val_indices(self, record):
155 | 
156 |         ### For glancer
157 |         if self.dense_sample:  # i3d dense sample
158 |             sample_pos = max(1, 1 + record.num_frames - 64)
159 |             t_stride = 64 // self.num_segments_glancer
160 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
161 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)]
162 |             offsets_glancer = np.array(offsets) + 1
163 |         else:
164 |             if record.num_frames > self.num_segments_glancer + self.new_length - 1:
165 |                 tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer)
166 |                 offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)])
167 |             else:
168 |                 offsets = np.zeros((self.num_segments_glancer,))
169 |             offsets_glancer = offsets + 1
170 | 
171 |         ### For focuser
172 |         if self.dense_sample:  # i3d dense sample
173 |             sample_pos = max(1, 1 + record.num_frames - 64)
174 |             t_stride = 64 // self.num_segments_focuser
175 |             start_idx = 0 if sample_pos == 1 else np.random.randint(0, sample_pos - 1)
176 |             offsets = [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_focuser)]
177 |             offsets_focuser = np.array(offsets) + 1
178 |         else:
179 |             if record.num_frames > self.num_segments_focuser + self.new_length - 1:
180 |                 tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser)
181 |                 offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)])
182 |             else:
183 |                 offsets = np.zeros((self.num_segments_focuser,))
184 |             offsets_focuser = offsets + 1
185 | 
186 |         return offsets_glancer, offsets_focuser
187 | 
188 |     def _get_test_indices(self, record):
189 | 
190 |         ### For glancer
191 |         if self.dense_sample:
192 |             sample_pos = max(1, 1 + record.num_frames - 64)
193 |             t_stride = 64 // self.num_segments_glancer
194 |             start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int)
195 |             offsets = []
196 |             for start_idx in start_list.tolist():
197 |                 offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in range(self.num_segments_glancer)]
198 |             offsets_glancer = np.array(offsets) + 1
199 |         elif self.twice_sample:
200 |             tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer)
201 | 
202 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)] +
203 |                                [int(tick * x) for x in range(self.num_segments_glancer)])
204 | 
205 |             offsets_glancer = offsets + 1
206 |         else:
207 |             tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_glancer)
208 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_glancer)])
209 |             offsets_glancer = offsets + 1
210 | 
211 |         ### For glancer
212 |         if self.dense_sample:
213 |             sample_pos = max(1, 1 + record.num_frames - 64)
214 |             t_stride = 64 // self.num_segments_focuser
215 |             start_list = np.linspace(0, sample_pos - 1, num=10, dtype=int)
216 |             offsets = []
217 |             for start_idx in start_list.tolist():
218 |                 offsets += [(idx * t_stride + start_idx) % record.num_frames for idx in
219 |                             range(self.num_segments_focuser)]
220 |             offsets_focuser = np.array(offsets) + 1
221 |         elif self.twice_sample:
222 |             tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser)
223 | 
224 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)] +
225 |                                [int(tick * x) for x in range(self.num_segments_focuser)])
226 | 
227 |             offsets_focuser = offsets + 1
228 |         else:
229 |             tick = (record.num_frames - self.new_length + 1) / float(self.num_segments_focuser)
230 |             offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments_focuser)])
231 |             offsets_focuser = offsets + 1
232 | 
233 |         return offsets_glancer, offsets_focuser
234 | 
235 |     def __getitem__(self, index):
236 |         # print('get item')
237 |         record = self.video_list[index]
238 |         # check this is a legit video folder
239 | 
240 |         if self.image_tmpl == 'flow_{}_{:05d}.jpg':
241 |             file_name = self.image_tmpl.format('x', 1)
242 |             full_path = os.path.join(self.root_path, record.path, file_name)
243 |         elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
244 |             file_name = self.image_tmpl.format(int(record.path), 'x', 1)
245 |             full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name)
246 |         else:
247 |             file_name = self.image_tmpl.format(1)
248 |             full_path = os.path.join(self.root_path, record.path, file_name)
249 | 
250 |         while not os.path.exists(full_path):
251 |             print('################## Not Found:', os.path.join(self.root_path, record.path, file_name))
252 |             index = np.random.randint(len(self.video_list))
253 |             record = self.video_list[index]
254 |             if self.image_tmpl == 'flow_{}_{:05d}.jpg':
255 |                 file_name = self.image_tmpl.format('x', 1)
256 |                 full_path = os.path.join(self.root_path, record.path, file_name)
257 |             elif self.image_tmpl == '{:06d}-{}_{:05d}.jpg':
258 |                 file_name = self.image_tmpl.format(int(record.path), 'x', 1)
259 |                 full_path = os.path.join(self.root_path, '{:06d}'.format(int(record.path)), file_name)
260 |             else:
261 |                 file_name = self.image_tmpl.format(1)
262 |                 full_path = os.path.join(self.root_path, record.path, file_name)
263 | 
264 |         # print('record:', record)
265 |         if not self.test_mode:
266 |             # print('test model False')
267 |             segment_indices_glancer, segment_indices_focuser = self._sample_indices(record) \
268 |                 if self.random_shift else self._get_val_indices(record)
269 |         else:
270 |             # print('test model True')
271 |             segment_indices_glancer, segment_indices_focuser = self._get_test_indices(record)
272 | 
273 |         return self.get(record, segment_indices_glancer, segment_indices_focuser)
274 | 
275 |     def get(self, record, indices_glancer, indices_focuser):
276 | 
277 |         images_glancer = list()
278 |         for seg_ind in indices_glancer:
279 |             p = int(seg_ind)
280 |             for i in range(self.new_length):
281 |                 seg_imgs = self._load_image(record.path, p)
282 |                 images_glancer.extend(seg_imgs)
283 |                 if p < record.num_frames:
284 |                     p += 1
285 | 
286 |         process_data_glancer = self.transform(images_glancer)
287 | 
288 |         images_focuser = list()
289 |         for seg_ind in indices_focuser:
290 |             p = int(seg_ind)
291 |             for i in range(self.new_length):
292 |                 seg_imgs = self._load_image(record.path, p)
293 |                 images_focuser.extend(seg_imgs)
294 |                 if p < record.num_frames:
295 |                     p += 1
296 | 
297 |         process_data_focuser = self.transform(images_focuser)
298 |         return process_data_glancer, process_data_focuser, record.label
299 | 
300 |     def __len__(self):
301 |         return len(self.video_list)
302 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/dataset_config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | 
 4 | def return_somethingv1(modality, root_dataset):
 5 |     filename_categories = 'something-something-v1/category.txt'
 6 |     if modality == 'RGB':
 7 |         root_data = os.path.join(root_dataset, 'something-something-v1/20bn-something-something-v1')
 8 |         filename_imglist_train = 'something-something-v1/train_videofolder.txt'
 9 |         filename_imglist_val = 'something-something-v1/val_videofolder.txt'
10 |         prefix = '{:05d}.jpg'
11 |     elif modality == 'Flow':
12 |         root_data = os.path.join(root_dataset, 'something-something-v1/20bn-something-something-v1-flow')
13 |         filename_imglist_train = 'something-something-v1/train_videofolder_flow.txt'
14 |         filename_imglist_val = 'something-something-v1/val_videofolder_flow.txt'
15 |         prefix = '{:06d}-{}_{:05d}.jpg'
16 |     else:
17 |         print('no such modality:'+modality)
18 |         raise NotImplementedError
19 |     return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix
20 | 
21 | 
22 | def return_somethingv2(modality, root_dataset):
23 |     filename_categories = 'something-something-v2/category.txt'
24 |     if modality == 'RGB':
25 |         root_data = os.path.join(root_dataset, 'something-something-v2/20bn-something-something-v2-frames')
26 |         filename_imglist_train = 'something-something-v2/train_videofolder.txt'
27 |         filename_imglist_val = 'something-something-v2/val_videofolder.txt'
28 |         prefix = '{:06d}.jpg'
29 |     elif modality == 'Flow':
30 |         root_data = os.path.join(root_dataset, 'something-something-v2/20bn-something-something-v2-flow')
31 |         filename_imglist_train = 'something-something-v2/train_videofolder_flow.txt'
32 |         filename_imglist_val = 'something-something-v2/val_videofolder_flow.txt'
33 |         prefix = '{:06d}.jpg'
34 |     else:
35 |         raise NotImplementedError('no such modality:'+modality)
36 |     return filename_categories, filename_imglist_train, filename_imglist_val, root_data, prefix
37 | 
38 | 
39 | def return_dataset(dataset, modality, root_dataset):
40 |     dict_single = {'somethingv1': return_somethingv1, 'somethingv2': return_somethingv2}
41 |     if dataset in dict_single:
42 |         file_categories, file_imglist_train, file_imglist_val, root_data, prefix = dict_single[dataset](modality, root_dataset)
43 |     else:
44 |         raise ValueError('Unknown dataset '+dataset)
45 | 
46 |     file_imglist_train = os.path.join(root_dataset, file_imglist_train)
47 |     file_imglist_val = os.path.join(root_dataset, file_imglist_val)
48 |     if isinstance(file_categories, str):
49 |         file_categories = os.path.join(root_dataset, file_categories)
50 |         with open(file_categories) as f:
51 |             lines = f.readlines()
52 |         categories = [item.rstrip() for item in lines]
53 |     else:  # number of categories
54 |         categories = [None] * file_categories
55 |     n_class = len(categories)
56 |     print('{}: {} classes'.format(dataset, n_class))
57 |     return n_class, file_imglist_train, file_imglist_val, root_data, prefix


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/my_logger.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import time
 3 | from datetime import datetime
 4 | 
 5 | 
 6 | # TODO(Yue) Overrided the logger
 7 | class Logger(object):
 8 |     def __init__(self, log_prefix=""):
 9 |         self._terminal = sys.stdout
10 |         self._timestr = datetime.fromtimestamp(time.time()).strftime("%m%d-%H%M%S")
11 |         self._log_path = None
12 |         self._log_dir_name = None
13 |         self._log_file_name = None
14 |         self._history_records = [" ".join(["python"] + sys.argv + ["\n"])]  # TODO(yue) remember the CLI input
15 |         self._write_mode = "bear_in_mind"
16 |         self._prefix = log_prefix
17 | 
18 |     # TODO(yue) pre_write and create_log help when we don't want to save logs so early because of some early-check process
19 |     # we just bear in mind, and when we realy need to write them down, we do that
20 |     # without Logger: terminal
21 |     # bear_in_mind:   terminal->RAM
22 |     # take_notes:     RAM->FILE
23 |     # normal:         terminal->FILE
24 |     def create_log(self, log_path, test_mode, t, bs, k):
25 |         self._log_dir_name = log_path
26 |         if test_mode:
27 |             self._log_file_name = "test-%s-t%02d-bz%02d-k%02d.txt" % (self._timestr, t, bs, k)
28 |         else:
29 |             self._log_file_name = "log-%s.txt" % self._timestr
30 |         self._log_path = log_path + "/" + self._log_file_name
31 |         self.log = open(self._log_path, "a", 1)
32 |         self._write_mode = "take_notes"
33 |         for record in self._history_records:
34 |             self.write(record)
35 |         self._history_records = []
36 |         self._write_mode = "normal"
37 | 
38 |     def write(self, message):
39 |         if self._write_mode in ["bear_in_mind", "normal"]:
40 |             self._terminal.write(message)
41 |         if self._write_mode in ["take_notes", "normal"]:
42 |             self.log.write(message.replace("\033[0m", ""). \
43 |                 replace("\033[95m", "").replace("\033[94m", "").replace("\033[93m", "").replace("\033[92m",
44 |                                                                                                 "").replace(
45 |                 "\033[91m", ""))
46 |         else:
47 |             self._history_records.append(message)
48 | 
49 |     def flush(self):
50 |         pass
51 | 
52 |     def close_log(self):
53 |         a = 1
54 |         self.log.close()
55 |         return sys.stdout
56 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/net_flops_table.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | sys.path.insert(0, "../")
 4 | 
 5 | import torch
 6 | import torchvision
 7 | from torch import nn
 8 | from thop import profile
 9 | 
10 | feat_dim_dict = {
11 |     "resnet18": 512,
12 |     "resnet34": 512,
13 |     "resnet50": 2048,
14 |     "resnet101": 2048,
15 |     "mobilenet_v2": 1280,
16 |     "efficientnet-b0": 1280,
17 |     "efficientnet-b1": 1280,
18 |     "efficientnet-b2": 1408,
19 |     "efficientnet-b3": 1536,
20 |     "efficientnet-b4": 1792,
21 |     "efficientnet-b5": 2048,
22 | }
23 | 
24 | prior_dict = {
25 |     "efficientnet-b0": (0.39, 5.3),
26 |     "efficientnet-b1": (0.70, 7.8),
27 |     "efficientnet-b2": (1.00, 9.2),
28 |     "efficientnet-b3": (1.80, 12),
29 |     "efficientnet-b4": (4.20, 19),
30 |     "efficientnet-b5": (9.90, 30),
31 | }
32 | 
33 | 
34 | def get_gflops_params(model_name, resolution, num_classes, seg_len=-1, pretrained=True):
35 |     if model_name in prior_dict:
36 |         gflops, params = prior_dict[model_name]
37 |         gflops = gflops / 224 / 224 * resolution * resolution
38 |         return gflops, params
39 | 
40 |     if "resnet" in model_name:
41 |         model = getattr(torchvision.models, model_name)(pretrained)
42 |         last_layer = "fc"
43 |     elif model_name == "mobilenet_v2":
44 |         model = getattr(torchvision.models, model_name)(pretrained)
45 |         last_layer = "classifier"
46 |     else:
47 |         exit("I don't know what is %s" % model_name)
48 |     feat_dim = feat_dim_dict[model_name]
49 | 
50 |     setattr(model, last_layer, nn.Linear(feat_dim, num_classes))
51 | 
52 |     if seg_len == -1:
53 |         dummy_data = torch.randn(1, 3, resolution, resolution)
54 |     else:
55 |         dummy_data = torch.randn(1, 3, seg_len, resolution, resolution)
56 | 
57 |     hooks = {}
58 |     flops, params = profile(model, inputs=(dummy_data,), custom_ops=hooks)
59 |     gflops = flops / 1e9
60 |     params = params / 1e6
61 | 
62 |     return gflops, params
63 | 
64 | 
65 | if __name__ == "__main__":
66 |     str_list = []
67 |     for s in str_list:
68 |         print(s)
69 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/sal_rank_loss.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | 
 4 | 
 5 | def cal_sal_rank_loss(real_pred, lite_pred, target, margin=0):
 6 |     B, T, K = real_pred.shape
 7 | 
 8 |     # TODO(shape) B * T
 9 |     b_idx = [[x] * T for x in range(B)]
10 |     t_idx = [list(range(T)) for _ in range(B)]
11 |     k_idx = [[tgt] * T for tgt in target[:, 0].cpu().numpy()]
12 | 
13 |     # TODO(shape) B * T
14 |     real_cfd = real_pred[b_idx, t_idx, k_idx]
15 |     lite_cfd = lite_pred[b_idx, t_idx, k_idx]
16 | 
17 |     x, y = torch.triu_indices(T - 1, T - 1) + torch.tensor([[0], [1]])
18 | 
19 |     # TODO(shape) B * (T*(T-1)/2)
20 |     real_cfd_x = real_cfd[:, x]
21 |     real_cfd_y = real_cfd[:, y]
22 |     lite_cfd_x = lite_cfd[:, x]
23 |     lite_cfd_y = lite_cfd[:, y]
24 | 
25 |     psu_label = (real_cfd_x > real_cfd_y).double() * 2 - 1
26 | 
27 |     return F.margin_ranking_loss(lite_cfd_x, lite_cfd_y, psu_label, margin=margin, reduction="mean")
28 | 
29 | 
30 | if __name__ == "__main__":
31 |     # B=2, T=3, K=4
32 |     a = torch.tensor([[[0.1, 0.2, 0.3, 0.4], [0.5, 0.2, 0.1, 0.2], [0.3, 0.3, 0.3, 0.1]],
33 |                       [[0.3, 0.1, 0.1, 0.1], [0.2, 0.2, 0.5, 0.3], [0.4, 0.4, 0.3, 0.1]]])
34 |     b = torch.tensor([[[0.0, 0.0, 0.0, 0.3], [0.0, 0.0, 0.0, 0.2], [0.0, 0.0, 0.0, 0.3]],
35 |                       [[0.0, 0.0, 0.1, 0.0], [0.0, 0.0, 0.3, 0.0], [0.0, 0.0, 0.2, 0.0]]])
36 |     target = torch.tensor([3, 2])
37 |     margin = 0
38 | 
39 |     print("Expect: 0.0000, reality:", cal_sal_rank_loss(a, a, target, 0))  # TODO(yue) expect to become 0
40 |     print("Expect: 0.0167, reality:", cal_sal_rank_loss(a, b, target, 0))  # TODO(yue) expect to become 0.1/6=0.0167
41 |     print("Expect: 0.9333, reality:", cal_sal_rank_loss(a, b, target, 1))  # TODO(yue) expect to become 5.6/6=0.9333
42 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/temporal_shift.py:
--------------------------------------------------------------------------------
  1 | # Code for "TSM: Temporal Shift Module for Efficient Video Understanding"
  2 | # arXiv:1811.08383
  3 | # Ji Lin*, Chuang Gan, Song Han
  4 | # {jilin, songhan}@mit.edu, ganchuang@csail.mit.edu
  5 | 
  6 | import torch
  7 | import torch.nn as nn
  8 | import torch.nn.functional as F
  9 | 
 10 | from models.resnet import resnet50
 11 | 
 12 | 
 13 | class TemporalShift(nn.Module):
 14 |     def __init__(self, net, n_segment=3, n_div=8, inplace=False):
 15 |         super(TemporalShift, self).__init__()
 16 |         self.net = net
 17 |         self.n_segment = n_segment
 18 |         self.fold_div = n_div
 19 |         self.inplace = inplace
 20 |         if inplace:
 21 |             print('=> Using in-place shift...')
 22 |         print('=> Using fold div: {}'.format(self.fold_div))
 23 | 
 24 |     def forward(self, x):
 25 |         x = self.shift(x, self.n_segment, fold_div=self.fold_div, inplace=self.inplace)
 26 |         return self.net(x)
 27 | 
 28 |     @staticmethod
 29 |     def shift(x, n_segment, fold_div=3, inplace=False):
 30 |         nt, c, h, w = x.size()
 31 |         n_batch = nt // n_segment
 32 |         x = x.view(n_batch, n_segment, c, h, w)
 33 | 
 34 |         fold = c // fold_div
 35 |         if inplace:
 36 |             # Due to some out of order error when performing parallel computing. 
 37 |             # May need to write a CUDA kernel.
 38 |             raise NotImplementedError  
 39 |             # out = InplaceShift.apply(x, fold)
 40 |         else:
 41 |             out = torch.zeros_like(x)
 42 |             out[:, :-1, :fold] = x[:, 1:, :fold]  # shift left
 43 |             out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold]  # shift right
 44 |             out[:, :, 2 * fold:] = x[:, :, 2 * fold:]  # not shift
 45 | 
 46 |         return out.view(nt, c, h, w)
 47 | 
 48 | 
 49 | class InplaceShift(torch.autograd.Function):
 50 |     # Special thanks to @raoyongming for the help to this function
 51 |     @staticmethod
 52 |     def forward(ctx, input, fold):
 53 |         # not support higher order gradient
 54 |         # input = input.detach_()
 55 |         ctx.fold_ = fold
 56 |         n, t, c, h, w = input.size()
 57 |         buffer = input.data.new(n, t, fold, h, w).zero_()
 58 |         buffer[:, :-1] = input.data[:, 1:, :fold]
 59 |         input.data[:, :, :fold] = buffer
 60 |         buffer.zero_()
 61 |         buffer[:, 1:] = input.data[:, :-1, fold: 2 * fold]
 62 |         input.data[:, :, fold: 2 * fold] = buffer
 63 |         return input
 64 | 
 65 |     @staticmethod
 66 |     def backward(ctx, grad_output):
 67 |         # grad_output = grad_output.detach_()
 68 |         fold = ctx.fold_
 69 |         n, t, c, h, w = grad_output.size()
 70 |         buffer = grad_output.data.new(n, t, fold, h, w).zero_()
 71 |         buffer[:, 1:] = grad_output.data[:, :-1, :fold]
 72 |         grad_output.data[:, :, :fold] = buffer
 73 |         buffer.zero_()
 74 |         buffer[:, :-1] = grad_output.data[:, 1:, fold: 2 * fold]
 75 |         grad_output.data[:, :, fold: 2 * fold] = buffer
 76 |         return grad_output, None
 77 | 
 78 | 
 79 | class TemporalPool(nn.Module):
 80 |     def __init__(self, net, n_segment):
 81 |         super(TemporalPool, self).__init__()
 82 |         self.net = net
 83 |         self.n_segment = n_segment
 84 | 
 85 |     def forward(self, x):
 86 |         x = self.temporal_pool(x, n_segment=self.n_segment)
 87 |         return self.net(x)
 88 | 
 89 |     @staticmethod
 90 |     def temporal_pool(x, n_segment):
 91 |         nt, c, h, w = x.size()
 92 |         n_batch = nt // n_segment
 93 |         x = x.view(n_batch, n_segment, c, h, w).transpose(1, 2)  # n, c, t, h, w
 94 |         x = F.max_pool3d(x, kernel_size=(3, 1, 1), stride=(2, 1, 1), padding=(1, 0, 0))
 95 |         x = x.transpose(1, 2).contiguous().view(nt // 2, c, h, w)
 96 |         return x
 97 | 
 98 | 
 99 | def make_temporal_shift(net, n_segment, n_div=8, place='blockres', temporal_pool=False):
100 |     if temporal_pool:
101 |         n_segment_list = [n_segment, n_segment // 2, n_segment // 2, n_segment // 2]
102 |     else:
103 |         n_segment_list = [n_segment] * 4
104 |     assert n_segment_list[-1] > 0
105 |     print('=> n_segment per stage: {}'.format(n_segment_list))
106 | 
107 |     import torchvision
108 |     # if isinstance(net, torchvision.models.ResNet):
109 |     if isinstance(net, torch.nn.Module):
110 |         if place == 'block':
111 |             def make_block_temporal(stage, this_segment):
112 |                 blocks = list(stage.children())
113 |                 print('=> Processing stage with {} blocks'.format(len(blocks)))
114 |                 for i, b in enumerate(blocks):
115 |                     blocks[i] = TemporalShift(b, n_segment=this_segment, n_div=n_div)
116 |                 return nn.Sequential(*(blocks))
117 | 
118 |             net.layer1 = make_block_temporal(net.layer1, n_segment_list[0])
119 |             net.layer2 = make_block_temporal(net.layer2, n_segment_list[1])
120 |             net.layer3 = make_block_temporal(net.layer3, n_segment_list[2])
121 |             net.layer4 = make_block_temporal(net.layer4, n_segment_list[3])
122 | 
123 |         elif 'blockres' in place:
124 |             n_round = 1
125 |             if len(list(net.layer3.children())) >= 23:
126 |                 n_round = 2
127 |                 print('=> Using n_round {} to insert temporal shift'.format(n_round))
128 | 
129 |             def make_block_temporal(stage, this_segment):
130 |                 blocks = list(stage.children())
131 |                 print('=> Processing stage with {} blocks residual'.format(len(blocks)))
132 |                 for i, b in enumerate(blocks):
133 |                     if i % n_round == 0:
134 |                         blocks[i].conv1 = TemporalShift(b.conv1, n_segment=this_segment, n_div=n_div)
135 |                 return nn.Sequential(*blocks)
136 | 
137 |             net.layer1 = make_block_temporal(net.layer1, n_segment_list[0])
138 |             net.layer2 = make_block_temporal(net.layer2, n_segment_list[1])
139 |             net.layer3 = make_block_temporal(net.layer3, n_segment_list[2])
140 |             net.layer4 = make_block_temporal(net.layer4, n_segment_list[3])
141 |     else:
142 |         raise NotImplementedError(place)
143 | 
144 | 
145 | def make_temporal_pool(net, n_segment):
146 |     import torchvision
147 |     if isinstance(net, torchvision.models.ResNet):
148 |         print('=> Injecting nonlocal pooling')
149 |         net.layer2 = TemporalPool(net.layer2, n_segment)
150 |     else:
151 |         raise NotImplementedError
152 | 
153 | 
154 | if __name__ == '__main__':
155 |     # test inplace shift v.s. vanilla shift
156 |     tsm1 = TemporalShift(nn.Sequential(), n_segment=8, n_div=8, inplace=False)
157 |     tsm2 = TemporalShift(nn.Sequential(), n_segment=8, n_div=8, inplace=True)
158 | 
159 |     print('=> Testing CPU...')
160 |     # test forward
161 |     with torch.no_grad():
162 |         for i in range(10):
163 |             x = torch.rand(2 * 8, 3, 224, 224)
164 |             y1 = tsm1(x)
165 |             y2 = tsm2(x)
166 |             assert torch.norm(y1 - y2).item() < 1e-5
167 | 
168 |     # test backward
169 |     with torch.enable_grad():
170 |         for i in range(10):
171 |             x1 = torch.rand(2 * 8, 3, 224, 224)
172 |             x1.requires_grad_()
173 |             x2 = x1.clone()
174 |             y1 = tsm1(x1)
175 |             y2 = tsm2(x2)
176 |             grad1 = torch.autograd.grad((y1 ** 2).mean(), [x1])[0]
177 |             grad2 = torch.autograd.grad((y2 ** 2).mean(), [x2])[0]
178 |             assert torch.norm(grad1 - grad2).item() < 1e-5
179 | 
180 |     print('=> Testing GPU...')
181 |     tsm1.cuda()
182 |     tsm2.cuda()
183 |     # test forward
184 |     with torch.no_grad():
185 |         for i in range(10):
186 |             x = torch.rand(2 * 8, 3, 224, 224).cuda()
187 |             y1 = tsm1(x)
188 |             y2 = tsm2(x)
189 |             assert torch.norm(y1 - y2).item() < 1e-5
190 | 
191 |     # test backward
192 |     with torch.enable_grad():
193 |         for i in range(10):
194 |             x1 = torch.rand(2 * 8, 3, 224, 224).cuda()
195 |             x1.requires_grad_()
196 |             x2 = x1.clone()
197 |             y1 = tsm1(x1)
198 |             y2 = tsm2(x2)
199 |             grad1 = torch.autograd.grad((y1 ** 2).mean(), [x1])[0]
200 |             grad2 = torch.autograd.grad((y2 ** 2).mean(), [x2])[0]
201 |             assert torch.norm(grad1 - grad2).item() < 1e-5
202 |     print('Test passed.')
203 | 
204 | 
205 | 
206 | 
207 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/transforms.py:
--------------------------------------------------------------------------------
  1 | import torchvision
  2 | import random
  3 | from PIL import Image, ImageOps
  4 | import numpy as np
  5 | import numbers
  6 | import math
  7 | import torch
  8 | 
  9 | 
 10 | class GroupRandomCrop(object):
 11 |     def __init__(self, size):
 12 |         if isinstance(size, numbers.Number):
 13 |             self.size = (int(size), int(size))
 14 |         else:
 15 |             self.size = size
 16 | 
 17 |     def __call__(self, img_group):
 18 | 
 19 |         w, h = img_group[0].size
 20 |         th, tw = self.size
 21 | 
 22 |         out_images = list()
 23 | 
 24 |         x1 = random.randint(0, w - tw)
 25 |         y1 = random.randint(0, h - th)
 26 | 
 27 |         for img in img_group:
 28 |             assert (img.size[0] == w and img.size[1] == h)
 29 |             if w == tw and h == th:
 30 |                 out_images.append(img)
 31 |             else:
 32 |                 out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
 33 | 
 34 |         return out_images
 35 | 
 36 | 
 37 | class GroupCenterCrop(object):
 38 |     def __init__(self, size):
 39 |         self.worker = torchvision.transforms.CenterCrop(size)
 40 | 
 41 |     def __call__(self, img_group):
 42 |         return [self.worker(img) for img in img_group]
 43 | 
 44 | 
 45 | class GroupRandomHorizontalFlip(object):
 46 |     """Randomly horizontally flips the given PIL.Image with a probability of 0.5
 47 |     """
 48 | 
 49 |     def __init__(self, is_flow=False):
 50 |         self.is_flow = is_flow
 51 | 
 52 |     def __call__(self, img_group, is_flow=False):
 53 |         v = random.random()
 54 |         if v < 0.5:
 55 |             ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
 56 |             if self.is_flow:
 57 |                 for i in range(0, len(ret), 2):
 58 |                     ret[i] = ImageOps.invert(ret[i])  # invert flow pixel values when flipping
 59 |             return ret
 60 |         else:
 61 |             return img_group
 62 | 
 63 | 
 64 | class GroupNormalize(object):
 65 |     def __init__(self, mean, std):
 66 |         self.mean = mean
 67 |         self.std = std
 68 | 
 69 |     def __call__(self, tensor):
 70 |         rep_mean = self.mean * (tensor.size()[0] // len(self.mean))
 71 |         rep_std = self.std * (tensor.size()[0] // len(self.std))
 72 | 
 73 |         # TODO: make efficient
 74 |         for t, m, s in zip(tensor, rep_mean, rep_std):
 75 |             t.sub_(m).div_(s)
 76 | 
 77 |         return tensor
 78 | 
 79 | 
 80 | class GroupScale(object):
 81 |     """ Rescales the input PIL.Image to the given 'size'.
 82 |     'size' will be the size of the smaller edge.
 83 |     For example, if height > width, then image will be
 84 |     rescaled to (size * height / width, size)
 85 |     size: size of the smaller edge
 86 |     interpolation: Default: PIL.Image.BILINEAR
 87 |     """
 88 | 
 89 |     def __init__(self, size, interpolation=Image.BILINEAR):
 90 |         self.worker = torchvision.transforms.Resize(size, interpolation)
 91 | 
 92 |     def __call__(self, img_group):
 93 |         return [self.worker(img) for img in img_group]
 94 | 
 95 | 
 96 | class GroupOverSample(object):
 97 |     def __init__(self, crop_size, scale_size=None, flip=True):
 98 |         self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
 99 | 
100 |         if scale_size is not None:
101 |             self.scale_worker = GroupScale(scale_size)
102 |         else:
103 |             self.scale_worker = None
104 |         self.flip = flip
105 | 
106 |     def __call__(self, img_group):
107 | 
108 |         if self.scale_worker is not None:
109 |             img_group = self.scale_worker(img_group)
110 | 
111 |         image_w, image_h = img_group[0].size
112 |         crop_w, crop_h = self.crop_size
113 | 
114 |         offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h)
115 |         oversample_group = list()
116 |         for o_w, o_h in offsets:
117 |             normal_group = list()
118 |             flip_group = list()
119 |             for i, img in enumerate(img_group):
120 |                 crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
121 |                 normal_group.append(crop)
122 |                 flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
123 | 
124 |                 if img.mode == 'L' and i % 2 == 0:
125 |                     flip_group.append(ImageOps.invert(flip_crop))
126 |                 else:
127 |                     flip_group.append(flip_crop)
128 | 
129 |             oversample_group.extend(normal_group)
130 |             if self.flip:
131 |                 oversample_group.extend(flip_group)
132 |         return oversample_group
133 | 
134 | 
135 | class GroupFullResSample(object):
136 |     def __init__(self, crop_size, scale_size=None, flip=True):
137 |         self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
138 | 
139 |         if scale_size is not None:
140 |             self.scale_worker = GroupScale(scale_size)
141 |         else:
142 |             self.scale_worker = None
143 |         self.flip = flip
144 | 
145 |     def __call__(self, img_group):
146 | 
147 |         if self.scale_worker is not None:
148 |             img_group = self.scale_worker(img_group)
149 | 
150 |         image_w, image_h = img_group[0].size
151 |         crop_w, crop_h = self.crop_size
152 | 
153 |         w_step = (image_w - crop_w) // 4
154 |         h_step = (image_h - crop_h) // 4
155 | 
156 |         offsets = list()
157 |         offsets.append((0 * w_step, 2 * h_step))  # left
158 |         offsets.append((4 * w_step, 2 * h_step))  # right
159 |         offsets.append((2 * w_step, 2 * h_step))  # center
160 | 
161 |         oversample_group = list()
162 |         for o_w, o_h in offsets:
163 |             normal_group = list()
164 |             flip_group = list()
165 |             for i, img in enumerate(img_group):
166 |                 crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
167 |                 normal_group.append(crop)
168 |                 if self.flip:
169 |                     flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
170 | 
171 |                     if img.mode == 'L' and i % 2 == 0:
172 |                         flip_group.append(ImageOps.invert(flip_crop))
173 |                     else:
174 |                         flip_group.append(flip_crop)
175 | 
176 |             oversample_group.extend(normal_group)
177 |             oversample_group.extend(flip_group)
178 |         return oversample_group
179 | 
180 | 
181 | class GroupMultiScaleCrop(object):
182 | 
183 |     def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
184 |         self.scales = scales if scales is not None else [1, .875, .75, .66]
185 |         self.max_distort = max_distort
186 |         self.fix_crop = fix_crop
187 |         self.more_fix_crop = more_fix_crop
188 |         self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size]
189 |         self.interpolation = Image.BILINEAR
190 | 
191 |     def __call__(self, img_group):
192 | 
193 |         im_size = img_group[0].size
194 | 
195 |         crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
196 |         crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
197 |         ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
198 |                          for img in crop_img_group]
199 |         return ret_img_group
200 | 
201 |     def _sample_crop_size(self, im_size):
202 |         image_w, image_h = im_size[0], im_size[1]
203 | 
204 |         # find a crop size
205 |         base_size = min(image_w, image_h)
206 |         crop_sizes = [int(base_size * x) for x in self.scales]
207 |         crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
208 |         crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]
209 | 
210 |         pairs = []
211 |         for i, h in enumerate(crop_h):
212 |             for j, w in enumerate(crop_w):
213 |                 if abs(i - j) <= self.max_distort:
214 |                     pairs.append((w, h))
215 | 
216 |         crop_pair = random.choice(pairs)
217 |         if not self.fix_crop:
218 |             w_offset = random.randint(0, image_w - crop_pair[0])
219 |             h_offset = random.randint(0, image_h - crop_pair[1])
220 |         else:
221 |             w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1])
222 | 
223 |         return crop_pair[0], crop_pair[1], w_offset, h_offset
224 | 
225 |     def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
226 |         offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h)
227 |         return random.choice(offsets)
228 | 
229 |     @staticmethod
230 |     def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
231 |         w_step = (image_w - crop_w) // 4
232 |         h_step = (image_h - crop_h) // 4
233 | 
234 |         ret = list()
235 |         ret.append((0, 0))  # upper left
236 |         ret.append((4 * w_step, 0))  # upper right
237 |         ret.append((0, 4 * h_step))  # lower left
238 |         ret.append((4 * w_step, 4 * h_step))  # lower right
239 |         ret.append((2 * w_step, 2 * h_step))  # center
240 | 
241 |         if more_fix_crop:
242 |             ret.append((0, 2 * h_step))  # center left
243 |             ret.append((4 * w_step, 2 * h_step))  # center right
244 |             ret.append((2 * w_step, 4 * h_step))  # lower center
245 |             ret.append((2 * w_step, 0 * h_step))  # upper center
246 | 
247 |             ret.append((1 * w_step, 1 * h_step))  # upper left quarter
248 |             ret.append((3 * w_step, 1 * h_step))  # upper right quarter
249 |             ret.append((1 * w_step, 3 * h_step))  # lower left quarter
250 |             ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
251 | 
252 |         return ret
253 | 
254 | 
255 | class GroupRandomSizedCrop(object):
256 |     """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size
257 |     and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
258 |     This is popularly used to train the Inception networks
259 |     size: size of the smaller edge
260 |     interpolation: Default: PIL.Image.BILINEAR
261 |     """
262 | 
263 |     def __init__(self, size, interpolation=Image.BILINEAR):
264 |         self.size = size
265 |         self.interpolation = interpolation
266 | 
267 |     def __call__(self, img_group):
268 |         for attempt in range(10):
269 |             area = img_group[0].size[0] * img_group[0].size[1]
270 |             target_area = random.uniform(0.08, 1.0) * area
271 |             aspect_ratio = random.uniform(3. / 4, 4. / 3)
272 | 
273 |             w = int(round(math.sqrt(target_area * aspect_ratio)))
274 |             h = int(round(math.sqrt(target_area / aspect_ratio)))
275 | 
276 |             if random.random() < 0.5:
277 |                 w, h = h, w
278 | 
279 |             if w <= img_group[0].size[0] and h <= img_group[0].size[1]:
280 |                 x1 = random.randint(0, img_group[0].size[0] - w)
281 |                 y1 = random.randint(0, img_group[0].size[1] - h)
282 |                 found = True
283 |                 break
284 |         else:
285 |             found = False
286 |             x1 = 0
287 |             y1 = 0
288 | 
289 |         if found:
290 |             out_group = list()
291 |             for img in img_group:
292 |                 img = img.crop((x1, y1, x1 + w, y1 + h))
293 |                 assert (img.size == (w, h))
294 |                 out_group.append(img.resize((self.size, self.size), self.interpolation))
295 |             return out_group
296 |         else:
297 |             # Fallback
298 |             scale = GroupScale(self.size, interpolation=self.interpolation)
299 |             crop = GroupRandomCrop(self.size)
300 |             return crop(scale(img_group))
301 | 
302 | 
303 | class Stack(object):
304 | 
305 |     def __init__(self, roll=False):
306 |         self.roll = roll
307 | 
308 |     def __call__(self, img_group):
309 |         if img_group[0].mode == 'L':
310 |             return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2)
311 |         elif img_group[0].mode == 'RGB':
312 |             if self.roll:
313 |                 return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2)
314 |             else:
315 |                 return np.concatenate(img_group, axis=2)
316 | 
317 | 
318 | class ToTorchFormatTensor(object):
319 |     """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
320 |     to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """
321 | 
322 |     def __init__(self, div=True):
323 |         self.div = div
324 | 
325 |     def __call__(self, pic):
326 |         if isinstance(pic, np.ndarray):
327 |             # handle numpy array
328 |             img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
329 |         else:
330 |             # handle PIL Image
331 |             img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
332 |             img = img.view(pic.size[1], pic.size[0], len(pic.mode))
333 |             # put it from HWC to CHW format
334 |             # yikes, this transpose takes 80% of the loading time/CPU
335 |             img = img.transpose(0, 1).transpose(0, 2).contiguous()
336 |         return img.float().div(255) if self.div else img.float()
337 | 
338 | 
339 | class IdentityTransform(object):
340 | 
341 |     def __call__(self, data):
342 |         return data
343 | 
344 | 
345 | if __name__ == "__main__":
346 |     trans = torchvision.transforms.Compose([
347 |         GroupScale(256),
348 |         GroupRandomCrop(224),
349 |         Stack(),
350 |         ToTorchFormatTensor(),
351 |         GroupNormalize(
352 |             mean=[.485, .456, .406],
353 |             std=[.229, .224, .225]
354 |         )]
355 |     )
356 | 
357 |     im = Image.open('../tensorflow-model-zoo.torch/lena_299.png')
358 | 
359 |     color_group = [im] * 3
360 |     rst = trans(color_group)
361 | 
362 |     gray_group = [im.convert('L')] * 9
363 |     gray_rst = trans(gray_group)
364 | 
365 |     trans2 = torchvision.transforms.Compose([
366 |         GroupRandomSizedCrop(256),
367 |         Stack(),
368 |         ToTorchFormatTensor(),
369 |         GroupNormalize(
370 |             mean=[.485, .456, .406],
371 |             std=[.229, .224, .225])
372 |     ])
373 |     print(trans2(color_group))
374 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/ops/video_jpg.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, division
 2 | import os
 3 | import time
 4 | import subprocess
 5 | from tqdm import tqdm
 6 | import argparse
 7 | from multiprocessing import Pool
 8 | 
 9 | parser = argparse.ArgumentParser(description="Dataset processor: Video->Frames")
10 | parser.add_argument("dir_path", type=str, help="original dataset path")
11 | parser.add_argument("dst_dir_path", type=str, help="dest path to save the frames")
12 | parser.add_argument("--prefix", type=str, default="image_%05d.jpg", help="output image type")
13 | parser.add_argument("--accepted_formats", type=str, default=[".mp4", ".mkv", ".webm"], nargs="+",
14 |                     help="list of input video formats")
15 | parser.add_argument("--begin", type=int, default=0)
16 | parser.add_argument("--end", type=int, default=666666666)
17 | parser.add_argument("--file_list", type=str, default="")
18 | parser.add_argument("--frame_rate", type=int, default=-1)
19 | parser.add_argument("--num_workers", type=int, default=16)
20 | parser.add_argument("--dry_run", action="store_true")
21 | parser.add_argument("--parallel", action="store_true")
22 | args = parser.parse_args()
23 | 
24 | 
25 | def par_job(command):
26 |     if args.dry_run:
27 |         print(command)
28 |     else:
29 |         subprocess.call(command, shell=True)
30 | 
31 | 
32 | if __name__ == "__main__":
33 |     t0 = time.time()
34 |     dir_path = args.dir_path
35 |     dst_dir_path = args.dst_dir_path
36 | 
37 |     if args.file_list == "":
38 |         file_names = sorted(os.listdir(dir_path))
39 |     else:
40 |         file_names = [x.strip() for x in open(args.file_list).readlines()]
41 |     del_list = []
42 |     for i, file_name in enumerate(file_names):
43 |         if not any([x in file_name for x in args.accepted_formats]):
44 |             del_list.append(i)
45 |     file_names = [x for i, x in enumerate(file_names) if i not in del_list]
46 |     file_names = file_names[args.begin:args.end + 1]
47 |     print("%d videos to handle (after %d being removed)" % (len(file_names), len(del_list)))
48 |     cmd_list = []
49 |     for file_name in tqdm(file_names):
50 | 
51 |         name, ext = os.path.splitext(file_name)
52 |         dst_directory_path = os.path.join(dst_dir_path, name)
53 | 
54 |         video_file_path = os.path.join(dir_path, file_name)
55 |         if not os.path.exists(dst_directory_path):
56 |             os.makedirs(dst_directory_path, exist_ok=True)
57 | 
58 |         if args.frame_rate > 0:
59 |             frame_rate_str = "-r %d" % args.frame_rate
60 |         else:
61 |             frame_rate_str = ""
62 |         cmd = 'ffmpeg -nostats -loglevel 0 -i {} -vf scale=-1:360 {} {}/{}'.format(video_file_path, frame_rate_str,
63 |                                                                                    dst_directory_path, args.prefix)
64 |         if not args.parallel:
65 |             if args.dry_run:
66 |                 print(cmd)
67 |             else:
68 |                 subprocess.call(cmd, shell=True)
69 |         cmd_list.append(cmd)
70 | 
71 |     if args.parallel:
72 |         with Pool(processes=args.num_workers) as pool:
73 |             with tqdm(total=len(cmd_list)) as pbar:
74 |                 for _ in tqdm(pool.imap_unordered(par_job, cmd_list)):
75 |                     pbar.update()
76 |     t1 = time.time()
77 |     print("Finished in %.4f seconds" % (t1 - t0))
78 |     os.system("stty sane")
79 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/train_stage1.sh:
--------------------------------------------------------------------------------
 1 | CUDA_VISIBLE_DEVICES=0,1,2,3 python stage1.py \
 2 |   dataset=somethingv1 \
 3 |   data_dir=PATH_TO_DATASET \
 4 |   train_stage=1 \
 5 |   batch_size=64 \
 6 |   num_segments_glancer=8 \
 7 |   num_segments_focuser=12 \
 8 |   glance_size=224 \
 9 |   patch_size=144 \
10 |   random_patch=True \
11 |   epochs=10 \
12 |   backbone_lr=0.00001 \
13 |   fc_lr=0.01 \
14 |   lr_type=cos \
15 |   dropout=0.5 \
16 |   load_pretrained_focuser_fc=False \
17 |   dist_url=tcp://127.0.0.1:8816 \
18 |   eval_freq=1 \
19 |   start_eval=0 \
20 |   print_freq=25 \
21 |   workers=8 \
22 |   pretrained_glancer=PATH_TO_PRETRAINED_GLANCER \
23 |   pretrained_focuser=PATH_TO_PRETRAINED_FOCUSER # load the pretrained model
24 | 
25 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/train_stage2.sh:
--------------------------------------------------------------------------------
 1 | CUDA_VISIBLE_DEVICES=0 python stage2.py \
 2 |   dataset=somethingv1 \
 3 |   data_dir=PATH_TO_DATASET \
 4 |   train_stage=2 \
 5 |   batch_size=64 \
 6 |   num_segments_glancer=8 \
 7 |   num_segments_focuser=12 \
 8 |   glance_size=224 \
 9 |   patch_size=144 \
10 |   random_patch=False \
11 |   epochs=50 \
12 |   policy_lr=0.0003 \
13 |   gamma=0.7 \
14 |   with_glancer=True \
15 |   ppo_continuous=True \
16 |   action_std=0.25 \
17 |   actorcritic_with_bn=True \
18 |   workers=8 \
19 |   eval_freq=1 \
20 |   pretrained=PATH_TO_STAGE1_PRETRAINED_MODEL # load the stage1 pretrained model
21 | 
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/Experiments on Something-Something V1&V2/train_stage3.sh:
--------------------------------------------------------------------------------
 1 | CUDA_VISIBLE_DEVICES=4,5,6,7 python stage3.py \
 2 |   dataset=somethingv1 \
 3 |   data_dir=PATH_TO_DATASET \
 4 |   train_stage=3 \
 5 |   batch_size=32 \
 6 |   num_segments_glancer=8 \
 7 |   num_segments_focuser=12 \
 8 |   glance_size=224 \
 9 |   patch_size=144 \
10 |   random_patch=False \
11 |   epochs=10 \
12 |   backbone_lr=0. \
13 |   fc_lr=0.0005 \
14 |   lr_type=cos \
15 |   workers=8 \
16 |   dropout=0 \
17 |   ppo_continuous=True \
18 |   action_std=0.25 \
19 |   actorcritic_with_bn=True \
20 |   with_glancer=True \
21 |   load_pretrained_s2_fc=True \
22 |   dist_url=tcp://127.0.0.1:8815 \
23 |   eval_freq=1 \
24 |   start_eval=0 \
25 |   print_freq=25 \
26 |   amp=False \
27 |   multiprocessing_distributed=False \
28 |   pretrained_s2=PATH_TO_STAGE2_PRETRAINED_MODEL # load the stage2 pretrained model
29 | 
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AdaFocus (ICCV-2021 Oral)
 2 | 
 3 | This repo contains the official code and pre-trained models for AdaFocus.
 4 | 
 5 | - [Adaptive Focus for Efficient Video Recognition](http://arxiv.org/abs/2105.03245)
 6 | 
 7 | **Update: The latest version of the AdaFocus series, [Uni-AdaFocus (TPAMI'24)](https://github.com/LeapLabTHU/Uni-AdaFocus) has been released!** This repository is no longer maintained.
 8 | 
 9 | ## Reference
10 | If you find our code or paper useful for your research, please cite:
11 | ```
12 | @InProceedings{wang2021adafocus,
13 |     author = {Wang, Yulin and Chen, Zhaoxi and Jiang, Haojun and Song, Shiji and Han, Yizeng and Huang, Gao},
14 |      title = {Adaptive Focus for Efficient Video Recognition},
15 |  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
16 |      month = {October},
17 |       year = {2021}
18 | }
19 | 
20 | @InProceedings{wang2022adafocusv2,
21 |     author = {Wang, Yulin and Yue, Yang and Lin, Yuanze and Jiang, Haojun and Lai, Zihang and Kulikov, Victor and Orlov, Nikita and Shi, Humphrey and Huang, Gao},
22 |      title = {AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition},
23 |  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
24 |       year = {2022}
25 | }
26 | ```
27 | 
28 | ## Introduction
29 | 
30 | In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1\&V2, demonstrate that our method is significantly more efficient than the competitive baselines.
31 | 
32 | 
33 | <p align="center">
34 |     <img src="./figure/intro.png" width= "475">
35 | </p>
36 | 
37 | 
38 | ## Result
39 | 
40 | - ActivityNet
41 | 
42 | <p align="center">
43 |     <img src="./figure/actnet.png" width= "550">
44 | </p>
45 | 
46 | 
47 | - Something-Something V1&V2
48 | 
49 | <p align="center">
50 |     <img src="./figure/sthsth.png" width= "900">
51 | </p>
52 | 
53 | 
54 | - Visualization
55 | 
56 | <p align="center">
57 |     <img src="./figure/visualization.png" width= "900">
58 | </p>
59 | 
60 | ## Get Started
61 | 
62 | Please go to the folder [Experiments on ActivityNet, FCVID and Mini-Kinetics](Experiments%20on%20ActivityNet,%20FCVID%20and%20Mini-Kinetics/) and [Experiments on Something-Something V1&V2](Experiments%20on%20Something-Something%20V1&V2) for specific docs.
63 | 
64 | 
65 | ## Contact
66 | If you have any question, feel free to contact the authors or raise an issue. 
67 | Yulin Wang: wang-yl19@mails.tsinghua.edu.cn.
68 | 


--------------------------------------------------------------------------------
/figure/actnet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/actnet.png


--------------------------------------------------------------------------------
/figure/intro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/intro.png


--------------------------------------------------------------------------------
/figure/sthsth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/sthsth.png


--------------------------------------------------------------------------------
/figure/visualization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blackfeather-wang/AdaFocus/8c0f8d256f448e52c693e23177fbf19fc309650a/figure/visualization.png


--------------------------------------------------------------------------------