├── .gitignore ├── README.md ├── VLN-DUET-RVR ├── README.md ├── map_nav_src │ ├── models │ │ ├── graph_utils.py │ │ ├── model.py │ │ ├── ops.py │ │ ├── transformer.py │ │ ├── vilmodel.py │ │ └── vlnbert_init.py │ ├── reverie │ │ ├── agent_base.py │ │ ├── agent_obj.py │ │ ├── data_utils.py │ │ ├── env.py │ │ ├── env_hm3d.py │ │ ├── eval_utils.py │ │ ├── main_nav_obj.py │ │ ├── main_nav_obj_hm3d.py │ │ └── parser.py │ ├── scripts │ │ └── run_reverie.sh │ └── utils │ │ ├── data.py │ │ ├── distributed.py │ │ ├── logger.py │ │ ├── misc.py │ │ └── ops.py └── pretrain_src │ ├── configs │ ├── model_config.json │ └── training_args.json │ ├── data │ ├── __init__.py │ ├── common.py │ ├── dataset.py │ ├── loader.py │ └── tasks.py │ ├── model │ ├── __init__.py │ ├── ops.py │ ├── pretrain_cmt.py │ ├── transformer.py │ └── vilmodel.py │ ├── optim │ ├── __init__.py │ ├── adamw.py │ ├── lookahead.py │ ├── misc.py │ ├── radam.py │ ├── ralamb.py │ ├── rangerlars.py │ └── sched.py │ ├── parser.py │ ├── submit_reverie.sh │ ├── test │ ├── test_dataset.py │ ├── test_tasks.py │ └── test_vilmodel.py │ ├── train_hm3d_reverie.py │ ├── train_reverie_obj.py │ ├── train_soon_obj.py │ └── utils │ ├── __init__.py │ ├── distributed.py │ ├── logger.py │ ├── misc.py │ └── save.py ├── VLN-DUET ├── datasets │ └── .gitkeep ├── map_nav_src │ ├── models │ │ ├── graph_utils.py │ │ ├── model.py │ │ ├── ops.py │ │ ├── transformer.py │ │ ├── vilmodel.py │ │ └── vlnbert_init.py │ ├── r2r │ │ ├── agent.py │ │ ├── agent_base.py │ │ ├── data_utils.py │ │ ├── env.py │ │ ├── eval_utils.py │ │ ├── main_nav.py │ │ └── parser.py │ ├── scripts │ │ ├── r2r_b16_mix.sh │ │ └── r2r_h14_envedit_mix.sh │ └── utils │ │ ├── data.py │ │ ├── distributed.py │ │ ├── logger.py │ │ ├── misc.py │ │ └── ops.py ├── pretrain_src │ ├── config │ │ ├── r2r_model_config_clip-b16.json │ │ ├── r2r_model_config_clip-h14.json │ │ ├── r2r_pretrain_hm3d+mp3d+gibson_clip-b16.json │ │ └── r2r_pretrain_hm3d+mp3d+gibson_clip-h14.json │ ├── data │ │ ├── __init__.py │ │ ├── common.py │ │ ├── dataset.py │ │ ├── loader.py │ │ └── tasks.py │ ├── model │ │ ├── __init__.py │ │ ├── ops.py │ │ ├── pretrain_cmt.py │ │ ├── transformer.py │ │ └── vilmodel.py │ ├── optim │ │ ├── __init__.py │ │ ├── adamw.py │ │ ├── lookahead.py │ │ ├── misc.py │ │ ├── radam.py │ │ ├── ralamb.py │ │ ├── rangerlars.py │ │ └── sched.py │ ├── parser.py │ ├── run_r2r_b16.sh │ ├── run_r2r_h14.sh │ ├── train_r2r.py │ └── utils │ │ ├── __init__.py │ │ ├── distributed.py │ │ ├── logger.py │ │ ├── misc.py │ │ └── save.py └── requirements.txt └── files └── overall.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | .ftpignore 2 | .ftpconfig 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | .vscode/ 10 | 11 | VLN-DUET/datasets/* 12 | out/* 13 | !VLN-DUET/datasets/.gitkeep -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ScaleVLN 2 | Official implementation of the **ICCV 2023** paper: 3 |
**Scaling Data Generation in Vision-and-Language Navigation**
4 | [**Zun Wang**](https://zunwang1.github.io/), [**Jialu Li**](https://jialuli-luka.github.io/), [**Yicong Hong**](http://www.yiconghong.me/), [Yi Wang](https://shepnerd.github.io/), [Qi Wu](http://www.qi-wu.me/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/), [Stephen Gould](http://users.cecs.anu.edu.au/~sgould/), [Hao Tan](https://www.cs.unc.edu/~airsplay/), [Yu Qiao](https://scholar.google.com/citations?hl=en&user=gFtI-8QAAAAJ&view_op=list_works)
5 | 6 | [Paper & Appendices](https://arxiv.org/abs/2307.15644) 7 | 8 | ![teaser](./files/overall.jpg) 9 | 10 |

11 | 12 | ## Abstract 13 | Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction-trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11\% absolute with regard to previous SoTA) to a significantly new best of 80\% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1\% (versus 8\% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments. 14 | 15 | ## Updates 16 | - **2023/07/31**🔥: We release the ScaleVLN dataset, models and training codes for R2R. 17 | 18 | - **2023/07/14**: ScaleVLN is accpeted by ICCV2023! 🎉 19 | 20 | ## TODOs 21 | 22 | - [x] ScaleVLN dataset 23 | - [x] Code, data and trained models for R2R 24 | - [x] Code, data and trained models for RVR 25 | - [ ] Graph Construction Code 26 | - [ ] Speaker Training Code 27 | 28 | ## Prerequisites 29 | 30 | ### Installation 31 | 32 | 1. Install Matterport3D simulators: follow instructions [here](https://github.com/peteanderson80/Matterport3DSimulator). We use the latest version instead of v0.1. 33 | ``` 34 | export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH 35 | ``` 36 | 37 | 2. Install requirements: 38 | ``` 39 | conda create --name vlnde python=3.9 40 | conda activate vlnde 41 | cd VLN-DUET 42 | pip install -r requirements.txt 43 | ``` 44 | 45 | ### R2R 46 | 47 | 1. Download the required data from [here](https://huggingface.co/datasets/OpenGVLab/ScaleVLN/blob/main/r2r_preprocess_data.zip) and unzip it to `VLN-DUET/datasets/R2R`. It should include three folders `annotations, connectivity, connectivity_mp3d`. 48 | 49 | 2. Download the CLIP and EnvEdit features from [here](https://huggingface.co/datasets/OpenGVLab/ScaleVLN/blob/main/features.zip) and unzip it to `VLN-DUET/datasets/R2R`. It should include one folder `features`. 50 | 51 | 3. (Optional) Download the trained models from [here](https://huggingface.co/datasets/OpenGVLab/ScaleVLN/blob/main/r2r_trained_models.zip) and unzip it to `VLN-DUET/datasets/R2R`. It should include one folder `trained_models`. 52 | 53 | 4. Download pretrained lxmert from [here](https://nlp.cs.unc.edu/data/model_LXRT.pth) and place it at `VLN-DUET/datasets/pretrained/LXMERT`. 54 | 55 | 56 | ## Running 57 | 58 | ### Pre-training 59 | 60 | We use Two NVDIA A100 GPUs for pre-training agents on ScaleVLN. 61 | 62 | ```bash 63 | bash run_r2r_b14.sh "0,1" 45008 64 | bash run_r2r_h14.sh "0,1" 45009 65 | ``` 66 | 67 | 68 | ### Fine-tuning 69 | 70 | We use one NVDIA A100 GPU for fine-tuning our agents. 71 | 72 | ``` 73 | bash scripts/r2r_b16_mix.sh 0 74 | bash scripts/r2r_h14_envedit_mix.sh 0 75 | ... 76 | ``` 77 | 78 | ## REVERIE 79 | Please see `ScaleVLN/VLN-DUET-RVR/README.md`. 80 | 81 | ## Citation 82 | Please cite our paper: 83 | ``` 84 | @InProceedings{wang2023scalevln, 85 | author = {Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao}, 86 | title = {Scaling Data Generation in Vision-and-Language Navigation}, 87 | booktitle = {ICCV 2023}, 88 | year = {2023} 89 | } 90 | ``` 91 | 92 | ## Acknowledgement 93 | 94 | We thank the developers of [DUET](https://github.com/cshizhe/VLN-DUET), [EnvDrop](https://github.com/clip-vil/CLIP-ViL/tree/master/CLIP-ViL-VLN), [Co-Mod GAN](https://github.com/zsyzzsoft/co-mod-gan), [Discrete-Continuous VLN](https://github.com/YicongHong/Discrete-Continuous-VLN), [HAMT](https://github.com/cshizhe/VLN-HAMT) for their public code release. 95 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/README.md: -------------------------------------------------------------------------------- 1 | 1. follow scaleVLN to get datasets for R2R and the pretrained LXMERT (connectivity files and img features of rvr are from R2R data) and unzip it in datasets. 2 | 2. download RVR data from [here](https://huggingface.co/datasets/OpenGVLab/ScaleVLN/blob/main/rvr_data.zip) and unzip it in datasets. We expects files in this structure: 3 | ``` 4 | datasets 5 | ├─ R2R 6 | ├─ REVERIE 7 | ├─ pretrained 8 | ``` 9 | 3. bash submit_reverie.sh to run rvr pretraining (make take ~1h to preload the data in training). 10 | 4. bash script/run_reverie.sh to run rvr finetuneing. 11 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/models/graph_utils.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import numpy as np 3 | 4 | MAX_DIST = 30 5 | MAX_STEP = 10 6 | 7 | def calc_position_distance(a, b): 8 | # a, b: (x, y, z) 9 | dx = b[0] - a[0] 10 | dy = b[1] - a[1] 11 | dz = b[2] - a[2] 12 | dist = np.sqrt(dx**2 + dy**2 + dz**2) 13 | return dist 14 | 15 | def calculate_vp_rel_pos_fts(a, b, base_heading=0, base_elevation=0): 16 | # a, b: (x, y, z) 17 | dx = b[0] - a[0] 18 | dy = b[1] - a[1] 19 | dz = b[2] - a[2] 20 | xy_dist = max(np.sqrt(dx**2 + dy**2), 1e-8) 21 | xyz_dist = max(np.sqrt(dx**2 + dy**2 + dz**2), 1e-8) 22 | 23 | # the simulator's api is weired (x-y axis is transposed) 24 | heading = np.arcsin(dx/xy_dist) # [-pi/2, pi/2] 25 | if b[1] < a[1]: 26 | heading = np.pi - heading 27 | heading -= base_heading 28 | 29 | elevation = np.arcsin(dz/xyz_dist) # [-pi/2, pi/2] 30 | elevation -= base_elevation 31 | 32 | return heading, elevation, xyz_dist 33 | 34 | def get_angle_fts(headings, elevations, angle_feat_size): 35 | ang_fts = [np.sin(headings), np.cos(headings), np.sin(elevations), np.cos(elevations)] 36 | ang_fts = np.vstack(ang_fts).transpose().astype(np.float32) 37 | num_repeats = angle_feat_size // 4 38 | if num_repeats > 1: 39 | ang_fts = np.concatenate([ang_fts] * num_repeats, 1) 40 | return ang_fts 41 | 42 | 43 | class FloydGraph(object): 44 | def __init__(self): 45 | self._dis = defaultdict(lambda :defaultdict(lambda: 95959595)) 46 | self._point = defaultdict(lambda :defaultdict(lambda: "")) 47 | self._visited = set() 48 | 49 | def distance(self, x, y): 50 | if x == y: 51 | return 0 52 | else: 53 | return self._dis[x][y] 54 | 55 | def add_edge(self, x, y, dis): 56 | if dis < self._dis[x][y]: 57 | self._dis[x][y] = dis 58 | self._dis[y][x] = dis 59 | self._point[x][y] = "" 60 | self._point[y][x] = "" 61 | 62 | def update(self, k): 63 | for x in self._dis: 64 | for y in self._dis: 65 | if x != y: 66 | if self._dis[x][k] + self._dis[k][y] < self._dis[x][y]: 67 | self._dis[x][y] = self._dis[x][k] + self._dis[k][y] 68 | self._dis[y][x] = self._dis[x][y] 69 | self._point[x][y] = k 70 | self._point[y][x] = k 71 | self._visited.add(k) 72 | 73 | def visited(self, k): 74 | return (k in self._visited) 75 | 76 | def path(self, x, y): 77 | """ 78 | :param x: start 79 | :param y: end 80 | :return: the path from x to y [v1, v2, ..., v_n, y] 81 | """ 82 | if x == y: 83 | return [] 84 | if self._point[x][y] == "": # Direct edge 85 | return [y] 86 | else: 87 | k = self._point[x][y] 88 | # print(x, y, k) 89 | # for x1 in (x, k, y): 90 | # for x2 in (x, k, y): 91 | # print(x1, x2, "%.4f" % self._dis[x1][x2]) 92 | return self.path(x, k) + self.path(k, y) 93 | 94 | 95 | class GraphMap(object): 96 | def __init__(self, start_vp): 97 | self.start_vp = start_vp # start viewpoint 98 | 99 | self.node_positions = {} # viewpoint to position (x, y, z) 100 | self.graph = FloydGraph() # shortest path graph 101 | self.node_embeds = {} # {viewpoint: feature (sum feature, count)} 102 | self.node_stop_scores = {} # {viewpoint: prob} 103 | self.node_nav_scores = {} # {viewpoint: {t: prob}} 104 | self.node_step_ids = {} 105 | 106 | def update_graph(self, ob): 107 | self.node_positions[ob['viewpoint']] = ob['position'] 108 | for cc in ob['candidate']: 109 | self.node_positions[cc['viewpointId']] = cc['position'] 110 | dist = calc_position_distance(ob['position'], cc['position']) 111 | self.graph.add_edge(ob['viewpoint'], cc['viewpointId'], dist) 112 | self.graph.update(ob['viewpoint']) 113 | 114 | def update_node_embed(self, vp, embed, rewrite=False): 115 | if rewrite: 116 | self.node_embeds[vp] = [embed, 1] 117 | else: 118 | if vp in self.node_embeds: 119 | self.node_embeds[vp][0] = self.node_embeds[vp][0] + embed 120 | self.node_embeds[vp][1] = self.node_embeds[vp][1] + 1 121 | else: 122 | self.node_embeds[vp] = [embed, 1] 123 | 124 | def get_node_embed(self, vp): 125 | return self.node_embeds[vp][0] / self.node_embeds[vp][1] 126 | 127 | def get_pos_fts(self, cur_vp, gmap_vpids, cur_heading, cur_elevation, angle_feat_size=4): 128 | # dim=7 (sin(heading), cos(heading), sin(elevation), cos(elevation), 129 | # line_dist, shortest_dist, shortest_step) 130 | rel_angles, rel_dists = [], [] 131 | for vp in gmap_vpids: 132 | if vp is None: 133 | rel_angles.append([0, 0]) 134 | rel_dists.append([0, 0, 0]) 135 | else: 136 | rel_heading, rel_elevation, rel_dist = calculate_vp_rel_pos_fts( 137 | self.node_positions[cur_vp], self.node_positions[vp], 138 | base_heading=cur_heading, base_elevation=cur_elevation, 139 | ) 140 | rel_angles.append([rel_heading, rel_elevation]) 141 | rel_dists.append( 142 | [rel_dist / MAX_DIST, self.graph.distance(cur_vp, vp) / MAX_DIST, \ 143 | len(self.graph.path(cur_vp, vp)) / MAX_STEP] 144 | ) 145 | rel_angles = np.array(rel_angles).astype(np.float32) 146 | rel_dists = np.array(rel_dists).astype(np.float32) 147 | rel_ang_fts = get_angle_fts(rel_angles[:, 0], rel_angles[:, 1], angle_feat_size) 148 | return np.concatenate([rel_ang_fts, rel_dists], 1) 149 | 150 | def save_to_json(self): 151 | nodes = {} 152 | for vp, pos in self.node_positions.items(): 153 | nodes[vp] = { 154 | 'location': pos, # (x, y, z) 155 | 'visited': self.graph.visited(vp), 156 | } 157 | if nodes[vp]['visited']: 158 | nodes[vp]['stop_prob'] = self.node_stop_scores[vp]['stop'] 159 | nodes[vp]['og_objid'] = self.node_stop_scores[vp]['og'] 160 | else: 161 | nodes[vp]['nav_prob'] = self.node_nav_scores[vp] 162 | 163 | edges = [] 164 | for k, v in self.graph._dis.items(): 165 | for kk in v.keys(): 166 | edges.append((k, kk)) 167 | 168 | return {'nodes': nodes, 'edges': edges} 169 | 170 | 171 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/models/model.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import collections 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | 8 | from transformers import BertPreTrainedModel 9 | 10 | from .vlnbert_init import get_vlnbert_models 11 | 12 | class VLNBert(nn.Module): 13 | def __init__(self, args): 14 | super().__init__() 15 | print('\nInitalizing the VLN-BERT model ...') 16 | self.args = args 17 | 18 | self.vln_bert = get_vlnbert_models(args, config=None) # initialize the VLN-BERT 19 | self.drop_env = nn.Dropout(p=args.feat_dropout) 20 | 21 | def forward(self, mode, batch): 22 | batch = collections.defaultdict(lambda: None, batch) 23 | 24 | if mode == 'language': 25 | txt_embeds = self.vln_bert(mode, batch) 26 | return txt_embeds 27 | 28 | elif mode == 'panorama': 29 | batch['view_img_fts'] = self.drop_env(batch['view_img_fts']) 30 | if 'obj_img_fts' in batch: 31 | batch['obj_img_fts'] = self.drop_env(batch['obj_img_fts']) 32 | pano_embeds, pano_masks = self.vln_bert(mode, batch) 33 | return pano_embeds, pano_masks 34 | 35 | elif mode == 'navigation': 36 | outs = self.vln_bert(mode, batch) 37 | return outs 38 | 39 | else: 40 | raise NotImplementedError('wrong mode: %s'%mode) 41 | 42 | 43 | class Critic(nn.Module): 44 | def __init__(self, args): 45 | super(Critic, self).__init__() 46 | self.state2value = nn.Sequential( 47 | nn.Linear(768, 512), 48 | nn.ReLU(), 49 | nn.Dropout(args.dropout), 50 | nn.Linear(512, 1), 51 | ) 52 | 53 | def forward(self, state): 54 | return self.state2value(state).squeeze() 55 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/models/ops.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from .transformer import TransformerEncoder, TransformerEncoderLayer 4 | 5 | # try: 6 | # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm 7 | # except (ImportError, AttributeError) as e: 8 | # # logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .") 9 | # BertLayerNorm = torch.nn.LayerNorm 10 | BertLayerNorm = torch.nn.LayerNorm 11 | 12 | def create_transformer_encoder(config, num_layers, norm=False): 13 | enc_layer = TransformerEncoderLayer( 14 | config.hidden_size, config.num_attention_heads, 15 | dim_feedforward=config.intermediate_size, 16 | dropout=config.hidden_dropout_prob, 17 | activation=config.hidden_act, 18 | normalize_before=True 19 | ) 20 | if norm: 21 | norm_layer = BertLayerNorm(config.hidden_size, eps=1e-12) 22 | else: 23 | norm_layer = None 24 | return TransformerEncoder(enc_layer, num_layers, norm=norm_layer, batch_first=True) 25 | 26 | def extend_neg_masks(masks, dtype=None): 27 | """ 28 | mask from (N, L) into (N, 1(H), 1(L), L) and make it negative 29 | """ 30 | if dtype is None: 31 | dtype = torch.float 32 | extended_masks = masks.unsqueeze(1).unsqueeze(2) 33 | extended_masks = extended_masks.to(dtype=dtype) 34 | extended_masks = (1.0 - extended_masks) * -10000.0 35 | return extended_masks 36 | 37 | def gen_seq_masks(seq_lens, max_len=None): 38 | if max_len is None: 39 | max_len = max(seq_lens) 40 | batch_size = len(seq_lens) 41 | device = seq_lens.device 42 | 43 | masks = torch.arange(max_len).unsqueeze(0).repeat(batch_size, 1).to(device) 44 | masks = masks < seq_lens.unsqueeze(1) 45 | return masks 46 | 47 | def pad_tensors_wgrad(tensors, lens=None): 48 | """B x [T, ...] torch tensors""" 49 | if lens is None: 50 | lens = [t.size(0) for t in tensors] 51 | max_len = max(lens) 52 | batch_size = len(tensors) 53 | hid = list(tensors[0].size()[1:]) 54 | 55 | device = tensors[0].device 56 | dtype = tensors[0].dtype 57 | 58 | output = [] 59 | for i in range(batch_size): 60 | if lens[i] < max_len: 61 | tmp = torch.cat( 62 | [tensors[i], torch.zeros([max_len-lens[i]]+hid, dtype=dtype).to(device)], 63 | dim=0 64 | ) 65 | else: 66 | tmp = tensors[i] 67 | output.append(tmp) 68 | output = torch.stack(output, 0) 69 | return output 70 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/models/vlnbert_init.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def get_tokenizer(args): 5 | from transformers import AutoTokenizer 6 | if args.tokenizer == 'xlm': 7 | cfg_name = 'xlm-roberta-base' 8 | else: 9 | cfg_name = 'bert-base-uncased' 10 | tokenizer = AutoTokenizer.from_pretrained(cfg_name) 11 | return tokenizer 12 | 13 | def get_vlnbert_models(args, config=None): 14 | 15 | from transformers import PretrainedConfig 16 | from models.vilmodel import GlocalTextPathNavCMT 17 | 18 | model_name_or_path = args.bert_ckpt_file 19 | new_ckpt_weights = {} 20 | if model_name_or_path is not None: 21 | ckpt_weights = torch.load(model_name_or_path) 22 | for k, v in ckpt_weights.items(): 23 | if k.startswith('module'): 24 | k = k[7:] 25 | if '_head' in k or 'sap_fuse' in k: 26 | new_ckpt_weights['bert.' + k] = v 27 | else: 28 | new_ckpt_weights[k] = v 29 | 30 | if args.tokenizer == 'xlm': 31 | cfg_name = 'xlm-roberta-base' 32 | else: 33 | cfg_name = 'bert-base-uncased' 34 | vis_config = PretrainedConfig.from_pretrained(cfg_name) 35 | 36 | if args.tokenizer == 'xlm': 37 | vis_config.type_vocab_size = 2 38 | 39 | vis_config.max_action_steps = 100 40 | vis_config.image_feat_size = args.image_feat_size 41 | vis_config.angle_feat_size = args.angle_feat_size 42 | vis_config.obj_feat_size = args.obj_feat_size 43 | vis_config.obj_loc_size = 3 44 | vis_config.num_l_layers = args.num_l_layers 45 | vis_config.num_pano_layers = args.num_pano_layers 46 | vis_config.num_x_layers = args.num_x_layers 47 | vis_config.graph_sprels = args.graph_sprels 48 | vis_config.glocal_fuse = args.fusion == 'dynamic' 49 | 50 | vis_config.fix_lang_embedding = args.fix_lang_embedding 51 | vis_config.fix_pano_embedding = args.fix_pano_embedding 52 | vis_config.fix_local_branch = args.fix_local_branch 53 | 54 | vis_config.update_lang_bert = not args.fix_lang_embedding 55 | vis_config.output_attentions = True 56 | vis_config.pred_head_dropout_prob = 0.1 57 | vis_config.use_lang2visn_attn = False 58 | 59 | visual_model = GlocalTextPathNavCMT.from_pretrained( 60 | pretrained_model_name_or_path=None, 61 | config=vis_config, 62 | state_dict=new_ckpt_weights) 63 | 64 | return visual_model 65 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/reverie/data_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import jsonlines 4 | import numpy as np 5 | 6 | import lmdb 7 | import msgpack 8 | import msgpack_numpy 9 | msgpack_numpy.patch() 10 | 11 | from utils.data import angle_feature 12 | 13 | class ObjectFeatureDB(object): 14 | def __init__(self, obj_ft_file, obj_feat_size, im_width=640, im_height=480): 15 | self.obj_feat_size = obj_feat_size 16 | self.obj_ft_file = obj_ft_file 17 | self._feature_store = {} 18 | self.im_width = im_width 19 | self.im_height = im_height 20 | self.env = lmdb.open(self.obj_ft_file, readonly=True) 21 | 22 | def load_feature(self, scan, viewpoint, max_objects=None): 23 | key = '%s_%s' % (scan, viewpoint) 24 | if key in self._feature_store: 25 | obj_fts, obj_attrs = self._feature_store[key] 26 | else: 27 | with self.env.begin() as txn: 28 | obj_data = txn.get(key.encode('ascii')) 29 | if obj_data is not None: 30 | obj_data = msgpack.unpackb(obj_data) 31 | obj_fts = obj_data['fts'][:, :self.obj_feat_size].astype(np.float32) 32 | obj_attrs = {k: v for k, v in obj_data.items() if k != 'fts'} 33 | else: 34 | obj_fts = np.zeros((0, self.obj_feat_size), dtype=np.float32) 35 | obj_attrs = {} 36 | self._feature_store[key] = (obj_fts, obj_attrs) 37 | 38 | if max_objects is not None: 39 | obj_fts = obj_fts[:max_objects] 40 | obj_attrs = {k: v[:max_objects] for k, v in obj_attrs.items()} 41 | return obj_fts, obj_attrs 42 | 43 | def get_object_feature( 44 | self, scan, viewpoint, base_heading, base_elevation, angle_feat_size, 45 | max_objects=None 46 | ): 47 | obj_fts, obj_attrs = self.load_feature(scan, viewpoint, max_objects=max_objects) 48 | obj_ang_fts = np.zeros((len(obj_fts), angle_feat_size), dtype=np.float32) 49 | obj_box_fts = np.zeros((len(obj_fts), 3), dtype=np.float32) 50 | obj_ids = [] 51 | if len(obj_fts) > 0: 52 | for k, obj_ang in enumerate(obj_attrs['centers']): 53 | obj_ang_fts[k] = angle_feature( 54 | obj_ang[0] - base_heading, obj_ang[1] - base_elevation, angle_feat_size 55 | ) 56 | w, h = obj_attrs['bboxes'][k][2:] 57 | obj_box_fts[k, :2] = [h/self.im_height, w/self.im_width] 58 | obj_box_fts[k, 2] = obj_box_fts[k, 0] * obj_box_fts[k, 1] 59 | obj_ids = obj_attrs['obj_ids'] 60 | return obj_fts, obj_ang_fts, obj_box_fts, obj_ids 61 | 62 | 63 | def load_instr_datasets(anno_dir, dataset, splits, tokenizer): 64 | data = [] 65 | for split in splits: 66 | if "/" not in split: # the official splits 67 | if tokenizer == 'bert': 68 | filepath = os.path.join(anno_dir, 'REVERIE_%s_enc.json' % split) 69 | elif tokenizer == 'xlm': 70 | filepath = os.path.join(anno_dir, 'REVERIE_%s_enc_xlmr.json' % split) 71 | else: 72 | raise NotImplementedError('unspported tokenizer %s' % tokenizer) 73 | 74 | with open(filepath) as f: 75 | new_data = json.load(f) 76 | else: # augmented data 77 | print('\nLoading augmented data %s for pretraining...' % os.path.basename(split)) 78 | if split.endswith('json'): 79 | with open(split) as f: 80 | new_data = json.load(f) 81 | elif split.endswith('jsonl'): 82 | # reuse pretrain aug format 83 | with jsonlines.open(split) as f: 84 | new_data = [] 85 | for item in f: 86 | objid = item['instr_id'].split('_')[1] 87 | new_data.append({ 88 | 'scan': item['scan'], 89 | 'id': '%s_%d'%(item['instr_id'], len(new_data)), 90 | 'instructions': [''], 91 | 'instr_encodings': [item['instr_encoding']], 92 | 'path_id': '%s_%d'%(item['instr_id'], len(new_data)), 93 | 'objId': objid, 94 | 'path': item['path'], 95 | 'heading': np.random.rand() * np.pi * 2, 96 | 'end_vps': item['pos_vps'], 97 | }) 98 | else: 99 | raise NotImplementedError('unsupported aug data format %s' % split) 100 | # Join 101 | data += new_data 102 | return data 103 | 104 | def construct_instrs(anno_dir, dataset, splits, tokenizer, max_instr_len=512): 105 | data = [] 106 | for i, item in enumerate(load_instr_datasets(anno_dir, dataset, splits, tokenizer)): 107 | # Split multiple instructions into separate entries 108 | for j, instr in enumerate(item['instructions']): 109 | new_item = dict(item) 110 | if 'objId' in item: 111 | new_item['instr_id'] = '%s_%s_%d' % (str(item['path_id']), str(item['objId']), j) 112 | else: 113 | new_item['path_id'] = item['id'] 114 | new_item['instr_id'] = '%s_%d' % (item['id'], j) 115 | new_item['objId'] = None 116 | new_item['instruction'] = instr 117 | new_item['instr_encoding'] = item['instr_encodings'][j][:max_instr_len] 118 | del new_item['instructions'] 119 | del new_item['instr_encodings'] 120 | data.append(new_item) 121 | return data 122 | 123 | def load_obj2vps(bbox_file): 124 | obj2vps = {} 125 | bbox_data = json.load(open(bbox_file)) 126 | for scanvp, value in bbox_data.items(): 127 | scan, vp = scanvp.split('_') 128 | # for all visible objects at that viewpoint 129 | for objid, objinfo in value.items(): 130 | if objinfo['visible_pos']: 131 | # if such object not already in the dict 132 | obj2vps.setdefault(scan+'_'+objid, []) 133 | obj2vps[scan+'_'+objid].append(vp) 134 | return obj2vps 135 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/reverie/env_hm3d.py: -------------------------------------------------------------------------------- 1 | ''' Batched REVERIE navigation environment ''' 2 | 3 | import json 4 | import os 5 | import numpy as np 6 | import math 7 | import random 8 | import networkx as nx 9 | import copy 10 | import h5py 11 | import jsonlines 12 | import collections 13 | 14 | from utils.data import load_nav_graphs, new_simulator 15 | from utils.data import angle_feature, get_all_point_angle_feature 16 | 17 | from .env import EnvBatch, ReverieObjectNavBatch 18 | from utils.data import ImageFeaturesDB 19 | from .data_utils import ObjectFeatureDB 20 | 21 | def construct_instrs(instr_files, max_instr_len=512): 22 | data = [] 23 | for instr_file in instr_files: 24 | with jsonlines.open(instr_file) as f: 25 | for item in f: 26 | newitem = { 27 | 'instr_id': item['instr_id'], 28 | 'objId': item['objid'], 29 | 'scan': item['scan'], 30 | 'path': item['path'], 31 | 'end_vps': item['pos_vps'], 32 | 'instruction': item['instruction'], 33 | 'instr_encoding': item['instr_encoding'][:max_instr_len], 34 | 'heading': np.random.rand() * np.pi * 2, 35 | } 36 | data.append(newitem) 37 | return data 38 | 39 | 40 | class HM3DReverieObjectNavBatch(ReverieObjectNavBatch): 41 | ''' Implements the REVERIE navigation task, using discretized viewpoints and pretrained features ''' 42 | 43 | def __init__( 44 | self, view_db_file, obj_db_file, instr_files, connectivity_dir, 45 | multi_endpoints=False, multi_startpoints=False, 46 | image_feat_size=768, obj_feat_size=768, 47 | batch_size=64, angle_feat_size=4, max_objects=None, 48 | seed=0, name=None, sel_data_idxs=None, scan_ranges=None 49 | ): 50 | view_db = ImageFeaturesDB(view_db_file, image_feat_size) 51 | obj_db = ObjectFeatureDB(obj_db_file, obj_feat_size, im_width=224, im_height=224) 52 | instr_data = construct_instrs(instr_files, max_instr_len=100) 53 | if scan_ranges is not None: 54 | scan_idxs = set(list(range(scan_ranges[0], scan_ranges[1]))) 55 | new_instr_data = [] 56 | for item in instr_data: 57 | if int(item['scan'].split('-')[0]) in scan_idxs: 58 | new_instr_data.append(item) 59 | instr_data = new_instr_data 60 | #print(connectivity_dir) 61 | #exit() 62 | self.env = EnvBatch(connectivity_dir, feat_db=view_db, batch_size=batch_size) 63 | self.obj_db = obj_db 64 | self.data = instr_data 65 | self.scans = set([x['scan'] for x in self.data]) 66 | self.multi_endpoints = multi_endpoints 67 | self.multi_startpoints = multi_startpoints 68 | self.connectivity_dir = connectivity_dir 69 | self.batch_size = batch_size 70 | self.angle_feat_size = angle_feat_size 71 | self.max_objects = max_objects 72 | self.name = name 73 | 74 | self.gt_trajs = self._get_gt_trajs(self.data) # for evaluation 75 | 76 | # in validation, we would split the data 77 | if sel_data_idxs is not None: 78 | t_split, n_splits = sel_data_idxs 79 | ndata_per_split = len(self.data) // n_splits 80 | start_idx = ndata_per_split * t_split 81 | if t_split == n_splits - 1: 82 | end_idx = None 83 | else: 84 | end_idx = start_idx + ndata_per_split 85 | self.data = self.data[start_idx: end_idx] 86 | 87 | # use different seeds in different processes to shuffle data 88 | self.seed = seed 89 | random.seed(self.seed) 90 | random.shuffle(self.data) 91 | 92 | self.obj2vps = collections.defaultdict(list) # {scan_objid: vp_list} (objects can be viewed at the viewpoints) 93 | for item in self.data: 94 | self.obj2vps['%s_%s'%(item['scan'], item['objId'])].extend(item['end_vps']) 95 | 96 | self.ix = 0 97 | self._load_nav_graphs() 98 | 99 | self.sim = new_simulator(self.connectivity_dir) 100 | self.angle_feature = get_all_point_angle_feature(self.sim, self.angle_feat_size) 101 | 102 | self.buffered_state_dict = {} 103 | print('%s loaded with %d instructions, using splits: %s' % ( 104 | self.__class__.__name__, len(self.data), self.name)) 105 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/reverie/eval_utils.py: -------------------------------------------------------------------------------- 1 | ''' Utils for evaluation ''' 2 | 3 | import numpy as np 4 | 5 | 6 | def cal_dtw(shortest_distances, prediction, reference, success=None, threshold=3.0): 7 | dtw_matrix = np.inf * np.ones((len(prediction) + 1, len(reference) + 1)) 8 | dtw_matrix[0][0] = 0 9 | for i in range(1, len(prediction)+1): 10 | for j in range(1, len(reference)+1): 11 | best_previous_cost = min( 12 | dtw_matrix[i-1][j], dtw_matrix[i][j-1], dtw_matrix[i-1][j-1]) 13 | cost = shortest_distances[prediction[i-1]][reference[j-1]] 14 | dtw_matrix[i][j] = cost + best_previous_cost 15 | 16 | dtw = dtw_matrix[len(prediction)][len(reference)] 17 | ndtw = np.exp(-dtw/(threshold * len(reference))) 18 | if success is None: 19 | success = float(shortest_distances[prediction[-1]][reference[-1]] < threshold) 20 | sdtw = success * ndtw 21 | 22 | return { 23 | 'DTW': dtw, 24 | 'nDTW': ndtw, 25 | 'SDTW': sdtw 26 | } 27 | 28 | def cal_cls(shortest_distances, prediction, reference, threshold=3.0): 29 | def length(nodes): 30 | return np.sum([ 31 | shortest_distances[a][b] 32 | for a, b in zip(nodes[:-1], nodes[1:]) 33 | ]) 34 | 35 | coverage = np.mean([ 36 | np.exp(-np.min([ # pylint: disable=g-complex-comprehension 37 | shortest_distances[u][v] for v in prediction 38 | ]) / threshold) for u in reference 39 | ]) 40 | expected = coverage * length(reference) 41 | score = expected / (expected + np.abs(expected - length(prediction))) 42 | return coverage * score 43 | 44 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/reverie/parser.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | 5 | def parse_args(): 6 | parser = argparse.ArgumentParser(description="") 7 | 8 | parser.add_argument('--root_dir', type=str, default='/sequoia/data3/shichen/datasets') 9 | parser.add_argument('--dataset', type=str, default='reverie', choices=['reverie']) 10 | parser.add_argument('--output_dir', type=str, default='default', help='experiment id') 11 | parser.add_argument('--seed', type=int, default=0) 12 | 13 | parser.add_argument('--tokenizer', choices=['bert', 'xlm'], default='bert') 14 | 15 | parser.add_argument('--fusion', choices=['global', 'local', 'avg', 'dynamic']) 16 | parser.add_argument('--dagger_sample', choices=['sample', 'expl_sample', 'argmax']) 17 | parser.add_argument('--expl_max_ratio', type=float, default=0.6) 18 | parser.add_argument('--loss_nav_3', action='store_true', default=False) 19 | 20 | # distributional training (single-node, multiple-gpus) 21 | parser.add_argument('--world_size', type=int, default=1, help='number of gpus') 22 | parser.add_argument('--local_rank', type=int, default=-1) 23 | parser.add_argument("--node_rank", type=int, default=0, help="Id of the node") 24 | 25 | # General 26 | parser.add_argument('--iters', type=int, default=100000, help='training iterations') 27 | parser.add_argument('--log_every', type=int, default=1000) 28 | parser.add_argument('--eval_first', action='store_true', default=False) 29 | 30 | # Data preparation 31 | parser.add_argument('--max_instr_len', type=int, default=80) 32 | parser.add_argument('--max_action_len', type=int, default=15) 33 | parser.add_argument('--max_objects', type=int, default=20) 34 | parser.add_argument('--batch_size', type=int, default=8) 35 | parser.add_argument('--ignoreid', type=int, default=-100, help='ignoreid for action') 36 | 37 | # Load the model from 38 | parser.add_argument("--resume_file", default=None, help='path of the trained model') 39 | parser.add_argument("--resume_optimizer", action="store_true", default=False) 40 | 41 | # Augmented Paths from 42 | parser.add_argument("--multi_endpoints", default=False, action="store_true") 43 | parser.add_argument("--multi_startpoints", default=False, action="store_true") 44 | parser.add_argument("--aug_only", default=False, action="store_true") 45 | parser.add_argument("--aug", default=None) 46 | parser.add_argument('--bert_ckpt_file', default=None, help='init vlnbert') 47 | 48 | # Listener Model Config 49 | parser.add_argument("--ml_weight", type=float, default=0.20) 50 | parser.add_argument('--entropy_loss_weight', type=float, default=0.01) 51 | 52 | parser.add_argument("--features", type=str, default='vitbase') 53 | parser.add_argument('--obj_features', type=str, default='vitbase') 54 | 55 | parser.add_argument('--fix_lang_embedding', action='store_true', default=False) 56 | parser.add_argument('--fix_pano_embedding', action='store_true', default=False) 57 | parser.add_argument('--fix_local_branch', action='store_true', default=False) 58 | 59 | parser.add_argument('--num_l_layers', type=int, default=9) 60 | parser.add_argument('--num_pano_layers', type=int, default=2) 61 | parser.add_argument('--num_x_layers', type=int, default=4) 62 | 63 | parser.add_argument('--enc_full_graph', default=False, action='store_true') 64 | parser.add_argument('--graph_sprels', action='store_true', default=False) 65 | 66 | # Dropout Param 67 | parser.add_argument('--dropout', type=float, default=0.5) 68 | parser.add_argument('--feat_dropout', type=float, default=0.3) 69 | 70 | # Submision configuration 71 | parser.add_argument('--test', action='store_true', default=False) 72 | parser.add_argument('--zero_shot', action='store_true', default=False) 73 | parser.add_argument("--submit", action='store_true', default=False) 74 | parser.add_argument('--no_backtrack', action='store_true', default=False) 75 | parser.add_argument('--detailed_output', action='store_true', default=False) 76 | 77 | # Training Configurations 78 | parser.add_argument( 79 | '--optim', type=str, default='rms', 80 | choices=['rms', 'adam', 'adamW', 'sgd'] 81 | ) # rms, adam 82 | parser.add_argument('--lr', type=float, default=0.00001, help="the learning rate") 83 | parser.add_argument('--decay', dest='weight_decay', type=float, default=0.) 84 | parser.add_argument( 85 | '--feedback', type=str, default='sample', 86 | help='How to choose next position, one of ``teacher``, ``sample`` and ``argmax``' 87 | ) 88 | parser.add_argument('--epsilon', type=float, default=0.1, help='') 89 | 90 | # Model hyper params: 91 | parser.add_argument("--angle_feat_size", type=int, default=4) 92 | parser.add_argument('--image_feat_size', type=int, default=2048) 93 | parser.add_argument('--obj_feat_size', type=int, default=2048) 94 | parser.add_argument('--views', type=int, default=36) 95 | 96 | # # A2C 97 | parser.add_argument("--gamma", default=0.9, type=float, help='reward discount factor') 98 | parser.add_argument( 99 | "--normalize", dest="normalize_loss", default="total", 100 | type=str, help='batch or total' 101 | ) 102 | parser.add_argument('--train_alg', 103 | choices=['imitation', 'dagger', 'a3c', 'reinforce'], 104 | default='imitation' 105 | ) 106 | 107 | parser.add_argument('--hm3d_scan_ranges', nargs='+', type=int, default=None) 108 | parser.add_argument('--hm3d_og_loss_weight', type=float, default=1) 109 | 110 | 111 | parser.add_argument('--num_mp3d_scans', default=None, type=int) 112 | 113 | args, _ = parser.parse_known_args() 114 | 115 | args = postprocess_args(args) 116 | 117 | return args 118 | 119 | 120 | def postprocess_args(args): 121 | ROOTDIR = args.root_dir 122 | 123 | # Setup input paths 124 | ft_file_map = { 125 | 'timm_vitb16': 'view_timm_imagenet_vitb16', 126 | # 'timm_vitb16': '../../../../data/img_features', 127 | 'mmdet2d_frcnn_hm3d': 'obj_gtmax_mmdet2d_frcnn_hm3d/views', 128 | 'clip-h14': 'clip_vit-h14_final.hdf5' 129 | } 130 | args.img_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', ft_file_map[args.features]) 131 | if not os.path.exists(args.img_ft_file): 132 | args.img_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', ft_file_map[args.features]) 133 | if args.features == 'timm_vitb16': 134 | args.img_ft_file = "/nvme/wangzun/vln/hm3d-vln/HM3DAutoVLN/datasets/R2R_hm3dautovln/features/view_timm_imagenet_vitb16" 135 | elif args.features == 'clip-h14': 136 | args.img_ft_file = "../datasets/R2R/features/clip_vit-h14_final.hdf5" 137 | args.aug_ft_file = "../datasets/R2R/features/clip_vit-h14_final.hdf5" 138 | 139 | args.mp3d_ft_files = [ 140 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_image_synthesis.hdf5'), 141 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_mask_image_synthesis.hdf5'), 142 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_style_transfer.hdf5'), 143 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5'), 144 | ] 145 | # args.mp3d_ft_files = [ 146 | # os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5'), 147 | # ] 148 | 149 | obj_ft_file_map = { 150 | 'timm_vitb16': 'obj_gtmax_timm_imagenet_vitb16', 151 | 'mmdet2d_frcnn_hm3d': 'obj_gtmax_mmdet2d_frcnn_hm3d/gt_objs', 152 | } 153 | args.obj_ft_file = os.path.join(ROOTDIR, 'REVERIE', 'features', obj_ft_file_map[args.obj_features]) 154 | args.obj_ft_file = "../datasets/REVERIE/features/obj_gtmax_timm_imagenet_vitb16" 155 | 156 | args.val_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5') 157 | 158 | args.connectivity_dir = os.path.join(ROOTDIR, 'R2R', 'connectivity') 159 | args.scan_data_dir = os.path.join(ROOTDIR, 'Matterport3D', 'v1_unzip_scans') 160 | 161 | args.anno_dir = os.path.join(ROOTDIR, 'REVERIE', 'annotations') 162 | 163 | # Build paths 164 | args.ckpt_dir = os.path.join(args.output_dir, 'ckpts') 165 | args.log_dir = os.path.join(args.output_dir, 'logs') 166 | if args.zero_shot: 167 | args.log_dir = os.path.join(args.output_dir, 'zero_shot_logs') 168 | args.pred_dir = os.path.join(args.output_dir, 'preds') 169 | 170 | os.makedirs(args.output_dir, exist_ok=True) 171 | os.makedirs(args.ckpt_dir, exist_ok=True) 172 | os.makedirs(args.log_dir, exist_ok=True) 173 | os.makedirs(args.pred_dir, exist_ok=True) 174 | 175 | return args 176 | 177 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/scripts/run_reverie.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | DATA_ROOT=../datasets 4 | 5 | train_alg=dagger 6 | 7 | features=clip-h14 8 | ft_dim=1024 9 | obj_features=timm_vitb16 10 | obj_ft_dim=768 11 | 12 | ngpus=1 13 | seed=0 14 | 15 | name=scalevln_rvr 16 | outdir=${DATA_ROOT}/REVERIE/expr_duet/finetune/${name} 17 | 18 | flag="--root_dir ${DATA_ROOT} 19 | --dataset reverie 20 | --output_dir ${outdir} 21 | --world_size ${ngpus} 22 | --seed ${seed} 23 | --tokenizer bert 24 | 25 | --enc_full_graph 26 | --graph_sprels 27 | --fusion dynamic 28 | --multi_endpoints 29 | 30 | --dagger_sample sample 31 | 32 | --train_alg ${train_alg} 33 | 34 | --num_l_layers 9 35 | --num_x_layers 4 36 | --num_pano_layers 2 37 | 38 | --max_action_len 15 39 | --max_instr_len 100 40 | --max_objects 50 41 | 42 | --batch_size 32 43 | --lr 2e-5 44 | --iters 200000 45 | --log_every 500 46 | --optim adamW 47 | 48 | --features ${features} 49 | --obj_features ${obj_features} 50 | --image_feat_size ${ft_dim} 51 | --angle_feat_size 4 52 | --obj_feat_size ${obj_ft_dim} 53 | 54 | --ml_weight 0.2 55 | 56 | --feat_dropout 0.4 57 | --dropout 0.5 58 | 59 | --gamma 0." 60 | 61 | CUDA_VISIBLE_DEVICES=0 python reverie/main_nav_obj_hm3d.py $flag --tokenizer bert \ 62 | --bert_ckpt_file ../datasets/REVERIE/trained_models/model_step_40000.pt \ 63 | --aug ../datasets/REVERIE/annotations/ade20k_pseudo3d_depth2_epoch_94_beam0_sample10.jsonl \ 64 | --eval_first 65 | 66 | # python reverie/main_nav_obj_hm3d.py $flag --tokenizer bert \ 67 | # --zero_shot -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/utils/data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import jsonlines 4 | import networkx as nx 5 | import math 6 | import numpy as np 7 | 8 | import lmdb 9 | import msgpack 10 | import msgpack_numpy 11 | msgpack_numpy.patch() 12 | import h5py 13 | import random 14 | 15 | class ImageFeaturesDB(object): 16 | def __init__(self, img_ft_file, image_feat_size): 17 | self.image_feat_size = image_feat_size 18 | self.img_ft_file = img_ft_file 19 | self._feature_store = {} 20 | if ".hdf5" not in self.img_ft_file: 21 | self.env = lmdb.open(self.img_ft_file, readonly=True) 22 | else: 23 | print('pass!') 24 | with h5py.File(self.img_ft_file, 'r') as f: 25 | for key in list(f.keys()): 26 | self._feature_store[key] = f[key][...].astype(np.float32) 27 | 28 | def __del__(self): 29 | if ".hdf5" not in self.img_ft_file: 30 | self.env.close() 31 | 32 | def get_image_feature(self, scan, viewpoint): 33 | key = '%s_%s' % (scan, viewpoint) 34 | if key in self._feature_store: 35 | ft = self._feature_store[key] 36 | else: 37 | if ".hdf5" in self.img_ft_file: 38 | with h5py.File(self.img_ft_file, 'r') as f: 39 | ft = f[key][...].astype(np.float32) 40 | else: 41 | with self.env.begin() as txn: 42 | ft = msgpack.unpackb(txn.get(key.encode('ascii'))) 43 | ft = ft[:, :self.image_feat_size].astype(np.float32) 44 | self._feature_store[key] = ft 45 | return ft 46 | 47 | class ImageFeaturesDB2(object): 48 | def __init__(self, img_ft_files, image_feat_size): 49 | self.image_feat_size = image_feat_size 50 | self.img_ft_file = img_ft_files 51 | self._feature_stores = {} 52 | for name in img_ft_files: 53 | self._feature_stores[name] = {} 54 | with h5py.File(name, 'r') as f: 55 | for key in f.keys(): 56 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 57 | self._feature_stores[name][key] = ft 58 | self.env_names = list(self._feature_stores.keys()) 59 | print(self.env_names) 60 | 61 | 62 | def get_image_feature(self, scan, viewpoint): 63 | key = '%s_%s' % (scan, viewpoint) 64 | env_name = random.choice(self.env_names) 65 | if key in self._feature_stores[env_name]: 66 | ft = self._feature_stores[env_name][key] 67 | else: 68 | with h5py.File(env_name, 'r') as f: 69 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 70 | self._feature_stores[env_name][key] = ft 71 | return ft 72 | 73 | def load_nav_graphs(connectivity_dir, scans): 74 | ''' Load connectivity graph for each scan ''' 75 | 76 | def distance(pose1, pose2): 77 | ''' Euclidean distance between two graph poses ''' 78 | return ((pose1['pose'][3]-pose2['pose'][3])**2\ 79 | + (pose1['pose'][7]-pose2['pose'][7])**2\ 80 | + (pose1['pose'][11]-pose2['pose'][11])**2)**0.5 81 | 82 | graphs = {} 83 | for scan in scans: 84 | with open(os.path.join(connectivity_dir, '%s_connectivity.json' % scan)) as f: 85 | G = nx.Graph() 86 | positions = {} 87 | data = json.load(f) 88 | for i,item in enumerate(data): 89 | if item['included']: 90 | for j,conn in enumerate(item['unobstructed']): 91 | if conn and data[j]['included']: 92 | positions[item['image_id']] = np.array([item['pose'][3], 93 | item['pose'][7], item['pose'][11]]); 94 | assert data[j]['unobstructed'][i], 'Graph should be undirected' 95 | G.add_edge(item['image_id'],data[j]['image_id'],weight=distance(item,data[j])) 96 | nx.set_node_attributes(G, values=positions, name='position') 97 | graphs[scan] = G 98 | return graphs 99 | 100 | def new_simulator(connectivity_dir, scan_data_dir=None, width=640, height=480, vfov=60): 101 | import MatterSim 102 | 103 | sim = MatterSim.Simulator() 104 | if scan_data_dir: 105 | sim.setDatasetPath(scan_data_dir) 106 | sim.setNavGraphPath(connectivity_dir) 107 | sim.setRenderingEnabled(False) 108 | sim.setCameraResolution(width, height) 109 | sim.setCameraVFOV(math.radians(vfov)) 110 | sim.setDiscretizedViewingAngles(True) 111 | sim.setBatchSize(1) 112 | sim.initialize() 113 | #sim.init() 114 | 115 | return sim 116 | 117 | def angle_feature(heading, elevation, angle_feat_size): 118 | return np.array( 119 | [math.sin(heading), math.cos(heading), math.sin(elevation), math.cos(elevation)] * (angle_feat_size // 4), 120 | dtype=np.float32) 121 | 122 | def get_point_angle_feature(sim, angle_feat_size, baseViewId=0): 123 | feature = np.empty((36, angle_feat_size), np.float32) 124 | base_heading = (baseViewId % 12) * math.radians(30) 125 | base_elevation = (baseViewId // 12 - 1) * math.radians(30) 126 | 127 | for ix in range(36): 128 | # if ix == 0: 129 | # sim.newEpisode(['ZMojNkEp431'], ['2f4d90acd4024c269fb0efe49a8ac540'], [0], [math.radians(-30)]) 130 | # elif ix % 12 == 0: 131 | # sim.makeAction([0], [1.0], [1.0]) 132 | # else: 133 | # sim.makeAction([0], [1.0], [0]) 134 | 135 | # state = sim.getState()[0] 136 | # assert state.viewIndex == ix 137 | 138 | # heading = state.heading - base_heading 139 | # elevation = state.elevation - base_elevation 140 | heading = (ix % 12) * math.radians(30) - base_heading 141 | elevation = (ix // 12 - 1) * math.radians(30) - base_elevation 142 | 143 | feature[ix, :] = angle_feature(heading, elevation, angle_feat_size) 144 | return feature 145 | 146 | def get_all_point_angle_feature(sim, angle_feat_size): 147 | return [get_point_angle_feature(sim, angle_feat_size, baseViewId) for baseViewId in range(36)] 148 | 149 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/utils/distributed.py: -------------------------------------------------------------------------------- 1 | """ 2 | Distributed tools 3 | """ 4 | import os 5 | from pathlib import Path 6 | from pprint import pformat 7 | import pickle 8 | 9 | import torch 10 | import torch.distributed as dist 11 | 12 | 13 | def load_init_param(opts): 14 | """ 15 | Load parameters for the rendezvous distributed procedure 16 | """ 17 | # sync file 18 | if opts.output_dir != "": 19 | sync_dir = Path(opts.output_dir).resolve() 20 | sync_dir.mkdir(parents=True, exist_ok=True) 21 | sync_file = f"{sync_dir}/.torch_distributed_sync" 22 | else: 23 | raise RuntimeError("Can't find any sync dir") 24 | 25 | # world size 26 | if opts.world_size != -1: 27 | world_size = opts.world_size 28 | elif os.environ.get("WORLD_SIZE", "") != "": 29 | world_size = int(os.environ["WORLD_SIZE"]) 30 | else: 31 | raise RuntimeError("Can't find any world size") 32 | 33 | # rank 34 | if os.environ.get("RANK", "") != "": 35 | # pytorch.distributed.launch provide this variable no matter what 36 | rank = int(os.environ["RANK"]) 37 | else: 38 | if opts.node_rank != -1: 39 | node_rank = opts.node_rank 40 | elif os.environ.get("NODE_RANK", "") != "": 41 | node_rank = int(os.environ["NODE_RANK"]) 42 | else: 43 | raise RuntimeError("Can't find any rank or node rank") 44 | 45 | if opts.local_rank != -1: 46 | local_rank = opts.local_rank 47 | elif os.environ.get("LOCAL_RANK", "") != "": 48 | local_rank = int(os.environ["LOCAL_RANK"]) 49 | else: 50 | raise RuntimeError("Can't find any rank or local rank") 51 | 52 | # WARNING: this assumes that each node has the same number of GPUs 53 | n_gpus = torch.cuda.device_count() 54 | rank = local_rank + node_rank * n_gpus 55 | 56 | return { 57 | "backend": "nccl", 58 | # "init_method": f"file://{sync_file}", 59 | "rank": rank, 60 | "world_size": world_size, 61 | } 62 | 63 | 64 | def init_distributed(opts): 65 | init_param = load_init_param(opts) 66 | rank = init_param["rank"] 67 | 68 | print(f"Init distributed {init_param['rank']} - {init_param['world_size']}") 69 | 70 | dist.init_process_group(**init_param) 71 | return rank 72 | 73 | 74 | def is_default_gpu(opts) -> bool: 75 | return opts.local_rank == -1 or dist.get_rank() == 0 76 | 77 | 78 | def is_dist_avail_and_initialized(): 79 | if not dist.is_available(): 80 | return False 81 | if not dist.is_initialized(): 82 | return False 83 | return True 84 | 85 | def get_world_size(): 86 | if not is_dist_avail_and_initialized(): 87 | return 1 88 | return dist.get_world_size() 89 | 90 | def all_gather(data): 91 | """ 92 | Run all_gather on arbitrary picklable data (not necessarily tensors) 93 | Args: 94 | data: any picklable object 95 | Returns: 96 | list[data]: list of data gathered from each rank 97 | """ 98 | world_size = get_world_size() 99 | if world_size == 1: 100 | return [data] 101 | 102 | # serialized to a Tensor 103 | buffer = pickle.dumps(data) 104 | storage = torch.ByteStorage.from_buffer(buffer) 105 | tensor = torch.ByteTensor(storage).to("cuda") 106 | 107 | # obtain Tensor size of each rank 108 | local_size = torch.tensor([tensor.numel()], device="cuda") 109 | size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)] 110 | dist.all_gather(size_list, local_size) 111 | size_list = [int(size.item()) for size in size_list] 112 | max_size = max(size_list) 113 | 114 | # receiving Tensor from all ranks 115 | # we pad the tensor because torch all_gather does not support 116 | # gathering tensors of different shapes 117 | tensor_list = [] 118 | for _ in size_list: 119 | tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) 120 | if local_size != max_size: 121 | padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda") 122 | tensor = torch.cat((tensor, padding), dim=0) 123 | dist.all_gather(tensor_list, tensor) 124 | 125 | data_list = [] 126 | for size, tensor in zip(size_list, tensor_list): 127 | buffer = tensor.cpu().numpy().tobytes()[:size] 128 | data_list.append(pickle.loads(buffer)) 129 | 130 | return data_list 131 | 132 | 133 | def reduce_dict(input_dict, average=True): 134 | """ 135 | Args: 136 | input_dict (dict): all the values will be reduced 137 | average (bool): whether to do average or sum 138 | Reduce the values in the dictionary from all processes so that all processes 139 | have the averaged results. Returns a dict with the same fields as 140 | input_dict, after reduction. 141 | """ 142 | world_size = get_world_size() 143 | if world_size < 2: 144 | return input_dict 145 | with torch.no_grad(): 146 | names = [] 147 | values = [] 148 | # sort the keys so that they are consistent across processes 149 | for k in sorted(input_dict.keys()): 150 | names.append(k) 151 | values.append(input_dict[k]) 152 | values = torch.stack(values, dim=0) 153 | dist.all_reduce(values) 154 | if average: 155 | values /= world_size 156 | reduced_dict = {k: v for k, v in zip(names, values)} 157 | return reduced_dict 158 | 159 | 160 | def merge_dist_results(results): 161 | outs = [] 162 | for res in results: 163 | outs.extend(res) 164 | return outs 165 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/utils/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import math 4 | import time 5 | from collections import OrderedDict 6 | 7 | 8 | def write_to_record_file(data, file_path, verbose=True): 9 | if verbose: 10 | print(data) 11 | record_file = open(file_path, 'a') 12 | record_file.write(data+'\n') 13 | record_file.close() 14 | 15 | 16 | def asMinutes(s): 17 | m = math.floor(s / 60) 18 | s -= m * 60 19 | return '%dm %ds' % (m, s) 20 | 21 | def timeSince(since, percent): 22 | now = time.time() 23 | s = now - since 24 | es = s / (percent) 25 | rs = es - s 26 | return '%s (- %s)' % (asMinutes(s), asMinutes(rs)) 27 | 28 | class Timer: 29 | def __init__(self): 30 | self.cul = OrderedDict() 31 | self.start = {} 32 | self.iter = 0 33 | 34 | def reset(self): 35 | self.cul = OrderedDict() 36 | self.start = {} 37 | self.iter = 0 38 | 39 | def tic(self, key): 40 | self.start[key] = time.time() 41 | 42 | def toc(self, key): 43 | delta = time.time() - self.start[key] 44 | if key not in self.cul: 45 | self.cul[key] = delta 46 | else: 47 | self.cul[key] += delta 48 | 49 | def step(self): 50 | self.iter += 1 51 | 52 | def show(self): 53 | total = sum(self.cul.values()) 54 | for key in self.cul: 55 | print("%s, total time %0.2f, avg time %0.2f, part of %0.2f" % 56 | (key, self.cul[key], self.cul[key]*1./self.iter, self.cul[key]*1./total)) 57 | print(total / self.iter) 58 | 59 | 60 | def print_progress(iteration, total, prefix='', suffix='', decimals=1, bar_length=100): 61 | """ 62 | Call in a loop to create terminal progress bar 63 | @params: 64 | iteration - Required : current iteration (Int) 65 | total - Required : total iterations (Int) 66 | prefix - Optional : prefix string (Str) 67 | suffix - Optional : suffix string (Str) 68 | decimals - Optional : positive number of decimals in percent complete (Int) 69 | bar_length - Optional : character length of bar (Int) 70 | """ 71 | str_format = "{0:." + str(decimals) + "f}" 72 | percents = str_format.format(100 * (iteration / float(total))) 73 | filled_length = int(round(bar_length * iteration / float(total))) 74 | bar = '█' * filled_length + '-' * (bar_length - filled_length) 75 | 76 | sys.stdout.write('\r%s |%s| %s%s %s' % (prefix, bar, percents, '%', suffix)), 77 | 78 | if iteration == total: 79 | sys.stdout.write('\n') 80 | sys.stdout.flush() 81 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/utils/misc.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import torch 4 | 5 | def set_random_seed(seed): 6 | torch.manual_seed(seed) 7 | torch.cuda.manual_seed(seed) 8 | torch.cuda.manual_seed_all(seed) 9 | random.seed(seed) 10 | np.random.seed(seed) 11 | 12 | def length2mask(length, size=None): 13 | batch_size = len(length) 14 | size = int(max(length)) if size is None else size 15 | mask = (torch.arange(size, dtype=torch.int64).unsqueeze(0).repeat(batch_size, 1) 16 | > (torch.LongTensor(length) - 1).unsqueeze(1)).cuda() 17 | return mask 18 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/map_nav_src/utils/ops.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | def pad_tensors(tensors, lens=None, pad=0): 5 | """B x [T, ...]""" 6 | if lens is None: 7 | lens = [t.size(0) for t in tensors] 8 | max_len = max(lens) 9 | bs = len(tensors) 10 | hid = list(tensors[0].size()[1:]) 11 | size = [bs, max_len] + hid 12 | 13 | dtype = tensors[0].dtype 14 | device = tensors[0].device 15 | output = torch.zeros(*size, dtype=dtype).to(device) 16 | if pad: 17 | output.data.fill_(pad) 18 | for i, (t, l) in enumerate(zip(tensors, lens)): 19 | output.data[i, :l, ...] = t.data 20 | return output 21 | 22 | def gen_seq_masks(seq_lens, max_len=None): 23 | if max_len is None: 24 | max_len = max(seq_lens) 25 | 26 | if isinstance(seq_lens, torch.Tensor): 27 | device = seq_lens.device 28 | masks = torch.arange(max_len).to(device).repeat(len(seq_lens), 1) < seq_lens.unsqueeze(1) 29 | return masks 30 | 31 | if max_len == 0: 32 | return np.zeros((len(seq_lens), 0), dtype=np.bool) 33 | 34 | seq_lens = np.array(seq_lens) 35 | batch_size = len(seq_lens) 36 | masks = np.arange(max_len).reshape(-1, max_len).repeat(batch_size, 0) 37 | masks = masks < seq_lens.reshape(-1, 1) 38 | return masks -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/configs/model_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "pred_head_dropout_prob": 0.1, 3 | "attention_probs_dropout_prob": 0.1, 4 | "finetuning_task": null, 5 | "hidden_act": "gelu", 6 | "hidden_dropout_prob": 0.1, 7 | "hidden_size": 768, 8 | "image_feat_size": 1024, 9 | "image_prob_size": 0, 10 | "angle_feat_size": 4, 11 | "obj_feat_size": 768, 12 | "obj_prob_size": 0, 13 | "share_scene_obj_enc": false, 14 | "img_feature_type": "imagenet", 15 | "initializer_range": 0.02, 16 | "intermediate_size": 3072, 17 | "num_l_layers": 9, 18 | "num_x_layers": 4, 19 | "num_pano_layers": 2, 20 | "layer_norm_eps": 1e-12, 21 | "max_position_embeddings": 512, 22 | "max_action_steps": 100, 23 | "num_attention_heads": 12, 24 | "num_hidden_layers": 12, 25 | "num_labels": 2, 26 | "output_attentions": false, 27 | "output_hidden_states": false, 28 | "pruned_heads": {}, 29 | "torchscript": false, 30 | "type_vocab_size": 2, 31 | "update_lang_bert": true, 32 | "vocab_size": 30522, 33 | "use_lang2visn_attn": true, 34 | "graph_sprels": true, 35 | "glocal_fuse": true, 36 | "lang_bert_name": "bert-base-uncased" 37 | } -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/configs/training_args.json: -------------------------------------------------------------------------------- 1 | { 2 | "vlnbert": "cmt", 3 | "model_config": "", 4 | "checkpoint": null, 5 | "output_dir": "", 6 | "train_batch_size": 128, 7 | "val_batch_size": 64, 8 | "gradient_accumulation_steps": 1, 9 | "learning_rate": 5e-05, 10 | "valid_steps": 5000, 11 | "log_steps": 1000, 12 | "num_train_steps": 200000, 13 | "optim": "adamw", 14 | "betas": [ 15 | 0.9, 16 | 0.98 17 | ], 18 | "dropout": 0.1, 19 | "weight_decay": 0.01, 20 | "grad_norm": 5.0, 21 | "warmup_steps": 10000, 22 | "seed": 0, 23 | "fp16": false, 24 | "n_workers": 0, 25 | "pin_mem": true, 26 | "local_rank": -1, 27 | "node_rank": 0, 28 | "world_size": 1, 29 | "mrc_mask_prob": 0.15, 30 | "itm_neg_imgs": 5, 31 | "nearby_vp_steps": null, 32 | "max_objects": 50, 33 | "max_txt_len": 100, 34 | "init_pretrained": "lxmert", 35 | "train_datasets": { 36 | "HM3D": { 37 | "name": "HM3D", 38 | "train_traj_files": [ 39 | "../datasets/REVERIE/annotations/pretrain/ade20k_pseudo3d_depth2_epoch_94_beam0_zun_3_none.jsonl", 40 | "../datasets/REVERIE/annotations/pretrain/ade20k_pseudo3d_depth2_epoch_94_beam0_zun_gibson_3_none.jsonl" 41 | ], 42 | "connectivity_dir": "../datasets/R2R/connectivity", 43 | "img_ft_file": "../datasets/R2R/features/clip_vit-h14_final.hdf5", 44 | "obj_ft_file": "../../data_all/obj_features_merged", 45 | "scanvp_cands_file": [ 46 | "../datasets/REVERIE/annotations/scanvp_candview_relangles_new.json", 47 | "../datasets/R2R/annotations/scanvp_candview_relangles.json", 48 | "../datasets/REVERIE/annotations/scanvp_candview_relangles_new_gibson.json" 49 | ], 50 | "tasks": [ 51 | "mlm", 52 | "sap", 53 | "og" 54 | ], 55 | "mix_ratio": [ 56 | 1, 57 | 1, 58 | 1 59 | ], 60 | "scan_ranges": null 61 | }, 62 | "REVERIE": { 63 | "name": "REVERIE", 64 | "val_seen_traj_files": [ 65 | "../datasets/REVERIE/annotations/pretrain/REVERIE_val_seen_enc.jsonl" 66 | ], 67 | "val_unseen_traj_files": [ 68 | "../datasets/REVERIE/annotations/pretrain/REVERIE_val_unseen_enc.jsonl" 69 | ], 70 | "connectivity_dir": "../datasets/R2R/connectivity", 71 | "img_ft_file": "../datasets/R2R/features/clip_vit-h14_final.hdf5", 72 | "obj_ft_file": "../datasets/REVERIE/features/obj_gtmax_timm_imagenet_vitb16", 73 | "scanvp_cands_file": [ 74 | "../datasets/R2R/annotations/scanvp_candview_relangles.json" 75 | ] 76 | } 77 | } 78 | } -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET-RVR/pretrain_src/data/__init__.py -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/data/common.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import json 4 | import numpy as np 5 | import networkx as nx 6 | 7 | import torch 8 | 9 | def pad_tensors(tensors, lens=None, pad=0): 10 | """B x [T, ...] torch tensors""" 11 | if lens is None: 12 | lens = [t.size(0) for t in tensors] 13 | max_len = max(lens) 14 | bs = len(tensors) 15 | hid = list(tensors[0].size()[1:]) 16 | size = [bs, max_len] + hid 17 | 18 | dtype = tensors[0].dtype 19 | output = torch.zeros(*size, dtype=dtype) 20 | if pad: 21 | output.data.fill_(pad) 22 | for i, (t, l) in enumerate(zip(tensors, lens)): 23 | output.data[i, :l, ...] = t.data 24 | return output 25 | 26 | def gen_seq_masks(seq_lens, max_len=None): 27 | """ 28 | Args: 29 | seq_lens: list or nparray int, shape=(N, ) 30 | Returns: 31 | masks: nparray, shape=(N, L), padded=0 32 | """ 33 | seq_lens = np.array(seq_lens) 34 | if max_len is None: 35 | max_len = max(seq_lens) 36 | if max_len == 0: 37 | return np.zeros((len(seq_lens), 0), dtype=np.bool) 38 | batch_size = len(seq_lens) 39 | masks = np.arange(max_len).reshape(-1, max_len).repeat(batch_size, 0) 40 | masks = masks < seq_lens.reshape(-1, 1) 41 | return masks 42 | 43 | def get_angle_fts(headings, elevations, angle_feat_size): 44 | ang_fts = [np.sin(headings), np.cos(headings), np.sin(elevations), np.cos(elevations)] 45 | ang_fts = np.vstack(ang_fts).transpose().astype(np.float32) 46 | num_repeats = angle_feat_size // 4 47 | if num_repeats > 1: 48 | ang_fts = np.concatenate([ang_fts] * num_repeats, 1) 49 | return ang_fts 50 | 51 | def get_view_rel_angles(baseViewId=0): 52 | rel_angles = np.zeros((36, 2), dtype=np.float32) 53 | 54 | base_heading = (baseViewId % 12) * math.radians(30) 55 | base_elevation = (baseViewId // 12 - 1) * math.radians(30) 56 | for ix in range(36): 57 | if ix == 0: 58 | heading = 0 59 | elevation = math.radians(-30) 60 | elif ix % 12 == 0: 61 | heading = 0 62 | elevation += math.radians(30) 63 | else: 64 | heading += math.radians(30) 65 | rel_angles[ix, 0] = heading - base_heading 66 | rel_angles[ix, 1] = elevation - base_elevation 67 | 68 | return rel_angles 69 | 70 | 71 | def load_nav_graphs(connectivity_dir): 72 | ''' Load connectivity graph for each scan ''' 73 | 74 | def distance(pose1, pose2): 75 | ''' Euclidean distance between two graph poses ''' 76 | return ((pose1['pose'][3]-pose2['pose'][3])**2\ 77 | + (pose1['pose'][7]-pose2['pose'][7])**2\ 78 | + (pose1['pose'][11]-pose2['pose'][11])**2)**0.5 79 | 80 | scans = [x.strip() for x in open(os.path.join(connectivity_dir, 'scans.txt')).readlines()] 81 | graphs = {} 82 | for scan in scans: 83 | with open(os.path.join(connectivity_dir, '%s_connectivity.json' % scan)) as f: 84 | G = nx.Graph() 85 | positions = {} 86 | data = json.load(f) 87 | for i, item in enumerate(data): 88 | if item['included']: 89 | for j,conn in enumerate(item['unobstructed']): 90 | if conn and data[j]['included']: 91 | positions[item['image_id']] = np.array([item['pose'][3], 92 | item['pose'][7], item['pose'][11]]); 93 | assert data[j]['unobstructed'][i], 'Graph should be undirected' 94 | G.add_edge(item['image_id'],data[j]['image_id'],weight=distance(item,data[j])) 95 | nx.set_node_attributes(G, values=positions, name='position') 96 | graphs[scan] = G 97 | 98 | shortest_distances = {} 99 | shortest_paths = {} 100 | for scan, G in graphs.items(): # compute all shortest paths 101 | shortest_distances[scan] = dict(nx.all_pairs_dijkstra_path_length(G)) 102 | shortest_paths[scan] = dict(nx.all_pairs_dijkstra_path(G)) 103 | return graphs, shortest_distances, shortest_paths 104 | 105 | def softmax(logits, dim=1): 106 | # logits: (n, d) 107 | tmp = np.exp(logits) 108 | return tmp / np.sum(tmp, axis=dim, keepdims=True) 109 | 110 | 111 | def calculate_vp_rel_pos_fts(a, b, base_heading=0, base_elevation=0): 112 | # a, b: (x, y, z) 113 | dx = b[0] - a[0] 114 | dy = b[1] - a[1] 115 | dz = b[2] - a[2] 116 | xy_dist = max(np.sqrt(dx**2 + dy**2), 1e-8) 117 | xyz_dist = max(np.sqrt(dx**2 + dy**2 + dz**2), 1e-8) 118 | 119 | # the simulator's api is weired (x-y axis is transposed) 120 | heading = np.arcsin(dx/xy_dist) # [-pi/2, pi/2] 121 | if b[1] < a[1]: 122 | heading = np.pi - heading 123 | heading -= base_heading 124 | 125 | elevation = np.arcsin(dz/xyz_dist) # [-pi/2, pi/2] 126 | elevation -= base_elevation 127 | 128 | return heading, elevation, xyz_dist 129 | 130 | def normalize_angle(x): 131 | '''convert radians into (-pi, pi]''' 132 | pi2 = 2 * math.pi 133 | x = x % pi2 # [0, 2pi] 134 | x = np.where(x > math.pi, x - pi2, x) 135 | return x -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/data/loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | A prefetch loader to speedup data loading 6 | Modified from Nvidia Deep Learning Examples 7 | (https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch). 8 | """ 9 | import random 10 | from typing import List, Dict, Tuple, Union, Iterator 11 | 12 | import torch 13 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler 14 | from torch.utils.data.distributed import DistributedSampler 15 | import torch.distributed as dist 16 | 17 | 18 | class MetaLoader: 19 | """wraps multiple data loaders""" 20 | 21 | def __init__( 22 | self, loaders, accum_steps: int = 1, distributed: bool = False, device=None 23 | ): 24 | assert isinstance(loaders, dict) 25 | self.name2loader = {} 26 | self.name2iter = {} 27 | self.name2pre_epoch = {} 28 | self.names: List[str] = [] 29 | ratios: List[int] = [] 30 | for n, l in loaders.items(): 31 | if isinstance(l, tuple): 32 | l, r, p = l 33 | elif isinstance(l, DataLoader): 34 | r = 1 35 | p = lambda e: None 36 | else: 37 | raise ValueError() 38 | self.names.append(n) 39 | self.name2loader[n] = l 40 | self.name2iter[n] = iter(l) 41 | self.name2pre_epoch[n] = p 42 | ratios.append(r) 43 | 44 | self.accum_steps = accum_steps 45 | self.device = device 46 | self.sampling_ratios = torch.tensor(ratios).float().to(self.device) 47 | self.distributed = distributed 48 | self.step = 0 49 | 50 | def __iter__(self) -> Iterator[Tuple]: 51 | """this iterator will run indefinitely""" 52 | task_id = None 53 | epoch_id = 0 54 | while True: 55 | if self.step % self.accum_steps == 0: 56 | task_id = torch.multinomial(self.sampling_ratios, 1) 57 | if self.distributed: 58 | # make sure all process is training same task 59 | dist.broadcast(task_id, 0) 60 | self.step += 1 61 | task = self.names[task_id.cpu().item()] 62 | iter_ = self.name2iter[task] 63 | try: 64 | batch = next(iter_) 65 | except StopIteration: 66 | epoch_id += 1 67 | # In distributed mode, calling the set_epoch() method at the beginning of each epoch 68 | # before creating the DataLoader iterator is necessary to make shuffling work properly 69 | # across multiple epochs. Otherwise, the same ordering will be always used. 70 | self.name2pre_epoch[task](epoch_id) 71 | iter_ = iter(self.name2loader[task]) 72 | batch = next(iter_) 73 | self.name2iter[task] = iter_ 74 | 75 | yield task, batch 76 | 77 | 78 | def move_to_cuda(batch: Union[List, Tuple, Dict, torch.Tensor], device: torch.device): 79 | if isinstance(batch, torch.Tensor): 80 | return batch.to(device, non_blocking=True) 81 | elif isinstance(batch, list): 82 | return [move_to_cuda(t, device) for t in batch] 83 | elif isinstance(batch, tuple): 84 | return tuple(move_to_cuda(t, device) for t in batch) 85 | elif isinstance(batch, dict): 86 | return {n: move_to_cuda(t, device) for n, t in batch.items()} 87 | return batch 88 | 89 | 90 | class PrefetchLoader(object): 91 | """ 92 | overlap compute and cuda data transfer 93 | """ 94 | def __init__(self, loader, device: torch.device): 95 | self.loader = loader 96 | self.device = device 97 | 98 | def __iter__(self): 99 | loader_it = iter(self.loader) 100 | self.preload(loader_it) 101 | batch = self.next(loader_it) 102 | while batch is not None: 103 | yield batch 104 | batch = self.next(loader_it) 105 | 106 | def __len__(self): 107 | return len(self.loader) 108 | 109 | def preload(self, it): 110 | try: 111 | self.batch = next(it) 112 | except StopIteration: 113 | self.batch = None 114 | return 115 | self.batch = move_to_cuda(self.batch, self.device) 116 | 117 | def next(self, it): 118 | batch = self.batch 119 | self.preload(it) 120 | return batch 121 | 122 | def __getattr__(self, name): 123 | method = self.loader.__getattribute__(name) 124 | return method 125 | 126 | 127 | def build_dataloader(task, dataset, collate_fn, is_train: bool, opts): 128 | 129 | batch_size = opts.train_batch_size if is_train else opts.val_batch_size 130 | # if task == 'itm': batch_size = max(1, batch_size // 2) 131 | 132 | if opts.local_rank == -1: 133 | if is_train: 134 | sampler: Union[ 135 | RandomSampler, SequentialSampler, DistributedSampler 136 | ] = RandomSampler(dataset) 137 | else: 138 | sampler = SequentialSampler(dataset) 139 | 140 | size = torch.cuda.device_count() if torch.cuda.is_available() else 1 141 | pre_epoch = lambda e: None 142 | 143 | # DataParallel: scale the batch size by the number of GPUs 144 | if size > 1: 145 | batch_size *= size 146 | 147 | else: 148 | size = dist.get_world_size() 149 | sampler = DistributedSampler( 150 | dataset, num_replicas=size, rank=dist.get_rank(), shuffle=is_train 151 | ) 152 | pre_epoch = sampler.set_epoch 153 | 154 | loader = DataLoader( 155 | dataset, 156 | sampler=sampler, 157 | batch_size=batch_size, 158 | num_workers=opts.n_workers, 159 | pin_memory=opts.pin_mem, 160 | collate_fn=collate_fn, 161 | drop_last=False, 162 | ) 163 | 164 | return loader, pre_epoch 165 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET-RVR/pretrain_src/model/__init__.py -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/model/ops.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from .transformer import TransformerEncoder, TransformerEncoderLayer 4 | 5 | # try: 6 | # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm 7 | # except (ImportError, AttributeError) as e: 8 | # # logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .") 9 | # BertLayerNorm = torch.nn.LayerNorm 10 | BertLayerNorm = torch.nn.LayerNorm 11 | 12 | def create_transformer_encoder(config, num_layers, norm=False): 13 | enc_layer = TransformerEncoderLayer( 14 | config.hidden_size, config.num_attention_heads, 15 | dim_feedforward=config.intermediate_size, 16 | dropout=config.hidden_dropout_prob, 17 | activation=config.hidden_act, 18 | normalize_before=True 19 | ) 20 | if norm: 21 | norm_layer = BertLayerNorm(config.hidden_size, eps=1e-12) 22 | else: 23 | norm_layer = None 24 | return TransformerEncoder(enc_layer, num_layers, norm=norm_layer, batch_first=True) 25 | 26 | def extend_neg_masks(masks, dtype=None): 27 | """ 28 | mask from (N, L) into (N, 1(H), 1(L), L) and make it negative 29 | """ 30 | if dtype is None: 31 | dtype = torch.float 32 | extended_masks = masks.unsqueeze(1).unsqueeze(2) 33 | extended_masks = extended_masks.to(dtype=dtype) 34 | extended_masks = (1.0 - extended_masks) * -10000.0 35 | return extended_masks 36 | 37 | def gen_seq_masks(seq_lens, max_len=None): 38 | if max_len is None: 39 | max_len = max(seq_lens) 40 | batch_size = len(seq_lens) 41 | device = seq_lens.device 42 | 43 | masks = torch.arange(max_len).unsqueeze(0).repeat(batch_size, 1).to(device) 44 | masks = masks < seq_lens.unsqueeze(1) 45 | return masks 46 | 47 | def pad_tensors_wgrad(tensors, lens=None): 48 | """B x [T, ...] torch tensors""" 49 | if lens is None: 50 | lens = [t.size(0) for t in tensors] 51 | max_len = max(lens) 52 | batch_size = len(tensors) 53 | hid = list(tensors[0].size()[1:]) 54 | 55 | device = tensors[0].device 56 | dtype = tensors[0].dtype 57 | 58 | output = [] 59 | for i in range(batch_size): 60 | if lens[i] < max_len: 61 | tmp = torch.cat( 62 | [tensors[i], torch.zeros([max_len-lens[i]]+hid, dtype=dtype).to(device)], 63 | dim=0 64 | ) 65 | else: 66 | tmp = tensors[i] 67 | output.append(tmp) 68 | output = torch.stack(output, 0) 69 | return output 70 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | """ 6 | from .sched import noam_schedule, warmup_linear, get_lr_sched 7 | from .adamw import AdamW 8 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/adamw.py: -------------------------------------------------------------------------------- 1 | """ 2 | AdamW optimizer (weight decay fix) 3 | copied from hugginface (https://github.com/huggingface/transformers). 4 | """ 5 | 6 | import math 7 | from typing import Callable, Iterable, Tuple 8 | 9 | import torch 10 | 11 | from torch.optim import Optimizer 12 | 13 | class AdamW(Optimizer): 14 | """ 15 | Implements Adam algorithm with weight decay fix as introduced in `Decoupled Weight Decay Regularization 16 | `__. 17 | 18 | Parameters: 19 | params (:obj:`Iterable[torch.nn.parameter.Parameter]`): 20 | Iterable of parameters to optimize or dictionaries defining parameter groups. 21 | lr (:obj:`float`, `optional`, defaults to 1e-3): 22 | The learning rate to use. 23 | betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)): 24 | Adam's betas parameters (b1, b2). 25 | eps (:obj:`float`, `optional`, defaults to 1e-6): 26 | Adam's epsilon for numerical stability. 27 | weight_decay (:obj:`float`, `optional`, defaults to 0): 28 | Decoupled weight decay to apply. 29 | correct_bias (:obj:`bool`, `optional`, defaults to `True`): 30 | Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`). 31 | """ 32 | 33 | def __init__( 34 | self, 35 | params: Iterable[torch.nn.parameter.Parameter], 36 | lr: float = 1e-3, 37 | betas: Tuple[float, float] = (0.9, 0.999), 38 | eps: float = 1e-6, 39 | weight_decay: float = 0.0, 40 | correct_bias: bool = True, 41 | ): 42 | if lr < 0.0: 43 | raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr)) 44 | if not 0.0 <= betas[0] < 1.0: 45 | raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[0])) 46 | if not 0.0 <= betas[1] < 1.0: 47 | raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[1])) 48 | if not 0.0 <= eps: 49 | raise ValueError("Invalid epsilon value: {} - should be >= 0.0".format(eps)) 50 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias) 51 | super().__init__(params, defaults) 52 | 53 | def step(self, closure: Callable = None): 54 | """ 55 | Performs a single optimization step. 56 | 57 | Arguments: 58 | closure (:obj:`Callable`, `optional`): A closure that reevaluates the model and returns the loss. 59 | """ 60 | loss = None 61 | if closure is not None: 62 | loss = closure() 63 | 64 | for group in self.param_groups: 65 | for p in group["params"]: 66 | if p.grad is None: 67 | continue 68 | grad = p.grad.data 69 | if grad.is_sparse: 70 | raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead") 71 | 72 | state = self.state[p] 73 | 74 | # State initialization 75 | if len(state) == 0: 76 | state["step"] = 0 77 | # Exponential moving average of gradient values 78 | state["exp_avg"] = torch.zeros_like(p.data) 79 | # Exponential moving average of squared gradient values 80 | state["exp_avg_sq"] = torch.zeros_like(p.data) 81 | 82 | exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 83 | beta1, beta2 = group["betas"] 84 | 85 | state["step"] += 1 86 | 87 | # Decay the first and second moment running average coefficient 88 | # In-place operations to update the averages at the same time 89 | exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1) 90 | exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2) 91 | denom = exp_avg_sq.sqrt().add_(group["eps"]) 92 | 93 | step_size = group["lr"] 94 | if group["correct_bias"]: # No bias correction for Bert 95 | bias_correction1 = 1.0 - beta1 ** state["step"] 96 | bias_correction2 = 1.0 - beta2 ** state["step"] 97 | step_size = step_size * math.sqrt(bias_correction2) / bias_correction1 98 | 99 | p.data.addcdiv_(exp_avg, denom, value=-step_size) 100 | 101 | # Just adding the square of the weights to the loss function is *not* 102 | # the correct way of using L2 regularization/weight decay with Adam, 103 | # since that will interact with the m and v parameters in strange ways. 104 | # 105 | # Instead we want to decay the weights in a manner that doesn't interact 106 | # with the m/v parameters. This is equivalent to adding the square 107 | # of the weights to the loss with plain (non-momentum) SGD. 108 | # Add weight decay at the end (fixed version) 109 | if group["weight_decay"] > 0.0: 110 | p.data.add_(p.data, alpha=-group["lr"] * group["weight_decay"]) 111 | 112 | return loss 113 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/lookahead.py: -------------------------------------------------------------------------------- 1 | # Lookahead implementation from https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/lookahead.py 2 | 3 | """ Lookahead Optimizer Wrapper. 4 | Implementation modified from: https://github.com/alphadl/lookahead.pytorch 5 | Paper: `Lookahead Optimizer: k steps forward, 1 step back` - https://arxiv.org/abs/1907.08610 6 | """ 7 | import torch 8 | from torch.optim.optimizer import Optimizer 9 | from torch.optim import Adam 10 | from collections import defaultdict 11 | 12 | class Lookahead(Optimizer): 13 | def __init__(self, base_optimizer, alpha=0.5, k=6): 14 | if not 0.0 <= alpha <= 1.0: 15 | raise ValueError(f'Invalid slow update rate: {alpha}') 16 | if not 1 <= k: 17 | raise ValueError(f'Invalid lookahead steps: {k}') 18 | defaults = dict(lookahead_alpha=alpha, lookahead_k=k, lookahead_step=0) 19 | self.base_optimizer = base_optimizer 20 | self.param_groups = self.base_optimizer.param_groups 21 | self.defaults = base_optimizer.defaults 22 | self.defaults.update(defaults) 23 | self.state = defaultdict(dict) 24 | # manually add our defaults to the param groups 25 | for name, default in defaults.items(): 26 | for group in self.param_groups: 27 | group.setdefault(name, default) 28 | 29 | def update_slow(self, group): 30 | for fast_p in group["params"]: 31 | if fast_p.grad is None: 32 | continue 33 | param_state = self.state[fast_p] 34 | if 'slow_buffer' not in param_state: 35 | param_state['slow_buffer'] = torch.empty_like(fast_p.data) 36 | param_state['slow_buffer'].copy_(fast_p.data) 37 | slow = param_state['slow_buffer'] 38 | slow.add_(group['lookahead_alpha'], fast_p.data - slow) 39 | fast_p.data.copy_(slow) 40 | 41 | def sync_lookahead(self): 42 | for group in self.param_groups: 43 | self.update_slow(group) 44 | 45 | def step(self, closure=None): 46 | # print(self.k) 47 | #assert id(self.param_groups) == id(self.base_optimizer.param_groups) 48 | loss = self.base_optimizer.step(closure) 49 | for group in self.param_groups: 50 | group['lookahead_step'] += 1 51 | if group['lookahead_step'] % group['lookahead_k'] == 0: 52 | self.update_slow(group) 53 | return loss 54 | 55 | def state_dict(self): 56 | fast_state_dict = self.base_optimizer.state_dict() 57 | slow_state = { 58 | (id(k) if isinstance(k, torch.Tensor) else k): v 59 | for k, v in self.state.items() 60 | } 61 | fast_state = fast_state_dict['state'] 62 | param_groups = fast_state_dict['param_groups'] 63 | return { 64 | 'state': fast_state, 65 | 'slow_state': slow_state, 66 | 'param_groups': param_groups, 67 | } 68 | 69 | def load_state_dict(self, state_dict): 70 | fast_state_dict = { 71 | 'state': state_dict['state'], 72 | 'param_groups': state_dict['param_groups'], 73 | } 74 | self.base_optimizer.load_state_dict(fast_state_dict) 75 | 76 | # We want to restore the slow state, but share param_groups reference 77 | # with base_optimizer. This is a bit redundant but least code 78 | slow_state_new = False 79 | if 'slow_state' not in state_dict: 80 | print('Loading state_dict from optimizer without Lookahead applied.') 81 | state_dict['slow_state'] = defaultdict(dict) 82 | slow_state_new = True 83 | slow_state_dict = { 84 | 'state': state_dict['slow_state'], 85 | 'param_groups': state_dict['param_groups'], # this is pointless but saves code 86 | } 87 | super(Lookahead, self).load_state_dict(slow_state_dict) 88 | self.param_groups = self.base_optimizer.param_groups # make both ref same container 89 | if slow_state_new: 90 | # reapply defaults to catch missing lookahead specific ones 91 | for name, default in self.defaults.items(): 92 | for group in self.param_groups: 93 | group.setdefault(name, default) 94 | 95 | def LookaheadAdam(params, alpha=0.5, k=6, *args, **kwargs): 96 | adam = Adam(params, *args, **kwargs) 97 | return Lookahead(adam, alpha, k) 98 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/misc.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | Misc lr helper 6 | """ 7 | from torch.optim import Adam, Adamax 8 | 9 | from .adamw import AdamW 10 | from .rangerlars import RangerLars 11 | 12 | def build_optimizer(model, opts): 13 | param_optimizer = list(model.named_parameters()) 14 | no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] 15 | optimizer_grouped_parameters = [ 16 | {'params': [p for n, p in param_optimizer 17 | if not any(nd in n for nd in no_decay)], 18 | 'weight_decay': opts.weight_decay}, 19 | {'params': [p for n, p in param_optimizer 20 | if any(nd in n for nd in no_decay)], 21 | 'weight_decay': 0.0} 22 | ] 23 | 24 | # currently Adam only 25 | if opts.optim == 'adam': 26 | OptimCls = Adam 27 | elif opts.optim == 'adamax': 28 | OptimCls = Adamax 29 | elif opts.optim == 'adamw': 30 | OptimCls = AdamW 31 | elif opts.optim == 'rangerlars': 32 | OptimCls = RangerLars 33 | else: 34 | raise ValueError('invalid optimizer') 35 | optimizer = OptimCls(optimizer_grouped_parameters, 36 | lr=opts.learning_rate, betas=opts.betas) 37 | return optimizer 38 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/ralamb.py: -------------------------------------------------------------------------------- 1 | import torch, math 2 | from torch.optim.optimizer import Optimizer 3 | 4 | # RAdam + LARS 5 | class Ralamb(Optimizer): 6 | 7 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): 8 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) 9 | self.buffer = [[None, None, None] for ind in range(10)] 10 | super(Ralamb, self).__init__(params, defaults) 11 | 12 | def __setstate__(self, state): 13 | super(Ralamb, self).__setstate__(state) 14 | 15 | def step(self, closure=None): 16 | 17 | loss = None 18 | if closure is not None: 19 | loss = closure() 20 | 21 | for group in self.param_groups: 22 | 23 | for p in group['params']: 24 | if p.grad is None: 25 | continue 26 | grad = p.grad.data.float() 27 | if grad.is_sparse: 28 | raise RuntimeError('Ralamb does not support sparse gradients') 29 | 30 | p_data_fp32 = p.data.float() 31 | 32 | state = self.state[p] 33 | 34 | if len(state) == 0: 35 | state['step'] = 0 36 | state['exp_avg'] = torch.zeros_like(p_data_fp32) 37 | state['exp_avg_sq'] = torch.zeros_like(p_data_fp32) 38 | else: 39 | state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32) 40 | state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32) 41 | 42 | exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] 43 | beta1, beta2 = group['betas'] 44 | 45 | # Decay the first and second moment running average coefficient 46 | # m_t 47 | exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) 48 | # v_t 49 | exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) 50 | 51 | state['step'] += 1 52 | buffered = self.buffer[int(state['step'] % 10)] 53 | 54 | if state['step'] == buffered[0]: 55 | N_sma, radam_step_size = buffered[1], buffered[2] 56 | else: 57 | buffered[0] = state['step'] 58 | beta2_t = beta2 ** state['step'] 59 | N_sma_max = 2 / (1 - beta2) - 1 60 | N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t) 61 | buffered[1] = N_sma 62 | 63 | # more conservative since it's an approximated value 64 | if N_sma >= 5: 65 | radam_step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step']) 66 | else: 67 | radam_step_size = 1.0 / (1 - beta1 ** state['step']) 68 | buffered[2] = radam_step_size 69 | 70 | if group['weight_decay'] != 0: 71 | p_data_fp32.add_(p_data_fp32, alpha=-group['weight_decay'] * group['lr']) 72 | 73 | # more conservative since it's an approximated value 74 | radam_step = p_data_fp32.clone() 75 | if N_sma >= 5: 76 | denom = exp_avg_sq.sqrt().add_(group['eps']) 77 | radam_step.addcdiv_(-radam_step_size * group['lr'], exp_avg, denom) 78 | else: 79 | radam_step.add_(exp_avg, alpha=-radam_step_size * group['lr']) 80 | 81 | radam_norm = radam_step.pow(2).sum().sqrt() 82 | weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10) 83 | if weight_norm == 0 or radam_norm == 0: 84 | trust_ratio = 1 85 | else: 86 | trust_ratio = weight_norm / radam_norm 87 | 88 | state['weight_norm'] = weight_norm 89 | state['adam_norm'] = radam_norm 90 | state['trust_ratio'] = trust_ratio 91 | 92 | if N_sma >= 5: 93 | p_data_fp32.addcdiv_(-radam_step_size * group['lr'] * trust_ratio, exp_avg, denom) 94 | else: 95 | p_data_fp32.add_(-radam_step_size * group['lr'] * trust_ratio, exp_avg) 96 | 97 | p.data.copy_(p_data_fp32) 98 | 99 | return loss 100 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/rangerlars.py: -------------------------------------------------------------------------------- 1 | import torch, math 2 | from torch.optim.optimizer import Optimizer 3 | import itertools as it 4 | from .lookahead import * 5 | from .ralamb import * 6 | 7 | # RAdam + LARS + LookAHead 8 | 9 | # Lookahead implementation from https://github.com/lonePatient/lookahead_pytorch/blob/master/optimizer.py 10 | # RAdam + LARS implementation from https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20 11 | 12 | def RangerLars(params, alpha=0.5, k=6, *args, **kwargs): 13 | ralamb = Ralamb(params, *args, **kwargs) 14 | return Lookahead(ralamb, alpha, k) 15 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/optim/sched.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | optimizer learning rate scheduling helpers 6 | """ 7 | from math import ceil 8 | 9 | 10 | def noam_schedule(step, warmup_step=4000): 11 | """ original Transformer schedule""" 12 | if step <= warmup_step: 13 | return step / warmup_step 14 | return (warmup_step ** 0.5) * (step ** -0.5) 15 | 16 | 17 | def warmup_linear(step, warmup_step, tot_step): 18 | """ BERT schedule """ 19 | if step < warmup_step: 20 | return step / warmup_step 21 | return max(0, (tot_step-step)/(tot_step-warmup_step)) 22 | 23 | 24 | def get_lr_sched(global_step, opts): 25 | # learning rate scheduling 26 | lr_this_step = opts.learning_rate * warmup_linear( 27 | global_step, opts.warmup_steps, opts.num_train_steps) 28 | if lr_this_step <= 0: 29 | lr_this_step = 1e-8 30 | return lr_this_step 31 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/parser.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import json 4 | 5 | 6 | def load_parser(): 7 | parser = argparse.ArgumentParser() 8 | 9 | # Required parameters 10 | # NOTE: train tasks and val tasks cannot take command line arguments 11 | parser.add_argument('--vlnbert', choices=['cmt', 'mmt', 'causal.cmt', 'cmt3']) 12 | parser.add_argument( 13 | "--model_config", type=str, help="path to model structure config json" 14 | ) 15 | parser.add_argument( 16 | "--checkpoint", default=None, type=str, help="path to model checkpoint (*.pt)" 17 | ) 18 | 19 | parser.add_argument( 20 | "--output_dir", 21 | default=None, 22 | type=str, 23 | help="The output directory where the model checkpoints will be written.", 24 | ) 25 | 26 | # training parameters 27 | parser.add_argument( 28 | "--train_batch_size", 29 | default=4096, 30 | type=int, 31 | help="Total batch size for training. ", 32 | ) 33 | parser.add_argument( 34 | "--val_batch_size", 35 | default=4096, 36 | type=int, 37 | help="Total batch size for validation. ", 38 | ) 39 | parser.add_argument( 40 | "--gradient_accumulation_steps", 41 | type=int, 42 | default=16, 43 | help="Number of updates steps to accumualte before " 44 | "performing a backward/update pass.", 45 | ) 46 | parser.add_argument( 47 | "--learning_rate", 48 | default=3e-5, 49 | type=float, 50 | help="The initial learning rate for Adam.", 51 | ) 52 | parser.add_argument( 53 | "--valid_steps", default=1000, type=int, help="Run validation every X steps" 54 | ) 55 | parser.add_argument("--log_steps", default=1000, type=int) 56 | parser.add_argument( 57 | "--num_train_steps", 58 | default=100000, 59 | type=int, 60 | help="Total number of training updates to perform.", 61 | ) 62 | parser.add_argument( 63 | "--optim", 64 | default="adamw", 65 | choices=["adam", "adamax", "adamw"], 66 | help="optimizer", 67 | ) 68 | parser.add_argument( 69 | "--betas", default=[0.9, 0.98], nargs="+", help="beta for adam optimizer" 70 | ) 71 | parser.add_argument( 72 | "--dropout", default=0.1, type=float, help="tune dropout regularization" 73 | ) 74 | parser.add_argument( 75 | "--weight_decay", 76 | default=0.01, 77 | type=float, 78 | help="weight decay (L2) regularization", 79 | ) 80 | parser.add_argument( 81 | "--grad_norm", 82 | default=2.0, 83 | type=float, 84 | help="gradient clipping (-1 for no clipping)", 85 | ) 86 | parser.add_argument( 87 | "--warmup_steps", 88 | default=10000, 89 | type=int, 90 | help="Number of training steps to perform linear " "learning rate warmup for.", 91 | ) 92 | 93 | # device parameters 94 | parser.add_argument( 95 | "--seed", type=int, default=0, help="random seed for initialization" 96 | ) 97 | parser.add_argument( 98 | "--fp16", 99 | action="store_true", 100 | help="Whether to use 16-bit float precision instead of 32-bit", 101 | ) 102 | parser.add_argument( 103 | "--n_workers", type=int, default=4, help="number of data workers" 104 | ) 105 | parser.add_argument("--pin_mem", action="store_true", help="pin memory") 106 | 107 | # distributed computing 108 | parser.add_argument( 109 | "--local_rank", 110 | type=int, 111 | default=-1, 112 | help="local rank for distributed training on gpus", 113 | ) 114 | parser.add_argument( 115 | "--node_rank", 116 | type=int, 117 | default=0, 118 | help="Id of the node", 119 | ) 120 | parser.add_argument( 121 | "--world_size", 122 | type=int, 123 | default=1, 124 | help="Number of GPUs across all nodes", 125 | ) 126 | 127 | # can use config files 128 | parser.add_argument("--config", required=True, help="JSON config files") 129 | 130 | return parser 131 | 132 | 133 | def parse_with_config(parser): 134 | args = parser.parse_args() 135 | if args.config is not None: 136 | config_args = json.load(open(args.config)) 137 | override_keys = { 138 | arg[2:].split("=")[0] for arg in sys.argv[1:] if arg.startswith("--") 139 | } 140 | for k, v in config_args.items(): 141 | if k not in override_keys: 142 | setattr(args, k, v) 143 | del args.config 144 | return args 145 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/submit_reverie.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --gres=gpu:1 4 | #SBATCH --qos=gamma_access 5 | #SBATCH -p gamma-gpu 6 | #SBATCH --wrap='hostname' 7 | #SBATCH --mem=64G 8 | #SBATCH --time=2-0:00:00 9 | 10 | # module add cuda/11.2 11 | # module add gcc/9.1.0 12 | # module add nccl/1.3.3 13 | 14 | # CUDA_VISIBLE_DEVICES=$1 15 | # python train_hm3d_reverie.py --world_size 4 16 | 17 | export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; 18 | CUDA_VISIBLE_DEVICES=0 python -u train_hm3d_reverie.py \ 19 | --vlnbert cmt \ 20 | --model_config configs/model_config.json \ 21 | --config configs/training_args.json \ 22 | --output_dir ../datasets/REVERIE/expr_duet/pretrain/hm3d_rvr 23 | # --model_config config/hm3d_reverie_obj_model_config.json \ 24 | # --config config/hm3d_reverie_obj_pretrain_text.clip.json \ 25 | # --output_dir ../datasets/REVERIE/expr_duet/pretrain/agent7_text.clip.fix_nomlm -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/test/test_dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | from data.dataset import ReverieTextPathData 5 | 6 | if __name__ == '__main__': 7 | # test 8 | data_dir = '/sequoia/data3/shichen/datasets' 9 | train_traj_files = [os.path.join(data_dir, "REVERIE/annotations/pretrain/REVERIE_train_enc.jsonl")] 10 | connectivity_dir = os.path.join(data_dir, "R2R/connectivity") 11 | img_ft_file = os.path.join(data_dir, "R2R/features/pth_vit_base_patch16_224_imagenet.hdf5") 12 | obj_ft_file = os.path.join(data_dir, "REVERIE/features/obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5") 13 | scanvp_cands_file = os.path.join(data_dir, "R2R/annotations/scanvp_candview_relangles.json") 14 | 15 | train_nav_db = ReverieTextPathData( 16 | train_traj_files, img_ft_file, obj_ft_file, 17 | scanvp_cands_file, connectivity_dir, 18 | image_prob_size=1000, image_feat_size=768, angle_feat_size=4, 19 | obj_feat_size=768, obj_prob_size=1000, 20 | max_txt_len=100, max_objects=20, in_memory=True 21 | ) 22 | print(len(train_nav_db)) 23 | 24 | print('\n\npos') 25 | print(train_nav_db.get_input(0, 'pos', return_act_label=True, return_img_probs=True, return_obj_label=True)) 26 | 27 | print('\n\nneg_in_gt_path') 28 | print(train_nav_db.get_input(0, 'neg_in_gt_path', return_act_label=True)) 29 | 30 | print('\n\nneg_others') 31 | print(train_nav_db.get_input(0, 'neg_others', return_act_label=True)) 32 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/test/test_tasks.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | from data.dataset import ReverieTextPathData 5 | from data.tasks import MlmDataset, mlm_collate 6 | 7 | import torch 8 | from torch.utils.data import DataLoader 9 | from transformers import AutoTokenizer 10 | 11 | if __name__ == '__main__': 12 | # test 13 | data_dir = '/sequoia/data3/shichen/datasets' 14 | train_traj_files = [os.path.join(data_dir, "REVERIE/annotations/pretrain/REVERIE_train_enc.jsonl")] 15 | connectivity_dir = os.path.join(data_dir, "R2R/connectivity") 16 | img_ft_file = os.path.join(data_dir, "R2R/features/pth_vit_base_patch16_224_imagenet.hdf5") 17 | obj_ft_file = os.path.join(data_dir, "REVERIE/features/obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5") 18 | scanvp_cands_file = os.path.join(data_dir, "R2R/annotations/scanvp_candview_relangles.json") 19 | 20 | train_nav_db = ReverieTextPathData( 21 | train_traj_files, img_ft_file, obj_ft_file, 22 | scanvp_cands_file, connectivity_dir, 23 | image_prob_size=1000, image_feat_size=768, angle_feat_size=4, 24 | obj_feat_size=768, obj_prob_size=1000, 25 | max_txt_len=100, max_objects=20, in_memory=True 26 | ) 27 | 28 | tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 29 | dataset = MlmDataset(train_nav_db, tokenizer) 30 | 31 | loader = DataLoader(dataset, batch_size=2, shuffle=False, collate_fn=mlm_collate) 32 | 33 | for batch in loader: 34 | for key, value in batch.items(): 35 | print(key) 36 | if isinstance(value, torch.Tensor): 37 | print(value.size()) 38 | break -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/test/test_vilmodel.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import json 4 | from easydict import EasyDict 5 | 6 | from data.dataset import ReverieTextPathData 7 | from data.tasks import MlmDataset, mlm_collate 8 | from model.vilmodel import GlocalTextPathCMT 9 | 10 | import torch 11 | from torch.utils.data import DataLoader 12 | from transformers import AutoTokenizer, PretrainedConfig 13 | 14 | if __name__ == '__main__': 15 | # test 16 | data_dir = '/sequoia/data3/shichen/datasets' 17 | train_traj_files = [os.path.join(data_dir, "REVERIE/annotations/pretrain/REVERIE_train_enc.jsonl")] 18 | connectivity_dir = os.path.join(data_dir, "R2R/connectivity") 19 | img_ft_file = os.path.join(data_dir, "R2R/features/pth_vit_base_patch16_224_imagenet.hdf5") 20 | obj_ft_file = os.path.join(data_dir, "REVERIE/features/obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5") 21 | scanvp_cands_file = os.path.join(data_dir, "R2R/annotations/scanvp_candview_relangles.json") 22 | 23 | model_config = PretrainedConfig.from_json_file('config/reverie_obj_model_config.json') 24 | model = GlocalTextPathCMT(model_config).cuda() 25 | 26 | train_nav_db = ReverieTextPathData( 27 | train_traj_files, img_ft_file, obj_ft_file, 28 | scanvp_cands_file, connectivity_dir, 29 | image_prob_size=1000, image_feat_size=768, angle_feat_size=4, 30 | obj_feat_size=768, obj_prob_size=1000, 31 | max_txt_len=100, max_objects=20, in_memory=True 32 | ) 33 | 34 | tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 35 | dataset = MlmDataset(train_nav_db, tokenizer) 36 | loader = DataLoader(dataset, batch_size=2, shuffle=False, collate_fn=mlm_collate) 37 | 38 | for batch in loader: 39 | for key, value in batch.items(): 40 | print(key) 41 | if isinstance(value, torch.Tensor): 42 | batch[key] = value.cuda() 43 | print('\t', value.size()) 44 | txt_embeds = model.forward_mlm( 45 | batch['txt_ids'], batch['txt_lens'], batch['traj_view_img_fts'], 46 | batch['traj_obj_img_fts'], batch['traj_loc_fts'], batch['traj_nav_types'], 47 | batch['traj_step_lens'], batch['traj_vp_view_lens'], batch['traj_vp_obj_lens'], 48 | batch['traj_vpids'], batch['traj_cand_vpids'], 49 | batch['gmap_lens'], batch['gmap_step_ids'], batch['gmap_pos_fts'], 50 | batch['gmap_pair_dists'], batch['gmap_vpids'], batch['vp_pos_fts'], 51 | ) 52 | print(txt_embeds) 53 | s = txt_embeds.sum() 54 | s.backward() 55 | print(s) 56 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET-RVR/pretrain_src/utils/__init__.py -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/utils/distributed.py: -------------------------------------------------------------------------------- 1 | """ 2 | Distributed tools 3 | """ 4 | import os 5 | from pathlib import Path 6 | from pprint import pformat 7 | import pickle 8 | 9 | import torch 10 | import torch.distributed as dist 11 | 12 | 13 | DEFAULT_PORT = 8738 14 | DEFAULT_PORT_RANGE = 127 15 | # Default address of world rank 0 16 | DEFAULT_MASTER_ADDR = "127.0.0.1" 17 | SLURM_JOBID = os.environ.get("SLURM_JOB_ID", None) 18 | 19 | def parse_ip(s): 20 | s = s.split("-") 21 | s = [y for x in s for y in x.split("[") if y] 22 | s = [y for x in s for y in x.split(",") if y ] 23 | 24 | return ".".join(s[2:6]) 25 | 26 | def load_init_param(opts): 27 | # import pdb;pdb.set_trace() 28 | """ 29 | Load parameters for the rendezvous distributed procedure 30 | """ 31 | # # sync file 32 | # if opts.output_dir != "": 33 | # sync_dir = Path(opts.output_dir).resolve() 34 | # sync_dir.mkdir(parents=True, exist_ok=True) 35 | # sync_file = f"{sync_dir}/.torch_distributed_sync" 36 | # else: 37 | # raise RuntimeError("Can't find any sync dir") 38 | 39 | # # world size 40 | # if opts.world_size != -1: 41 | # world_size = opts.world_size 42 | # elif os.environ.get("WORLD_SIZE", "") != "": 43 | # world_size = int(os.environ["WORLD_SIZE"]) 44 | # else: 45 | # raise RuntimeError("Can't find any world size") 46 | 47 | # # rank 48 | # if os.environ.get("RANK", "") != "": 49 | # # pytorch.distributed.launch provide this variable no matter what 50 | # rank = int(os.environ["RANK"]) 51 | # else: 52 | # # if not provided, calculate the gpu rank 53 | # if opts.node_rank != -1: 54 | # node_rank = opts.node_rank 55 | # elif os.environ.get("NODE_RANK", "") != "": 56 | # node_rank = int(os.environ["NODE_RANK"]) 57 | # else: 58 | # raise RuntimeError("Can't find any rank or node rank") 59 | 60 | # if opts.local_rank != -1: 61 | # local_rank = opts.local_rank 62 | # elif os.environ.get("LOCAL_RANK", "") != "": 63 | # local_rank = int(os.environ["LOCAL_RANK"]) 64 | # else: 65 | # raise RuntimeError("Can't find any rank or local rank") 66 | 67 | # # WARNING: this assumes that each node has the same number of GPUs 68 | # n_gpus = torch.cuda.device_count() 69 | # rank = local_rank + node_rank * n_gpus 70 | # opts.rank = rank 71 | 72 | # Check to see if we should parse from torch.distributed.launch 73 | if os.environ.get("LOCAL_RANK", None) is not None: 74 | local_rank = int(os.environ["LOCAL_RANK"]) 75 | world_rank = int(os.environ["RANK"]) 76 | world_size = int(os.environ["WORLD_SIZE"]) 77 | # Else parse from SLURM is using SLURM 78 | elif SLURM_JOBID is not None: 79 | local_rank = int(os.environ["SLURM_LOCALID"]) 80 | world_rank = int(os.environ["SLURM_PROCID"]) 81 | world_size = int(os.environ["SLURM_NTASKS"]) 82 | # Otherwise setup for just 1 process, this is nice for testing 83 | else: 84 | local_rank = 0 85 | world_rank = 0 86 | world_size = 1 87 | 88 | opts.local_rank = local_rank 89 | opts.rank = world_rank 90 | opts.world_size = world_size 91 | 92 | print("tcp://{}:{}".format( 93 | parse_ip(os.environ['SLURM_STEP_NODELIST']), "9998")) 94 | return { 95 | "backend": "nccl", 96 | "init_method": "tcp://{}:{}".format( 97 | parse_ip(os.environ['SLURM_STEP_NODELIST']), "9998"), 98 | "rank": world_rank, 99 | "world_size": world_size, 100 | } 101 | 102 | 103 | def init_distributed(opts): 104 | init_param = load_init_param(opts) 105 | rank = init_param["rank"] 106 | 107 | print(f"Init distributed {init_param['rank']} - {init_param['world_size']}") 108 | 109 | dist.init_process_group(**init_param) 110 | 111 | 112 | def is_default_gpu(opts) -> bool: 113 | return opts.local_rank == -1 or dist.get_rank() == 0 114 | 115 | 116 | 117 | def is_dist_avail_and_initialized(): 118 | if not dist.is_available(): 119 | return False 120 | if not dist.is_initialized(): 121 | return False 122 | return True 123 | 124 | def get_world_size(): 125 | if not is_dist_avail_and_initialized(): 126 | return 1 127 | return dist.get_world_size() 128 | 129 | def all_gather(data): 130 | """ 131 | Run all_gather on arbitrary picklable data (not necessarily tensors) 132 | Args: 133 | data: any picklable object 134 | Returns: 135 | list[data]: list of data gathered from each rank 136 | """ 137 | world_size = get_world_size() 138 | if world_size == 1: 139 | return [data] 140 | 141 | # serialized to a Tensor 142 | buffer = pickle.dumps(data) 143 | storage = torch.ByteStorage.from_buffer(buffer) 144 | tensor = torch.ByteTensor(storage).to("cuda") 145 | 146 | # obtain Tensor size of each rank 147 | local_size = torch.tensor([tensor.numel()], device="cuda") 148 | size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)] 149 | dist.all_gather(size_list, local_size) 150 | size_list = [int(size.item()) for size in size_list] 151 | max_size = max(size_list) 152 | 153 | # receiving Tensor from all ranks 154 | # we pad the tensor because torch all_gather does not support 155 | # gathering tensors of different shapes 156 | tensor_list = [] 157 | for _ in size_list: 158 | tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) 159 | if local_size != max_size: 160 | padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda") 161 | tensor = torch.cat((tensor, padding), dim=0) 162 | dist.all_gather(tensor_list, tensor) 163 | 164 | data_list = [] 165 | for size, tensor in zip(size_list, tensor_list): 166 | buffer = tensor.cpu().numpy().tobytes()[:size] 167 | data_list.append(pickle.loads(buffer)) 168 | 169 | return data_list 170 | 171 | 172 | def reduce_dict(input_dict, average=True): 173 | """ 174 | Args: 175 | input_dict (dict): all the values will be reduced 176 | average (bool): whether to do average or sum 177 | Reduce the values in the dictionary from all processes so that all processes 178 | have the averaged results. Returns a dict with the same fields as 179 | input_dict, after reduction. 180 | """ 181 | world_size = get_world_size() 182 | if world_size < 2: 183 | return input_dict 184 | with torch.no_grad(): 185 | names = [] 186 | values = [] 187 | # sort the keys so that they are consistent across processes 188 | for k in sorted(input_dict.keys()): 189 | names.append(k) 190 | values.append(input_dict[k]) 191 | values = torch.stack(values, dim=0) 192 | dist.all_reduce(values) 193 | if average: 194 | values /= world_size 195 | reduced_dict = {k: v for k, v in zip(names, values)} 196 | return reduced_dict 197 | 198 | 199 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/utils/logger.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | helper for logging 6 | NOTE: loggers are global objects use with caution 7 | """ 8 | import logging 9 | import math 10 | 11 | import tensorboardX 12 | 13 | 14 | _LOG_FMT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s' 15 | _DATE_FMT = '%m/%d/%Y %H:%M:%S' 16 | logging.basicConfig(format=_LOG_FMT, datefmt=_DATE_FMT, level=logging.INFO) 17 | LOGGER = logging.getLogger('__main__') # this is the global logger 18 | 19 | 20 | def add_log_to_file(log_path): 21 | fh = logging.FileHandler(log_path) 22 | formatter = logging.Formatter(_LOG_FMT, datefmt=_DATE_FMT) 23 | fh.setFormatter(formatter) 24 | LOGGER.addHandler(fh) 25 | 26 | 27 | class TensorboardLogger(object): 28 | def __init__(self): 29 | self._logger = None 30 | self._global_step = 0 31 | 32 | def create(self, path): 33 | self._logger = tensorboardX.SummaryWriter(path) 34 | 35 | def noop(self, *args, **kwargs): 36 | return 37 | 38 | def step(self): 39 | self._global_step += 1 40 | 41 | @property 42 | def global_step(self): 43 | return self._global_step 44 | 45 | def log_scalar_dict(self, log_dict, prefix=''): 46 | """ log a dictionary of scalar values""" 47 | if self._logger is None: 48 | return 49 | if prefix: 50 | prefix = f'{prefix}_' 51 | for name, value in log_dict.items(): 52 | if isinstance(value, dict): 53 | self.log_scalar_dict(value, self._global_step, 54 | prefix=f'{prefix}{name}') 55 | else: 56 | self._logger.add_scalar(f'{prefix}{name}', value, 57 | self._global_step) 58 | 59 | def __getattr__(self, name): 60 | if self._logger is None: 61 | return self.noop 62 | return self._logger.__getattribute__(name) 63 | 64 | 65 | TB_LOGGER = TensorboardLogger() 66 | 67 | 68 | class RunningMeter(object): 69 | """ running meteor of a scalar value 70 | (useful for monitoring training loss) 71 | """ 72 | def __init__(self, name, val=None, smooth=0.99): 73 | self._name = name 74 | self._sm = smooth 75 | self._val = val 76 | 77 | def __call__(self, value): 78 | val = (value if self._val is None 79 | else value*(1-self._sm) + self._val*self._sm) 80 | if not math.isnan(val): 81 | self._val = val 82 | 83 | def __str__(self): 84 | return f'{self._name}: {self._val:.4f}' 85 | 86 | @property 87 | def val(self): 88 | if self._val is None: 89 | return 0 90 | return self._val 91 | 92 | @property 93 | def name(self): 94 | return self._name 95 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/utils/misc.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | from typing import Tuple, Union, Dict, Any 4 | 5 | import os 6 | import torch 7 | import torch.distributed as dist 8 | from torch.nn.parallel import DistributedDataParallel as DDP 9 | 10 | from .distributed import init_distributed 11 | from .logger import LOGGER 12 | 13 | 14 | def set_random_seed(seed): 15 | random.seed(seed) 16 | np.random.seed(seed) 17 | torch.manual_seed(seed) 18 | torch.cuda.manual_seed_all(seed) 19 | 20 | def set_dropout(model, drop_p): 21 | for name, module in model.named_modules(): 22 | # we might want to tune dropout for smaller dataset 23 | if isinstance(module, torch.nn.Dropout): 24 | if module.p != drop_p: 25 | module.p = drop_p 26 | LOGGER.info(f'{name} set to {drop_p}') 27 | 28 | 29 | def set_cuda(opts) -> Tuple[bool, int, torch.device]: 30 | """ 31 | Initialize CUDA for distributed computing 32 | """ 33 | if not torch.cuda.is_available(): 34 | assert opts.local_rank == -1, opts.local_rank 35 | return True, 0, torch.device("cpu") 36 | 37 | # get device settings 38 | if opts.local_rank != -1: 39 | init_distributed(opts) 40 | torch.cuda.set_device(opts.local_rank) 41 | device = torch.device("cuda", opts.local_rank) 42 | n_gpu = 1 43 | default_gpu = dist.get_rank() == 0 44 | if default_gpu: 45 | LOGGER.info(f"Found {dist.get_world_size()} GPUs") 46 | elif os.environ.get("SLURM_JOBID", None) is not None: 47 | init_distributed(opts) 48 | torch.cuda.set_device(opts.local_rank) 49 | device = torch.device("cuda", opts.local_rank) 50 | n_gpu = 1 51 | default_gpu = dist.get_rank() == 0 52 | if default_gpu: 53 | LOGGER.info(f"Found {dist.get_world_size()} GPUs") 54 | else: 55 | default_gpu = True 56 | device = torch.device("cuda") 57 | n_gpu = torch.cuda.device_count() 58 | 59 | return default_gpu, n_gpu, device 60 | 61 | 62 | def wrap_model( 63 | model: torch.nn.Module, device: torch.device, local_rank: int 64 | ) -> torch.nn.Module: 65 | model.to(device) 66 | 67 | if local_rank != -1: 68 | model = DDP(model, device_ids=[local_rank], find_unused_parameters=True) 69 | # At the time of DDP wrapping, parameters and buffers (i.e., model.state_dict()) 70 | # on rank0 are broadcasted to all other ranks. 71 | elif torch.cuda.device_count() > 1: 72 | LOGGER.info("Using data parallel") 73 | model = torch.nn.DataParallel(model) 74 | 75 | return model 76 | 77 | 78 | class NoOp(object): 79 | """ useful for distributed training No-Ops """ 80 | def __getattr__(self, name): 81 | return self.noop 82 | 83 | def noop(self, *args, **kwargs): 84 | return 85 | 86 | -------------------------------------------------------------------------------- /VLN-DUET-RVR/pretrain_src/utils/save.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | saving utilities 6 | """ 7 | import json 8 | import os 9 | import torch 10 | 11 | 12 | def save_training_meta(args): 13 | os.makedirs(os.path.join(args.output_dir, 'logs'), exist_ok=True) 14 | os.makedirs(os.path.join(args.output_dir, 'ckpts'), exist_ok=True) 15 | 16 | with open(os.path.join(args.output_dir, 'logs', 'training_args.json'), 'w') as writer: 17 | json.dump(vars(args), writer, indent=4) 18 | model_config = json.load(open(args.model_config)) 19 | with open(os.path.join(args.output_dir, 'logs', 'model_config.json'), 'w') as writer: 20 | json.dump(model_config, writer, indent=4) 21 | 22 | 23 | class ModelSaver(object): 24 | def __init__(self, output_dir, prefix='model_step', suffix='pt'): 25 | self.output_dir = output_dir 26 | self.prefix = prefix 27 | self.suffix = suffix 28 | 29 | def save(self, model, step, optimizer=None): 30 | output_model_file = os.path.join(self.output_dir, 31 | f"{self.prefix}_{step}.{self.suffix}") 32 | state_dict = {} 33 | for k, v in model.state_dict().items(): 34 | if k.startswith('module.'): 35 | k = k[7:] 36 | if isinstance(v, torch.Tensor): 37 | state_dict[k] = v.cpu() 38 | else: 39 | state_dict[k] = v 40 | torch.save(state_dict, output_model_file) 41 | if optimizer is not None: 42 | dump = {'step': step, 'optimizer': optimizer.state_dict()} 43 | if hasattr(optimizer, '_amp_stash'): 44 | pass # TODO fp16 optimizer 45 | torch.save(dump, f'{self.output_dir}/train_state_{step}.pt') 46 | 47 | -------------------------------------------------------------------------------- /VLN-DUET/datasets/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET/datasets/.gitkeep -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/models/graph_utils.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import numpy as np 3 | 4 | MAX_DIST = 30 5 | MAX_STEP = 10 6 | 7 | def calc_position_distance(a, b): 8 | # a, b: (x, y, z) 9 | dx = b[0] - a[0] 10 | dy = b[1] - a[1] 11 | dz = b[2] - a[2] 12 | dist = np.sqrt(dx**2 + dy**2 + dz**2) 13 | return dist 14 | 15 | def calculate_vp_rel_pos_fts(a, b, base_heading=0, base_elevation=0): 16 | # a, b: (x, y, z) 17 | dx = b[0] - a[0] 18 | dy = b[1] - a[1] 19 | dz = b[2] - a[2] 20 | xy_dist = max(np.sqrt(dx**2 + dy**2), 1e-8) 21 | xyz_dist = max(np.sqrt(dx**2 + dy**2 + dz**2), 1e-8) 22 | 23 | # the simulator's api is weired (x-y axis is transposed) 24 | heading = np.arcsin(dx/xy_dist) # [-pi/2, pi/2] 25 | if b[1] < a[1]: 26 | heading = np.pi - heading 27 | heading -= base_heading 28 | 29 | elevation = np.arcsin(dz/xyz_dist) # [-pi/2, pi/2] 30 | elevation -= base_elevation 31 | 32 | return heading, elevation, xyz_dist 33 | 34 | def get_angle_fts(headings, elevations, angle_feat_size): 35 | ang_fts = [np.sin(headings), np.cos(headings), np.sin(elevations), np.cos(elevations)] 36 | ang_fts = np.vstack(ang_fts).transpose().astype(np.float32) 37 | num_repeats = angle_feat_size // 4 38 | if num_repeats > 1: 39 | ang_fts = np.concatenate([ang_fts] * num_repeats, 1) 40 | return ang_fts 41 | 42 | 43 | class FloydGraph(object): 44 | def __init__(self): 45 | self._dis = defaultdict(lambda :defaultdict(lambda: 95959595)) 46 | self._point = defaultdict(lambda :defaultdict(lambda: "")) 47 | self._visited = set() 48 | 49 | def distance(self, x, y): 50 | if x == y: 51 | return 0 52 | else: 53 | return self._dis[x][y] 54 | 55 | def add_edge(self, x, y, dis): 56 | if dis < self._dis[x][y]: 57 | self._dis[x][y] = dis 58 | self._dis[y][x] = dis 59 | self._point[x][y] = "" 60 | self._point[y][x] = "" 61 | 62 | def update(self, k): 63 | for x in self._dis: 64 | for y in self._dis: 65 | if x != y: 66 | if self._dis[x][k] + self._dis[k][y] < self._dis[x][y]: 67 | self._dis[x][y] = self._dis[x][k] + self._dis[k][y] 68 | self._dis[y][x] = self._dis[x][y] 69 | self._point[x][y] = k 70 | self._point[y][x] = k 71 | self._visited.add(k) 72 | 73 | def visited(self, k): 74 | return (k in self._visited) 75 | 76 | def path(self, x, y): 77 | """ 78 | :param x: start 79 | :param y: end 80 | :return: the path from x to y [v1, v2, ..., v_n, y] 81 | """ 82 | if x == y: 83 | return [] 84 | if self._point[x][y] == "": # Direct edge 85 | return [y] 86 | else: 87 | k = self._point[x][y] 88 | # print(x, y, k) 89 | # for x1 in (x, k, y): 90 | # for x2 in (x, k, y): 91 | # print(x1, x2, "%.4f" % self._dis[x1][x2]) 92 | return self.path(x, k) + self.path(k, y) 93 | 94 | 95 | class GraphMap(object): 96 | def __init__(self, start_vp): 97 | self.start_vp = start_vp # start viewpoint 98 | 99 | self.node_positions = {} # viewpoint to position (x, y, z) 100 | self.graph = FloydGraph() # shortest path graph 101 | self.node_embeds = {} # {viewpoint: feature (sum feature, count)} 102 | self.node_stop_scores = {} # {viewpoint: prob} 103 | self.node_nav_scores = {} # {viewpoint: {t: prob}} 104 | self.node_step_ids = {} 105 | 106 | def update_graph(self, ob): 107 | self.node_positions[ob['viewpoint']] = ob['position'] 108 | for cc in ob['candidate']: 109 | self.node_positions[cc['viewpointId']] = cc['position'] 110 | dist = calc_position_distance(ob['position'], cc['position']) 111 | self.graph.add_edge(ob['viewpoint'], cc['viewpointId'], dist) 112 | self.graph.update(ob['viewpoint']) 113 | 114 | def update_node_embed(self, vp, embed, rewrite=False): 115 | if rewrite: 116 | self.node_embeds[vp] = [embed, 1] 117 | else: 118 | if vp in self.node_embeds: 119 | self.node_embeds[vp][0] = self.node_embeds[vp][0] + embed 120 | self.node_embeds[vp][1] = self.node_embeds[vp][1] + 1 121 | else: 122 | self.node_embeds[vp] = [embed, 1] 123 | 124 | def get_node_embed(self, vp): 125 | return self.node_embeds[vp][0] / self.node_embeds[vp][1] 126 | 127 | def get_pos_fts(self, cur_vp, gmap_vpids, cur_heading, cur_elevation, angle_feat_size=4): 128 | # dim=7 (sin(heading), cos(heading), sin(elevation), cos(elevation), 129 | # line_dist, shortest_dist, shortest_step) 130 | rel_angles, rel_dists = [], [] 131 | for vp in gmap_vpids: 132 | if vp is None: 133 | rel_angles.append([0, 0]) 134 | rel_dists.append([0, 0, 0]) 135 | else: 136 | rel_heading, rel_elevation, rel_dist = calculate_vp_rel_pos_fts( 137 | self.node_positions[cur_vp], self.node_positions[vp], 138 | base_heading=cur_heading, base_elevation=cur_elevation, 139 | ) 140 | rel_angles.append([rel_heading, rel_elevation]) 141 | rel_dists.append( 142 | [rel_dist / MAX_DIST, self.graph.distance(cur_vp, vp) / MAX_DIST, \ 143 | len(self.graph.path(cur_vp, vp)) / MAX_STEP] 144 | ) 145 | rel_angles = np.array(rel_angles).astype(np.float32) 146 | rel_dists = np.array(rel_dists).astype(np.float32) 147 | rel_ang_fts = get_angle_fts(rel_angles[:, 0], rel_angles[:, 1], angle_feat_size) 148 | return np.concatenate([rel_ang_fts, rel_dists], 1) 149 | 150 | def save_to_json(self): 151 | nodes = {} 152 | for vp, pos in self.node_positions.items(): 153 | nodes[vp] = { 154 | 'location': pos, # (x, y, z) 155 | 'visited': self.graph.visited(vp), 156 | } 157 | if nodes[vp]['visited']: 158 | nodes[vp]['stop_prob'] = self.node_stop_scores[vp]['stop'] 159 | nodes[vp]['og_objid'] = self.node_stop_scores[vp]['og'] 160 | else: 161 | nodes[vp]['nav_prob'] = self.node_nav_scores[vp] 162 | 163 | edges = [] 164 | for k, v in self.graph._dis.items(): 165 | for kk in v.keys(): 166 | edges.append((k, kk)) 167 | 168 | return {'nodes': nodes, 'edges': edges} 169 | 170 | 171 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/models/model.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import collections 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | 8 | from transformers import BertPreTrainedModel 9 | 10 | from .vlnbert_init import get_vlnbert_models 11 | 12 | class VLNBert(nn.Module): 13 | def __init__(self, args): 14 | super().__init__() 15 | print('\nInitalizing the VLN-BERT model ...') 16 | self.args = args 17 | 18 | self.vln_bert = get_vlnbert_models(args, config=None) # initialize the VLN-BERT 19 | self.drop_env = nn.Dropout(p=args.feat_dropout) 20 | 21 | def forward(self, mode, batch): 22 | batch = collections.defaultdict(lambda: None, batch) 23 | 24 | if mode == 'language': 25 | txt_embeds = self.vln_bert(mode, batch) 26 | return txt_embeds 27 | 28 | elif mode == 'panorama': 29 | batch['view_img_fts'] = self.drop_env(batch['view_img_fts']) 30 | if 'obj_img_fts' in batch: 31 | batch['obj_img_fts'] = self.drop_env(batch['obj_img_fts']) 32 | pano_embeds, pano_masks = self.vln_bert(mode, batch) 33 | return pano_embeds, pano_masks 34 | 35 | elif mode == 'navigation': 36 | outs = self.vln_bert(mode, batch) 37 | return outs 38 | 39 | else: 40 | raise NotImplementedError('wrong mode: %s'%mode) 41 | 42 | 43 | class Critic(nn.Module): 44 | def __init__(self, args): 45 | super(Critic, self).__init__() 46 | self.state2value = nn.Sequential( 47 | nn.Linear(768, 512), 48 | nn.ReLU(), 49 | nn.Dropout(args.dropout), 50 | nn.Linear(512, 1), 51 | ) 52 | 53 | def forward(self, state): 54 | return self.state2value(state).squeeze() 55 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/models/ops.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from .transformer import TransformerEncoder, TransformerEncoderLayer 4 | 5 | # try: 6 | # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm 7 | # except (ImportError, AttributeError) as e: 8 | # # logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .") 9 | # BertLayerNorm = torch.nn.LayerNorm 10 | BertLayerNorm = torch.nn.LayerNorm 11 | 12 | def create_transformer_encoder(config, num_layers, norm=False): 13 | enc_layer = TransformerEncoderLayer( 14 | config.hidden_size, config.num_attention_heads, 15 | dim_feedforward=config.intermediate_size, 16 | dropout=config.hidden_dropout_prob, 17 | activation=config.hidden_act, 18 | normalize_before=True 19 | ) 20 | if norm: 21 | norm_layer = BertLayerNorm(config.hidden_size, eps=1e-12) 22 | else: 23 | norm_layer = None 24 | return TransformerEncoder(enc_layer, num_layers, norm=norm_layer, batch_first=True) 25 | 26 | def extend_neg_masks(masks, dtype=None): 27 | """ 28 | mask from (N, L) into (N, 1(H), 1(L), L) and make it negative 29 | """ 30 | if dtype is None: 31 | dtype = torch.float 32 | extended_masks = masks.unsqueeze(1).unsqueeze(2) 33 | extended_masks = extended_masks.to(dtype=dtype) 34 | extended_masks = (1.0 - extended_masks) * -10000.0 35 | return extended_masks 36 | 37 | def gen_seq_masks(seq_lens, max_len=None): 38 | if max_len is None: 39 | max_len = max(seq_lens) 40 | batch_size = len(seq_lens) 41 | device = seq_lens.device 42 | 43 | masks = torch.arange(max_len).unsqueeze(0).repeat(batch_size, 1).to(device) 44 | masks = masks < seq_lens.unsqueeze(1) 45 | return masks 46 | 47 | def pad_tensors_wgrad(tensors, lens=None): 48 | """B x [T, ...] torch tensors""" 49 | if lens is None: 50 | lens = [t.size(0) for t in tensors] 51 | max_len = max(lens) 52 | batch_size = len(tensors) 53 | hid = list(tensors[0].size()[1:]) 54 | 55 | device = tensors[0].device 56 | dtype = tensors[0].dtype 57 | 58 | output = [] 59 | for i in range(batch_size): 60 | if lens[i] < max_len: 61 | tmp = torch.cat( 62 | [tensors[i], torch.zeros([max_len-lens[i]]+hid, dtype=dtype).to(device)], 63 | dim=0 64 | ) 65 | else: 66 | tmp = tensors[i] 67 | output.append(tmp) 68 | output = torch.stack(output, 0) 69 | return output 70 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/models/vlnbert_init.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def get_tokenizer(args): 5 | from transformers import AutoTokenizer 6 | if args.tokenizer == 'xlm': 7 | cfg_name = 'xlm-roberta-base' 8 | else: 9 | cfg_name = 'bert-base-uncased' 10 | tokenizer = AutoTokenizer.from_pretrained(cfg_name) 11 | return tokenizer 12 | 13 | def get_vlnbert_models(args, config=None): 14 | 15 | from transformers import PretrainedConfig 16 | from models.vilmodel import GlocalTextPathNavCMT 17 | 18 | model_name_or_path = args.bert_ckpt_file 19 | new_ckpt_weights = {} 20 | if model_name_or_path is not None: 21 | ckpt_weights = torch.load(model_name_or_path) 22 | for k, v in ckpt_weights.items(): 23 | if k.startswith('module'): 24 | k = k[7:] 25 | if '_head' in k or 'sap_fuse' in k: 26 | new_ckpt_weights['bert.' + k] = v 27 | else: 28 | new_ckpt_weights[k] = v 29 | 30 | if args.tokenizer == 'xlm': 31 | cfg_name = 'xlm-roberta-base' 32 | else: 33 | cfg_name = 'bert-base-uncased' 34 | vis_config = PretrainedConfig.from_pretrained(cfg_name) 35 | 36 | if args.tokenizer == 'xlm': 37 | vis_config.type_vocab_size = 2 38 | 39 | vis_config.max_action_steps = 100 40 | vis_config.image_feat_size = args.image_feat_size 41 | vis_config.angle_feat_size = args.angle_feat_size 42 | vis_config.obj_feat_size = args.obj_feat_size 43 | vis_config.obj_loc_size = 3 44 | vis_config.num_l_layers = args.num_l_layers 45 | vis_config.num_pano_layers = args.num_pano_layers 46 | vis_config.num_x_layers = args.num_x_layers 47 | vis_config.graph_sprels = args.graph_sprels 48 | vis_config.glocal_fuse = args.fusion == 'dynamic' 49 | 50 | vis_config.fix_lang_embedding = args.fix_lang_embedding 51 | vis_config.fix_pano_embedding = args.fix_pano_embedding 52 | vis_config.fix_local_branch = args.fix_local_branch 53 | 54 | vis_config.update_lang_bert = not args.fix_lang_embedding 55 | vis_config.output_attentions = True 56 | vis_config.pred_head_dropout_prob = 0.1 57 | vis_config.use_lang2visn_attn = False 58 | 59 | visual_model = GlocalTextPathNavCMT.from_pretrained( 60 | pretrained_model_name_or_path=None, 61 | config=vis_config, 62 | state_dict=new_ckpt_weights) 63 | 64 | return visual_model 65 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/r2r/data_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import numpy as np 4 | 5 | def load_instr_datasets(anno_dir, dataset, splits, tokenizer, is_test=True): 6 | data = [] 7 | for split in splits: 8 | if "/" not in split: # the official splits 9 | if tokenizer == 'bert': 10 | filepath = os.path.join(anno_dir, '%s_%s_enc.json' % (dataset.upper(), split)) 11 | elif tokenizer == 'xlm': 12 | filepath = os.path.join(anno_dir, '%s_%s_enc_xlmr.json' % (dataset.upper(), split)) 13 | else: 14 | raise NotImplementedError('unspported tokenizer %s' % tokenizer) 15 | 16 | with open(filepath) as f: 17 | new_data = json.load(f) 18 | 19 | if split == 'val_train_seen': 20 | new_data = new_data[:50] 21 | 22 | if not is_test: 23 | if dataset == 'r4r' and split == 'val_unseen': 24 | ridxs = np.random.permutation(len(new_data))[:200] 25 | new_data = [new_data[ridx] for ridx in ridxs] 26 | else: # augmented data 27 | print('\nLoading augmented data %s for pretraining...' % os.path.basename(split)) 28 | with open(split) as f: 29 | new_data = json.load(f) 30 | # Join 31 | data += new_data 32 | return data 33 | 34 | def construct_instrs(anno_dir, dataset, splits, tokenizer, max_instr_len=512, is_test=True): 35 | data = [] 36 | for i, item in enumerate(load_instr_datasets(anno_dir, dataset, splits, tokenizer, is_test=is_test)): 37 | # Split multiple instructions into separate entries 38 | for j, instr in enumerate(item['instructions']): 39 | new_item = dict(item) 40 | new_item['instr_id'] = '%s_%d' % (item['path_id'], j) 41 | new_item['instruction'] = instr 42 | new_item['instr_encoding'] = item['instr_encodings'][j][:max_instr_len] 43 | del new_item['instructions'] 44 | del new_item['instr_encodings'] 45 | data.append(new_item) 46 | return data -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/r2r/eval_utils.py: -------------------------------------------------------------------------------- 1 | ''' Utils for evaluation ''' 2 | 3 | import numpy as np 4 | 5 | 6 | def cal_dtw(shortest_distances, prediction, reference, success=None, threshold=3.0): 7 | dtw_matrix = np.inf * np.ones((len(prediction) + 1, len(reference) + 1)) 8 | dtw_matrix[0][0] = 0 9 | for i in range(1, len(prediction)+1): 10 | for j in range(1, len(reference)+1): 11 | best_previous_cost = min( 12 | dtw_matrix[i-1][j], dtw_matrix[i][j-1], dtw_matrix[i-1][j-1]) 13 | cost = shortest_distances[prediction[i-1]][reference[j-1]] 14 | dtw_matrix[i][j] = cost + best_previous_cost 15 | 16 | dtw = dtw_matrix[len(prediction)][len(reference)] 17 | ndtw = np.exp(-dtw/(threshold * len(reference))) 18 | if success is None: 19 | success = float(shortest_distances[prediction[-1]][reference[-1]] < threshold) 20 | sdtw = success * ndtw 21 | 22 | return { 23 | 'DTW': dtw, 24 | 'nDTW': ndtw, 25 | 'SDTW': sdtw 26 | } 27 | 28 | def cal_cls(shortest_distances, prediction, reference, threshold=3.0): 29 | def length(nodes): 30 | return np.sum([ 31 | shortest_distances[a][b] 32 | for a, b in zip(nodes[:-1], nodes[1:]) 33 | ]) 34 | 35 | coverage = np.mean([ 36 | np.exp(-np.min([ # pylint: disable=g-complex-comprehension 37 | shortest_distances[u][v] for v in prediction 38 | ]) / threshold) for u in reference 39 | ]) 40 | expected = coverage * length(reference) 41 | score = expected / (expected + np.abs(expected - length(prediction))) 42 | return coverage * score 43 | 44 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/r2r/parser.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from ast import arg 3 | import os 4 | 5 | 6 | def parse_args(): 7 | parser = argparse.ArgumentParser(description="") 8 | 9 | parser.add_argument('--root_dir', type=str, default='../datasets') 10 | parser.add_argument('--dataset', type=str, default='r2r', choices=['r2r', 'r4r']) 11 | parser.add_argument('--output_dir', type=str, default='default', help='experiment id') 12 | parser.add_argument('--seed', type=int, default=0) 13 | 14 | parser.add_argument('--tokenizer', choices=['bert', 'xlm'], default='bert') 15 | 16 | parser.add_argument('--act_visited_nodes', action='store_true', default=False) 17 | parser.add_argument('--fusion', choices=['global', 'local', 'avg', 'dynamic']) 18 | parser.add_argument('--expl_sample', action='store_true', default=False) 19 | parser.add_argument('--expl_max_ratio', type=float, default=0.6) 20 | parser.add_argument('--expert_policy', default='spl', choices=['spl', 'ndtw']) 21 | 22 | # distributional training (single-node, multiple-gpus) 23 | parser.add_argument('--world_size', type=int, default=1, help='number of gpus') 24 | parser.add_argument('--local_rank', type=int, default=-1) 25 | parser.add_argument("--node_rank", type=int, default=0, help="Id of the node") 26 | 27 | # General 28 | parser.add_argument('--iters', type=int, default=100000, help='training iterations') 29 | parser.add_argument('--log_every', type=int, default=1000) 30 | parser.add_argument('--eval_first', action='store_true', default=False) 31 | 32 | # Data preparation 33 | parser.add_argument('--max_instr_len', type=int, default=80) 34 | parser.add_argument('--max_action_len', type=int, default=15) 35 | parser.add_argument('--batch_size', type=int, default=8) 36 | parser.add_argument('--ignoreid', type=int, default=-100, help='ignoreid for action') 37 | 38 | # Load the model from 39 | parser.add_argument("--resume_file", default=None, help='path of the trained model') 40 | parser.add_argument("--resume_optimizer", action="store_true", default=False) 41 | 42 | 43 | # Augmented Paths from 44 | parser.add_argument("--aug", default=None) 45 | parser.add_argument('--bert_ckpt_file', default=None, help='init vlnbert') 46 | 47 | # Listener Model Config 48 | parser.add_argument("--ml_weight", type=float, default=0.20) 49 | parser.add_argument('--entropy_loss_weight', type=float, default=0.01) 50 | 51 | parser.add_argument("--features", type=str, default='vitbase') 52 | parser.add_argument("--env_aug", action='store_true', default=False) 53 | parser.add_argument("--aug_times", type=int, default=19) 54 | 55 | parser.add_argument('--fix_lang_embedding', action='store_true', default=False) 56 | parser.add_argument('--fix_pano_embedding', action='store_true', default=False) 57 | parser.add_argument('--fix_local_branch', action='store_true', default=False) 58 | 59 | parser.add_argument('--num_l_layers', type=int, default=9) 60 | parser.add_argument('--num_pano_layers', type=int, default=2) 61 | parser.add_argument('--num_x_layers', type=int, default=4) 62 | 63 | parser.add_argument('--enc_full_graph', default=False, action='store_true') 64 | parser.add_argument('--graph_sprels', action='store_true', default=False) 65 | 66 | # Dropout Param 67 | parser.add_argument('--dropout', type=float, default=0.5) 68 | parser.add_argument('--feat_dropout', type=float, default=0.3) 69 | 70 | # Submision configuration 71 | parser.add_argument('--test', action='store_true', default=False) 72 | parser.add_argument('--zero_shot', action='store_true', default=False) 73 | parser.add_argument("--submit", action='store_true', default=False) 74 | parser.add_argument('--no_backtrack', action='store_true', default=False) 75 | parser.add_argument('--detailed_output', action='store_true', default=False) 76 | 77 | # Training Configurations 78 | parser.add_argument( 79 | '--optim', type=str, default='rms', 80 | choices=['rms', 'adam', 'adamW', 'sgd'] 81 | ) # rms, adam 82 | parser.add_argument('--lr', type=float, default=0.00001, help="the learning rate") 83 | parser.add_argument('--decay', dest='weight_decay', type=float, default=0.) 84 | parser.add_argument( 85 | '--feedback', type=str, default='sample', 86 | help='How to choose next position, one of ``teacher``, ``sample`` and ``argmax``' 87 | ) 88 | parser.add_argument('--epsilon', type=float, default=0.1, help='') 89 | 90 | # Model hyper params: 91 | parser.add_argument("--angle_feat_size", type=int, default=4) 92 | parser.add_argument('--image_feat_size', type=int, default=2048) 93 | parser.add_argument('--obj_feat_size', type=int, default=0) 94 | parser.add_argument('--views', type=int, default=36) 95 | 96 | # # A2C 97 | parser.add_argument("--gamma", default=0.9, type=float, help='reward discount factor') 98 | parser.add_argument( 99 | "--normalize", dest="normalize_loss", default="total", 100 | type=str, help='batch or total' 101 | ) 102 | parser.add_argument('--train_alg', 103 | choices=['imitation', 'dagger'], 104 | default='imitation' 105 | ) 106 | 107 | args, _ = parser.parse_known_args() 108 | 109 | args = postprocess_args(args) 110 | 111 | return args 112 | 113 | 114 | def postprocess_args(args): 115 | ROOTDIR = args.root_dir 116 | 117 | # Setup input paths 118 | ft_file_map = { 119 | 'clip.h14': 'clip_vit-h14_mp3d_hm3d_gibson.hdf5', 120 | 'clip.b16': 'clip_vit-b16_mp3d_hm3d_gibson.hdf5' 121 | } 122 | 123 | args.aug_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', ft_file_map[args.features]) 124 | 125 | if args.features == 'clip.h14': 126 | args.mp3d_ft_files = [os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5')] 127 | args.val_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5') 128 | elif args.features == 'clip.b16': 129 | args.mp3d_ft_files = [os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-b16_mp3d_original.hdf5')] 130 | args.val_ft_file = os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-b16_mp3d_original.hdf5') 131 | 132 | if args.env_aug: # only h14 133 | args.mp3d_ft_files = [ 134 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_image_synthesis.hdf5'), 135 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_mask_image_synthesis.hdf5'), 136 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_img_style_transfer.hdf5'), 137 | os.path.join(ROOTDIR, 'R2R', 'features', 'clip_vit-h14_mp3d_original.hdf5'), 138 | ] 139 | 140 | 141 | if args.aug: 142 | args.connectivity_dir = os.path.join(ROOTDIR, 'R2R', 'connectivity') 143 | else: 144 | args.connectivity_dir = os.path.join(ROOTDIR, 'R2R', 'connectivity_mp3d') 145 | 146 | args.scan_data_dir = os.path.join(ROOTDIR, 'Matterport3D', 'v1_unzip_scans') 147 | 148 | args.anno_dir = os.path.join(ROOTDIR, 'R2R', 'annotations') 149 | 150 | # Build paths 151 | args.ckpt_dir = os.path.join(args.output_dir, 'ckpts') 152 | if args.zero_shot: 153 | args.log_dir = os.path.join(args.output_dir, 'zero_shot_logs') 154 | else: 155 | args.log_dir = os.path.join(args.output_dir, 'logs') 156 | args.pred_dir = os.path.join(args.output_dir, 'preds') 157 | 158 | if not args.zero_shot: 159 | os.makedirs(args.output_dir, exist_ok=True) 160 | os.makedirs(args.ckpt_dir, exist_ok=True) 161 | os.makedirs(args.pred_dir, exist_ok=True) 162 | os.makedirs(args.log_dir, exist_ok=True) 163 | 164 | return args 165 | 166 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/scripts/r2r_b16_mix.sh: -------------------------------------------------------------------------------- 1 | DATA_ROOT=../datasets 2 | 3 | train_alg=dagger 4 | 5 | features=clip.b16 6 | ft_dim=512 7 | obj_features=vitbase 8 | obj_ft_dim=768 9 | 10 | ngpus=1 11 | bs=16 12 | seed=0 13 | 14 | name=${train_alg}-${features} 15 | name=${name}-seed.${seed} 16 | name=${name}-aug.mp3d.prevalent.hm3d_gibson.envdrop.init.140k 17 | 18 | 19 | outdir=${DATA_ROOT}/R2R/exprs_map/finetune/${name}-aug.hm3d.envdrop 20 | 21 | flag="--root_dir ${DATA_ROOT} 22 | --dataset r2r 23 | --output_dir ${outdir} 24 | --world_size ${ngpus} 25 | --seed ${seed} 26 | --tokenizer bert 27 | 28 | --enc_full_graph 29 | --graph_sprels 30 | --fusion dynamic 31 | 32 | --expert_policy spl 33 | --train_alg ${train_alg} 34 | 35 | --num_l_layers 9 36 | --num_x_layers 4 37 | --num_pano_layers 2 38 | 39 | --max_action_len 15 40 | --max_instr_len 200 41 | 42 | --batch_size ${bs} 43 | --lr 1e-5 44 | --iters 200000 45 | --log_every 500 46 | --aug_times 9 47 | 48 | --optim adamW 49 | 50 | --features ${features} 51 | --image_feat_size ${ft_dim} 52 | --angle_feat_size 4 53 | 54 | --ml_weight 0.15 55 | 56 | --feat_dropout 0.4 57 | --dropout 0.5 58 | 59 | --gamma 0." 60 | 61 | # # zero shot 62 | # python r2r/main_nav.py $flag \ 63 | # --tokenizer bert \ 64 | # --zero_shot 65 | 66 | # train 67 | CUDA_VISIBLE_DEVICES=$1 python r2r/main_nav.py $flag \ 68 | --tokenizer bert \ 69 | --bert_ckpt_file ../datasets/R2R/trained_models/pretrain/duet_vit-b16_model_step_140000.pt \ 70 | --aug ../datasets/R2R/annotations/R2R_scalevln_ft_aug_enc.json 71 | 72 | # # test 73 | # CUDA_VISIBLE_DEVICES=$1 python r2r/main_nav.py $flag \ 74 | # --tokenizer bert \ 75 | # --resume_file ../datasets/R2R/trained_models/finetune/duet_vit-b16_ft_best_val_unseen \ 76 | # --test --submit -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/scripts/r2r_h14_envedit_mix.sh: -------------------------------------------------------------------------------- 1 | DATA_ROOT=../datasets 2 | 3 | train_alg=dagger 4 | 5 | features=clip.h14 6 | ft_dim=1024 7 | obj_features=vitbase 8 | obj_ft_dim=768 9 | 10 | ngpus=1 11 | bs=8 12 | seed=0 13 | 14 | name=${train_alg}-${features}-envedit 15 | name=${name}-seed.${seed} 16 | name=${name}-aug.mp3d.prevalent.hm3d_gibson.envdrop.init.190k 17 | 18 | 19 | outdir=${DATA_ROOT}/R2R/exprs_map/finetune/${name}-aug.hm3d.envdrop 20 | 21 | flag="--root_dir ${DATA_ROOT} 22 | --dataset r2r 23 | --output_dir ${outdir} 24 | --world_size ${ngpus} 25 | --seed ${seed} 26 | --tokenizer bert 27 | 28 | --enc_full_graph 29 | --graph_sprels 30 | --fusion dynamic 31 | 32 | --expert_policy spl 33 | --train_alg ${train_alg} 34 | 35 | --num_l_layers 9 36 | --num_x_layers 4 37 | --num_pano_layers 2 38 | 39 | --max_action_len 15 40 | --max_instr_len 200 41 | 42 | --batch_size ${bs} 43 | --lr 1e-5 44 | --iters 200000 45 | --log_every 500 46 | --aug_times 9 47 | 48 | --env_aug 49 | 50 | --optim adamW 51 | 52 | --features ${features} 53 | --image_feat_size ${ft_dim} 54 | --angle_feat_size 4 55 | 56 | --ml_weight 0.15 57 | 58 | --feat_dropout 0.4 59 | --dropout 0.5 60 | 61 | --gamma 0." 62 | 63 | # # zero shot 64 | # python r2r/main_nav.py $flag \ 65 | # --tokenizer bert \ 66 | # --zero_shot 67 | 68 | # train 69 | CUDA_VISIBLE_DEVICES=$1 python r2r/main_nav.py $flag \ 70 | --tokenizer bert \ 71 | --bert_ckpt_file ../datasets/R2R/trained_models/pretrain/duet_vit-h14_model_step_190000.pt \ 72 | --aug ../datasets/R2R/annotations/R2R_scalevln_ft_aug_enc.json 73 | 74 | # # test 75 | # CUDA_VISIBLE_DEVICES='0' python r2r/main_nav.py $flag \ 76 | # --tokenizer bert \ 77 | # --resume_file ../datasets/R2R/trained_models/finetune/duet_vit-h14_ft_best_val_unseen \ 78 | # --test --submit -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/utils/data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import jsonlines 4 | import h5py 5 | import networkx as nx 6 | import math 7 | import numpy as np 8 | import random 9 | 10 | class ImageFeaturesDB(object): 11 | def __init__(self, img_ft_file, image_feat_size): 12 | self.image_feat_size = image_feat_size 13 | self.img_ft_file = img_ft_file 14 | self._feature_store = {} 15 | with h5py.File(self.img_ft_file, 'r') as f: 16 | for key in f.keys(): 17 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 18 | self._feature_store[key] = ft 19 | 20 | 21 | def get_image_feature(self, scan, viewpoint): 22 | key = '%s_%s' % (scan, viewpoint) 23 | if key in self._feature_store: 24 | ft = self._feature_store[key] 25 | else: 26 | with h5py.File(self.img_ft_file, 'r') as f: 27 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 28 | self._feature_store[key] = ft 29 | return ft 30 | 31 | class ImageFeaturesDB2(object): 32 | def __init__(self, img_ft_files, image_feat_size): 33 | self.image_feat_size = image_feat_size 34 | self.img_ft_file = img_ft_files 35 | self._feature_stores = {} 36 | for name in img_ft_files: 37 | self._feature_stores[name] = {} 38 | with h5py.File(name, 'r') as f: 39 | for key in f.keys(): 40 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 41 | self._feature_stores[name][key] = ft 42 | self.env_names = list(self._feature_stores.keys()) 43 | print(self.env_names) 44 | 45 | 46 | def get_image_feature(self, scan, viewpoint): 47 | key = '%s_%s' % (scan, viewpoint) 48 | env_name = random.choice(self.env_names) 49 | if key in self._feature_stores[env_name]: 50 | ft = self._feature_stores[env_name][key] 51 | else: 52 | with h5py.File(env_name, 'r') as f: 53 | ft = f[key][...][:, :self.image_feat_size].astype(np.float32) 54 | self._feature_stores[env_name][key] = ft 55 | return ft 56 | 57 | def load_nav_graphs(connectivity_dir, scans): 58 | ''' Load connectivity graph for each scan ''' 59 | 60 | def distance(pose1, pose2): 61 | ''' Euclidean distance between two graph poses ''' 62 | return ((pose1['pose'][3]-pose2['pose'][3])**2\ 63 | + (pose1['pose'][7]-pose2['pose'][7])**2\ 64 | + (pose1['pose'][11]-pose2['pose'][11])**2)**0.5 65 | 66 | graphs = {} 67 | for scan in scans: 68 | with open(os.path.join(connectivity_dir, '%s_connectivity.json' % scan)) as f: 69 | G = nx.Graph() 70 | positions = {} 71 | data = json.load(f) 72 | for i,item in enumerate(data): 73 | if item['included']: 74 | for j,conn in enumerate(item['unobstructed']): 75 | if conn and data[j]['included']: 76 | positions[item['image_id']] = np.array([item['pose'][3], 77 | item['pose'][7], item['pose'][11]]); 78 | assert data[j]['unobstructed'][i], 'Graph should be undirected' 79 | G.add_edge(item['image_id'],data[j]['image_id'],weight=distance(item,data[j])) 80 | nx.set_node_attributes(G, values=positions, name='position') 81 | graphs[scan] = G 82 | return graphs 83 | 84 | def new_simulator(connectivity_dir, scan_data_dir=None): 85 | import MatterSim 86 | 87 | # Simulator image parameters 88 | WIDTH = 640 89 | HEIGHT = 480 90 | VFOV = 60 91 | 92 | sim = MatterSim.Simulator() 93 | if scan_data_dir: 94 | sim.setDatasetPath(scan_data_dir) 95 | sim.setNavGraphPath(connectivity_dir) 96 | sim.setRenderingEnabled(False) 97 | sim.setCameraResolution(WIDTH, HEIGHT) 98 | sim.setCameraVFOV(math.radians(VFOV)) 99 | sim.setDiscretizedViewingAngles(True) 100 | sim.setBatchSize(1) 101 | sim.initialize() 102 | 103 | return sim 104 | 105 | def angle_feature(heading, elevation, angle_feat_size): 106 | return np.array( 107 | [math.sin(heading), math.cos(heading), math.sin(elevation), math.cos(elevation)] * (angle_feat_size // 4), 108 | dtype=np.float32) 109 | 110 | def get_point_angle_feature(sim, angle_feat_size, baseViewId=0): 111 | feature = np.empty((36, angle_feat_size), np.float32) 112 | base_heading = (baseViewId % 12) * math.radians(30) 113 | base_elevation = (baseViewId // 12 - 1) * math.radians(30) 114 | 115 | for ix in range(36): 116 | if ix == 0: 117 | sim.newEpisode(['ZMojNkEp431'], ['2f4d90acd4024c269fb0efe49a8ac540'], [0], [math.radians(-30)]) 118 | elif ix % 12 == 0: 119 | sim.makeAction([0], [1.0], [1.0]) 120 | else: 121 | sim.makeAction([0], [1.0], [0]) 122 | 123 | state = sim.getState()[0] 124 | assert state.viewIndex == ix 125 | 126 | heading = state.heading - base_heading 127 | elevation = state.elevation - base_elevation 128 | 129 | feature[ix, :] = angle_feature(heading, elevation, angle_feat_size) 130 | return feature 131 | 132 | def get_all_point_angle_feature(sim, angle_feat_size): 133 | return [get_point_angle_feature(sim, angle_feat_size, baseViewId) for baseViewId in range(36)] 134 | 135 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/utils/distributed.py: -------------------------------------------------------------------------------- 1 | """ 2 | Distributed tools 3 | """ 4 | import os 5 | from pathlib import Path 6 | from pprint import pformat 7 | import pickle 8 | 9 | import torch 10 | import torch.distributed as dist 11 | 12 | 13 | def load_init_param(opts): 14 | """ 15 | Load parameters for the rendezvous distributed procedure 16 | """ 17 | # sync file 18 | if opts.output_dir != "": 19 | sync_dir = Path(opts.output_dir).resolve() 20 | sync_dir.mkdir(parents=True, exist_ok=True) 21 | sync_file = f"{sync_dir}/.torch_distributed_sync" 22 | else: 23 | raise RuntimeError("Can't find any sync dir") 24 | 25 | # world size 26 | if opts.world_size != -1: 27 | world_size = opts.world_size 28 | elif os.environ.get("WORLD_SIZE", "") != "": 29 | world_size = int(os.environ["WORLD_SIZE"]) 30 | else: 31 | raise RuntimeError("Can't find any world size") 32 | 33 | # rank 34 | if os.environ.get("RANK", "") != "": 35 | # pytorch.distributed.launch provide this variable no matter what 36 | rank = int(os.environ["RANK"]) 37 | else: 38 | if opts.node_rank != -1: 39 | node_rank = opts.node_rank 40 | elif os.environ.get("NODE_RANK", "") != "": 41 | node_rank = int(os.environ["NODE_RANK"]) 42 | else: 43 | raise RuntimeError("Can't find any rank or node rank") 44 | 45 | if opts.local_rank != -1: 46 | local_rank = opts.local_rank 47 | elif os.environ.get("LOCAL_RANK", "") != "": 48 | local_rank = int(os.environ["LOCAL_RANK"]) 49 | else: 50 | raise RuntimeError("Can't find any rank or local rank") 51 | 52 | # WARNING: this assumes that each node has the same number of GPUs 53 | n_gpus = torch.cuda.device_count() 54 | rank = local_rank + node_rank * n_gpus 55 | 56 | return { 57 | "backend": "nccl", 58 | # "init_method": f"file://{sync_file}", 59 | "rank": rank, 60 | "world_size": world_size, 61 | } 62 | 63 | 64 | def init_distributed(opts): 65 | init_param = load_init_param(opts) 66 | rank = init_param["rank"] 67 | 68 | print(f"Init distributed {init_param['rank']} - {init_param['world_size']}") 69 | 70 | dist.init_process_group(**init_param) 71 | return rank 72 | 73 | 74 | def is_default_gpu(opts) -> bool: 75 | return opts.local_rank == -1 or dist.get_rank() == 0 76 | 77 | 78 | def is_dist_avail_and_initialized(): 79 | if not dist.is_available(): 80 | return False 81 | if not dist.is_initialized(): 82 | return False 83 | return True 84 | 85 | def get_world_size(): 86 | if not is_dist_avail_and_initialized(): 87 | return 1 88 | return dist.get_world_size() 89 | 90 | def all_gather(data): 91 | """ 92 | Run all_gather on arbitrary picklable data (not necessarily tensors) 93 | Args: 94 | data: any picklable object 95 | Returns: 96 | list[data]: list of data gathered from each rank 97 | """ 98 | world_size = get_world_size() 99 | if world_size == 1: 100 | return [data] 101 | 102 | # serialized to a Tensor 103 | buffer = pickle.dumps(data) 104 | storage = torch.ByteStorage.from_buffer(buffer) 105 | tensor = torch.ByteTensor(storage).to("cuda") 106 | 107 | # obtain Tensor size of each rank 108 | local_size = torch.tensor([tensor.numel()], device="cuda") 109 | size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)] 110 | dist.all_gather(size_list, local_size) 111 | size_list = [int(size.item()) for size in size_list] 112 | max_size = max(size_list) 113 | 114 | # receiving Tensor from all ranks 115 | # we pad the tensor because torch all_gather does not support 116 | # gathering tensors of different shapes 117 | tensor_list = [] 118 | for _ in size_list: 119 | tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) 120 | if local_size != max_size: 121 | padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda") 122 | tensor = torch.cat((tensor, padding), dim=0) 123 | dist.all_gather(tensor_list, tensor) 124 | 125 | data_list = [] 126 | for size, tensor in zip(size_list, tensor_list): 127 | buffer = tensor.cpu().numpy().tobytes()[:size] 128 | data_list.append(pickle.loads(buffer)) 129 | 130 | return data_list 131 | 132 | 133 | def reduce_dict(input_dict, average=True): 134 | """ 135 | Args: 136 | input_dict (dict): all the values will be reduced 137 | average (bool): whether to do average or sum 138 | Reduce the values in the dictionary from all processes so that all processes 139 | have the averaged results. Returns a dict with the same fields as 140 | input_dict, after reduction. 141 | """ 142 | world_size = get_world_size() 143 | if world_size < 2: 144 | return input_dict 145 | with torch.no_grad(): 146 | names = [] 147 | values = [] 148 | # sort the keys so that they are consistent across processes 149 | for k in sorted(input_dict.keys()): 150 | names.append(k) 151 | values.append(input_dict[k]) 152 | values = torch.stack(values, dim=0) 153 | dist.all_reduce(values) 154 | if average: 155 | values /= world_size 156 | reduced_dict = {k: v for k, v in zip(names, values)} 157 | return reduced_dict 158 | 159 | 160 | def merge_dist_results(results): 161 | outs = [] 162 | for res in results: 163 | outs.extend(res) 164 | return outs 165 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/utils/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import math 4 | import time 5 | from collections import OrderedDict 6 | 7 | 8 | def write_to_record_file(data, file_path, verbose=True): 9 | if verbose: 10 | print(data) 11 | record_file = open(file_path, 'a') 12 | record_file.write(data+'\n') 13 | record_file.close() 14 | 15 | 16 | def asMinutes(s): 17 | m = math.floor(s / 60) 18 | s -= m * 60 19 | return '%dm %ds' % (m, s) 20 | 21 | def timeSince(since, percent): 22 | now = time.time() 23 | s = now - since 24 | es = s / (percent) 25 | rs = es - s 26 | return '%s (- %s)' % (asMinutes(s), asMinutes(rs)) 27 | 28 | class Timer: 29 | def __init__(self): 30 | self.cul = OrderedDict() 31 | self.start = {} 32 | self.iter = 0 33 | 34 | def reset(self): 35 | self.cul = OrderedDict() 36 | self.start = {} 37 | self.iter = 0 38 | 39 | def tic(self, key): 40 | self.start[key] = time.time() 41 | 42 | def toc(self, key): 43 | delta = time.time() - self.start[key] 44 | if key not in self.cul: 45 | self.cul[key] = delta 46 | else: 47 | self.cul[key] += delta 48 | 49 | def step(self): 50 | self.iter += 1 51 | 52 | def show(self): 53 | total = sum(self.cul.values()) 54 | for key in self.cul: 55 | print("%s, total time %0.2f, avg time %0.2f, part of %0.2f" % 56 | (key, self.cul[key], self.cul[key]*1./self.iter, self.cul[key]*1./total)) 57 | print(total / self.iter) 58 | 59 | 60 | def print_progress(iteration, total, prefix='', suffix='', decimals=1, bar_length=100): 61 | """ 62 | Call in a loop to create terminal progress bar 63 | @params: 64 | iteration - Required : current iteration (Int) 65 | total - Required : total iterations (Int) 66 | prefix - Optional : prefix string (Str) 67 | suffix - Optional : suffix string (Str) 68 | decimals - Optional : positive number of decimals in percent complete (Int) 69 | bar_length - Optional : character length of bar (Int) 70 | """ 71 | str_format = "{0:." + str(decimals) + "f}" 72 | percents = str_format.format(100 * (iteration / float(total))) 73 | filled_length = int(round(bar_length * iteration / float(total))) 74 | bar = '█' * filled_length + '-' * (bar_length - filled_length) 75 | 76 | sys.stdout.write('\r%s |%s| %s%s %s' % (prefix, bar, percents, '%', suffix)), 77 | 78 | if iteration == total: 79 | sys.stdout.write('\n') 80 | sys.stdout.flush() 81 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/utils/misc.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import torch 4 | 5 | def set_random_seed(seed): 6 | torch.manual_seed(seed) 7 | torch.cuda.manual_seed(seed) 8 | torch.cuda.manual_seed_all(seed) 9 | random.seed(seed) 10 | np.random.seed(seed) 11 | 12 | def length2mask(length, size=None): 13 | batch_size = len(length) 14 | size = int(max(length)) if size is None else size 15 | mask = (torch.arange(size, dtype=torch.int64).unsqueeze(0).repeat(batch_size, 1) 16 | > (torch.LongTensor(length) - 1).unsqueeze(1)).cuda() 17 | return mask 18 | -------------------------------------------------------------------------------- /VLN-DUET/map_nav_src/utils/ops.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | def pad_tensors(tensors, lens=None, pad=0): 5 | """B x [T, ...]""" 6 | if lens is None: 7 | lens = [t.size(0) for t in tensors] 8 | max_len = max(lens) 9 | bs = len(tensors) 10 | hid = list(tensors[0].size()[1:]) 11 | size = [bs, max_len] + hid 12 | 13 | dtype = tensors[0].dtype 14 | device = tensors[0].device 15 | output = torch.zeros(*size, dtype=dtype).to(device) 16 | if pad: 17 | output.data.fill_(pad) 18 | for i, (t, l) in enumerate(zip(tensors, lens)): 19 | output.data[i, :l, ...] = t.data 20 | return output 21 | 22 | def gen_seq_masks(seq_lens, max_len=None): 23 | if max_len is None: 24 | max_len = max(seq_lens) 25 | 26 | if isinstance(seq_lens, torch.Tensor): 27 | device = seq_lens.device 28 | masks = torch.arange(max_len).to(device).repeat(len(seq_lens), 1) < seq_lens.unsqueeze(1) 29 | return masks 30 | 31 | if max_len == 0: 32 | return np.zeros((len(seq_lens), 0), dtype=np.bool) 33 | 34 | seq_lens = np.array(seq_lens) 35 | batch_size = len(seq_lens) 36 | masks = np.arange(max_len).reshape(-1, max_len).repeat(batch_size, 0) 37 | masks = masks < seq_lens.reshape(-1, 1) 38 | return masks -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/config/r2r_model_config_clip-b16.json: -------------------------------------------------------------------------------- 1 | { 2 | "pred_head_dropout_prob": 0.1, 3 | "attention_probs_dropout_prob": 0.1, 4 | "finetuning_task": null, 5 | "hidden_act": "gelu", 6 | "hidden_dropout_prob": 0.1, 7 | "hidden_size": 768, 8 | "image_feat_size": 512, 9 | "image_prob_size": 1000, 10 | "angle_feat_size": 4, 11 | "obj_feat_size": 0, 12 | "obj_prob_size": 0, 13 | "img_feature_type": "imagenet", 14 | "initializer_range": 0.02, 15 | "intermediate_size": 3072, 16 | "num_l_layers": 9, 17 | "num_x_layers": 4, 18 | "num_pano_layers": 2, 19 | "layer_norm_eps": 1e-12, 20 | "max_position_embeddings": 512, 21 | "max_action_steps": 100, 22 | "num_attention_heads": 12, 23 | "num_hidden_layers": 12, 24 | "num_labels": 2, 25 | "output_attentions": false, 26 | "output_hidden_states": false, 27 | "pruned_heads": {}, 28 | "torchscript": false, 29 | "type_vocab_size": 2, 30 | "update_lang_bert": true, 31 | "vocab_size": 30522, 32 | "use_lang2visn_attn": true, 33 | "graph_sprels": true, 34 | "glocal_fuse": true, 35 | "lang_bert_name": "bert-base-uncased" 36 | } 37 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/config/r2r_model_config_clip-h14.json: -------------------------------------------------------------------------------- 1 | { 2 | "pred_head_dropout_prob": 0.1, 3 | "attention_probs_dropout_prob": 0.1, 4 | "finetuning_task": null, 5 | "hidden_act": "gelu", 6 | "hidden_dropout_prob": 0.1, 7 | "hidden_size": 768, 8 | "image_feat_size": 1024, 9 | "image_prob_size": 1000, 10 | "angle_feat_size": 4, 11 | "obj_feat_size": 0, 12 | "obj_prob_size": 0, 13 | "img_feature_type": "imagenet", 14 | "initializer_range": 0.02, 15 | "intermediate_size": 3072, 16 | "num_l_layers": 9, 17 | "num_x_layers": 4, 18 | "num_pano_layers": 2, 19 | "layer_norm_eps": 1e-12, 20 | "max_position_embeddings": 512, 21 | "max_action_steps": 100, 22 | "num_attention_heads": 12, 23 | "num_hidden_layers": 12, 24 | "num_labels": 2, 25 | "output_attentions": false, 26 | "output_hidden_states": false, 27 | "pruned_heads": {}, 28 | "torchscript": false, 29 | "type_vocab_size": 2, 30 | "update_lang_bert": true, 31 | "vocab_size": 30522, 32 | "use_lang2visn_attn": true, 33 | "graph_sprels": true, 34 | "glocal_fuse": true, 35 | "lang_bert_name": "bert-base-uncased" 36 | } 37 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/config/r2r_pretrain_hm3d+mp3d+gibson_clip-b16.json: -------------------------------------------------------------------------------- 1 | { 2 | "model_config": "", 3 | "checkpoint": null, 4 | "output_dir": "", 5 | "mrc_mask_prob": 0.15, 6 | "max_txt_len": 200, 7 | "train_batch_size": 128, 8 | "val_batch_size": 128, 9 | "gradient_accumulation_steps": 1, 10 | "learning_rate": 5e-05, 11 | "valid_steps": 10000, 12 | "log_steps": 1000, 13 | "num_train_steps":200000, 14 | "optim": "adamw", 15 | "betas": [ 16 | 0.9, 17 | 0.98 18 | ], 19 | "dropout": 0.1, 20 | "weight_decay": 0.01, 21 | "grad_norm": 5.0, 22 | "warmup_steps": 10000, 23 | "seed": 0, 24 | "fp16": false, 25 | "n_workers": 1, 26 | "pin_mem": false, 27 | "init_pretrained": "lxmert", 28 | 29 | "train_datasets": { 30 | "R2R": { 31 | "name": "R2R", 32 | "train_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_train_enc.jsonl", 33 | "../datasets/R2R/annotations/pretrain_map/R2R_hm3d_aug_envdrop_generated_enc.jsonl", 34 | "../datasets/R2R/annotations/pretrain_map/R2R_prevalent_aug_train_enc.jsonl", 35 | "../datasets/R2R/annotations/pretrain_map/R2R_gibson_aug_envdrop_generated_enc.jsonl"], 36 | "val_seen_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_val_seen_enc.jsonl"], 37 | "val_unseen_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_val_unseen_enc.jsonl"], 38 | "connectivity_dir": "../datasets/R2R/connectivity", 39 | "img_ft_file": "../datasets/R2R/features/clip_vit-b16_mp3d_hm3d_gibson.hdf5", 40 | "scanvp_cands_file": "../datasets/R2R/annotations/scanvp_candview_relangles_with_hm3d_gibson.json", 41 | "tasks": [ 42 | "mlm", 43 | "sap" 44 | ], 45 | "mix_ratio": [ 46 | 1, 47 | 1 48 | ] 49 | } 50 | } 51 | } 52 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/config/r2r_pretrain_hm3d+mp3d+gibson_clip-h14.json: -------------------------------------------------------------------------------- 1 | { 2 | "model_config": "", 3 | "checkpoint": null, 4 | "output_dir": "", 5 | "mrc_mask_prob": 0.15, 6 | "max_txt_len": 200, 7 | "train_batch_size": 128, 8 | "val_batch_size": 128, 9 | "gradient_accumulation_steps": 1, 10 | "learning_rate": 5e-05, 11 | "valid_steps": 10000, 12 | "log_steps": 1000, 13 | "num_train_steps":200000, 14 | "optim": "adamw", 15 | "betas": [ 16 | 0.9, 17 | 0.98 18 | ], 19 | "dropout": 0.1, 20 | "weight_decay": 0.01, 21 | "grad_norm": 5.0, 22 | "warmup_steps": 10000, 23 | "seed": 0, 24 | "fp16": false, 25 | "n_workers": 1, 26 | "pin_mem": false, 27 | "init_pretrained": "lxmert", 28 | 29 | "train_datasets": { 30 | "R2R": { 31 | "name": "R2R", 32 | "train_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_train_enc.jsonl", 33 | "../datasets/R2R/annotations/pretrain_map/R2R_hm3d_aug_envdrop_generated_enc.jsonl", 34 | "../datasets/R2R/annotations/pretrain_map/R2R_prevalent_aug_train_enc.jsonl", 35 | "../datasets/R2R/annotations/pretrain_map/R2R_gibson_aug_envdrop_generated_enc.jsonl"], 36 | "val_seen_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_val_seen_enc.jsonl"], 37 | "val_unseen_traj_files": ["../datasets/R2R/annotations/pretrain_map/R2R_val_unseen_enc.jsonl"], 38 | "connectivity_dir": "../datasets/R2R/connectivity", 39 | "img_ft_file": "../datasets/R2R/features/clip_vit-h14_mp3d_hm3d_gibson.hdf5", 40 | "scanvp_cands_file": "../datasets/R2R/annotations/scanvp_candview_relangles_with_hm3d_gibson.json", 41 | "tasks": [ 42 | "mlm", 43 | "sap" 44 | ], 45 | "mix_ratio": [ 46 | 1, 47 | 1 48 | ] 49 | } 50 | } 51 | } 52 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET/pretrain_src/data/__init__.py -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/data/common.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import json 4 | import numpy as np 5 | import networkx as nx 6 | 7 | import torch 8 | 9 | def pad_tensors(tensors, lens=None, pad=0): 10 | """B x [T, ...] torch tensors""" 11 | if lens is None: 12 | lens = [t.size(0) for t in tensors] 13 | max_len = max(lens) 14 | bs = len(tensors) 15 | hid = list(tensors[0].size()[1:]) 16 | size = [bs, max_len] + hid 17 | 18 | dtype = tensors[0].dtype 19 | output = torch.zeros(*size, dtype=dtype) 20 | if pad: 21 | output.data.fill_(pad) 22 | for i, (t, l) in enumerate(zip(tensors, lens)): 23 | output.data[i, :l, ...] = t.data 24 | return output 25 | 26 | def gen_seq_masks(seq_lens, max_len=None): 27 | """ 28 | Args: 29 | seq_lens: list or nparray int, shape=(N, ) 30 | Returns: 31 | masks: nparray, shape=(N, L), padded=0 32 | """ 33 | seq_lens = np.array(seq_lens) 34 | if max_len is None: 35 | max_len = max(seq_lens) 36 | if max_len == 0: 37 | return np.zeros((len(seq_lens), 0), dtype=np.bool) 38 | batch_size = len(seq_lens) 39 | masks = np.arange(max_len).reshape(-1, max_len).repeat(batch_size, 0) 40 | masks = masks < seq_lens.reshape(-1, 1) 41 | return masks 42 | 43 | def get_angle_fts(headings, elevations, angle_feat_size): 44 | ang_fts = [np.sin(headings), np.cos(headings), np.sin(elevations), np.cos(elevations)] 45 | ang_fts = np.vstack(ang_fts).transpose().astype(np.float32) 46 | num_repeats = angle_feat_size // 4 47 | if num_repeats > 1: 48 | ang_fts = np.concatenate([ang_fts] * num_repeats, 1) 49 | return ang_fts 50 | 51 | def get_view_rel_angles(baseViewId=0): 52 | rel_angles = np.zeros((36, 2), dtype=np.float32) 53 | 54 | base_heading = (baseViewId % 12) * math.radians(30) 55 | base_elevation = (baseViewId // 12 - 1) * math.radians(30) 56 | for ix in range(36): 57 | if ix == 0: 58 | heading = 0 59 | elevation = math.radians(-30) 60 | elif ix % 12 == 0: 61 | heading = 0 62 | elevation += math.radians(30) 63 | else: 64 | heading += math.radians(30) 65 | rel_angles[ix, 0] = heading - base_heading 66 | rel_angles[ix, 1] = elevation - base_elevation 67 | 68 | return rel_angles 69 | 70 | 71 | def load_nav_graphs(connectivity_dir): 72 | ''' Load connectivity graph for each scan ''' 73 | 74 | def distance(pose1, pose2): 75 | ''' Euclidean distance between two graph poses ''' 76 | return ((pose1['pose'][3]-pose2['pose'][3])**2\ 77 | + (pose1['pose'][7]-pose2['pose'][7])**2\ 78 | + (pose1['pose'][11]-pose2['pose'][11])**2)**0.5 79 | 80 | scans = [x.strip() for x in open(os.path.join(connectivity_dir, 'scans.txt')).readlines()] 81 | graphs = {} 82 | for scan in scans: 83 | with open(os.path.join(connectivity_dir, '%s_connectivity.json' % scan)) as f: 84 | G = nx.Graph() 85 | positions = {} 86 | data = json.load(f) 87 | for i, item in enumerate(data): 88 | if item['included']: 89 | for j,conn in enumerate(item['unobstructed']): 90 | if conn and data[j]['included']: 91 | positions[item['image_id']] = np.array([item['pose'][3], 92 | item['pose'][7], item['pose'][11]]); 93 | assert data[j]['unobstructed'][i], 'Graph should be undirected' 94 | G.add_edge(item['image_id'],data[j]['image_id'],weight=distance(item,data[j])) 95 | nx.set_node_attributes(G, values=positions, name='position') 96 | graphs[scan] = G 97 | 98 | shortest_distances = {} 99 | shortest_paths = {} 100 | for scan, G in graphs.items(): # compute all shortest paths 101 | shortest_distances[scan] = dict(nx.all_pairs_dijkstra_path_length(G)) 102 | shortest_paths[scan] = dict(nx.all_pairs_dijkstra_path(G)) 103 | return graphs, shortest_distances, shortest_paths 104 | 105 | def softmax(logits, dim=1): 106 | # logits: (n, d) 107 | tmp = np.exp(logits) 108 | return tmp / np.sum(tmp, axis=dim, keepdims=True) 109 | 110 | 111 | def calculate_vp_rel_pos_fts(a, b, base_heading=0, base_elevation=0): 112 | # a, b: (x, y, z) 113 | dx = b[0] - a[0] 114 | dy = b[1] - a[1] 115 | dz = b[2] - a[2] 116 | xy_dist = max(np.sqrt(dx**2 + dy**2), 1e-8) 117 | xyz_dist = max(np.sqrt(dx**2 + dy**2 + dz**2), 1e-8) 118 | 119 | # the simulator's api is weired (x-y axis is transposed) 120 | heading = np.arcsin(dx/xy_dist) # [-pi/2, pi/2] 121 | if b[1] < a[1]: 122 | heading = np.pi - heading 123 | heading -= base_heading 124 | 125 | elevation = np.arcsin(dz/xyz_dist) # [-pi/2, pi/2] 126 | elevation -= base_elevation 127 | 128 | return heading, elevation, xyz_dist 129 | 130 | def normalize_angle(x): 131 | '''convert radians into (-pi, pi]''' 132 | pi2 = 2 * math.pi 133 | x = x % pi2 # [0, 2pi] 134 | x = np.where(x > math.pi, x - pi2, x) 135 | return x -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/data/loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | A prefetch loader to speedup data loading 6 | Modified from Nvidia Deep Learning Examples 7 | (https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch). 8 | """ 9 | import random 10 | from typing import List, Dict, Tuple, Union, Iterator 11 | 12 | import torch 13 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler 14 | from torch.utils.data.distributed import DistributedSampler 15 | import torch.distributed as dist 16 | 17 | 18 | class MetaLoader: 19 | """wraps multiple data loaders""" 20 | 21 | def __init__( 22 | self, loaders, accum_steps: int = 1, distributed: bool = False, device=None 23 | ): 24 | assert isinstance(loaders, dict) 25 | self.name2loader = {} 26 | self.name2iter = {} 27 | self.name2pre_epoch = {} 28 | self.names: List[str] = [] 29 | ratios: List[int] = [] 30 | for n, l in loaders.items(): 31 | if isinstance(l, tuple): 32 | l, r, p = l 33 | elif isinstance(l, DataLoader): 34 | r = 1 35 | p = lambda e: None 36 | else: 37 | raise ValueError() 38 | self.names.append(n) 39 | self.name2loader[n] = l 40 | self.name2iter[n] = iter(l) 41 | self.name2pre_epoch[n] = p 42 | ratios.append(r) 43 | 44 | self.accum_steps = accum_steps 45 | self.device = device 46 | self.sampling_ratios = torch.tensor(ratios).float().to(self.device) 47 | self.distributed = distributed 48 | self.step = 0 49 | 50 | def __iter__(self) -> Iterator[Tuple]: 51 | """this iterator will run indefinitely""" 52 | task_id = None 53 | epoch_id = 0 54 | while True: 55 | if self.step % self.accum_steps == 0: 56 | task_id = torch.multinomial(self.sampling_ratios, 1) 57 | if self.distributed: 58 | # make sure all process is training same task 59 | dist.broadcast(task_id, 0) 60 | self.step += 1 61 | task = self.names[task_id.cpu().item()] 62 | iter_ = self.name2iter[task] 63 | try: 64 | batch = next(iter_) 65 | except StopIteration: 66 | epoch_id += 1 67 | # In distributed mode, calling the set_epoch() method at the beginning of each epoch 68 | # before creating the DataLoader iterator is necessary to make shuffling work properly 69 | # across multiple epochs. Otherwise, the same ordering will be always used. 70 | self.name2pre_epoch[task](epoch_id) 71 | iter_ = iter(self.name2loader[task]) 72 | batch = next(iter_) 73 | self.name2iter[task] = iter_ 74 | 75 | yield task, batch 76 | 77 | 78 | def move_to_cuda(batch: Union[List, Tuple, Dict, torch.Tensor], device: torch.device): 79 | if isinstance(batch, torch.Tensor): 80 | return batch.to(device, non_blocking=True) 81 | elif isinstance(batch, list): 82 | return [move_to_cuda(t, device) for t in batch] 83 | elif isinstance(batch, tuple): 84 | return tuple(move_to_cuda(t, device) for t in batch) 85 | elif isinstance(batch, dict): 86 | return {n: move_to_cuda(t, device) for n, t in batch.items()} 87 | return batch 88 | 89 | 90 | class PrefetchLoader(object): 91 | """ 92 | overlap compute and cuda data transfer 93 | """ 94 | def __init__(self, loader, device: torch.device): 95 | self.loader = loader 96 | self.device = device 97 | 98 | def __iter__(self): 99 | loader_it = iter(self.loader) 100 | self.preload(loader_it) 101 | batch = self.next(loader_it) 102 | while batch is not None: 103 | yield batch 104 | batch = self.next(loader_it) 105 | 106 | def __len__(self): 107 | return len(self.loader) 108 | 109 | def preload(self, it): 110 | try: 111 | self.batch = next(it) 112 | except StopIteration: 113 | self.batch = None 114 | return 115 | self.batch = move_to_cuda(self.batch, self.device) 116 | 117 | def next(self, it): 118 | batch = self.batch 119 | self.preload(it) 120 | return batch 121 | 122 | def __getattr__(self, name): 123 | method = self.loader.__getattribute__(name) 124 | return method 125 | 126 | 127 | def build_dataloader(task, dataset, collate_fn, is_train: bool, opts): 128 | 129 | batch_size = opts.train_batch_size if is_train else opts.val_batch_size 130 | # if task == 'itm': batch_size = max(1, batch_size // 2) 131 | 132 | if opts.local_rank == -1: 133 | if is_train: 134 | sampler: Union[ 135 | RandomSampler, SequentialSampler, DistributedSampler 136 | ] = RandomSampler(dataset) 137 | else: 138 | sampler = SequentialSampler(dataset) 139 | 140 | size = torch.cuda.device_count() if torch.cuda.is_available() else 1 141 | pre_epoch = lambda e: None 142 | 143 | # DataParallel: scale the batch size by the number of GPUs 144 | if size > 1: 145 | batch_size *= size 146 | 147 | else: 148 | size = dist.get_world_size() 149 | sampler = DistributedSampler( 150 | dataset, num_replicas=size, rank=dist.get_rank(), shuffle=is_train 151 | ) 152 | pre_epoch = sampler.set_epoch 153 | 154 | loader = DataLoader( 155 | dataset, 156 | sampler=sampler, 157 | batch_size=batch_size, 158 | num_workers=opts.n_workers, 159 | pin_memory=opts.pin_mem, 160 | collate_fn=collate_fn, 161 | drop_last=False, 162 | ) 163 | 164 | return loader, pre_epoch 165 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET/pretrain_src/model/__init__.py -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/model/ops.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from .transformer import TransformerEncoder, TransformerEncoderLayer 4 | 5 | # try: 6 | # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm 7 | # except (ImportError, AttributeError) as e: 8 | # # logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .") 9 | # BertLayerNorm = torch.nn.LayerNorm 10 | BertLayerNorm = torch.nn.LayerNorm 11 | 12 | def create_transformer_encoder(config, num_layers, norm=False): 13 | enc_layer = TransformerEncoderLayer( 14 | config.hidden_size, config.num_attention_heads, 15 | dim_feedforward=config.intermediate_size, 16 | dropout=config.hidden_dropout_prob, 17 | activation=config.hidden_act, 18 | normalize_before=True 19 | ) 20 | if norm: 21 | norm_layer = BertLayerNorm(config.hidden_size, eps=1e-12) 22 | else: 23 | norm_layer = None 24 | return TransformerEncoder(enc_layer, num_layers, norm=norm_layer, batch_first=True) 25 | 26 | def extend_neg_masks(masks, dtype=None): 27 | """ 28 | mask from (N, L) into (N, 1(H), 1(L), L) and make it negative 29 | """ 30 | if dtype is None: 31 | dtype = torch.float 32 | extended_masks = masks.unsqueeze(1).unsqueeze(2) 33 | extended_masks = extended_masks.to(dtype=dtype) 34 | extended_masks = (1.0 - extended_masks) * -10000.0 35 | return extended_masks 36 | 37 | def gen_seq_masks(seq_lens, max_len=None): 38 | if max_len is None: 39 | max_len = max(seq_lens) 40 | batch_size = len(seq_lens) 41 | device = seq_lens.device 42 | 43 | masks = torch.arange(max_len).unsqueeze(0).repeat(batch_size, 1).to(device) 44 | masks = masks < seq_lens.unsqueeze(1) 45 | return masks 46 | 47 | def pad_tensors_wgrad(tensors, lens=None): 48 | """B x [T, ...] torch tensors""" 49 | if lens is None: 50 | lens = [t.size(0) for t in tensors] 51 | max_len = max(lens) 52 | batch_size = len(tensors) 53 | hid = list(tensors[0].size()[1:]) 54 | 55 | device = tensors[0].device 56 | dtype = tensors[0].dtype 57 | 58 | output = [] 59 | for i in range(batch_size): 60 | if lens[i] < max_len: 61 | tmp = torch.cat( 62 | [tensors[i], torch.zeros([max_len-lens[i]]+hid, dtype=dtype).to(device)], 63 | dim=0 64 | ) 65 | else: 66 | tmp = tensors[i] 67 | output.append(tmp) 68 | output = torch.stack(output, 0) 69 | return output 70 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | """ 6 | from .sched import noam_schedule, warmup_linear, get_lr_sched 7 | from .adamw import AdamW 8 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/adamw.py: -------------------------------------------------------------------------------- 1 | """ 2 | AdamW optimizer (weight decay fix) 3 | copied from hugginface (https://github.com/huggingface/transformers). 4 | """ 5 | 6 | import math 7 | from typing import Callable, Iterable, Tuple 8 | 9 | import torch 10 | 11 | from torch.optim import Optimizer 12 | 13 | class AdamW(Optimizer): 14 | """ 15 | Implements Adam algorithm with weight decay fix as introduced in `Decoupled Weight Decay Regularization 16 | `__. 17 | 18 | Parameters: 19 | params (:obj:`Iterable[torch.nn.parameter.Parameter]`): 20 | Iterable of parameters to optimize or dictionaries defining parameter groups. 21 | lr (:obj:`float`, `optional`, defaults to 1e-3): 22 | The learning rate to use. 23 | betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)): 24 | Adam's betas parameters (b1, b2). 25 | eps (:obj:`float`, `optional`, defaults to 1e-6): 26 | Adam's epsilon for numerical stability. 27 | weight_decay (:obj:`float`, `optional`, defaults to 0): 28 | Decoupled weight decay to apply. 29 | correct_bias (:obj:`bool`, `optional`, defaults to `True`): 30 | Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`). 31 | """ 32 | 33 | def __init__( 34 | self, 35 | params: Iterable[torch.nn.parameter.Parameter], 36 | lr: float = 1e-3, 37 | betas: Tuple[float, float] = (0.9, 0.999), 38 | eps: float = 1e-6, 39 | weight_decay: float = 0.0, 40 | correct_bias: bool = True, 41 | ): 42 | if lr < 0.0: 43 | raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr)) 44 | if not 0.0 <= betas[0] < 1.0: 45 | raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[0])) 46 | if not 0.0 <= betas[1] < 1.0: 47 | raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[1])) 48 | if not 0.0 <= eps: 49 | raise ValueError("Invalid epsilon value: {} - should be >= 0.0".format(eps)) 50 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias) 51 | super().__init__(params, defaults) 52 | 53 | def step(self, closure: Callable = None): 54 | """ 55 | Performs a single optimization step. 56 | 57 | Arguments: 58 | closure (:obj:`Callable`, `optional`): A closure that reevaluates the model and returns the loss. 59 | """ 60 | loss = None 61 | if closure is not None: 62 | loss = closure() 63 | 64 | for group in self.param_groups: 65 | for p in group["params"]: 66 | if p.grad is None: 67 | continue 68 | grad = p.grad.data 69 | if grad.is_sparse: 70 | raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead") 71 | 72 | state = self.state[p] 73 | 74 | # State initialization 75 | if len(state) == 0: 76 | state["step"] = 0 77 | # Exponential moving average of gradient values 78 | state["exp_avg"] = torch.zeros_like(p.data) 79 | # Exponential moving average of squared gradient values 80 | state["exp_avg_sq"] = torch.zeros_like(p.data) 81 | 82 | exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 83 | beta1, beta2 = group["betas"] 84 | 85 | state["step"] += 1 86 | 87 | # Decay the first and second moment running average coefficient 88 | # In-place operations to update the averages at the same time 89 | exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1) 90 | exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2) 91 | denom = exp_avg_sq.sqrt().add_(group["eps"]) 92 | 93 | step_size = group["lr"] 94 | if group["correct_bias"]: # No bias correction for Bert 95 | bias_correction1 = 1.0 - beta1 ** state["step"] 96 | bias_correction2 = 1.0 - beta2 ** state["step"] 97 | step_size = step_size * math.sqrt(bias_correction2) / bias_correction1 98 | 99 | p.data.addcdiv_(exp_avg, denom, value=-step_size) 100 | 101 | # Just adding the square of the weights to the loss function is *not* 102 | # the correct way of using L2 regularization/weight decay with Adam, 103 | # since that will interact with the m and v parameters in strange ways. 104 | # 105 | # Instead we want to decay the weights in a manner that doesn't interact 106 | # with the m/v parameters. This is equivalent to adding the square 107 | # of the weights to the loss with plain (non-momentum) SGD. 108 | # Add weight decay at the end (fixed version) 109 | if group["weight_decay"] > 0.0: 110 | p.data.add_(p.data, alpha=-group["lr"] * group["weight_decay"]) 111 | 112 | return loss 113 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/lookahead.py: -------------------------------------------------------------------------------- 1 | # Lookahead implementation from https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/lookahead.py 2 | 3 | """ Lookahead Optimizer Wrapper. 4 | Implementation modified from: https://github.com/alphadl/lookahead.pytorch 5 | Paper: `Lookahead Optimizer: k steps forward, 1 step back` - https://arxiv.org/abs/1907.08610 6 | """ 7 | import torch 8 | from torch.optim.optimizer import Optimizer 9 | from torch.optim import Adam 10 | from collections import defaultdict 11 | 12 | class Lookahead(Optimizer): 13 | def __init__(self, base_optimizer, alpha=0.5, k=6): 14 | if not 0.0 <= alpha <= 1.0: 15 | raise ValueError(f'Invalid slow update rate: {alpha}') 16 | if not 1 <= k: 17 | raise ValueError(f'Invalid lookahead steps: {k}') 18 | defaults = dict(lookahead_alpha=alpha, lookahead_k=k, lookahead_step=0) 19 | self.base_optimizer = base_optimizer 20 | self.param_groups = self.base_optimizer.param_groups 21 | self.defaults = base_optimizer.defaults 22 | self.defaults.update(defaults) 23 | self.state = defaultdict(dict) 24 | # manually add our defaults to the param groups 25 | for name, default in defaults.items(): 26 | for group in self.param_groups: 27 | group.setdefault(name, default) 28 | 29 | def update_slow(self, group): 30 | for fast_p in group["params"]: 31 | if fast_p.grad is None: 32 | continue 33 | param_state = self.state[fast_p] 34 | if 'slow_buffer' not in param_state: 35 | param_state['slow_buffer'] = torch.empty_like(fast_p.data) 36 | param_state['slow_buffer'].copy_(fast_p.data) 37 | slow = param_state['slow_buffer'] 38 | slow.add_(group['lookahead_alpha'], fast_p.data - slow) 39 | fast_p.data.copy_(slow) 40 | 41 | def sync_lookahead(self): 42 | for group in self.param_groups: 43 | self.update_slow(group) 44 | 45 | def step(self, closure=None): 46 | # print(self.k) 47 | #assert id(self.param_groups) == id(self.base_optimizer.param_groups) 48 | loss = self.base_optimizer.step(closure) 49 | for group in self.param_groups: 50 | group['lookahead_step'] += 1 51 | if group['lookahead_step'] % group['lookahead_k'] == 0: 52 | self.update_slow(group) 53 | return loss 54 | 55 | def state_dict(self): 56 | fast_state_dict = self.base_optimizer.state_dict() 57 | slow_state = { 58 | (id(k) if isinstance(k, torch.Tensor) else k): v 59 | for k, v in self.state.items() 60 | } 61 | fast_state = fast_state_dict['state'] 62 | param_groups = fast_state_dict['param_groups'] 63 | return { 64 | 'state': fast_state, 65 | 'slow_state': slow_state, 66 | 'param_groups': param_groups, 67 | } 68 | 69 | def load_state_dict(self, state_dict): 70 | fast_state_dict = { 71 | 'state': state_dict['state'], 72 | 'param_groups': state_dict['param_groups'], 73 | } 74 | self.base_optimizer.load_state_dict(fast_state_dict) 75 | 76 | # We want to restore the slow state, but share param_groups reference 77 | # with base_optimizer. This is a bit redundant but least code 78 | slow_state_new = False 79 | if 'slow_state' not in state_dict: 80 | print('Loading state_dict from optimizer without Lookahead applied.') 81 | state_dict['slow_state'] = defaultdict(dict) 82 | slow_state_new = True 83 | slow_state_dict = { 84 | 'state': state_dict['slow_state'], 85 | 'param_groups': state_dict['param_groups'], # this is pointless but saves code 86 | } 87 | super(Lookahead, self).load_state_dict(slow_state_dict) 88 | self.param_groups = self.base_optimizer.param_groups # make both ref same container 89 | if slow_state_new: 90 | # reapply defaults to catch missing lookahead specific ones 91 | for name, default in self.defaults.items(): 92 | for group in self.param_groups: 93 | group.setdefault(name, default) 94 | 95 | def LookaheadAdam(params, alpha=0.5, k=6, *args, **kwargs): 96 | adam = Adam(params, *args, **kwargs) 97 | return Lookahead(adam, alpha, k) 98 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/misc.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | Misc lr helper 6 | """ 7 | from torch.optim import Adam, Adamax 8 | 9 | from .adamw import AdamW 10 | from .rangerlars import RangerLars 11 | 12 | def build_optimizer(model, opts): 13 | param_optimizer = list(model.named_parameters()) 14 | no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] 15 | optimizer_grouped_parameters = [ 16 | {'params': [p for n, p in param_optimizer 17 | if not any(nd in n for nd in no_decay)], 18 | 'weight_decay': opts.weight_decay}, 19 | {'params': [p for n, p in param_optimizer 20 | if any(nd in n for nd in no_decay)], 21 | 'weight_decay': 0.0} 22 | ] 23 | 24 | # currently Adam only 25 | if opts.optim == 'adam': 26 | OptimCls = Adam 27 | elif opts.optim == 'adamax': 28 | OptimCls = Adamax 29 | elif opts.optim == 'adamw': 30 | OptimCls = AdamW 31 | elif opts.optim == 'rangerlars': 32 | OptimCls = RangerLars 33 | else: 34 | raise ValueError('invalid optimizer') 35 | optimizer = OptimCls(optimizer_grouped_parameters, 36 | lr=opts.learning_rate, betas=opts.betas) 37 | return optimizer 38 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/ralamb.py: -------------------------------------------------------------------------------- 1 | import torch, math 2 | from torch.optim.optimizer import Optimizer 3 | 4 | # RAdam + LARS 5 | class Ralamb(Optimizer): 6 | 7 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): 8 | defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) 9 | self.buffer = [[None, None, None] for ind in range(10)] 10 | super(Ralamb, self).__init__(params, defaults) 11 | 12 | def __setstate__(self, state): 13 | super(Ralamb, self).__setstate__(state) 14 | 15 | def step(self, closure=None): 16 | 17 | loss = None 18 | if closure is not None: 19 | loss = closure() 20 | 21 | for group in self.param_groups: 22 | 23 | for p in group['params']: 24 | if p.grad is None: 25 | continue 26 | grad = p.grad.data.float() 27 | if grad.is_sparse: 28 | raise RuntimeError('Ralamb does not support sparse gradients') 29 | 30 | p_data_fp32 = p.data.float() 31 | 32 | state = self.state[p] 33 | 34 | if len(state) == 0: 35 | state['step'] = 0 36 | state['exp_avg'] = torch.zeros_like(p_data_fp32) 37 | state['exp_avg_sq'] = torch.zeros_like(p_data_fp32) 38 | else: 39 | state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32) 40 | state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32) 41 | 42 | exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] 43 | beta1, beta2 = group['betas'] 44 | 45 | # Decay the first and second moment running average coefficient 46 | # m_t 47 | exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) 48 | # v_t 49 | exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) 50 | 51 | state['step'] += 1 52 | buffered = self.buffer[int(state['step'] % 10)] 53 | 54 | if state['step'] == buffered[0]: 55 | N_sma, radam_step_size = buffered[1], buffered[2] 56 | else: 57 | buffered[0] = state['step'] 58 | beta2_t = beta2 ** state['step'] 59 | N_sma_max = 2 / (1 - beta2) - 1 60 | N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t) 61 | buffered[1] = N_sma 62 | 63 | # more conservative since it's an approximated value 64 | if N_sma >= 5: 65 | radam_step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step']) 66 | else: 67 | radam_step_size = 1.0 / (1 - beta1 ** state['step']) 68 | buffered[2] = radam_step_size 69 | 70 | if group['weight_decay'] != 0: 71 | p_data_fp32.add_(p_data_fp32, alpha=-group['weight_decay'] * group['lr']) 72 | 73 | # more conservative since it's an approximated value 74 | radam_step = p_data_fp32.clone() 75 | if N_sma >= 5: 76 | denom = exp_avg_sq.sqrt().add_(group['eps']) 77 | radam_step.addcdiv_(-radam_step_size * group['lr'], exp_avg, denom) 78 | else: 79 | radam_step.add_(exp_avg, alpha=-radam_step_size * group['lr']) 80 | 81 | radam_norm = radam_step.pow(2).sum().sqrt() 82 | weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10) 83 | if weight_norm == 0 or radam_norm == 0: 84 | trust_ratio = 1 85 | else: 86 | trust_ratio = weight_norm / radam_norm 87 | 88 | state['weight_norm'] = weight_norm 89 | state['adam_norm'] = radam_norm 90 | state['trust_ratio'] = trust_ratio 91 | 92 | if N_sma >= 5: 93 | p_data_fp32.addcdiv_(-radam_step_size * group['lr'] * trust_ratio, exp_avg, denom) 94 | else: 95 | p_data_fp32.add_(-radam_step_size * group['lr'] * trust_ratio, exp_avg) 96 | 97 | p.data.copy_(p_data_fp32) 98 | 99 | return loss 100 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/rangerlars.py: -------------------------------------------------------------------------------- 1 | import torch, math 2 | from torch.optim.optimizer import Optimizer 3 | import itertools as it 4 | from .lookahead import * 5 | from .ralamb import * 6 | 7 | # RAdam + LARS + LookAHead 8 | 9 | # Lookahead implementation from https://github.com/lonePatient/lookahead_pytorch/blob/master/optimizer.py 10 | # RAdam + LARS implementation from https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20 11 | 12 | def RangerLars(params, alpha=0.5, k=6, *args, **kwargs): 13 | ralamb = Ralamb(params, *args, **kwargs) 14 | return Lookahead(ralamb, alpha, k) 15 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/optim/sched.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | optimizer learning rate scheduling helpers 6 | """ 7 | from math import ceil 8 | 9 | 10 | def noam_schedule(step, warmup_step=4000): 11 | """ original Transformer schedule""" 12 | if step <= warmup_step: 13 | return step / warmup_step 14 | return (warmup_step ** 0.5) * (step ** -0.5) 15 | 16 | 17 | def warmup_linear(step, warmup_step, tot_step): 18 | """ BERT schedule """ 19 | if step < warmup_step: 20 | return step / warmup_step 21 | return max(0, (tot_step-step)/(tot_step-warmup_step)) 22 | 23 | 24 | def get_lr_sched(global_step, opts): 25 | # learning rate scheduling 26 | lr_this_step = opts.learning_rate * warmup_linear( 27 | global_step, opts.warmup_steps, opts.num_train_steps) 28 | if lr_this_step <= 0: 29 | lr_this_step = 1e-8 30 | return lr_this_step 31 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/parser.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import json 4 | 5 | 6 | def load_parser(): 7 | parser = argparse.ArgumentParser() 8 | 9 | # Required parameters 10 | # NOTE: train tasks and val tasks cannot take command line arguments 11 | parser.add_argument('--vlnbert', choices=['cmt']) 12 | parser.add_argument( 13 | "--model_config", type=str, help="path to model structure config json" 14 | ) 15 | parser.add_argument( 16 | "--checkpoint", default=None, type=str, help="path to model checkpoint (*.pt)" 17 | ) 18 | 19 | parser.add_argument( 20 | "--output_dir", 21 | default=None, 22 | type=str, 23 | help="The output directory where the model checkpoints will be written.", 24 | ) 25 | 26 | # training parameters 27 | parser.add_argument( 28 | "--train_batch_size", 29 | default=4096, 30 | type=int, 31 | help="Total batch size for training. ", 32 | ) 33 | parser.add_argument( 34 | "--val_batch_size", 35 | default=4096, 36 | type=int, 37 | help="Total batch size for validation. ", 38 | ) 39 | parser.add_argument( 40 | "--gradient_accumulation_steps", 41 | type=int, 42 | default=16, 43 | help="Number of updates steps to accumualte before " 44 | "performing a backward/update pass.", 45 | ) 46 | parser.add_argument( 47 | "--learning_rate", 48 | default=3e-5, 49 | type=float, 50 | help="The initial learning rate for Adam.", 51 | ) 52 | parser.add_argument( 53 | "--valid_steps", default=1000, type=int, help="Run validation every X steps" 54 | ) 55 | parser.add_argument("--log_steps", default=1000, type=int) 56 | parser.add_argument( 57 | "--num_train_steps", 58 | default=100000, 59 | type=int, 60 | help="Total number of training updates to perform.", 61 | ) 62 | parser.add_argument( 63 | "--optim", 64 | default="adamw", 65 | choices=["adam", "adamax", "adamw"], 66 | help="optimizer", 67 | ) 68 | parser.add_argument( 69 | "--betas", default=[0.9, 0.98], nargs="+", help="beta for adam optimizer" 70 | ) 71 | parser.add_argument( 72 | "--dropout", default=0.1, type=float, help="tune dropout regularization" 73 | ) 74 | parser.add_argument( 75 | "--weight_decay", 76 | default=0.01, 77 | type=float, 78 | help="weight decay (L2) regularization", 79 | ) 80 | parser.add_argument( 81 | "--grad_norm", 82 | default=2.0, 83 | type=float, 84 | help="gradient clipping (-1 for no clipping)", 85 | ) 86 | parser.add_argument( 87 | "--warmup_steps", 88 | default=10000, 89 | type=int, 90 | help="Number of training steps to perform linear " "learning rate warmup for.", 91 | ) 92 | 93 | # device parameters 94 | parser.add_argument( 95 | "--seed", type=int, default=0, help="random seed for initialization" 96 | ) 97 | parser.add_argument( 98 | "--fp16", 99 | action="store_true", 100 | help="Whether to use 16-bit float precision instead of 32-bit", 101 | ) 102 | parser.add_argument( 103 | "--n_workers", type=int, default=4, help="number of data workers" 104 | ) 105 | parser.add_argument("--pin_mem", action="store_true", help="pin memory") 106 | 107 | # distributed computing 108 | parser.add_argument( 109 | "--local_rank", 110 | type=int, 111 | default=-1, 112 | help="local rank for distributed training on gpus", 113 | ) 114 | parser.add_argument( 115 | "--node_rank", 116 | type=int, 117 | default=0, 118 | help="Id of the node", 119 | ) 120 | parser.add_argument( 121 | "--world_size", 122 | type=int, 123 | default=1, 124 | help="Number of GPUs across all nodes", 125 | ) 126 | 127 | # can use config files 128 | parser.add_argument("--config", required=True, help="JSON config files") 129 | 130 | return parser 131 | 132 | 133 | def parse_with_config(parser): 134 | args = parser.parse_args() 135 | if args.config is not None: 136 | config_args = json.load(open(args.config)) 137 | override_keys = { 138 | arg[2:].split("=")[0] for arg in sys.argv[1:] if arg.startswith("--") 139 | } 140 | for k, v in config_args.items(): 141 | if k not in override_keys: 142 | setattr(args, k, v) 143 | del args.config 144 | return args 145 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/run_r2r_b16.sh: -------------------------------------------------------------------------------- 1 | 2 | NODE_RANK=0 3 | NUM_GPUS=2 4 | outdir=../datasets/R2R/exprs_map/pretrain/cmt-clip.vit.b16-mlm.sap-init.lxmert-aug.mp3d.prevalent.hm3d_gibson.envdrop 5 | 6 | # train 7 | CUDA_VISIBLE_DEVICES=$1 python -m torch.distributed.launch \ 8 | --master_port $2 \ 9 | --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \ 10 | train_r2r.py --world_size ${NUM_GPUS} \ 11 | --vlnbert cmt \ 12 | --model_config config/r2r_model_config_clip-b16.json \ 13 | --config config/r2r_pretrain_hm3d+mp3d+gibson_clip-b16.json \ 14 | --output_dir $outdir 15 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/run_r2r_h14.sh: -------------------------------------------------------------------------------- 1 | 2 | NODE_RANK=0 3 | NUM_GPUS=2 4 | outdir=../datasets/R2R/exprs_map/pretrain/cmt-clip.vit.h14-mlm.sap-init.lxmert-aug.mp3d.prevalent.hm3d_gibson.envdrop 5 | 6 | # train 7 | CUDA_VISIBLE_DEVICES=$1 python -m torch.distributed.launch \ 8 | --master_port $2 \ 9 | --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \ 10 | train_r2r.py --world_size ${NUM_GPUS} \ 11 | --vlnbert cmt \ 12 | --model_config config/r2r_model_config_clip-h14.json \ 13 | --config config/r2r_pretrain_hm3d+mp3d+gibson_clip-h14.json \ 14 | --output_dir $outdir 15 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/VLN-DUET/pretrain_src/utils/__init__.py -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/utils/distributed.py: -------------------------------------------------------------------------------- 1 | """ 2 | Distributed tools 3 | """ 4 | import os 5 | from pathlib import Path 6 | from pprint import pformat 7 | import pickle 8 | 9 | import torch 10 | import torch.distributed as dist 11 | 12 | 13 | def load_init_param(opts): 14 | """ 15 | Load parameters for the rendezvous distributed procedure 16 | """ 17 | # sync file 18 | if opts.output_dir != "": 19 | sync_dir = Path(opts.output_dir).resolve() 20 | sync_dir.mkdir(parents=True, exist_ok=True) 21 | sync_file = f"{sync_dir}/.torch_distributed_sync" 22 | else: 23 | raise RuntimeError("Can't find any sync dir") 24 | 25 | # world size 26 | if opts.world_size != -1: 27 | world_size = opts.world_size 28 | elif os.environ.get("WORLD_SIZE", "") != "": 29 | world_size = int(os.environ["WORLD_SIZE"]) 30 | else: 31 | raise RuntimeError("Can't find any world size") 32 | 33 | # rank 34 | if os.environ.get("RANK", "") != "": 35 | # pytorch.distributed.launch provide this variable no matter what 36 | rank = int(os.environ["RANK"]) 37 | else: 38 | # if not provided, calculate the gpu rank 39 | if opts.node_rank != -1: 40 | node_rank = opts.node_rank 41 | elif os.environ.get("NODE_RANK", "") != "": 42 | node_rank = int(os.environ["NODE_RANK"]) 43 | else: 44 | raise RuntimeError("Can't find any rank or node rank") 45 | 46 | if opts.local_rank != -1: 47 | local_rank = opts.local_rank 48 | elif os.environ.get("LOCAL_RANK", "") != "": 49 | local_rank = int(os.environ["LOCAL_RANK"]) 50 | else: 51 | raise RuntimeError("Can't find any rank or local rank") 52 | 53 | # WARNING: this assumes that each node has the same number of GPUs 54 | n_gpus = torch.cuda.device_count() 55 | rank = local_rank + node_rank * n_gpus 56 | opts.rank = rank 57 | 58 | return { 59 | "backend": "nccl", 60 | # "init_method": f"file://{sync_file}", 61 | "rank": rank, 62 | "world_size": world_size, 63 | } 64 | 65 | 66 | def init_distributed(opts): 67 | init_param = load_init_param(opts) 68 | rank = init_param["rank"] 69 | 70 | print(f"Init distributed {init_param['rank']} - {init_param['world_size']}") 71 | 72 | dist.init_process_group(**init_param) 73 | 74 | 75 | def is_default_gpu(opts) -> bool: 76 | return opts.local_rank == -1 or dist.get_rank() == 0 77 | 78 | 79 | def is_dist_avail_and_initialized(): 80 | if not dist.is_available(): 81 | return False 82 | if not dist.is_initialized(): 83 | return False 84 | return True 85 | 86 | def get_world_size(): 87 | if not is_dist_avail_and_initialized(): 88 | return 1 89 | return dist.get_world_size() 90 | 91 | def all_gather(data): 92 | """ 93 | Run all_gather on arbitrary picklable data (not necessarily tensors) 94 | Args: 95 | data: any picklable object 96 | Returns: 97 | list[data]: list of data gathered from each rank 98 | """ 99 | world_size = get_world_size() 100 | if world_size == 1: 101 | return [data] 102 | 103 | # serialized to a Tensor 104 | buffer = pickle.dumps(data) 105 | storage = torch.ByteStorage.from_buffer(buffer) 106 | tensor = torch.ByteTensor(storage).to("cuda") 107 | 108 | # obtain Tensor size of each rank 109 | local_size = torch.tensor([tensor.numel()], device="cuda") 110 | size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)] 111 | dist.all_gather(size_list, local_size) 112 | size_list = [int(size.item()) for size in size_list] 113 | max_size = max(size_list) 114 | 115 | # receiving Tensor from all ranks 116 | # we pad the tensor because torch all_gather does not support 117 | # gathering tensors of different shapes 118 | tensor_list = [] 119 | for _ in size_list: 120 | tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) 121 | if local_size != max_size: 122 | padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda") 123 | tensor = torch.cat((tensor, padding), dim=0) 124 | dist.all_gather(tensor_list, tensor) 125 | 126 | data_list = [] 127 | for size, tensor in zip(size_list, tensor_list): 128 | buffer = tensor.cpu().numpy().tobytes()[:size] 129 | data_list.append(pickle.loads(buffer)) 130 | 131 | return data_list 132 | 133 | 134 | def reduce_dict(input_dict, average=True): 135 | """ 136 | Args: 137 | input_dict (dict): all the values will be reduced 138 | average (bool): whether to do average or sum 139 | Reduce the values in the dictionary from all processes so that all processes 140 | have the averaged results. Returns a dict with the same fields as 141 | input_dict, after reduction. 142 | """ 143 | world_size = get_world_size() 144 | if world_size < 2: 145 | return input_dict 146 | with torch.no_grad(): 147 | names = [] 148 | values = [] 149 | # sort the keys so that they are consistent across processes 150 | for k in sorted(input_dict.keys()): 151 | names.append(k) 152 | values.append(input_dict[k]) 153 | values = torch.stack(values, dim=0) 154 | dist.all_reduce(values) 155 | if average: 156 | values /= world_size 157 | reduced_dict = {k: v for k, v in zip(names, values)} 158 | return reduced_dict 159 | 160 | 161 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/utils/logger.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | helper for logging 6 | NOTE: loggers are global objects use with caution 7 | """ 8 | import logging 9 | import math 10 | 11 | import tensorboardX 12 | 13 | 14 | _LOG_FMT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s' 15 | _DATE_FMT = '%m/%d/%Y %H:%M:%S' 16 | logging.basicConfig(format=_LOG_FMT, datefmt=_DATE_FMT, level=logging.INFO) 17 | LOGGER = logging.getLogger('__main__') # this is the global logger 18 | 19 | 20 | def add_log_to_file(log_path): 21 | fh = logging.FileHandler(log_path) 22 | formatter = logging.Formatter(_LOG_FMT, datefmt=_DATE_FMT) 23 | fh.setFormatter(formatter) 24 | LOGGER.addHandler(fh) 25 | 26 | 27 | class TensorboardLogger(object): 28 | def __init__(self): 29 | self._logger = None 30 | self._global_step = 0 31 | 32 | def create(self, path): 33 | self._logger = tensorboardX.SummaryWriter(path) 34 | 35 | def noop(self, *args, **kwargs): 36 | return 37 | 38 | def step(self): 39 | self._global_step += 1 40 | 41 | @property 42 | def global_step(self): 43 | return self._global_step 44 | 45 | def log_scalar_dict(self, log_dict, prefix=''): 46 | """ log a dictionary of scalar values""" 47 | if self._logger is None: 48 | return 49 | if prefix: 50 | prefix = f'{prefix}_' 51 | for name, value in log_dict.items(): 52 | if isinstance(value, dict): 53 | self.log_scalar_dict(value, self._global_step, 54 | prefix=f'{prefix}{name}') 55 | else: 56 | self._logger.add_scalar(f'{prefix}{name}', value, 57 | self._global_step) 58 | 59 | def __getattr__(self, name): 60 | if self._logger is None: 61 | return self.noop 62 | return self._logger.__getattribute__(name) 63 | 64 | 65 | TB_LOGGER = TensorboardLogger() 66 | 67 | 68 | class RunningMeter(object): 69 | """ running meteor of a scalar value 70 | (useful for monitoring training loss) 71 | """ 72 | def __init__(self, name, val=None, smooth=0.99): 73 | self._name = name 74 | self._sm = smooth 75 | self._val = val 76 | 77 | def __call__(self, value): 78 | val = (value if self._val is None 79 | else value*(1-self._sm) + self._val*self._sm) 80 | if not math.isnan(val): 81 | self._val = val 82 | 83 | def __str__(self): 84 | return f'{self._name}: {self._val:.4f}' 85 | 86 | @property 87 | def val(self): 88 | if self._val is None: 89 | return 0 90 | return self._val 91 | 92 | @property 93 | def name(self): 94 | return self._name 95 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/utils/misc.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | from typing import Tuple, Union, Dict, Any 4 | 5 | import torch 6 | import torch.distributed as dist 7 | from torch.nn.parallel import DistributedDataParallel as DDP 8 | 9 | from .distributed import init_distributed 10 | from .logger import LOGGER 11 | 12 | 13 | def set_random_seed(seed): 14 | random.seed(seed) 15 | np.random.seed(seed) 16 | torch.manual_seed(seed) 17 | torch.cuda.manual_seed_all(seed) 18 | 19 | def set_dropout(model, drop_p): 20 | for name, module in model.named_modules(): 21 | # we might want to tune dropout for smaller dataset 22 | if isinstance(module, torch.nn.Dropout): 23 | if module.p != drop_p: 24 | module.p = drop_p 25 | LOGGER.info(f'{name} set to {drop_p}') 26 | 27 | def set_cuda(opts) -> Tuple[bool, int, torch.device]: 28 | """ 29 | Initialize CUDA for distributed computing 30 | """ 31 | if not torch.cuda.is_available(): 32 | assert opts.local_rank == -1, opts.local_rank 33 | return True, 0, torch.device("cpu") 34 | 35 | # get device settings 36 | if opts.local_rank != -1: 37 | init_distributed(opts) 38 | torch.cuda.set_device(opts.local_rank) 39 | device = torch.device("cuda", opts.local_rank) 40 | n_gpu = 1 41 | default_gpu = dist.get_rank() == 0 42 | if default_gpu: 43 | LOGGER.info(f"Found {dist.get_world_size()} GPUs") 44 | else: 45 | default_gpu = True 46 | device = torch.device("cuda") 47 | n_gpu = torch.cuda.device_count() 48 | 49 | return default_gpu, n_gpu, device 50 | 51 | 52 | def wrap_model( 53 | model: torch.nn.Module, device: torch.device, local_rank: int 54 | ) -> torch.nn.Module: 55 | model.to(device) 56 | 57 | if local_rank != -1: 58 | model = DDP(model, device_ids=[local_rank], find_unused_parameters=True) 59 | # At the time of DDP wrapping, parameters and buffers (i.e., model.state_dict()) 60 | # on rank0 are broadcasted to all other ranks. 61 | elif torch.cuda.device_count() > 1: 62 | LOGGER.info("Using data parallel") 63 | model = torch.nn.DataParallel(model) 64 | 65 | return model 66 | 67 | 68 | class NoOp(object): 69 | """ useful for distributed training No-Ops """ 70 | def __getattr__(self, name): 71 | return self.noop 72 | 73 | def noop(self, *args, **kwargs): 74 | return 75 | 76 | -------------------------------------------------------------------------------- /VLN-DUET/pretrain_src/utils/save.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) Microsoft Corporation. 3 | Licensed under the MIT license. 4 | 5 | saving utilities 6 | """ 7 | import json 8 | import os 9 | import torch 10 | 11 | 12 | def save_training_meta(args): 13 | os.makedirs(os.path.join(args.output_dir, 'logs'), exist_ok=True) 14 | os.makedirs(os.path.join(args.output_dir, 'ckpts'), exist_ok=True) 15 | 16 | with open(os.path.join(args.output_dir, 'logs', 'training_args.json'), 'w') as writer: 17 | json.dump(vars(args), writer, indent=4) 18 | model_config = json.load(open(args.model_config)) 19 | with open(os.path.join(args.output_dir, 'logs', 'model_config.json'), 'w') as writer: 20 | json.dump(model_config, writer, indent=4) 21 | 22 | 23 | class ModelSaver(object): 24 | def __init__(self, output_dir, prefix='model_step', suffix='pt'): 25 | self.output_dir = output_dir 26 | self.prefix = prefix 27 | self.suffix = suffix 28 | 29 | def save(self, model, step, optimizer=None): 30 | output_model_file = os.path.join(self.output_dir, 31 | f"{self.prefix}_{step}.{self.suffix}") 32 | state_dict = {} 33 | for k, v in model.state_dict().items(): 34 | if k.startswith('module.'): 35 | k = k[7:] 36 | if isinstance(v, torch.Tensor): 37 | state_dict[k] = v.cpu() 38 | else: 39 | state_dict[k] = v 40 | torch.save(state_dict, output_model_file) 41 | if optimizer is not None: 42 | dump = {'step': step, 'optimizer': optimizer.state_dict()} 43 | if hasattr(optimizer, '_amp_stash'): 44 | pass # TODO fp16 optimizer 45 | torch.save(dump, f'{self.output_dir}/train_state_{step}.pt') 46 | 47 | -------------------------------------------------------------------------------- /VLN-DUET/requirements.txt: -------------------------------------------------------------------------------- 1 | # jsonlines==2.0.0 2 | # tqdm==4.62.0 3 | # easydict==1.9 4 | # Shapely==1.7.1 5 | # h5py==2.10.0 6 | timm==0.4.9 7 | # networkx==2.5.1 8 | # numpy==1.20.3 9 | # tensorboardX==2.4.1 10 | protobuf==3.20.1 11 | line_profiler=4.0.3 12 | # transformers==4.12.5 13 | -------------------------------------------------------------------------------- /files/overall.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wz0919/ScaleVLN/1189fe898462e2e10908631070bcf2d4ec2204b2/files/overall.jpg --------------------------------------------------------------------------------