└── README.md
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | Awesome Visual-Language-Navigation (VLN)
4 |
5 |
6 |
7 | This repository contains a curated list of resources addressing the VLN (Visual Language Navigation).
8 | Additionally, it includes related papers from areas such as Learning-based Navigation, etc.
9 |
10 | If you find some ignored papers, **feel free to [*create pull requests*](https://github.com/KwanWaiPang/Awesome-Transformer-based-SLAM/blob/pdf/How-to-PR.md), or [*open issues*](https://github.com/KwanWaiPang/Awesome-VLN/issues/new)**.
11 |
12 | Contributions in any form to make this list more comprehensive are welcome.
13 |
14 | If you find this repository useful, a simple star should be the best affirmation. 😊
15 |
16 | Feel free to share this list with others!
17 |
18 | # Overview
19 | - [VLN](#VLN)
20 | - [Simulator and Dataset](#Simulator-and-Dataset)
21 | - [Survey Paper](#Survey-Paper)
22 | - [Learning-based Navigation](#Learning-based-Navigation)
23 | - [Mapless navigation](#Mapless-navigation)
24 | - [Others](#Others)
25 | - [Occupancy Perception](#Occupancy-Perception)
26 | - [VLA](#VLA)
27 |
28 | # VLN
29 |
30 |
31 | | Year | Venue | Paper Title | Repository | Note |
32 | |:----:|:-----:| ----------- |:----------:|:----:|
33 | |2025|`arXiv`|[Embodied navigation foundation model](https://arxiv.org/pdf/2509.12129)|---|[website](https://pku-epic.github.io/NavFoM-Web/)
NavFoM|
34 | |2025|`arXiv`|[InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf)|[](https://github.com/InternRobotics/InternNav) |[website](https://internrobotics.github.io/internvla-n1.github.io/)|
35 | |2025|`arXiv`|[Odyssey: Open-world quadrupeds exploration and manipulation for long-horizon tasks](https://arxiv.org/pdf/2508.08240)|---|[website](https://kaijwang.github.io/odyssey.github.io/)|
36 | |2025|`arXiv`|[OpenVLN: Open-world aerial Vision-Language Navigation](https://arxiv.org/pdf/2511.06182)|---|---|
37 | |2025|`arXiv`|[VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation](https://arxiv.org/pdf/2509.18592?)|[](https://github.com/VLN-Zero/vln-zero.github.io)|[website](https://vln-zero.github.io/)|
38 | |2025|`arXiv`|[JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation](https://arxiv.org/pdf/2509.22548)|[](https://github.com/MIV-XJTU/JanusVLN)|[Website](https://miv-xjtu.github.io/JanusVLN.github.io/)|
39 | |2025|`arXiv`|[Streamvln: Streaming vision-and-language navigation via slowfast context modeling](https://arxiv.org/pdf/2507.05240)|[](https://github.com/InternRobotics/StreamVLN)|[Website](https://streamvln.github.io/)|
40 | |2025|`arXiv`|[GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation](https://arxiv.org/pdf/2509.10454)|[](https://github.com/bagh2178/GC-VLN)|[Website](https://bagh2178.github.io/GC-VLN/)|
41 | |2025|`arXiv`|[Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting](https://arxiv.org/pdf/2509.20499)|---|---|
42 | |2025|`arXiv`|[SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning](https://arxiv.org/pdf/2509.20739)|---|---|
43 | |2025|`arXiv`|[Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation](https://arxiv.org/pdf/2411.07848)|---|[website](https://sonia-raychaudhuri.github.io/nlslam/)|
44 | |2025|`RSS`|[Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks](https://arxiv.org/pdf/2412.06224)|[](https://github.com/jzhzhang/Uni-NaVid)|[website](https://pku-epic.github.io/Uni-NaVid/)|
45 | |2025|`RSS`|[NaVILA: Legged Robot Vision-Language-Action Model for Navigation](https://arxiv.org/pdf/2412.04453)|[](https://github.com/AnjieCheng/NaVILA)|[website](https://navila-bot.github.io/)|
46 | |2025|`ICCV`|[Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation](https://arxiv.org/pdf/2507.04047)|[](https://github.com/MTU3D/MTU3D)|[website](https://mtu3d.github.io/)|
47 | | 2025 | `ACL` | [MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation](https://arxiv.org/pdf/2502.13451) |---|---|
48 | | 2025 | `CVPR` | [Scene Map-based Prompt Tuning for Navigation Instruction Generation](https://openaccess.thecvf.com/content/CVPR2025/papers/Fan_Scene_Map-based_Prompt_Tuning_for_Navigation_Instruction_Generation_CVPR_2025_paper.pdf) |---|---|
49 | | 2025 | `ACL` | [NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM](https://arxiv.org/pdf/2502.11142) | [](https://github.com/MrZihan/NavRAG) |---|
50 | | 2025 | ICLR | [Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel](https://arxiv.org/abs/2412.08467) | [](https://github.com/wz0919/VLN-SRDF) |---|
51 | | 2025 | `ICCV` | [SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts](https://arxiv.org/pdf/2412.05552) | [](https://github.com/GengzeZhou/SAME) |---|
52 | | 2025 | `ICCV` | [NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments](https://arxiv.org/pdf/2506.23468) | [](https://github.com/Feliciaxyao/NavMorph) |---|
53 | | 2025 | `AAAI` | [Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation](https://arxiv.org/abs/2407.05890) |---|---|
54 | | 2025 | Arxiv | [EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation](https://arxiv.org/pdf/2506.01551) | [](https://github.com/expectorlin/EvolveNav) |---|
55 | | 2025 | `CVPR` | [Do Visual Imaginations Improve Vision-and-Language Navigation Agents?](https://arxiv.org/pdf/2503.16394) |---|---|
56 | | 2024 | `AAAI` | [VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation](https://arxiv.org/abs/2402.03561) |---|---|
57 | | 2024 | `CVPR` | [Volumetric Environment Representation for Vision-Language Navigation](https://arxiv.org/pdf/2403.14158) | [](https://github.com/DefaultRui/VLN-VER) |---|
58 | |2024|`ECCV`|[Navgpt-2: Unleashing navigational reasoning capability for large vision-language models](https://arxiv.org/pdf/2407.12366?)|[](https://github.com/GengzeZhou/NavGPT-2)|---|
59 | | 2024 | `CVPR` | [Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation](https://arxiv.org/pdf/2404.01943) | [](https://github.com/MrZihan/HNR-VLN) |---|
60 | | 2024 | `TPAMI` | [ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments](https://arxiv.org/abs/2304.03047v2) | [](https://github.com/MarSaKi/ETPNav) |---|
61 | | 2024 | `MM` | [Narrowing the Gap between Vision and Action in Navigation](https://www.arxiv.org/abs/2408.10388) |---|---|
62 | | 2024 | `ECCV` | [LLM as Copilot for Coarse-grained Vision-and-Language Navigation](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00833.pdf) |---|---|
63 | | 2024 | `ICRA` | [Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions](https://ieeexplore.ieee.org/abstract/document/10611565) | [](https://github.com/LYX0501/DiscussNav) |---|
64 | | 2024 | `ACL` | [MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation](https://arxiv.org/abs/2401.07314) | [](https://chen-judge.github.io/MapGPT/) |---|
65 | | 2024 |`arXiv`| [MC-GPT: Empowering Vision-and-LanguageNavigation with Memory Map and Reasoning Chains](https://arxiv.org/pdf/2405.10620) |---|---|
66 | | 2024 |`arXiv`| [InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment](https://arxiv.org/pdf/2406.04882) | [](https://github.com/LYX0501/InstructNav) |---|
67 | | 2024 | `AAAI` | [NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models](https://arxiv.org/abs/2305.16986) | [](https://github.com/GengzeZhou/NavGPT) |---|
68 | | 2024 | NACCL Findings | [LangNav: Language as a Perceptual Representation for Navigation](https://aclanthology.org/2024.findings-naacl.60.pdf) | [](https://github.com/pbw-Berwin/LangNav) |---|
69 | | 2024 |`arXiv`| [NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning](https://arxiv.org/abs/2403.07376) | [](https://github.com/expectorlin/NavCoT) |---|
70 | | 2024 | `CVPR` | [Towards Learning a Generalist Model for Embodied Navigation](https://arxiv.org/abs/2312.02010) | [](https://github.com/LaVi-Lab/NaviLLM) |NaviLLM|
71 | | 2024 | `RSS` | [NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation](https://arxiv.org/pdf/2402.15852) | [](https://github.com/jzhzhang/NaVid-VLN-CE) |[website](https://pku-epic.github.io/NaVid/)|
72 | | 2024 |`EMNLP`| [Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation](https://arxiv.org/pdf/2409.17313) | [](https://github.com/zehao-wang/navnuances) |---|
73 | | 2023 | `CVPR` | [Behavioral Analysis of Vision-and-Language Navigation Agents](https://yoark.github.io/assets/pdf/vln-behave/vln-behave.pdf) | [](https://github.com/Yoark/vln-behave) |---|
74 | | 2023 | `ICCV` | [March in Chat: Interactive Prompting for Remote Embodied Referring Expression](https://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_March_in_Chat_Interactive_Prompting_for_Remote_Embodied_Referring_Expression_ICCV_2023_paper.pdf) | [](https://github.com/YanyuanQiao/MiC) |---|
75 | | 2023 |`arXiv`| [Vision and Language Navigation in the Real World via Online Visual Language Mapping](https://arxiv.org/pdf/2310.10822) |---|---|
76 | | 2023 | `NeurIPS` | [A2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models](https://peihaochen.github.io/files/publications/A2Nav.pdf) |---|---|
77 | | 2023 | `ICCV` | [BEVBert: Multimodal Map Pre-training for Language-guided Navigation](https://arxiv.org/pdf/2212.04385) | [](https://github.com/MarSaKi/VLN-BEVBert) |---|
78 | |2023|`CVPR`|[Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Gadre_CoWs_on_Pasture_Baselines_and_Benchmarks_for_Language-Driven_Zero-Shot_Object_CVPR_2023_paper.pdf)|[](https://github.com/real-stanford/cow)|CLIP on Wheels
[website](https://cow.cs.columbia.edu/)|
79 | |2023|`NIPS`|[Frequency-enhanced data augmentation for vision-and-language navigation](https://proceedings.neurips.cc/paper_files/paper/2023/file/0d9e08f247ca7fbbfd5e50b7ff9cf357-Paper-Conference.pdf)|[](https://github.com/hekj/FDA)|---|
80 | |2023|`NIPS`|[Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation](https://proceedings.neurips.cc/paper_files/paper/2023/file/34e278fbbd7d6d7d788c98065988e1a9-Paper-Conference.pdf)|[](https://github.com/whcpumpkin/Demand-driven-navigation)|[website](https://sites.google.com/view/demand-driven-navigation)|
81 | |2023|`ACL`|[Aerial vision-and-dialog navigation](https://arxiv.org/pdf/2205.12219)|[](https://github.com/eric-ai-lab/Aerial-Vision-and-Dialog-Navigation)|[website](https://sites.google.com/view/aerial-vision-and-dialog/home)|
82 | | 2023 | `AAAI` | [Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation](https://arxiv.org/pdf/2302.06072) |---|---|
83 | | 2023 | `ICCV` | [Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation](https://arxiv.org/abs/2308.12587) | [](https://github.com/CSir1996/VLN-GELA) |---|
84 | | 2023 | `CVPR` | [Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Gao_Adaptive_Zone-Aware_Hierarchical_Planner_for_Vision-Language_Navigation_CVPR_2023_paper.pdf) | [](https://github.com/chengaopro/AZHP) |---|
85 | | 2023 | `ICCV` | [Bird's-Eye-View Scene Graph for Vision-Language Navigation](https://arxiv.org/abs/2308.04758) |---|---|
86 | | 2023 |`EMNLP`| [Masked Path Modeling for Vision-and-Language Navigation](https://arxiv.org/abs/2305.14268) |---|---|
87 | | 2023 | `CVPR` | [Improving Vision-and-Language Navigation by Generating Future-View Image Semantics](https://arxiv.org/pdf/2304.04907) | [](https://github.com/jialuli-luka/VLN-SIG) |---|
88 | | 2023 | `TPAMI` | [HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation](https://ieeexplore.ieee.org/document/10006384) |---|---|
89 | | 2023 | `TPAMI` | [Learning to Follow and Generate Instructions for Language-Capable Navigation](https://ieeexplore.ieee.org/document/10359152) |---|---|
90 | | 2023 | `CVPR` | [A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning](https://arxiv.org/pdf/2210.03112) | --- |[Dataset](https://github.com/google-research-datasets/RxR/tree/main/marky-mT5)|
91 | | 2023 | `CVPR` | [Lana: A Language-Capable Navigator for Instruction Following and Generation](https://arxiv.org/abs/2303.08409) | [](https://github.com/wxh1996/LANA-VLN) |---|
92 | | 2023 | `CVPR` | [KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Li_KERM_Knowledge_Enhanced_Reasoning_for_Vision-and-Language_Navigation_CVPR_2023_paper.pdf) | [](https://github.com/xiangyangli-cn/KERM) |---|
93 | | 2023 | `MM` | [PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation](https://arxiv.org/pdf/2305.11918) |---|---|
94 | | 2023 |`arXiv`| [CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation](https://arxiv.org/abs/2103.00852) |---|---|
95 | | 2023 | `ACL` | [VLN-Trans: Translator for the Vision and Language Navigation Agent](https://arxiv.org/pdf/2302.09230) | [](https://github.com/HLR/VLN-trans) |---|
96 | | 2022 | `ACL` | [Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration](https://arxiv.org/pdf/2203.04006) | [](https://github.com/liangcici/Probes-VLN) |---|
97 | | 2022 | `CVPR` | [Less is More: Generating Grounded Navigation Instructions from Landmarks](https://arxiv.org/pdf/2004.14973) | [](https://github.com/google-research-datasets/RxR/tree/main/marky-mT5) |---|
98 | | 2022 | `MM` | [Target-Driven Structured Transformer Planner for Vision-Language Navigation](https://arxiv.org/pdf/2207.11201) | [](https://github.com/YushengZhao/TD-STP) |---|
99 | | 2022 | `CVPR` | [HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation](https://ieeexplore.ieee.org/document/9880046) | [](https://github.com/YanyuanQiao/HOP-VLN) |---|
100 | | 2022 | `International Conference on Computational Linguistics` | [LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation](https://aclanthology.org/2022.coling-1.505.pdf) | [](https://github.com/HLR/LOViS) |---|
101 | | 2022 | NACCL | [Diagnosing Vision-and-Language Navigation: What Really Matters](https://aclanthology.org/2022.naacl-main.438.pdf) | [](https://github.com/VegB/Diagnose_VLN) |---|
102 | | 2022 |`arXiv`| [CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation](https://arxiv.org/pdf/2211.16649) |---|---|
103 | | 2022 | `CVPR` | [Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation](https://arxiv.org/abs/2203.02764) | [](https://github.com/YicongHong/Discrete-Continuous-VLN) |---|
104 | | 2021 | `CVPR` | [Scene-Intuitive Agent for Remote Embodied Visual Grounding](https://arxiv.org/pdf/2103.12944) |---|---|
105 | | 2021 | `NeurIPS` | [SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation](https://arxiv.org/abs/2110.14143) |---|---|
106 | | 2021 | `ICCV` | [The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation](https://openaccess.thecvf.com/content/ICCV2021/papers/Qi_The_Road_To_Know-Where_An_Object-and-Room_Informed_Sequential_BERT_for_ICCV_2021_paper.pdf) | [](https://github.com/YuankaiQi/ORIST) |---|
107 | | 2021 | `CVPR` | [VLN BERT: A Recurrent Vision-and-Language BERT for Navigation](https://openaccess.thecvf.com/content/CVPR2021/papers/Hong_VLN_BERT_A_Recurrent_Vision-and-Language_BERT_for_Navigation_CVPR_2021_paper.pdf) | [](https://github.com/YicongHong/Recurrent-VLN-BERT) |---|
108 | | 2021 | EACL | [On the Evaluation of Vision-and-Language Navigation Instructions](https://arxiv.org/abs/2101.10504) |---|---|
109 | |---|---| [Do As I Can, Not As I Say: Grounding Language in Robotic Affordances](https://say-can.github.io/assets/palm_saycan.pdf) | [](https://say-can.github.io/) |---|
110 | |2021|`NIPS`|[History aware multimodal transformer for vision-and-language navigation](https://proceedings.neurips.cc/paper/2021/file/2e5c2cb8d13e8fba78d95211440ba326-Paper.pdf)|[](https://github.com/cshizhe/VLN-HAMT)|[website](https://cshizhe.github.io/projects/vln_hamt.html)|
111 | |2021|`CVPR`|[Room-and-object aware knowledge reasoning for remote embodied referring expression](https://openaccess.thecvf.com/content/CVPR2021/papers/Gao_Room-and-Object_Aware_Knowledge_Reasoning_for_Remote_Embodied_Referring_Expression_CVPR_2021_paper.pdf)|[](https://github.com/alloldman/CKR)|---|
112 | |2021|`ICCV`|[Vision-language navigation with random environmental mixup](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Vision-Language_Navigation_With_Random_Environmental_Mixup_ICCV_2021_paper.pdf)|---|---|
113 | |2021|`ICRA`|[Hierarchical cross-modal agent for robotics vision-and-language navigation](https://arxiv.org/pdf/2104.10674)|---|Robo-VLN
First continuous|
114 | | 2020 | `CVPR` | [Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training](https://arxiv.org/abs/2002.10638) | [](https://github.com/weituo12321/PREVALENT) |---|
115 | |2020|`ECCV`|[Active visual information gathering for vision-language navigation](https://arxiv.org/pdf/2007.08037)|[](https://github.com/HanqingWangAI/Active_VLN)|---|
116 | |2020|`CVPR`|[Vision-language navigation with self-supervised auxiliary reasoning tasks](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.pdf)|---|---|
117 | |2020|`ECCV`|[Improving vision-and-language navigation with image-text pairs from the web](https://arxiv.org/pdf/2004.14973)|---|VLN-BERT|
118 | | 2020 | `ECCV` | [Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments](https://arxiv.org/abs/2004.02857) | [](https://github.com/jacobkrantz/VLN-CE) |---|
119 | |2019|`EMNLP`|[Robust navigation with language pretraining and stochastic sampling](https://arxiv.org/pdf/1909.02244)|[](https://github.com/xjli/r2r_vln)|---|
120 | |2019|`CoRL`|[Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight](https://arxiv.org/pdf/1910.09664)|[](https://github.com/lil-lab/drif)|---|
121 | |2018|`NIPS`|[Speaker-follower models for vision-and-language navigation](https://arxiv.org/pdf/1806.02724)|[](https://github.com/ronghanghu/speaker_follower)|[website](https://ronghanghu.com/speaker_follower/)|
122 | |2018|`RSS`|[Following high-level navigation instructions on a simulated quadcopter with imitation learning](https://arxiv.org/pdf/1806.00047)|[](https://github.com/lil-lab/gsmn)|---|
123 |
124 | ## Simulator and Dataset
125 |
126 |
127 |
128 | | Year | Venue | Paper Title | Repository | Note |
129 | |:----:|:-----:| ----------- |:----------:|:----:|
130 | |2025|`arXiv`|[InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf)|[](https://github.com/InternRobotics/InternNav) |[website](https://internrobotics.github.io/internvla-n1.github.io/)
InternData-N1 Dataset|
131 | |2025|`arXiv`|[HA-VLN: A Benchmark for Human-Aware Navigation in Discrete–Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard](https://arxiv.org/pdf/2503.14229)|[](https://github.com/F1y1113/HA-VLN)|[websit](https://ha-vln-project.vercel.app/)|
132 | |2023|`ICCV`|[Learning vision-and-language navigation from youtube videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_Learning_Vision-and-Language_Navigation_from_YouTube_Videos_ICCV_2023_paper.pdf)|[](https://github.com/JeremyLinky/YouTube-VLN)|YouTube-VLN|
133 | |2023|`ICCV`|[Aerialvln: Vision-and-language navigation for uavs](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf)|[](https://github.com/AirVLN/AirVLN)|AerialVLN|
134 | |2023|`ICCV`|[Scaling data generation in vision-and-language navigation](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Scaling_Data_Generation_in_Vision-and-Language_Navigation_ICCV_2023_paper.pdf)|[](https://github.com/wz0919/ScaleVLN)|ScaleVLN|
135 | |2022|`CVPR`|[Habitat-web: Learning embodied object-search strategies from human demonstrations at scale](https://openaccess.thecvf.com/content/CVPR2022/papers/Ramrakhya_Habitat-Web_Learning_Embodied_Object-Search_Strategies_From_Human_Demonstrations_at_Scale_CVPR_2022_paper.pdf)|[](https://github.com/Ram81/habitat-web)|[website](https://ram81.github.io/projects/habitat-web)|
136 | |2022|`CVPR`|[Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation](https://openaccess.thecvf.com/content/CVPR2022/papers/Hong_Bridging_the_Gap_Between_Learning_in_Discrete_and_Continuous_Environments_CVPR_2022_paper.pdf)|[](https://github.com/YicongHong/Discrete-Continuous-VLN)|R2R-CE|
137 | |2021|`CVPR`|[Soon: Scenario oriented object navigation with graph-based exploration](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhu_SOON_Scenario_Oriented_Object_Navigation_With_Graph-Based_Exploration_CVPR_2021_paper.pdf)|[](https://github.com/ZhuFengdaaa/SOON)|SOON|
138 | |2020|`arXiv`|[Alfred: A benchmark for interpreting grounded instructions for everyday tasks](https://openaccess.thecvf.com/content_CVPR_2020/papers/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.pdf)|[](https://github.com/askforalfred/alfred) |ALFRED
[website](AskForALFRED.com)|
139 | |2020|`CVPR`|[REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments](https://openaccess.thecvf.com/content_CVPR_2020/papers/Qi_REVERIE_Remote_Embodied_Visual_Referring_Expression_in_Real_Indoor_Environments_CVPR_2020_paper.pdf)|[](https://github.com/YuankaiQi/REVERIE)|REVERIE|
140 | |2020|`EMNLP`|[Where are you? localization from embodied dialog](https://arxiv.org/pdf/2011.08277)|---|[website](https://meerahahn.github.io/way/)|
141 | |2020|`CoRL`|[Vision-and-Dialog Navigation](https://arxiv.org/pdf/1907.04957)|[](https://github.com/mmurray/cvdn/)|CVDN
[website](https://cvdn.dev/)|
142 | |2020|`EMNLP`|[Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding](https://arxiv.org/pdf/2010.07954)|[](https://github.com/google-research-datasets/RxR)|RxR|
143 | |2019|`EMNLP`|[Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning](https://arxiv.org/pdf/1909.01871)|[](https://github.com/khanhptnk/hanna)|HANNA|
144 | |2019|`ACL`|[Stay on the path: Instruction fidelity in vision-and-language navigation](https://arxiv.org/pdf/1905.12255)|---|R4R
[website](https://github.com/google-research/google-research/tree/master/r4r)|
145 | |2019|`arXiv`|[Learning to navigate unseen environments: Back translation with environmental dropout](https://arxiv.org/pdf/1904.04195)|[](https://github.com/airsplay/R2R-EnvDrop)|R2R-EnvDrop-CE|
146 | |2018|`CVPR`|[Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments](https://openaccess.thecvf.com/content_cvpr_2018/papers/Anderson_Vision-and-Language_Navigation_Interpreting_CVPR_2018_paper.pdf)|[](https://github.com/peteanderson80/Matterport3DSimulator)|R2R
[website](https://bringmeaspoon.org/)|
147 |
148 |
149 | ## Survey Paper
150 |
151 |
152 | | Year | Venue | Paper Title | Repository | Note |
153 | |:----:|:-----:| ----------- |:----------:|:----:|
154 | |2025|`arXiv`|[Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey](https://arxiv.org/pdf/2508.15354)|[](https://github.com/Franky-X/Awesome-Embodied-Navigation)|Survey for EN
[blog](https://kwanwaipang.github.io/Enbodied-Navigation/)|
155 | |2025|`Transactions on Mechatronics`|[Aligning cyber space with physical world: A comprehensive survey on embodied ai](https://arxiv.org/pdf/2407.06886?)|[](https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List)|---|
156 | |2024|`Transactions on Machine Learning Research`|[Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models](https://openreview.net/pdf?id=yiqeh2ZYUh)|[](https://github.com/zhangyuejoslin/VLN-Survey-with-Foundation-Models)|[blog](https://kwanwaipang.github.io/VLNsurvery2024/)|
157 | |2024|`arXiv`|[A Survey on Vision-Language-Action Models for Embodied AI](https://arxiv.org/pdf/2405.14093)|[](https://github.com/yueen-ma/Awesome-VLA)|Survey for VLA|
158 | |2024|`Neural Computing and Applications`|[Vision-language navigation: a survey and taxonomy](https://arxiv.org/pdf/2108.11544)|---|---|
159 | |2023|`Artificial Intelligence Review`|[Visual language navigation: A survey and open challenges](https://link.springer.com/article/10.1007/s10462-022-10174-9)|---|---|
160 | |2022|`ACL`|[Vision-and-language navigation: A survey of tasks, methods, and future directions](https://arxiv.org/pdf/2203.12667)|[](https://github.com/eric-ai-lab/awesome-vision-language-navigation)|---|
161 |
162 |
163 |
164 |
179 |
180 |
181 |
182 |
183 | # Learning-based Navigation
184 | or Image-goal Navigation, or object-goal navigation
185 |
186 |
187 | | Year | Venue | Paper Title | Repository | Note |
188 | |:----:|:-----:| ----------- |:----------:|:----:|
189 | |2025|`arXiv`|[Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation](https://arxiv.org/pdf/2510.08713)|[](https://github.com/F1y1113/UniWM)|---|
190 | |2025|`arXiv`|[NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance](https://arxiv.org/pdf/2505.08712)|[](https://github.com/InternRobotics/NavDP)|[website](https://wzcai99.github.io/navigation-diffusion-policy.github.io/)|
191 | |2025|`arXiv`|[MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation](https://arxiv.org/pdf/2511.10376v2)|[](https://github.com/ylwhxht/MSGNav)|---|
192 | |2025|`arXiv`|[Adaptive Interactive Navigation of Quadruped Robots using Large Language Models](https://arxiv.org/pdf/2503.22942?)|---|[Video](https://www.youtube.com/watch?v=W5ttPnSap2g)|
193 | |2025|`arXiv`|[DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction](https://www.arxiv.org/pdf/2510.07152)|---|---|
194 | |2025|`arXiv`|[IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation](https://arxiv.org/pdf/2508.00823)|[](https://github.com/GWxuan/IGL-Nav)|[website](https://gwxuan.github.io/IGL-Nav/)
Exploration+target matching|
195 | |2025|`arXiv`|[LOVON: Legged Open-Vocabulary Object Navigator](https://arxiv.org/pdf/2507.06747?)|[](https://github.com/DaojiePENG/LOVON)|[website](https://daojiepeng.github.io/LOVON/)|
196 | |2025|`ICRA`|[TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals](https://arxiv.org/pdf/2509.08699)|[](https://github.com/podgorki/TANGO)|[website](https://podgorki.github.io/TANGO/)|
197 | |2025|`RSS`|[Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation](https://arxiv.org/pdf/2504.19322)|[](https://github.com/leggedrobotics/fdm)|[website](https://leggedrobotics.github.io/fdm.github.io/)|
198 | |2025|`arXiv`|[Parkour in the Wild: Learning a General and Extensible Agile Locomotion Policy Using Multi-expert Distillation and RL Fine-tuning](https://arxiv.org/pdf/2505.11164)|---|---|
199 | |2025|`CoRL`|[Omni-Perception: Omnidirectional Collision Avoidance for Legged Locomotion in Dynamic Environments](https://arxiv.org/pdf/2505.19214)|[](https://github.com/aCodeDog/OmniPerception)|---|
200 | |2024|`ICRA`|[VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation](https://arxiv.org/pdf/2312.03275)|[](https://github.com/bdaiinstitute/vlfm)|[website](https://naoki.io/portfolio/vlfm)|
201 | |2024|`SRO`|[Learning Robust Autonomous Navigation and Locomotion for Wheeled-Legged Robots](https://arxiv.org/pdf/2405.01792)|---|---|
202 | |2024|`RAL`|[PIE: Parkour With Implicit-Explicit Learning Framework for Legged Robots](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678805)|---|---|
203 | |2024|`ICRA`|[Extreme Parkour with Legged Robots](https://arxiv.org/pdf/2309.14341)|[](https://github.com/chengxuxin/extreme-parkour)|[website](https://extreme-parkour.github.io/)|
204 | |2023|`ICML`|[Esc: Exploration with soft commonsense constraints for zero-shot object navigation](https://proceedings.mlr.press/v202/zhou23r/zhou23r.pdf)|---|---|
205 | |2023|`ICRA`|[Zero-shot object goal visual navigation](https://arxiv.org/pdf/2206.07423)|[](https://github.com/pioneer-innovation/Zero-Shot-Object-Navigation)|---|
206 | |2023|`ICRA`|[ViNL: Visual Navigation and Locomotion Over Obstacles](https://arxiv.org/pdf/2210.14791)|[](https://github.com/SimarKareer/ViNL)|[website](https://www.joannetruong.com/projects/vinl.html)|
207 | |2023|`Field Robotics`|[ArtPlanner: Robust Legged Robot Navigation in the Field](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10876046)|[](https://github.com/leggedrobotics/art_planner)|---|
208 |
209 |
210 | ## Mapless navigation
211 |
212 |
213 | | Year | Venue | Paper Title | Repository | Note |
214 | |:----:|:-----:| ----------- |:----------:|:----:|
215 | |2025|`RSS`|[CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance](https://arxiv.org/pdf/2503.03921?)|[](https://github.com/ut-amrl/creste_public)|[website](https://amrl.cs.utexas.edu/creste/)|
216 |
217 |
218 | # Others
219 |
220 |
221 | | Year | Venue | Paper Title | Repository | Note |
222 | |:----:|:-----:| ----------- |:----------:|:----:|
223 | |2025|`IEEE/ASME Transactions on Mechatronics`|[Aligning cyber space with physical world: A comprehensive survey on embodied ai](https://arxiv.org/pdf/2407.06886)|[](https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List)|---|
224 | |2025|`arXiv`|[HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots](https://arxiv.org/pdf/2503.09010)|---|---|
225 | |2021|`ICML`|[Learning transferable visual models from natural language supervision](https://proceedings.mlr.press/v139/radford21a/radford21a.pdf)|[](https://github.com/OpenAI/CLIP)|CLIP
[website](https://openai.com/index/clip/)|
226 |
227 | ## Occupancy Perception
228 |
229 |
230 | | Year | Venue | Paper Title | Repository | Note |
231 | |:----:|:-----:| ----------- |:----------:|:----:|
232 | |2025|`arXiv`|[Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots](https://arxiv.org/pdf/2507.20217)|[](https://github.com/Open-X-Humanoid/Humanoid-Occupancy)|[website](https://humanoid-occupancy.github.io/)
Multimodal Occupancy Perception|
233 | |2025|`arXiv`|[Roboocc: Enhancing the geometric and semantic scene understanding for robots](https://arxiv.org/pdf/2504.14604)|---|3DGS|
234 | |2025|`ICCV`|[EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding](https://arxiv.org/pdf/2412.04380)|[](https://github.com/YkiWu/EmbodiedOcc)|[website](https://ykiwu.github.io/EmbodiedOcc/)|
235 | |2023|`ICCV`|[Scene as occupancy](https://openaccess.thecvf.com/content/ICCV2023/papers/Tong_Scene_as_Occupancy_ICCV_2023_paper.pdf)|[](https://github.com/OpenDriveLab/OccNet)|[Challenge and dataset](https://github.com/OpenDriveLab/OpenScene)|
236 |
237 | ## VLA
238 | * Paper List for [VLA (Visual Language Action)](https://github.com/KwanWaiPang/Awesome-VLA)
239 |
240 |
241 |
--------------------------------------------------------------------------------