└── README.md /README.md: -------------------------------------------------------------------------------- 1 |

2 |

3 | Awesome Visual-Language-Navigation (VLN) 4 |

5 |

6 | 7 | This repository contains a curated list of resources addressing the VLN (Visual Language Navigation). 8 | Additionally, it includes related papers from areas such as Learning-based Navigation, etc. 9 | 10 | If you find some ignored papers, **feel free to [*create pull requests*](https://github.com/KwanWaiPang/Awesome-Transformer-based-SLAM/blob/pdf/How-to-PR.md), or [*open issues*](https://github.com/KwanWaiPang/Awesome-VLN/issues/new)**. 11 | 12 | Contributions in any form to make this list more comprehensive are welcome. 13 | 14 | If you find this repository useful, a simple star should be the best affirmation. 😊 15 | 16 | Feel free to share this list with others! 17 | 18 | # Overview 19 | - [VLN](#VLN) 20 | - [Simulator and Dataset](#Simulator-and-Dataset) 21 | - [Survey Paper](#Survey-Paper) 22 | - [Learning-based Navigation](#Learning-based-Navigation) 23 | - [Mapless navigation](#Mapless-navigation) 24 | - [Others](#Others) 25 | - [Occupancy Perception](#Occupancy-Perception) 26 | - [VLA](#VLA) 27 | 28 | # VLN 29 | 30 | 31 | | Year | Venue | Paper Title | Repository | Note | 32 | |:----:|:-----:| ----------- |:----------:|:----:| 33 | |2025|`arXiv`|[Embodied navigation foundation model](https://arxiv.org/pdf/2509.12129)|---|[website](https://pku-epic.github.io/NavFoM-Web/)
NavFoM| 34 | |2025|`arXiv`|[InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf)|[![Github stars](https://img.shields.io/github/stars/InternRobotics/InternNav.svg)](https://github.com/InternRobotics/InternNav) |[website](https://internrobotics.github.io/internvla-n1.github.io/)| 35 | |2025|`arXiv`|[Odyssey: Open-world quadrupeds exploration and manipulation for long-horizon tasks](https://arxiv.org/pdf/2508.08240)|---|[website](https://kaijwang.github.io/odyssey.github.io/)| 36 | |2025|`arXiv`|[OpenVLN: Open-world aerial Vision-Language Navigation](https://arxiv.org/pdf/2511.06182)|---|---| 37 | |2025|`arXiv`|[VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation](https://arxiv.org/pdf/2509.18592?)|[![Github stars](https://img.shields.io/github/stars/VLN-Zero/vln-zero.github.io.svg)](https://github.com/VLN-Zero/vln-zero.github.io)|[website](https://vln-zero.github.io/)| 38 | |2025|`arXiv`|[JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation](https://arxiv.org/pdf/2509.22548)|[![Github stars](https://img.shields.io/github/stars/MIV-XJTU/JanusVLN.svg)](https://github.com/MIV-XJTU/JanusVLN)|[Website](https://miv-xjtu.github.io/JanusVLN.github.io/)| 39 | |2025|`arXiv`|[Streamvln: Streaming vision-and-language navigation via slowfast context modeling](https://arxiv.org/pdf/2507.05240)|[![Github stars](https://img.shields.io/github/stars/InternRobotics/StreamVLN.svg)](https://github.com/InternRobotics/StreamVLN)|[Website](https://streamvln.github.io/)| 40 | |2025|`arXiv`|[GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation](https://arxiv.org/pdf/2509.10454)|[![Github stars](https://img.shields.io/github/stars/bagh2178/GC-VLN.svg)](https://github.com/bagh2178/GC-VLN)|[Website](https://bagh2178.github.io/GC-VLN/)| 41 | |2025|`arXiv`|[Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting](https://arxiv.org/pdf/2509.20499)|---|---| 42 | |2025|`arXiv`|[SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning](https://arxiv.org/pdf/2509.20739)|---|---| 43 | |2025|`arXiv`|[Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation](https://arxiv.org/pdf/2411.07848)|---|[website](https://sonia-raychaudhuri.github.io/nlslam/)| 44 | |2025|`RSS`|[Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks](https://arxiv.org/pdf/2412.06224)|[![Github stars](https://img.shields.io/github/stars/jzhzhang/Uni-NaVid.svg)](https://github.com/jzhzhang/Uni-NaVid)|[website](https://pku-epic.github.io/Uni-NaVid/)| 45 | |2025|`RSS`|[NaVILA: Legged Robot Vision-Language-Action Model for Navigation](https://arxiv.org/pdf/2412.04453)|[![Github stars](https://img.shields.io/github/stars/AnjieCheng/NaVILA.svg)](https://github.com/AnjieCheng/NaVILA)|[website](https://navila-bot.github.io/)| 46 | |2025|`ICCV`|[Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation](https://arxiv.org/pdf/2507.04047)|[![Github stars](https://img.shields.io/github/stars/MTU3D/MTU3D.svg)](https://github.com/MTU3D/MTU3D)|[website](https://mtu3d.github.io/)| 47 | | 2025 | `ACL` | [MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation](https://arxiv.org/pdf/2502.13451) |---|---| 48 | | 2025 | `CVPR` | [Scene Map-based Prompt Tuning for Navigation Instruction Generation](https://openaccess.thecvf.com/content/CVPR2025/papers/Fan_Scene_Map-based_Prompt_Tuning_for_Navigation_Instruction_Generation_CVPR_2025_paper.pdf) |---|---| 49 | | 2025 | `ACL` | [NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM](https://arxiv.org/pdf/2502.11142) | [![Github stars](https://img.shields.io/github/stars/MrZihan/NavRAG.svg)](https://github.com/MrZihan/NavRAG) |---| 50 | | 2025 | ICLR | [Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel](https://arxiv.org/abs/2412.08467) | [![Github stars](https://img.shields.io/github/stars/wz0919/VLN-SRDF.svg)](https://github.com/wz0919/VLN-SRDF) |---| 51 | | 2025 | `ICCV` | [SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts](https://arxiv.org/pdf/2412.05552) | [![Github stars](https://img.shields.io/github/stars/GengzeZhou/SAME.svg)](https://github.com/GengzeZhou/SAME) |---| 52 | | 2025 | `ICCV` | [NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments](https://arxiv.org/pdf/2506.23468) | [![Github stars](https://img.shields.io/github/stars/Feliciaxyao/NavMorph.svg)](https://github.com/Feliciaxyao/NavMorph) |---| 53 | | 2025 | `AAAI` | [Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation](https://arxiv.org/abs/2407.05890) |---|---| 54 | | 2025 | Arxiv | [EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation](https://arxiv.org/pdf/2506.01551) | [![Github stars](https://img.shields.io/github/stars/expectorlin/EvolveNav.svg)](https://github.com/expectorlin/EvolveNav) |---| 55 | | 2025 | `CVPR` | [Do Visual Imaginations Improve Vision-and-Language Navigation Agents?](https://arxiv.org/pdf/2503.16394) |---|---| 56 | | 2024 | `AAAI` | [VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation](https://arxiv.org/abs/2402.03561) |---|---| 57 | | 2024 | `CVPR` | [Volumetric Environment Representation for Vision-Language Navigation](https://arxiv.org/pdf/2403.14158) | [![Github stars](https://img.shields.io/github/stars/DefaultRui/VLN-VER.svg)](https://github.com/DefaultRui/VLN-VER) |---| 58 | |2024|`ECCV`|[Navgpt-2: Unleashing navigational reasoning capability for large vision-language models](https://arxiv.org/pdf/2407.12366?)|[![Github stars](https://img.shields.io/github/stars/GengzeZhou/NavGPT-2.svg)](https://github.com/GengzeZhou/NavGPT-2)|---| 59 | | 2024 | `CVPR` | [Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation](https://arxiv.org/pdf/2404.01943) | [![Github stars](https://img.shields.io/github/stars/MrZihan/HNR-VLN.svg)](https://github.com/MrZihan/HNR-VLN) |---| 60 | | 2024 | `TPAMI` | [ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments](https://arxiv.org/abs/2304.03047v2) | [![Github stars](https://img.shields.io/github/stars/MarSaKi/ETPNav.svg)](https://github.com/MarSaKi/ETPNav) |---| 61 | | 2024 | `MM` | [Narrowing the Gap between Vision and Action in Navigation](https://www.arxiv.org/abs/2408.10388) |---|---| 62 | | 2024 | `ECCV` | [LLM as Copilot for Coarse-grained Vision-and-Language Navigation](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00833.pdf) |---|---| 63 | | 2024 | `ICRA` | [Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions](https://ieeexplore.ieee.org/abstract/document/10611565) | [![Github stars](https://img.shields.io/github/stars/LYX0501/DiscussNav.svg)](https://github.com/LYX0501/DiscussNav) |---| 64 | | 2024 | `ACL` | [MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation](https://arxiv.org/abs/2401.07314) | [![Github stars](https://img.shields.io/github/stars/chen-judge/MapGPT.svg)](https://chen-judge.github.io/MapGPT/) |---| 65 | | 2024 |`arXiv`| [MC-GPT: Empowering Vision-and-LanguageNavigation with Memory Map and Reasoning Chains](https://arxiv.org/pdf/2405.10620) |---|---| 66 | | 2024 |`arXiv`| [InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment](https://arxiv.org/pdf/2406.04882) | [![Github stars](https://img.shields.io/github/stars/LYX0501/InstructNav.svg)](https://github.com/LYX0501/InstructNav) |---| 67 | | 2024 | `AAAI` | [NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models](https://arxiv.org/abs/2305.16986) | [![Github stars](https://img.shields.io/github/stars/GengzeZhou/NavGPT.svg)](https://github.com/GengzeZhou/NavGPT) |---| 68 | | 2024 | NACCL Findings | [LangNav: Language as a Perceptual Representation for Navigation](https://aclanthology.org/2024.findings-naacl.60.pdf) | [![Github stars](https://img.shields.io/github/stars/pbw-Berwin/LangNav.svg)](https://github.com/pbw-Berwin/LangNav) |---| 69 | | 2024 |`arXiv`| [NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning](https://arxiv.org/abs/2403.07376) | [![Github stars](https://img.shields.io/github/stars/expectorlin/NavCoT.svg)](https://github.com/expectorlin/NavCoT) |---| 70 | | 2024 | `CVPR` | [Towards Learning a Generalist Model for Embodied Navigation](https://arxiv.org/abs/2312.02010) | [![Github stars](https://img.shields.io/github/stars/LaVi-Lab/NaviLLM.svg)](https://github.com/LaVi-Lab/NaviLLM) |NaviLLM| 71 | | 2024 | `RSS` | [NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation](https://arxiv.org/pdf/2402.15852) | [![Github stars](https://img.shields.io/github/stars/jzhzhang/NaVid-VLN-CE.svg)](https://github.com/jzhzhang/NaVid-VLN-CE) |[website](https://pku-epic.github.io/NaVid/)| 72 | | 2024 |`EMNLP`| [Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation](https://arxiv.org/pdf/2409.17313) | [![Github stars](https://img.shields.io/github/stars/zehao-wang/navnuances.svg)](https://github.com/zehao-wang/navnuances) |---| 73 | | 2023 | `CVPR` | [Behavioral Analysis of Vision-and-Language Navigation Agents](https://yoark.github.io/assets/pdf/vln-behave/vln-behave.pdf) | [![Github stars](https://img.shields.io/github/stars/Yoark/vln-behave.svg)](https://github.com/Yoark/vln-behave) |---| 74 | | 2023 | `ICCV` | [March in Chat: Interactive Prompting for Remote Embodied Referring Expression](https://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_March_in_Chat_Interactive_Prompting_for_Remote_Embodied_Referring_Expression_ICCV_2023_paper.pdf) | [![Github stars](https://img.shields.io/github/stars/YanyuanQiao/MiC.svg)](https://github.com/YanyuanQiao/MiC) |---| 75 | | 2023 |`arXiv`| [Vision and Language Navigation in the Real World via Online Visual Language Mapping](https://arxiv.org/pdf/2310.10822) |---|---| 76 | | 2023 | `NeurIPS` | [A2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models](https://peihaochen.github.io/files/publications/A2Nav.pdf) |---|---| 77 | | 2023 | `ICCV` | [BEVBert: Multimodal Map Pre-training for Language-guided Navigation](https://arxiv.org/pdf/2212.04385) | [![Github stars](https://img.shields.io/github/stars/MarSaKi/VLN-BEVBert.svg)](https://github.com/MarSaKi/VLN-BEVBert) |---| 78 | |2023|`CVPR`|[Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Gadre_CoWs_on_Pasture_Baselines_and_Benchmarks_for_Language-Driven_Zero-Shot_Object_CVPR_2023_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/real-stanford/cow.svg)](https://github.com/real-stanford/cow)|CLIP on Wheels
[website](https://cow.cs.columbia.edu/)| 79 | |2023|`NIPS`|[Frequency-enhanced data augmentation for vision-and-language navigation](https://proceedings.neurips.cc/paper_files/paper/2023/file/0d9e08f247ca7fbbfd5e50b7ff9cf357-Paper-Conference.pdf)|[![Github stars](https://img.shields.io/github/stars/hekj/FDA.svg)](https://github.com/hekj/FDA)|---| 80 | |2023|`NIPS`|[Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation](https://proceedings.neurips.cc/paper_files/paper/2023/file/34e278fbbd7d6d7d788c98065988e1a9-Paper-Conference.pdf)|[![Github stars](https://img.shields.io/github/stars/whcpumpkin/Demand-driven-navigation.svg)](https://github.com/whcpumpkin/Demand-driven-navigation)|[website](https://sites.google.com/view/demand-driven-navigation)| 81 | |2023|`ACL`|[Aerial vision-and-dialog navigation](https://arxiv.org/pdf/2205.12219)|[![Github stars](https://img.shields.io/github/stars/eric-ai-lab/Aerial-Vision-and-Dialog-Navigation.svg)](https://github.com/eric-ai-lab/Aerial-Vision-and-Dialog-Navigation)|[website](https://sites.google.com/view/aerial-vision-and-dialog/home)| 82 | | 2023 | `AAAI` | [Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation](https://arxiv.org/pdf/2302.06072) |---|---| 83 | | 2023 | `ICCV` | [Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation](https://arxiv.org/abs/2308.12587) | [![Github stars](https://img.shields.io/github/stars/CSir1996/VLN-GELA.svg)](https://github.com/CSir1996/VLN-GELA) |---| 84 | | 2023 | `CVPR` | [Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Gao_Adaptive_Zone-Aware_Hierarchical_Planner_for_Vision-Language_Navigation_CVPR_2023_paper.pdf) | [![Github stars](https://img.shields.io/github/stars/chengaopro/AZHP.svg)](https://github.com/chengaopro/AZHP) |---| 85 | | 2023 | `ICCV` | [Bird's-Eye-View Scene Graph for Vision-Language Navigation](https://arxiv.org/abs/2308.04758) |---|---| 86 | | 2023 |`EMNLP`| [Masked Path Modeling for Vision-and-Language Navigation](https://arxiv.org/abs/2305.14268) |---|---| 87 | | 2023 | `CVPR` | [Improving Vision-and-Language Navigation by Generating Future-View Image Semantics](https://arxiv.org/pdf/2304.04907) | [![Github stars](https://img.shields.io/github/stars/jialuli-luka/VLN-SIG.svg)](https://github.com/jialuli-luka/VLN-SIG) |---| 88 | | 2023 | `TPAMI` | [HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation](https://ieeexplore.ieee.org/document/10006384) |---|---| 89 | | 2023 | `TPAMI` | [Learning to Follow and Generate Instructions for Language-Capable Navigation](https://ieeexplore.ieee.org/document/10359152) |---|---| 90 | | 2023 | `CVPR` | [A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning](https://arxiv.org/pdf/2210.03112) | --- |[Dataset](https://github.com/google-research-datasets/RxR/tree/main/marky-mT5)| 91 | | 2023 | `CVPR` | [Lana: A Language-Capable Navigator for Instruction Following and Generation](https://arxiv.org/abs/2303.08409) | [![Github stars](https://img.shields.io/github/stars/wxh1996/LANA-VLN.svg)](https://github.com/wxh1996/LANA-VLN) |---| 92 | | 2023 | `CVPR` | [KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation](https://openaccess.thecvf.com/content/CVPR2023/papers/Li_KERM_Knowledge_Enhanced_Reasoning_for_Vision-and-Language_Navigation_CVPR_2023_paper.pdf) | [![Github stars](https://img.shields.io/github/stars/xiangyangli-cn/KERM.svg)](https://github.com/xiangyangli-cn/KERM) |---| 93 | | 2023 | `MM` | [PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation](https://arxiv.org/pdf/2305.11918) |---|---| 94 | | 2023 |`arXiv`| [CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation](https://arxiv.org/abs/2103.00852) |---|---| 95 | | 2023 | `ACL` | [VLN-Trans: Translator for the Vision and Language Navigation Agent](https://arxiv.org/pdf/2302.09230) | [![Github stars](https://img.shields.io/github/stars/HLR/VLN-trans.svg)](https://github.com/HLR/VLN-trans) |---| 96 | | 2022 | `ACL` | [Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration](https://arxiv.org/pdf/2203.04006) | [![Github stars](https://img.shields.io/github/stars/liangcici/Probes-VLN.svg)](https://github.com/liangcici/Probes-VLN) |---| 97 | | 2022 | `CVPR` | [Less is More: Generating Grounded Navigation Instructions from Landmarks](https://arxiv.org/pdf/2004.14973) | [![Github stars](https://img.shields.io/github/stars/google-research-datasets/RxR.svg)](https://github.com/google-research-datasets/RxR/tree/main/marky-mT5) |---| 98 | | 2022 | `MM` | [Target-Driven Structured Transformer Planner for Vision-Language Navigation](https://arxiv.org/pdf/2207.11201) | [![Github stars](https://img.shields.io/github/stars/YushengZhao/TD-STP.svg)](https://github.com/YushengZhao/TD-STP) |---| 99 | | 2022 | `CVPR` | [HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation](https://ieeexplore.ieee.org/document/9880046) | [![Github stars](https://img.shields.io/github/stars/YanyuanQiao/HOP-VLN.svg)](https://github.com/YanyuanQiao/HOP-VLN) |---| 100 | | 2022 | `International Conference on Computational Linguistics` | [LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation](https://aclanthology.org/2022.coling-1.505.pdf) | [![Github stars](https://img.shields.io/github/stars/HLR/LOViS.svg)](https://github.com/HLR/LOViS) |---| 101 | | 2022 | NACCL | [Diagnosing Vision-and-Language Navigation: What Really Matters](https://aclanthology.org/2022.naacl-main.438.pdf) | [![Github stars](https://img.shields.io/github/stars/VegB/Diagnose_VLN.svg)](https://github.com/VegB/Diagnose_VLN) |---| 102 | | 2022 |`arXiv`| [CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation](https://arxiv.org/pdf/2211.16649) |---|---| 103 | | 2022 | `CVPR` | [Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation](https://arxiv.org/abs/2203.02764) | [![Github stars](https://img.shields.io/github/stars/YicongHong/Discrete-Continuous-VLN.svg)](https://github.com/YicongHong/Discrete-Continuous-VLN) |---| 104 | | 2021 | `CVPR` | [Scene-Intuitive Agent for Remote Embodied Visual Grounding](https://arxiv.org/pdf/2103.12944) |---|---| 105 | | 2021 | `NeurIPS` | [SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation](https://arxiv.org/abs/2110.14143) |---|---| 106 | | 2021 | `ICCV` | [The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation](https://openaccess.thecvf.com/content/ICCV2021/papers/Qi_The_Road_To_Know-Where_An_Object-and-Room_Informed_Sequential_BERT_for_ICCV_2021_paper.pdf) | [![Github stars](https://img.shields.io/github/stars/YuankaiQi/ORIST.svg)](https://github.com/YuankaiQi/ORIST) |---| 107 | | 2021 | `CVPR` | [VLN BERT: A Recurrent Vision-and-Language BERT for Navigation](https://openaccess.thecvf.com/content/CVPR2021/papers/Hong_VLN_BERT_A_Recurrent_Vision-and-Language_BERT_for_Navigation_CVPR_2021_paper.pdf) | [![Github stars](https://img.shields.io/github/stars/YicongHong/Recurrent-VLN-BERT.svg)](https://github.com/YicongHong/Recurrent-VLN-BERT) |---| 108 | | 2021 | EACL | [On the Evaluation of Vision-and-Language Navigation Instructions](https://arxiv.org/abs/2101.10504) |---|---| 109 | |---|---| [Do As I Can, Not As I Say: Grounding Language in Robotic Affordances](https://say-can.github.io/assets/palm_saycan.pdf) | [![Github stars](https://img.shields.io/github/stars/say-can/say-can.github.io.svg)](https://say-can.github.io/) |---| 110 | |2021|`NIPS`|[History aware multimodal transformer for vision-and-language navigation](https://proceedings.neurips.cc/paper/2021/file/2e5c2cb8d13e8fba78d95211440ba326-Paper.pdf)|[![Github stars](https://img.shields.io/github/stars/cshizhe/VLN-HAMT.svg)](https://github.com/cshizhe/VLN-HAMT)|[website](https://cshizhe.github.io/projects/vln_hamt.html)| 111 | |2021|`CVPR`|[Room-and-object aware knowledge reasoning for remote embodied referring expression](https://openaccess.thecvf.com/content/CVPR2021/papers/Gao_Room-and-Object_Aware_Knowledge_Reasoning_for_Remote_Embodied_Referring_Expression_CVPR_2021_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/alloldman/CKR.svg)](https://github.com/alloldman/CKR)|---| 112 | |2021|`ICCV`|[Vision-language navigation with random environmental mixup](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Vision-Language_Navigation_With_Random_Environmental_Mixup_ICCV_2021_paper.pdf)|---|---| 113 | |2021|`ICRA`|[Hierarchical cross-modal agent for robotics vision-and-language navigation](https://arxiv.org/pdf/2104.10674)|---|Robo-VLN
First continuous| 114 | | 2020 | `CVPR` | [Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training](https://arxiv.org/abs/2002.10638) | [![Github stars](https://img.shields.io/github/stars/weituo12321/PREVALENT.svg)](https://github.com/weituo12321/PREVALENT) |---| 115 | |2020|`ECCV`|[Active visual information gathering for vision-language navigation](https://arxiv.org/pdf/2007.08037)|[![Github stars](https://img.shields.io/github/stars/HanqingWangAI/Active_VLN.svg)](https://github.com/HanqingWangAI/Active_VLN)|---| 116 | |2020|`CVPR`|[Vision-language navigation with self-supervised auxiliary reasoning tasks](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.pdf)|---|---| 117 | |2020|`ECCV`|[Improving vision-and-language navigation with image-text pairs from the web](https://arxiv.org/pdf/2004.14973)|---|VLN-BERT| 118 | | 2020 | `ECCV` | [Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments](https://arxiv.org/abs/2004.02857) | [![Github stars](https://img.shields.io/github/stars/jacobkrantz/VLN-CE.svg)](https://github.com/jacobkrantz/VLN-CE) |---| 119 | |2019|`EMNLP`|[Robust navigation with language pretraining and stochastic sampling](https://arxiv.org/pdf/1909.02244)|[![Github stars](https://img.shields.io/github/stars/xjli/r2r_vln.svg)](https://github.com/xjli/r2r_vln)|---| 120 | |2019|`CoRL`|[Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight](https://arxiv.org/pdf/1910.09664)|[![Github stars](https://img.shields.io/github/stars/lil-lab/drif.svg)](https://github.com/lil-lab/drif)|---| 121 | |2018|`NIPS`|[Speaker-follower models for vision-and-language navigation](https://arxiv.org/pdf/1806.02724)|[![Github stars](https://img.shields.io/github/stars/ronghanghu/speaker_follower.svg)](https://github.com/ronghanghu/speaker_follower)|[website](https://ronghanghu.com/speaker_follower/)| 122 | |2018|`RSS`|[Following high-level navigation instructions on a simulated quadcopter with imitation learning](https://arxiv.org/pdf/1806.00047)|[![Github stars](https://img.shields.io/github/stars/lil-lab/gsmn.svg)](https://github.com/lil-lab/gsmn)|---| 123 | 124 | ## Simulator and Dataset 125 | 126 | 127 | 128 | | Year | Venue | Paper Title | Repository | Note | 129 | |:----:|:-----:| ----------- |:----------:|:----:| 130 | |2025|`arXiv`|[InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf)|[![Github stars](https://img.shields.io/github/stars/InternRobotics/InternNav.svg)](https://github.com/InternRobotics/InternNav) |[website](https://internrobotics.github.io/internvla-n1.github.io/)
InternData-N1 Dataset| 131 | |2025|`arXiv`|[HA-VLN: A Benchmark for Human-Aware Navigation in Discrete–Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard](https://arxiv.org/pdf/2503.14229)|[![Github stars](https://img.shields.io/github/stars/F1y1113/HA-VLN.svg)](https://github.com/F1y1113/HA-VLN)|[websit](https://ha-vln-project.vercel.app/)| 132 | |2023|`ICCV`|[Learning vision-and-language navigation from youtube videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_Learning_Vision-and-Language_Navigation_from_YouTube_Videos_ICCV_2023_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/JeremyLinky/YouTube-VLN.svg)](https://github.com/JeremyLinky/YouTube-VLN)|YouTube-VLN| 133 | |2023|`ICCV`|[Aerialvln: Vision-and-language navigation for uavs](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/AirVLN/AirVLN.svg)](https://github.com/AirVLN/AirVLN)|AerialVLN| 134 | |2023|`ICCV`|[Scaling data generation in vision-and-language navigation](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Scaling_Data_Generation_in_Vision-and-Language_Navigation_ICCV_2023_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/wz0919/ScaleVLN.svg)](https://github.com/wz0919/ScaleVLN)|ScaleVLN| 135 | |2022|`CVPR`|[Habitat-web: Learning embodied object-search strategies from human demonstrations at scale](https://openaccess.thecvf.com/content/CVPR2022/papers/Ramrakhya_Habitat-Web_Learning_Embodied_Object-Search_Strategies_From_Human_Demonstrations_at_Scale_CVPR_2022_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/Ram81/habitat-web.svg)](https://github.com/Ram81/habitat-web)|[website](https://ram81.github.io/projects/habitat-web)| 136 | |2022|`CVPR`|[Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation](https://openaccess.thecvf.com/content/CVPR2022/papers/Hong_Bridging_the_Gap_Between_Learning_in_Discrete_and_Continuous_Environments_CVPR_2022_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/YicongHong/Discrete-Continuous-VLN.svg)](https://github.com/YicongHong/Discrete-Continuous-VLN)|R2R-CE| 137 | |2021|`CVPR`|[Soon: Scenario oriented object navigation with graph-based exploration](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhu_SOON_Scenario_Oriented_Object_Navigation_With_Graph-Based_Exploration_CVPR_2021_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/ZhuFengdaaa/SOON.svg)](https://github.com/ZhuFengdaaa/SOON)|SOON| 138 | |2020|`arXiv`|[Alfred: A benchmark for interpreting grounded instructions for everyday tasks](https://openaccess.thecvf.com/content_CVPR_2020/papers/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/askforalfred/alfred.svg)](https://github.com/askforalfred/alfred) |ALFRED
[website](AskForALFRED.com)| 139 | |2020|`CVPR`|[REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments](https://openaccess.thecvf.com/content_CVPR_2020/papers/Qi_REVERIE_Remote_Embodied_Visual_Referring_Expression_in_Real_Indoor_Environments_CVPR_2020_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/YuankaiQi/REVERIE.svg)](https://github.com/YuankaiQi/REVERIE)|REVERIE| 140 | |2020|`EMNLP`|[Where are you? localization from embodied dialog](https://arxiv.org/pdf/2011.08277)|---|[website](https://meerahahn.github.io/way/)| 141 | |2020|`CoRL`|[Vision-and-Dialog Navigation](https://arxiv.org/pdf/1907.04957)|[![Github stars](https://img.shields.io/github/stars/mmurray/cvdn.svg)](https://github.com/mmurray/cvdn/)|CVDN
[website](https://cvdn.dev/)| 142 | |2020|`EMNLP`|[Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding](https://arxiv.org/pdf/2010.07954)|[![Github stars](https://img.shields.io/github/stars/google-research-datasets/RxR.svg)](https://github.com/google-research-datasets/RxR)|RxR| 143 | |2019|`EMNLP`|[Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning](https://arxiv.org/pdf/1909.01871)|[![Github stars](https://img.shields.io/github/stars/khanhptnk/hanna.svg)](https://github.com/khanhptnk/hanna)|HANNA| 144 | |2019|`ACL`|[Stay on the path: Instruction fidelity in vision-and-language navigation](https://arxiv.org/pdf/1905.12255)|---|R4R
[website](https://github.com/google-research/google-research/tree/master/r4r)| 145 | |2019|`arXiv`|[Learning to navigate unseen environments: Back translation with environmental dropout](https://arxiv.org/pdf/1904.04195)|[![Github stars](https://img.shields.io/github/stars/airsplay/R2R-EnvDrop.svg)](https://github.com/airsplay/R2R-EnvDrop)|R2R-EnvDrop-CE| 146 | |2018|`CVPR`|[Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments](https://openaccess.thecvf.com/content_cvpr_2018/papers/Anderson_Vision-and-Language_Navigation_Interpreting_CVPR_2018_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/peteanderson80/Matterport3DSimulator.svg)](https://github.com/peteanderson80/Matterport3DSimulator)|R2R
[website](https://bringmeaspoon.org/)| 147 | 148 | 149 | ## Survey Paper 150 | 151 | 152 | | Year | Venue | Paper Title | Repository | Note | 153 | |:----:|:-----:| ----------- |:----------:|:----:| 154 | |2025|`arXiv`|[Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey](https://arxiv.org/pdf/2508.15354)|[![Github stars](https://img.shields.io/github/stars/Franky-X/Awesome-Embodied-Navigation.svg)](https://github.com/Franky-X/Awesome-Embodied-Navigation)|Survey for EN
[blog](https://kwanwaipang.github.io/Enbodied-Navigation/)| 155 | |2025|`Transactions on Mechatronics`|[Aligning cyber space with physical world: A comprehensive survey on embodied ai](https://arxiv.org/pdf/2407.06886?)|[![Github stars](https://img.shields.io/github/stars/HCPLab-SYSU/Embodied_AI_Paper_List.svg)](https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List)|---| 156 | |2024|`Transactions on Machine Learning Research`|[Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models](https://openreview.net/pdf?id=yiqeh2ZYUh)|[![Github stars](https://img.shields.io/github/stars/zhangyuejoslin/VLN-Survey-with-Foundation-Models.svg)](https://github.com/zhangyuejoslin/VLN-Survey-with-Foundation-Models)|[blog](https://kwanwaipang.github.io/VLNsurvery2024/)| 157 | |2024|`arXiv`|[A Survey on Vision-Language-Action Models for Embodied AI](https://arxiv.org/pdf/2405.14093)|[![Github stars](https://img.shields.io/github/stars/yueen-ma/Awesome-VLA.svg)](https://github.com/yueen-ma/Awesome-VLA)|Survey for VLA| 158 | |2024|`Neural Computing and Applications`|[Vision-language navigation: a survey and taxonomy](https://arxiv.org/pdf/2108.11544)|---|---| 159 | |2023|`Artificial Intelligence Review`|[Visual language navigation: A survey and open challenges](https://link.springer.com/article/10.1007/s10462-022-10174-9)|---|---| 160 | |2022|`ACL`|[Vision-and-language navigation: A survey of tasks, methods, and future directions](https://arxiv.org/pdf/2203.12667)|[![Github stars](https://img.shields.io/github/stars/eric-ai-lab/awesome-vision-language-navigation.svg)](https://github.com/eric-ai-lab/awesome-vision-language-navigation)|---| 161 | 162 | 163 | 164 | 179 | 180 | 181 | 182 | 183 | # Learning-based Navigation 184 | or Image-goal Navigation, or object-goal navigation 185 | 186 | 187 | | Year | Venue | Paper Title | Repository | Note | 188 | |:----:|:-----:| ----------- |:----------:|:----:| 189 | |2025|`arXiv`|[Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation](https://arxiv.org/pdf/2510.08713)|[![Github stars](https://img.shields.io/github/stars/F1y1113/UniWM.svg)](https://github.com/F1y1113/UniWM)|---| 190 | |2025|`arXiv`|[NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance](https://arxiv.org/pdf/2505.08712)|[![Github stars](https://img.shields.io/github/stars/InternRobotics/NavDP.svg)](https://github.com/InternRobotics/NavDP)|[website](https://wzcai99.github.io/navigation-diffusion-policy.github.io/)| 191 | |2025|`arXiv`|[MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation](https://arxiv.org/pdf/2511.10376v2)|[![Github stars](https://img.shields.io/github/stars/ylwhxht/MSGNav.svg)](https://github.com/ylwhxht/MSGNav)|---| 192 | |2025|`arXiv`|[Adaptive Interactive Navigation of Quadruped Robots using Large Language Models](https://arxiv.org/pdf/2503.22942?)|---|[Video](https://www.youtube.com/watch?v=W5ttPnSap2g)| 193 | |2025|`arXiv`|[DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction](https://www.arxiv.org/pdf/2510.07152)|---|---| 194 | |2025|`arXiv`|[IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation](https://arxiv.org/pdf/2508.00823)|[![Github stars](https://img.shields.io/github/stars/GWxuan/IGL-Nav.svg)](https://github.com/GWxuan/IGL-Nav)|[website](https://gwxuan.github.io/IGL-Nav/)
Exploration+target matching| 195 | |2025|`arXiv`|[LOVON: Legged Open-Vocabulary Object Navigator](https://arxiv.org/pdf/2507.06747?)|[![Github stars](https://img.shields.io/github/stars/DaojiePENG/LOVON.svg)](https://github.com/DaojiePENG/LOVON)|[website](https://daojiepeng.github.io/LOVON/)| 196 | |2025|`ICRA`|[TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals](https://arxiv.org/pdf/2509.08699)|[![Github stars](https://img.shields.io/github/stars/podgorki/TANGO.svg)](https://github.com/podgorki/TANGO)|[website](https://podgorki.github.io/TANGO/)| 197 | |2025|`RSS`|[Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation](https://arxiv.org/pdf/2504.19322)|[![Github stars](https://img.shields.io/github/stars/leggedrobotics/fdm.svg)](https://github.com/leggedrobotics/fdm)|[website](https://leggedrobotics.github.io/fdm.github.io/)| 198 | |2025|`arXiv`|[Parkour in the Wild: Learning a General and Extensible Agile Locomotion Policy Using Multi-expert Distillation and RL Fine-tuning](https://arxiv.org/pdf/2505.11164)|---|---| 199 | |2025|`CoRL`|[Omni-Perception: Omnidirectional Collision Avoidance for Legged Locomotion in Dynamic Environments](https://arxiv.org/pdf/2505.19214)|[![Github stars](https://img.shields.io/github/stars/aCodeDog/OmniPerception.svg)](https://github.com/aCodeDog/OmniPerception)|---| 200 | |2024|`ICRA`|[VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation](https://arxiv.org/pdf/2312.03275)|[![Github stars](https://img.shields.io/github/stars/bdaiinstitute/vlfm.svg)](https://github.com/bdaiinstitute/vlfm)|[website](https://naoki.io/portfolio/vlfm)| 201 | |2024|`SRO`|[Learning Robust Autonomous Navigation and Locomotion for Wheeled-Legged Robots](https://arxiv.org/pdf/2405.01792)|---|---| 202 | |2024|`RAL`|[PIE: Parkour With Implicit-Explicit Learning Framework for Legged Robots](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678805)|---|---| 203 | |2024|`ICRA`|[Extreme Parkour with Legged Robots](https://arxiv.org/pdf/2309.14341)|[![Github stars](https://img.shields.io/github/stars/chengxuxin/extreme-parkour.svg)](https://github.com/chengxuxin/extreme-parkour)|[website](https://extreme-parkour.github.io/)| 204 | |2023|`ICML`|[Esc: Exploration with soft commonsense constraints for zero-shot object navigation](https://proceedings.mlr.press/v202/zhou23r/zhou23r.pdf)|---|---| 205 | |2023|`ICRA`|[Zero-shot object goal visual navigation](https://arxiv.org/pdf/2206.07423)|[![Github stars](https://img.shields.io/github/stars/pioneer-innovation/Zero-Shot-Object-Navigation.svg)](https://github.com/pioneer-innovation/Zero-Shot-Object-Navigation)|---| 206 | |2023|`ICRA`|[ViNL: Visual Navigation and Locomotion Over Obstacles](https://arxiv.org/pdf/2210.14791)|[![Github stars](https://img.shields.io/github/stars/SimarKareer/ViNL.svg)](https://github.com/SimarKareer/ViNL)|[website](https://www.joannetruong.com/projects/vinl.html)| 207 | |2023|`Field Robotics`|[ArtPlanner: Robust Legged Robot Navigation in the Field](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10876046)|[![Github stars](https://img.shields.io/github/stars/leggedrobotics/art_planner.svg)](https://github.com/leggedrobotics/art_planner)|---| 208 | 209 | 210 | ## Mapless navigation 211 | 212 | 213 | | Year | Venue | Paper Title | Repository | Note | 214 | |:----:|:-----:| ----------- |:----------:|:----:| 215 | |2025|`RSS`|[CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance](https://arxiv.org/pdf/2503.03921?)|[![Github stars](https://img.shields.io/github/stars/ut-amrl/creste_public.svg)](https://github.com/ut-amrl/creste_public)|[website](https://amrl.cs.utexas.edu/creste/)| 216 | 217 | 218 | # Others 219 | 220 | 221 | | Year | Venue | Paper Title | Repository | Note | 222 | |:----:|:-----:| ----------- |:----------:|:----:| 223 | |2025|`IEEE/ASME Transactions on Mechatronics`|[Aligning cyber space with physical world: A comprehensive survey on embodied ai](https://arxiv.org/pdf/2407.06886)|[![Github stars](https://img.shields.io/github/stars/HCPLab-SYSU/Embodied_AI_Paper_List.svg)](https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List)|---| 224 | |2025|`arXiv`|[HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots](https://arxiv.org/pdf/2503.09010)|---|---| 225 | |2021|`ICML`|[Learning transferable visual models from natural language supervision](https://proceedings.mlr.press/v139/radford21a/radford21a.pdf)|[![Github stars](https://img.shields.io/github/stars/OpenAI/CLIP.svg)](https://github.com/OpenAI/CLIP)|CLIP
[website](https://openai.com/index/clip/)| 226 | 227 | ## Occupancy Perception 228 | 229 | 230 | | Year | Venue | Paper Title | Repository | Note | 231 | |:----:|:-----:| ----------- |:----------:|:----:| 232 | |2025|`arXiv`|[Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots](https://arxiv.org/pdf/2507.20217)|[![Github stars](https://img.shields.io/github/stars/Open-X-Humanoid/Humanoid-Occupancy.svg)](https://github.com/Open-X-Humanoid/Humanoid-Occupancy)|[website](https://humanoid-occupancy.github.io/)
Multimodal Occupancy Perception| 233 | |2025|`arXiv`|[Roboocc: Enhancing the geometric and semantic scene understanding for robots](https://arxiv.org/pdf/2504.14604)|---|3DGS| 234 | |2025|`ICCV`|[EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding](https://arxiv.org/pdf/2412.04380)|[![Github stars](https://img.shields.io/github/stars/YkiWu/EmbodiedOcc.svg)](https://github.com/YkiWu/EmbodiedOcc)|[website](https://ykiwu.github.io/EmbodiedOcc/)| 235 | |2023|`ICCV`|[Scene as occupancy](https://openaccess.thecvf.com/content/ICCV2023/papers/Tong_Scene_as_Occupancy_ICCV_2023_paper.pdf)|[![Github stars](https://img.shields.io/github/stars/OpenDriveLab/OccNet.svg)](https://github.com/OpenDriveLab/OccNet)|[Challenge and dataset](https://github.com/OpenDriveLab/OpenScene)| 236 | 237 | ## VLA 238 | * Paper List for [VLA (Visual Language Action)](https://github.com/KwanWaiPang/Awesome-VLA) 239 | 240 | 241 | --------------------------------------------------------------------------------