├── survey_tree.png ├── LICENSE ├── .gitignore └── README.md /survey_tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/robotics-survey/Awesome-Robotics-Foundation-Models/HEAD/survey_tree.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 robotics-survey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-Robotics-Foundation-Models 2 | 3 | [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 4 | 5 | ![alt text](https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models/blob/main/survey_tree.png) 6 | 7 | This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods. The organization of this readme follows Figure 1 in the paper (shown above) and is thus divided into foundation models that have been applied to robotics and those that are relevant to robotics in some way. 8 | 9 | We welcome contributions to this repository to add more resources. Please submit a pull request if you want to contribute! 10 | 11 | ## Table of Contents 12 | 13 | - [Survey](#survey) 14 | - [Robotics](#robotics) 15 | - [Neural Scaling Laws for embodied AI](#neural-scaling-laws-for-embodied-AI) 16 | - [Robot Policy Learning for Decision Making and Controls](#robot-policy-learning-for-decision-making-and-controls) 17 | - [Language-Image Goal-Conditioned Value Learning](#language-image-goal-conditioned-value-learning) 18 | - [Robot Task Planning Using Large Language Models](#robot-task-planning-using-large-language-models) 19 | - [Robot Transformers](#robot-transformers) 20 | - [In-context Learning for Decision-Making](#in-context-learning-for-decision-making) 21 | - [Open-Vocabulary Robot Navigation and Manipulation](#open-vocabulary-robot-navigation-and-manipulation) 22 | - [Relevant to Robotics](#relevant-to-robotics) 23 | - [Open-Vocabulary Object Detection and 3D Classification](#open-vocabulary-object-detection-and-3D-classification) 24 | - [Open-Vocabulary Semantic Segmentation](#open-vocabulary-semantic-segmentation) 25 | - [Open-Vocabulary 3D Scene Representations](#open-vocabulary-3D-scene-representations) 26 | - [Open-Vocabulary Object Representations](#open-vocabulary-object-representations) 27 | - [Affordance Information](#affordance-information) 28 | - [Predictive Models](#predictive-models) 29 | - [Generalist AI](#generalist-AI) 30 | - [Simulators](#simulators) 31 | 32 | 33 | ## Survey 34 | 35 | This repository is largely based on the following paper: 36 | 37 | **[Foundation Models in Robotics: Applications, Challenges, and the Future]()** 38 |
39 | Roya Firoozi, 40 | Johnathan Tucker, 41 | Stephen Tian, 42 | Anirudha Majumdar, 43 | Jiankai Sun, 44 | Weiyu Liu, 45 | Yuke Zhu, 46 | Shuran Song, 47 | Ashish Kapoor, 48 | Karol Hausman, 49 | Brian Ichter, 50 | Danny Driess, 51 | Jiajun Wu, 52 | Cewu Lu, 53 | Mac Schwager 54 |
55 | 56 | If you find this repository helpful, please consider citing: 57 | ```bibtex 58 | @article{firoozi2024foundation, 59 | title={Foundation Models in Robotics: Applications, Challenges, and the Future}, 60 | author={Firoozi, Roya and Tucker, Johnathan and Tian, Stephen and Majumdar, Anirudha and Sun, Jiankai and Liu, Weiyu and Zhu, Yuke and Song, Shuran and Kapoor, Ashish and Hausman, Karol and others}, 61 | journal={The International Journal of Robotics Research}, 62 | year={2024}, 63 | doi= {https://doi.org/10.1177/02783649241281508} 64 | } 65 | ``` 66 | 67 | 68 | ```bibtex 69 | @article{firoozi2023foundation, 70 | title={Foundation Models in Robotics: Applications, Challenges, and the Future}, 71 | author={Firoozi, Roya and Tucker, Johnathan and Tian, Stephen and Majumdar, Anirudha and Sun, Jiankai and Liu, Weiyu and Zhu, Yuke and Song, Shuran and Kapoor, Ashish and Hausman, Karol and others}, 72 | journal={arXiv preprint arXiv:2312.07843}, 73 | year={2023} 74 | } 75 | ``` 76 | 77 | 78 | ## Robotics 79 | 80 | ### Neural Scaling Laws 81 | * Neural Scaling Laws for Embodied AI: Neural Scaling Laws for Embodied AI [[Paper]](https://arxiv.org/abs/2405.14005) 82 | 83 | 84 | ### Robot Policy Learning for Decision-Making and Controls 85 | #### Language-Conditioned Imitation Learning 86 | * CLIPort: What and Where Pathways for Robotic Manipulation [[Paper]](https://arxiv.org/abs/2109.12098)[[Project]](https://cliport.github.io/)[[Code]](https://github.com/cliport/cliport) 87 | * Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [[Paper]](https://arxiv.org/abs/2209.05451)[[Project]](https://peract.github.io/)[[Code]](https://github.com/peract/peract) 88 | * Play-LMP: Learning Latent Plans from Play [[Project]](https://learning-from-play.github.io/) 89 | * Multi-Context Imitation: Language-Conditioned Imitation Learning over Unstructured Data [[Project]](https://language-play.github.io) 90 | 91 | #### Language-Assisted Reinforcement Learning 92 | * Towards A Unified Agent with Foundation Models [[Paper]](https://arxiv.org/abs/2307.09668) 93 | * Reward Design with Language Models [[Paper]](https://arxiv.org/abs/2303.00001) 94 | * Learning to generate better than your llm [[Paper]](https://arxiv.org/pdf/2306.11816.pdf)[[Code]](https://github.com/Cornell-RL/tril) 95 | * Guiding Pretraining in Reinforcement Learning with Large Language Models [[Paper]](https://arxiv.org/abs/2302.06692)[[Code]](https://github.com/yuqingd/ellm) 96 | * Motif: Intrinsic Motivation from Artificial Intelligence Feedback [[Paper]](https://arxiv.org/abs/2310.00166)[[Code]](https://github.com/facebookresearch/motif) 97 | 98 | ### Language-Image Goal-Conditioned Value Learning 99 | * SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [[Paper]](https://arxiv.org/abs/2204.01691)[[Project]](https://say-can.github.io/)[[Code]](https://github.com/google-research/google-research/tree/master/saycan) 100 | * Zero-Shot Reward Specification via Grounded Natural Language [[Paper]](https://proceedings.mlr.press/v162/mahmoudieh22a/mahmoudieh22a.pdf) 101 | * VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [[Project]](https://voxposer.github.io) 102 | * VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [[Paper]](https://arxiv.org/abs/2210.00030)[[Project]](https://sites.google.com/view/vip-rl) 103 | * LIV: Language-Image Representations and Rewards for Robotic Control [[Paper]](https://arxiv.org/abs/2306.00958)[[Project]](https://penn-pal-lab.github.io/LIV/) 104 | * LOReL: Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [[Paper]](https://arxiv.org/abs/2109.01115)[[Project]](https://sites.google.com/view/robotlorel) 105 | * Text2Motion: From Natural Language Instructions to Feasible Plans [[Paper]](https://arxiv.org/abs/2303.12153)[[Project]](https://sites.google.com/stanford.edu/text2motion) 106 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer) 107 | 108 | ### Robot Task Planning Using Large Language Models 109 | * Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [[Paper]](https://arxiv.org/abs/2201.07207)[[Project]](https://wenlong.page/language-planner/) 110 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/) 111 | * NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models [[Paper]](https://arxiv.org/pdf/2305.07766.pdf)[[Project]](https://yongchao98.github.io/MIT-realm-NL2TL/)[[Code]](https://github.com/yongchao98/NL2TL) 112 | * AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers[[Paper]](https://arxiv.org/abs/2306.06531)[[Project]](https://yongchao98.github.io/MIT-REALM-AutoTAMP/) 113 | * LATTE: LAnguage Trajectory TransformEr [[Paper]](https://arxiv.org/abs/2208.02918)[[Code]](https://github.com/arthurfenderbucker/LaTTe-Language-Trajectory-TransformEr) 114 | * Planning with Large Language Models via Corrective Re-prompting [[Paper]](https://arxiv.org/abs/2211.09935) 115 | * Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [[Paper]](https://arxiv.org/pdf/2302.01560.pdf)[[Code]](https://github.com/CraftJarvis/MC-Planner) 116 | * JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [[Paper]](https://arxiv.org/pdf/2311.05997.pdf)[[Project]](https://craftjarvis.github.io/JARVIS-1/)[[Code]](https://github.com/CraftJarvis/JARVIS-1) 117 | * An Embodied Generalist Agent in 3D World [[Paper]](https://arxiv.org/pdf/2311.12871.pdf)[[Project]](https://embodied-generalist.github.io/)[[Code]](https://github.com/embodied-generalist/embodied-generalist) 118 | * LLM+P: Empowering Large Language Models with Optimal Planning Proficiency [[Paper]](https://arxiv.org/pdf/2304.11477.pdf)[[Code]](https://github.com/Cranial-XIX/llm-pddl) 119 | * MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [[Paper]](https://arxiv.org/pdf/2312.07472.pdf)[[Project]](https://iranqin.github.io/MP5.github.io/)[[Code]](https://github.com/IranQin/MP5) 120 | 121 | ### LLM-Based Code Generation 122 | * ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [[Paper]](https://arxiv.org/abs/2209.11302)[[Project]](https://progprompt.github.io/) 123 | * Code as Policies: Language Model Programs for Embodied Control [[Paper]](https://arxiv.org/abs/2209.07753)[[Project]](https://code-as-policies.github.io/) 124 | * ChatGPT for Robotics: Design Principles and Model Abilities [[Paper]](https://arxiv.org/abs/2306.17582)[[Project]](https://www.microsoft.com/en-us/research/group/autonomous-systems-group-robotics/articles/chatgpt-for-robotics/)[[Code]](https://github.com/microsoft/PromptCraft-Robotics) 125 | * Voyager: An Open-Ended Embodied Agent with Large Language Models [[Paper]](https://arxiv.org/abs/2305.16291)[[Project]](https://voyager.minedojo.org/) 126 | * Visual Programming: Compositional visual reasoning without training [[Paper]](https://arxiv.org/abs/2211.11559)[[Project]](https://prior.allenai.org/projects/visprog)[[Code]](https://github.com/allenai/visprog) 127 | * Deploying and Evaluating LLMs to Program Service Mobile Robots [[Paper]](https://arxiv.org/abs/2311.11183)[[Project]](https://amrl.cs.utexas.edu/codebotler/)[[Code]](https://github.com/ut-amrl/codebotler) 128 | 129 | ### Robot Transformers 130 | * MotionGPT: Finetuned LLMs are General-Purpose Motion Generators [[Paper]](https://arxiv.org/abs/2306.10900)[[Project]](https://qiqiapink.github.io/MotionGPT/) 131 | * RT-1: Robotics Transformer for Real-World Control at Scale [[Paper]](https://robotics-transformer.github.io/assets/rt1.pdf)[[Project]](https://robotics-transformer.github.io/)[[Code]](https://github.com/google-research/robotics_transformer) 132 | * Masked Visual Pre-training for Motor Control [[Paper]](https://arxiv.org/abs/2203.06173)[[Project]](https://tetexiao.com/projects/mvp)[[Code]](https://github.com/ir413/mvp) 133 | * Real-world robot learning with masked visual pre-training [[Paper]](https://arxiv.org/abs/2210.03109)[[Project]](https://tetexiao.com/projects/real-mvp) 134 | * R3M: A Universal Visual Representation for Robot Manipulation [[Paper]](https://arxiv.org/abs/2203.12601)[[Project]](https://sites.google.com/view/robot-r3m/)[[Code]](https://github.com/facebookresearch/r3m) 135 | * Robot Learning with Sensorimotor Pre-training [[Paper]](https://arxiv.org/abs/2306.10007)[[Project]](https://robotic-pretrained-transformer.github.io/) 136 | * Rt-2: Vision-language-action models transfer web knowledge to robotic control [[Paper]](https://arxiv.org/abs/2307.15818)[[Project]](https://robotics-transformer2.github.io/) 137 | * PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [[Paper]](https://arxiv.org/abs/2209.11133) 138 | * GROOT: Learning to Follow Instructions by Watching Gameplay Videos [[Paper]](https://arxiv.org/pdf/2310.08235.pdf)[[Project]](https://craftjarvis.github.io/GROOT/)[[Code]](https://github.com/CraftJarvis/GROOT) 139 | * Behavior Transformers (BeT): Cloning k modes with one stone [[Paper]](https://arxiv.org/abs/2206.11251)[[Project]](https://mahis.life/bet/)[[Code]](https://github.com/notmahi/bet) 140 | * Conditional Behavior Transformers (C-BeT), From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data [[Paper]](https://arxiv.org/abs/2210.10047)[[Project]](https://play-to-policy.github.io/)[[Code]](https://github.com/jeffacce/play-to-policy) 141 | * MAGICVFM: Meta-learning Adaptation for Ground Interaction Control with Visual Foundation Models [[Paper]](https://arxiv.org/abs/2407.12304) 142 | 143 | ### In-context Learning for Decision-Making 144 | * A Survey on In-context Learning [[Paper]](https://arxiv.org/abs/2301.00234) 145 | * Large Language Models as General Pattern Machines [[Paper]](https://arxiv.org/abs/2307.04721) 146 | * Chain-of-Thought Predictive Control [[Paper]](https://arxiv.org/abs/2304.00776) 147 | * ReAct: Synergizing Reasoning and Acting in Language Models [[Paper]](https://arxiv.org/abs/2210.03629) 148 | * ICRT: In-Context Imitation Learning via Next-Token Prediction [[Paper]](https://arxiv.org/abs/2408.15980) [[Project]](https://icrt.dev/) [[Code]](https://github.com/Max-Fu/icrt) 149 | 150 | ### Open-Vocabulary Robot Navigation and Manipulation 151 | * CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation [[Paper]](https://arxiv.org/pdf/2203.10421.pdf)[[Project]](https://cow.cs.columbia.edu/)[[Code]]() 152 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/) 153 | * LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place [[Paper]]()[[Project]](https://languageguidedskillcoordination.github.io/) 154 | * L3MVN: Leveraging Large Language Models for Visual Target Navigation [[Project]](https://arxiv.org/abs/2304.05501) 155 | * Open-World Object Manipulation using Pre-trained Vision-Language Models [[Paper]](https://robot-moo.github.io/assets/moo.pdf)[[Project]](https://robot-moo.github.io/) 156 | * VIMA: General Robot Manipulation with Multimodal Prompts [[Paper]](https://arxiv.org/abs/2210.03094)[[Project]](https://vimalabs.github.io/)[[Code]](https://github.com/vimalabs/VIMA) 157 | * Diffusion-based Generation, Optimization, and Planning in 3D Scenes [[Paper]](https://arxiv.org/pdf/2301.06015.pdf)[[Project]](https://scenediffuser.github.io/)[[Code]](https://github.com/scenediffuser/Scene-Diffuser) 158 | * LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery [[Paper]](http://arxiv.org/abs/2311.02058) [[Project]](https://ut-austin-rpl.github.io/Lotus/) 159 | * Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World [[Paper]](https://arxiv.org/abs/2312.02976) [[Project]](https://spoc-robot.github.io/) 160 | * ThinkBot: Embodied Instruction Following with Thought Chain Reasoning [[Paper]](https://arxiv.org/abs/2312.07062) [[Project]](https://guanxinglu.github.io/thinkbot/) 161 | * CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [[Paper]](https://arxiv.org/abs/2210.05663) [[Project]](https://mahis.life/clip-fields) [[Code]](https://github.com/notmahi/clip-fields) 162 | * USA-Net: Unified Semantic and Affordance Representations for Robot Memory [[Paper]](https://arxiv.org/abs/2304.12164) [[Project]](https://usa.bolte.cc/) [[Code]](https://github.com/codekansas/usa) 163 | 164 | ## Relevant to Robotics (Perception) 165 | 166 | ### Open-Vocabulary Object Detection and 3D Classification 167 | * Simple Open-Vocabulary Object Detection with Vision Transformers [[Paper]](https://arxiv.org/pdf/2205.06230.pdf)[[Code]](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit) 168 | * Grounded Language-Image Pre-training [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.pdf)[[Code]](https://github.com/microsoft/GLIP) 169 | * Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [[Paper]](https://arxiv.org/abs/2303.05499)[[Code]](https://github.com/IDEA-Research/GroundingDINO) 170 | * PointCLIP: Point Cloud Understanding by CLIP [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_PointCLIP_Point_Cloud_Understanding_by_CLIP_CVPR_2022_paper.pdf)[[Code]](https://github.com/ZrrSkywalker/PointCLIP) 171 | * Point-bert: Pre-training 3d point cloud transformers with masked point modeling [[Paper]](https://arxiv.org/abs/2111.14819)[[Code]](https://github.com/lulutang0608/Point-BERT) 172 | * ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [[Paper]](https://arxiv.org/abs/2212.05171)[[Project]](https://tycho-xue.github.io/ULIP/)[[Code]](https://github.com/salesforce/ULIP) 173 | * Ulip-2: Towards scalable multimodal pre-training for 3d understanding [[Paper]](https://arxiv.org/pdf/2305.08275.pdf)[[Code]](https://github.com/salesforce/ULIP) 174 | * 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [[Paper]](https://arxiv.org/pdf/2308.04352.pdf)[[Project]](https://3d-vista.github.io/)[[Code]](https://github.com/3d-vista/3D-VisTA) 175 | 176 | ### Open-Vocabulary Semantic Segmentation 177 | * Language-driven Semantic Segmentation [[Paper]](https://arxiv.org/abs/2201.03546)[[Code]](https://github.com/isl-org/lang-seg) 178 | * Emerging Properties in Self-Supervised Vision Transformers [[Paper]](https://arxiv.org/abs/2104.14294)[[Code]](https://github.com/facebookresearch/dino) 179 | * Segment Anything [[Paper]](https://arxiv.org/abs/2304.02643)[[Project]](https://segment-anything.com/) 180 | * Fast segment anything [[Paper]](https://arxiv.org/abs/2306.12156)[[Code]](https://github.com/CASIA-IVA-Lab/FastSAM) 181 | * Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [[Paper]](https://arxiv.org/abs/2306.14289)[[Code]](https://github.com/ChaoningZhang/MobileSAM) 182 | * Track anything: Segment anything meets videos [[Paper]](https://arxiv.org/abs/2304.11968)[[Code]](https://github.com/gaomingqi/Track-Anything) 183 | 184 | ### Open-Vocabulary 3D Scene Representations 185 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/) 186 | * Clip-NeRF: Text-and-image driven manipulation of neural radiance fields [[Paper]](https://arxiv.org/abs/2112.05139)[[Project]](https://cassiepython.github.io/clipnerf/) 187 | * CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [[Paper]](https://arxiv.org/abs/2210.05663) [[Project]](https://mahis.life/clip-fields) [[Code]](https://github.com/notmahi/clip-fields) 188 | * LERF: Language Embedded Radiance Fields [[Paper]](https://arxiv.org/abs/2303.09553)[[Project]](https://www.lerf.io/)[[Code]](https://github.com/kerrj/lerf) 189 | * Decomposing nerf for editing via feature field distillation [[Paper]](https://arxiv.org/abs/2205.15585)[[Project]](https://pfnet-research.github.io/distilled-feature-fields) 190 | 191 | ### Object Representations 192 | * FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [[Paper]](https://arxiv.org/abs/2312.08344)[[Project]](https://nvlabs.github.io/FoundationPose/) 193 | * BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects [[Paper]](https://arxiv.org/abs/2303.14158)[[Project]](https://bundlesdf.github.io//) 194 | * Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation [[Paper]](https://arxiv.org/abs/2112.05124)[[Project]](https://yilundu.github.io/ndf/) 195 | * Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [[Paper]](https://arxiv.org/abs/2308.07931)[[Project]](https://f3rm.github.io/) 196 | * You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [[Paper]](https://arxiv.org/abs/2305.12626) 197 | * Zero-Shot Category-Level Object Pose Estimation [[Paper]](https://arxiv.org/abs/2204.03635)[[Code]](https://github.com/applied-ai-lab/zero-shot-pose) 198 | * VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors [[Paper]](https://arxiv.org/abs/2210.11339)[[Project]](https://ut-austin-rpl.github.io/VIOLA/)[[Code]](https://github.com/UT-Austin-RPL/VIOLA) 199 | * Learning Generalizable Manipulation Policies with Object-Centric 3D Representations [[Paper]](http://arxiv.org/abs/2310.14386)[[Project]](https://ut-austin-rpl.github.io/GROOT/)[[Code]](https://github.com/UT-Austin-RPL/GROOT) 200 | 201 | ### Affordance Information 202 | * Affordance Diffusion: Synthesizing Hand-Object Interactions [[Paper]](https://arxiv.org/abs/2303.12538)[[Project]](https://judyye.github.io/affordiffusion-www/) 203 | * Affordances from Human Videos as a Versatile Representation for Robotics [[Paper]](https://arxiv.org/abs/2304.08488)[[Project]](https://robo-affordances.github.io/) 204 | 205 | ### Predictive Models 206 | * Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model [[Paper]](https://ieeexplore.ieee.org/document/9361118) 207 | * Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure [[Paper]](https://arxiv.org/pdf/2309.07504.pdf) 208 | * Self-Supervised Traffic Advisors: Distributed, Multi-view Traffic Prediction for Smart Cities [[Paper]](https://arxiv.org/abs/2204.06171) 209 | * Planning with diffusion for flexible behavior synthesis [[Paper]](https://arxiv.org/abs/2205.09991) 210 | * Phenaki: Variable-length video generation from open domain textual description [[Paper]](https://arxiv.org/abs/2210.02399) 211 | * Robonet: Large-scale multi-robot learning [[Paper]](https://arxiv.org/abs/1910.11215) 212 | * GAIA-1: A Generative World Model for Autonomous Driving [[Paper]](https://arxiv.org/abs/2309.17080) 213 | * Learning universal policies via text-guided video generation [[Paper]](https://arxiv.org/abs/2302.00111) 214 | * Video language planning [[Paper]](https://arxiv.org/abs/2310.10625) 215 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer) 216 | 217 | ## Relevant to Robotics (Embodied AI) 218 | * Inner Monologue: Embodied Reasoning through Planning with Language Models [[Paper]](https://arxiv.org/abs/2207.05608)[[Project]](https://innermonologue.github.io/) 219 | * Statler: State-Maintaining Language Models for Embodied Reasoning [[Paper]](https://arxiv.org/abs/2306.17840)[[Project]](https://statler-lm.github.io/) 220 | * EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [[Paper]](https://arxiv.org/pdf/2305.15021.pdf)[[Project]](https://embodiedgpt.github.io/) 221 | * MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [[Paper]](https://openreview.net/forum?id=rc8o_j8I8PX)[[Code]](https://github.com/MineDojo/MineDojo) 222 | * Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [[Paper]](https://arxiv.org/abs/2206.11795) 223 | * Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction [[Paper]](https://arxiv.org/pdf/2301.10034.pdf)[[Code]](https://github.com/CraftJarvis/MC-Controller) 224 | * Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [[Paper]](https://arxiv.org/pdf/2302.01560.pdf)[[Code]](https://github.com/CraftJarvis/MC-Planner) 225 | * Voyager: An Open-Ended Embodied Agent with Large Language Models [[Paper]](https://arxiv.org/abs/2305.16291)[[Project]](https://voyager.minedojo.org/)[[Code]](https://github.com/MineDojo/Voyager) 226 | * Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [[Paper]](https://arxiv.org/abs/2305.17144)[[Project]](https://github.com/OpenGVLab/GITM) 227 | * Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [[Paper]](https://arxiv.org/pdf/2201.07207.pdf)[[Project]](https://wenlong.page/language-planner/)[[Code]](https://github.com/huangwl18/language-planner) 228 | * GROOT: Learning to Follow Instructions by Watching Gameplay Videos [[Paper]](https://arxiv.org/pdf/2310.08235.pdf)[[Project]](https://craftjarvis.github.io/GROOT/)[[Code]](https://github.com/CraftJarvis/GROOT) 229 | * JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [[Paper]](https://arxiv.org/pdf/2311.05997.pdf)[[Project]](https://craftjarvis.github.io/JARVIS-1/)[[Code]](https://github.com/CraftJarvis/JARVIS-1) 230 | * SQA3D: Situated Question Answering in 3D Scenes [[Paper]](https://arxiv.org/pdf/2210.07474.pdf)[[Project]](https://sqa3d.github.io/)[[Code]](https://github.com/SilongYong/SQA3D) 231 | * MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [[Paper]](https://arxiv.org/pdf/2312.07472.pdf)[[Project]](https://iranqin.github.io/MP5.github.io/)[[Code]](https://github.com/IranQin/MP5) 232 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer) 233 | 234 | ### Generalist AI 235 | * Generative Agents: Interactive Simulacra of Human Behavior [[Paper]](https://arxiv.org/abs/2304.03442) 236 | * Towards Generalist Robots: A Promising Paradigm via Generative Simulation [[Paper]](https://arxiv.org/abs/2305.10455) 237 | * A generalist agent [[Paper]](https://arxiv.org/abs/2205.06175) 238 | * An Embodied Generalist Agent in 3D World [[Paper]](https://arxiv.org/pdf/2311.12871.pdf)[[Project]](https://embodied-generalist.github.io/)[[Code]](https://github.com/embodied-generalist/embodied-generalist) 239 | 240 | ### Simulators 241 | * Gibson Env: real-world perception for embodied agents [[Paper]](https://arxiv.org/abs/1808.10654) 242 | * iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [[Paper]](https://arxiv.org/abs/2108.03272)[[Project]](https://svl.stanford.edu/igibson/) 243 | * BEHAVIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation [[Paper]](https://openreview.net/forum?id=_8DoIe8G3t)[[Project]](https://behavior.stanford.edu/behavior-1k) 244 | * Habitat: A Platform for Embodied AI Research [[Paper]](https://arxiv.org/abs/1904.01201)[[Project]](https://aihabitat.org/) 245 | * Habitat 2.0: Training home assistants to rearrange their habitat [[Paper]](https://arxiv.org/abs/2106.14405) 246 | * Robothor: An open simulation-to-real embodied ai platform [[Paper]](https://arxiv.org/abs/2004.06799)[[Project]](https://ai2thor.allenai.org/robothor/) 247 | * VirtualHome: Simulating Household Activities via Programs [[Paper]](https://arxiv.org/abs/1806.07011) 248 | * ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [[Paper]](https://arxiv.org/pdf/2304.04321.pdf)[[Project]](https://arnold-benchmark.github.io/)[[Code]](https://github.com/arnold-benchmark/arnold) 249 | * ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [[Paper]](https://arxiv.org/abs/1912.01734)[[Project]](https://askforalfred.com/)[[Code]](https://github.com/askforalfred/alfred) 250 | * LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [[Paper]](https://arxiv.org/pdf/2306.03310.pdf)[[Project]](https://lifelong-robot-learning.github.io/LIBERO/html/getting_started/overview.html)[[Code]](https://github.com/Lifelong-Robot-Learning/LIBERO) 251 | * ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [[Paper]](https://arxiv.org/abs/2206.06994)[[Project]](https://procthor.allenai.org/)[[Code]](https://github.com/allenai/procthor) 252 | --------------------------------------------------------------------------------