├── survey_tree.png
├── LICENSE
├── .gitignore
└── README.md
/survey_tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robotics-survey/Awesome-Robotics-Foundation-Models/HEAD/survey_tree.png
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 robotics-survey
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | share/python-wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 |
29 | # PyInstaller
30 | # Usually these files are written by a python script from a template
31 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 |
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 |
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .nox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *.cover
49 | *.py,cover
50 | .hypothesis/
51 | .pytest_cache/
52 | cover/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | .pybuilder/
76 | target/
77 |
78 | # Jupyter Notebook
79 | .ipynb_checkpoints
80 |
81 | # IPython
82 | profile_default/
83 | ipython_config.py
84 |
85 | # pyenv
86 | # For a library or package, you might want to ignore these files since the code is
87 | # intended to run in multiple environments; otherwise, check them in:
88 | # .python-version
89 |
90 | # pipenv
91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
94 | # install all needed dependencies.
95 | #Pipfile.lock
96 |
97 | # poetry
98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99 | # This is especially recommended for binary packages to ensure reproducibility, and is more
100 | # commonly ignored for libraries.
101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 |
104 | # pdm
105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | # in version control.
109 | # https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 |
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 |
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 |
119 | # SageMath parsed files
120 | *.sage.py
121 |
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 |
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 |
135 | # Rope project settings
136 | .ropeproject
137 |
138 | # mkdocs documentation
139 | /site
140 |
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 |
146 | # Pyre type checker
147 | .pyre/
148 |
149 | # pytype static type analyzer
150 | .pytype/
151 |
152 | # Cython debug symbols
153 | cython_debug/
154 |
155 | # PyCharm
156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | # and can be added to the global gitignore or merged into this file. For a more nuclear
159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-Robotics-Foundation-Models
2 |
3 | [](https://awesome.re)
4 |
5 | 
6 |
7 | This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods. The organization of this readme follows Figure 1 in the paper (shown above) and is thus divided into foundation models that have been applied to robotics and those that are relevant to robotics in some way.
8 |
9 | We welcome contributions to this repository to add more resources. Please submit a pull request if you want to contribute!
10 |
11 | ## Table of Contents
12 |
13 | - [Survey](#survey)
14 | - [Robotics](#robotics)
15 | - [Neural Scaling Laws for embodied AI](#neural-scaling-laws-for-embodied-AI)
16 | - [Robot Policy Learning for Decision Making and Controls](#robot-policy-learning-for-decision-making-and-controls)
17 | - [Language-Image Goal-Conditioned Value Learning](#language-image-goal-conditioned-value-learning)
18 | - [Robot Task Planning Using Large Language Models](#robot-task-planning-using-large-language-models)
19 | - [Robot Transformers](#robot-transformers)
20 | - [In-context Learning for Decision-Making](#in-context-learning-for-decision-making)
21 | - [Open-Vocabulary Robot Navigation and Manipulation](#open-vocabulary-robot-navigation-and-manipulation)
22 | - [Relevant to Robotics](#relevant-to-robotics)
23 | - [Open-Vocabulary Object Detection and 3D Classification](#open-vocabulary-object-detection-and-3D-classification)
24 | - [Open-Vocabulary Semantic Segmentation](#open-vocabulary-semantic-segmentation)
25 | - [Open-Vocabulary 3D Scene Representations](#open-vocabulary-3D-scene-representations)
26 | - [Open-Vocabulary Object Representations](#open-vocabulary-object-representations)
27 | - [Affordance Information](#affordance-information)
28 | - [Predictive Models](#predictive-models)
29 | - [Generalist AI](#generalist-AI)
30 | - [Simulators](#simulators)
31 |
32 |
33 | ## Survey
34 |
35 | This repository is largely based on the following paper:
36 |
37 | **[Foundation Models in Robotics: Applications, Challenges, and the Future]()**
38 |
39 | Roya Firoozi,
40 | Johnathan Tucker,
41 | Stephen Tian,
42 | Anirudha Majumdar,
43 | Jiankai Sun,
44 | Weiyu Liu,
45 | Yuke Zhu,
46 | Shuran Song,
47 | Ashish Kapoor,
48 | Karol Hausman,
49 | Brian Ichter,
50 | Danny Driess,
51 | Jiajun Wu,
52 | Cewu Lu,
53 | Mac Schwager
54 |
55 |
56 | If you find this repository helpful, please consider citing:
57 | ```bibtex
58 | @article{firoozi2024foundation,
59 | title={Foundation Models in Robotics: Applications, Challenges, and the Future},
60 | author={Firoozi, Roya and Tucker, Johnathan and Tian, Stephen and Majumdar, Anirudha and Sun, Jiankai and Liu, Weiyu and Zhu, Yuke and Song, Shuran and Kapoor, Ashish and Hausman, Karol and others},
61 | journal={The International Journal of Robotics Research},
62 | year={2024},
63 | doi= {https://doi.org/10.1177/02783649241281508}
64 | }
65 | ```
66 |
67 |
68 | ```bibtex
69 | @article{firoozi2023foundation,
70 | title={Foundation Models in Robotics: Applications, Challenges, and the Future},
71 | author={Firoozi, Roya and Tucker, Johnathan and Tian, Stephen and Majumdar, Anirudha and Sun, Jiankai and Liu, Weiyu and Zhu, Yuke and Song, Shuran and Kapoor, Ashish and Hausman, Karol and others},
72 | journal={arXiv preprint arXiv:2312.07843},
73 | year={2023}
74 | }
75 | ```
76 |
77 |
78 | ## Robotics
79 |
80 | ### Neural Scaling Laws
81 | * Neural Scaling Laws for Embodied AI: Neural Scaling Laws for Embodied AI [[Paper]](https://arxiv.org/abs/2405.14005)
82 |
83 |
84 | ### Robot Policy Learning for Decision-Making and Controls
85 | #### Language-Conditioned Imitation Learning
86 | * CLIPort: What and Where Pathways for Robotic Manipulation [[Paper]](https://arxiv.org/abs/2109.12098)[[Project]](https://cliport.github.io/)[[Code]](https://github.com/cliport/cliport)
87 | * Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [[Paper]](https://arxiv.org/abs/2209.05451)[[Project]](https://peract.github.io/)[[Code]](https://github.com/peract/peract)
88 | * Play-LMP: Learning Latent Plans from Play [[Project]](https://learning-from-play.github.io/)
89 | * Multi-Context Imitation: Language-Conditioned Imitation Learning over Unstructured Data [[Project]](https://language-play.github.io)
90 |
91 | #### Language-Assisted Reinforcement Learning
92 | * Towards A Unified Agent with Foundation Models [[Paper]](https://arxiv.org/abs/2307.09668)
93 | * Reward Design with Language Models [[Paper]](https://arxiv.org/abs/2303.00001)
94 | * Learning to generate better than your llm [[Paper]](https://arxiv.org/pdf/2306.11816.pdf)[[Code]](https://github.com/Cornell-RL/tril)
95 | * Guiding Pretraining in Reinforcement Learning with Large Language Models [[Paper]](https://arxiv.org/abs/2302.06692)[[Code]](https://github.com/yuqingd/ellm)
96 | * Motif: Intrinsic Motivation from Artificial Intelligence Feedback [[Paper]](https://arxiv.org/abs/2310.00166)[[Code]](https://github.com/facebookresearch/motif)
97 |
98 | ### Language-Image Goal-Conditioned Value Learning
99 | * SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [[Paper]](https://arxiv.org/abs/2204.01691)[[Project]](https://say-can.github.io/)[[Code]](https://github.com/google-research/google-research/tree/master/saycan)
100 | * Zero-Shot Reward Specification via Grounded Natural Language [[Paper]](https://proceedings.mlr.press/v162/mahmoudieh22a/mahmoudieh22a.pdf)
101 | * VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [[Project]](https://voxposer.github.io)
102 | * VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [[Paper]](https://arxiv.org/abs/2210.00030)[[Project]](https://sites.google.com/view/vip-rl)
103 | * LIV: Language-Image Representations and Rewards for Robotic Control [[Paper]](https://arxiv.org/abs/2306.00958)[[Project]](https://penn-pal-lab.github.io/LIV/)
104 | * LOReL: Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [[Paper]](https://arxiv.org/abs/2109.01115)[[Project]](https://sites.google.com/view/robotlorel)
105 | * Text2Motion: From Natural Language Instructions to Feasible Plans [[Paper]](https://arxiv.org/abs/2303.12153)[[Project]](https://sites.google.com/stanford.edu/text2motion)
106 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer)
107 |
108 | ### Robot Task Planning Using Large Language Models
109 | * Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [[Paper]](https://arxiv.org/abs/2201.07207)[[Project]](https://wenlong.page/language-planner/)
110 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/)
111 | * NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models [[Paper]](https://arxiv.org/pdf/2305.07766.pdf)[[Project]](https://yongchao98.github.io/MIT-realm-NL2TL/)[[Code]](https://github.com/yongchao98/NL2TL)
112 | * AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers[[Paper]](https://arxiv.org/abs/2306.06531)[[Project]](https://yongchao98.github.io/MIT-REALM-AutoTAMP/)
113 | * LATTE: LAnguage Trajectory TransformEr [[Paper]](https://arxiv.org/abs/2208.02918)[[Code]](https://github.com/arthurfenderbucker/LaTTe-Language-Trajectory-TransformEr)
114 | * Planning with Large Language Models via Corrective Re-prompting [[Paper]](https://arxiv.org/abs/2211.09935)
115 | * Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [[Paper]](https://arxiv.org/pdf/2302.01560.pdf)[[Code]](https://github.com/CraftJarvis/MC-Planner)
116 | * JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [[Paper]](https://arxiv.org/pdf/2311.05997.pdf)[[Project]](https://craftjarvis.github.io/JARVIS-1/)[[Code]](https://github.com/CraftJarvis/JARVIS-1)
117 | * An Embodied Generalist Agent in 3D World [[Paper]](https://arxiv.org/pdf/2311.12871.pdf)[[Project]](https://embodied-generalist.github.io/)[[Code]](https://github.com/embodied-generalist/embodied-generalist)
118 | * LLM+P: Empowering Large Language Models with Optimal Planning Proficiency [[Paper]](https://arxiv.org/pdf/2304.11477.pdf)[[Code]](https://github.com/Cranial-XIX/llm-pddl)
119 | * MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [[Paper]](https://arxiv.org/pdf/2312.07472.pdf)[[Project]](https://iranqin.github.io/MP5.github.io/)[[Code]](https://github.com/IranQin/MP5)
120 |
121 | ### LLM-Based Code Generation
122 | * ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [[Paper]](https://arxiv.org/abs/2209.11302)[[Project]](https://progprompt.github.io/)
123 | * Code as Policies: Language Model Programs for Embodied Control [[Paper]](https://arxiv.org/abs/2209.07753)[[Project]](https://code-as-policies.github.io/)
124 | * ChatGPT for Robotics: Design Principles and Model Abilities [[Paper]](https://arxiv.org/abs/2306.17582)[[Project]](https://www.microsoft.com/en-us/research/group/autonomous-systems-group-robotics/articles/chatgpt-for-robotics/)[[Code]](https://github.com/microsoft/PromptCraft-Robotics)
125 | * Voyager: An Open-Ended Embodied Agent with Large Language Models [[Paper]](https://arxiv.org/abs/2305.16291)[[Project]](https://voyager.minedojo.org/)
126 | * Visual Programming: Compositional visual reasoning without training [[Paper]](https://arxiv.org/abs/2211.11559)[[Project]](https://prior.allenai.org/projects/visprog)[[Code]](https://github.com/allenai/visprog)
127 | * Deploying and Evaluating LLMs to Program Service Mobile Robots [[Paper]](https://arxiv.org/abs/2311.11183)[[Project]](https://amrl.cs.utexas.edu/codebotler/)[[Code]](https://github.com/ut-amrl/codebotler)
128 |
129 | ### Robot Transformers
130 | * MotionGPT: Finetuned LLMs are General-Purpose Motion Generators [[Paper]](https://arxiv.org/abs/2306.10900)[[Project]](https://qiqiapink.github.io/MotionGPT/)
131 | * RT-1: Robotics Transformer for Real-World Control at Scale [[Paper]](https://robotics-transformer.github.io/assets/rt1.pdf)[[Project]](https://robotics-transformer.github.io/)[[Code]](https://github.com/google-research/robotics_transformer)
132 | * Masked Visual Pre-training for Motor Control [[Paper]](https://arxiv.org/abs/2203.06173)[[Project]](https://tetexiao.com/projects/mvp)[[Code]](https://github.com/ir413/mvp)
133 | * Real-world robot learning with masked visual pre-training [[Paper]](https://arxiv.org/abs/2210.03109)[[Project]](https://tetexiao.com/projects/real-mvp)
134 | * R3M: A Universal Visual Representation for Robot Manipulation [[Paper]](https://arxiv.org/abs/2203.12601)[[Project]](https://sites.google.com/view/robot-r3m/)[[Code]](https://github.com/facebookresearch/r3m)
135 | * Robot Learning with Sensorimotor Pre-training [[Paper]](https://arxiv.org/abs/2306.10007)[[Project]](https://robotic-pretrained-transformer.github.io/)
136 | * Rt-2: Vision-language-action models transfer web knowledge to robotic control [[Paper]](https://arxiv.org/abs/2307.15818)[[Project]](https://robotics-transformer2.github.io/)
137 | * PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [[Paper]](https://arxiv.org/abs/2209.11133)
138 | * GROOT: Learning to Follow Instructions by Watching Gameplay Videos [[Paper]](https://arxiv.org/pdf/2310.08235.pdf)[[Project]](https://craftjarvis.github.io/GROOT/)[[Code]](https://github.com/CraftJarvis/GROOT)
139 | * Behavior Transformers (BeT): Cloning k modes with one stone [[Paper]](https://arxiv.org/abs/2206.11251)[[Project]](https://mahis.life/bet/)[[Code]](https://github.com/notmahi/bet)
140 | * Conditional Behavior Transformers (C-BeT), From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data [[Paper]](https://arxiv.org/abs/2210.10047)[[Project]](https://play-to-policy.github.io/)[[Code]](https://github.com/jeffacce/play-to-policy)
141 | * MAGICVFM: Meta-learning Adaptation for Ground Interaction Control with Visual Foundation Models [[Paper]](https://arxiv.org/abs/2407.12304)
142 |
143 | ### In-context Learning for Decision-Making
144 | * A Survey on In-context Learning [[Paper]](https://arxiv.org/abs/2301.00234)
145 | * Large Language Models as General Pattern Machines [[Paper]](https://arxiv.org/abs/2307.04721)
146 | * Chain-of-Thought Predictive Control [[Paper]](https://arxiv.org/abs/2304.00776)
147 | * ReAct: Synergizing Reasoning and Acting in Language Models [[Paper]](https://arxiv.org/abs/2210.03629)
148 | * ICRT: In-Context Imitation Learning via Next-Token Prediction [[Paper]](https://arxiv.org/abs/2408.15980) [[Project]](https://icrt.dev/) [[Code]](https://github.com/Max-Fu/icrt)
149 |
150 | ### Open-Vocabulary Robot Navigation and Manipulation
151 | * CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation [[Paper]](https://arxiv.org/pdf/2203.10421.pdf)[[Project]](https://cow.cs.columbia.edu/)[[Code]]()
152 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/)
153 | * LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place [[Paper]]()[[Project]](https://languageguidedskillcoordination.github.io/)
154 | * L3MVN: Leveraging Large Language Models for Visual Target Navigation [[Project]](https://arxiv.org/abs/2304.05501)
155 | * Open-World Object Manipulation using Pre-trained Vision-Language Models [[Paper]](https://robot-moo.github.io/assets/moo.pdf)[[Project]](https://robot-moo.github.io/)
156 | * VIMA: General Robot Manipulation with Multimodal Prompts [[Paper]](https://arxiv.org/abs/2210.03094)[[Project]](https://vimalabs.github.io/)[[Code]](https://github.com/vimalabs/VIMA)
157 | * Diffusion-based Generation, Optimization, and Planning in 3D Scenes [[Paper]](https://arxiv.org/pdf/2301.06015.pdf)[[Project]](https://scenediffuser.github.io/)[[Code]](https://github.com/scenediffuser/Scene-Diffuser)
158 | * LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery [[Paper]](http://arxiv.org/abs/2311.02058) [[Project]](https://ut-austin-rpl.github.io/Lotus/)
159 | * Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World [[Paper]](https://arxiv.org/abs/2312.02976) [[Project]](https://spoc-robot.github.io/)
160 | * ThinkBot: Embodied Instruction Following with Thought Chain Reasoning [[Paper]](https://arxiv.org/abs/2312.07062) [[Project]](https://guanxinglu.github.io/thinkbot/)
161 | * CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [[Paper]](https://arxiv.org/abs/2210.05663) [[Project]](https://mahis.life/clip-fields) [[Code]](https://github.com/notmahi/clip-fields)
162 | * USA-Net: Unified Semantic and Affordance Representations for Robot Memory [[Paper]](https://arxiv.org/abs/2304.12164) [[Project]](https://usa.bolte.cc/) [[Code]](https://github.com/codekansas/usa)
163 |
164 | ## Relevant to Robotics (Perception)
165 |
166 | ### Open-Vocabulary Object Detection and 3D Classification
167 | * Simple Open-Vocabulary Object Detection with Vision Transformers [[Paper]](https://arxiv.org/pdf/2205.06230.pdf)[[Code]](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
168 | * Grounded Language-Image Pre-training [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.pdf)[[Code]](https://github.com/microsoft/GLIP)
169 | * Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [[Paper]](https://arxiv.org/abs/2303.05499)[[Code]](https://github.com/IDEA-Research/GroundingDINO)
170 | * PointCLIP: Point Cloud Understanding by CLIP [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_PointCLIP_Point_Cloud_Understanding_by_CLIP_CVPR_2022_paper.pdf)[[Code]](https://github.com/ZrrSkywalker/PointCLIP)
171 | * Point-bert: Pre-training 3d point cloud transformers with masked point modeling [[Paper]](https://arxiv.org/abs/2111.14819)[[Code]](https://github.com/lulutang0608/Point-BERT)
172 | * ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [[Paper]](https://arxiv.org/abs/2212.05171)[[Project]](https://tycho-xue.github.io/ULIP/)[[Code]](https://github.com/salesforce/ULIP)
173 | * Ulip-2: Towards scalable multimodal pre-training for 3d understanding [[Paper]](https://arxiv.org/pdf/2305.08275.pdf)[[Code]](https://github.com/salesforce/ULIP)
174 | * 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [[Paper]](https://arxiv.org/pdf/2308.04352.pdf)[[Project]](https://3d-vista.github.io/)[[Code]](https://github.com/3d-vista/3D-VisTA)
175 |
176 | ### Open-Vocabulary Semantic Segmentation
177 | * Language-driven Semantic Segmentation [[Paper]](https://arxiv.org/abs/2201.03546)[[Code]](https://github.com/isl-org/lang-seg)
178 | * Emerging Properties in Self-Supervised Vision Transformers [[Paper]](https://arxiv.org/abs/2104.14294)[[Code]](https://github.com/facebookresearch/dino)
179 | * Segment Anything [[Paper]](https://arxiv.org/abs/2304.02643)[[Project]](https://segment-anything.com/)
180 | * Fast segment anything [[Paper]](https://arxiv.org/abs/2306.12156)[[Code]](https://github.com/CASIA-IVA-Lab/FastSAM)
181 | * Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [[Paper]](https://arxiv.org/abs/2306.14289)[[Code]](https://github.com/ChaoningZhang/MobileSAM)
182 | * Track anything: Segment anything meets videos [[Paper]](https://arxiv.org/abs/2304.11968)[[Code]](https://github.com/gaomingqi/Track-Anything)
183 |
184 | ### Open-Vocabulary 3D Scene Representations
185 | * Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [[Paper]](https://arxiv.org/pdf/2209.09874.pdf)[[Project]](https://nlmap-saycan.github.io/)
186 | * Clip-NeRF: Text-and-image driven manipulation of neural radiance fields [[Paper]](https://arxiv.org/abs/2112.05139)[[Project]](https://cassiepython.github.io/clipnerf/)
187 | * CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [[Paper]](https://arxiv.org/abs/2210.05663) [[Project]](https://mahis.life/clip-fields) [[Code]](https://github.com/notmahi/clip-fields)
188 | * LERF: Language Embedded Radiance Fields [[Paper]](https://arxiv.org/abs/2303.09553)[[Project]](https://www.lerf.io/)[[Code]](https://github.com/kerrj/lerf)
189 | * Decomposing nerf for editing via feature field distillation [[Paper]](https://arxiv.org/abs/2205.15585)[[Project]](https://pfnet-research.github.io/distilled-feature-fields)
190 |
191 | ### Object Representations
192 | * FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [[Paper]](https://arxiv.org/abs/2312.08344)[[Project]](https://nvlabs.github.io/FoundationPose/)
193 | * BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects [[Paper]](https://arxiv.org/abs/2303.14158)[[Project]](https://bundlesdf.github.io//)
194 | * Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation [[Paper]](https://arxiv.org/abs/2112.05124)[[Project]](https://yilundu.github.io/ndf/)
195 | * Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [[Paper]](https://arxiv.org/abs/2308.07931)[[Project]](https://f3rm.github.io/)
196 | * You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [[Paper]](https://arxiv.org/abs/2305.12626)
197 | * Zero-Shot Category-Level Object Pose Estimation [[Paper]](https://arxiv.org/abs/2204.03635)[[Code]](https://github.com/applied-ai-lab/zero-shot-pose)
198 | * VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors [[Paper]](https://arxiv.org/abs/2210.11339)[[Project]](https://ut-austin-rpl.github.io/VIOLA/)[[Code]](https://github.com/UT-Austin-RPL/VIOLA)
199 | * Learning Generalizable Manipulation Policies with Object-Centric 3D Representations [[Paper]](http://arxiv.org/abs/2310.14386)[[Project]](https://ut-austin-rpl.github.io/GROOT/)[[Code]](https://github.com/UT-Austin-RPL/GROOT)
200 |
201 | ### Affordance Information
202 | * Affordance Diffusion: Synthesizing Hand-Object Interactions [[Paper]](https://arxiv.org/abs/2303.12538)[[Project]](https://judyye.github.io/affordiffusion-www/)
203 | * Affordances from Human Videos as a Versatile Representation for Robotics [[Paper]](https://arxiv.org/abs/2304.08488)[[Project]](https://robo-affordances.github.io/)
204 |
205 | ### Predictive Models
206 | * Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model [[Paper]](https://ieeexplore.ieee.org/document/9361118)
207 | * Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure [[Paper]](https://arxiv.org/pdf/2309.07504.pdf)
208 | * Self-Supervised Traffic Advisors: Distributed, Multi-view Traffic Prediction for Smart Cities [[Paper]](https://arxiv.org/abs/2204.06171)
209 | * Planning with diffusion for flexible behavior synthesis [[Paper]](https://arxiv.org/abs/2205.09991)
210 | * Phenaki: Variable-length video generation from open domain textual description [[Paper]](https://arxiv.org/abs/2210.02399)
211 | * Robonet: Large-scale multi-robot learning [[Paper]](https://arxiv.org/abs/1910.11215)
212 | * GAIA-1: A Generative World Model for Autonomous Driving [[Paper]](https://arxiv.org/abs/2309.17080)
213 | * Learning universal policies via text-guided video generation [[Paper]](https://arxiv.org/abs/2302.00111)
214 | * Video language planning [[Paper]](https://arxiv.org/abs/2310.10625)
215 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer)
216 |
217 | ## Relevant to Robotics (Embodied AI)
218 | * Inner Monologue: Embodied Reasoning through Planning with Language Models [[Paper]](https://arxiv.org/abs/2207.05608)[[Project]](https://innermonologue.github.io/)
219 | * Statler: State-Maintaining Language Models for Embodied Reasoning [[Paper]](https://arxiv.org/abs/2306.17840)[[Project]](https://statler-lm.github.io/)
220 | * EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [[Paper]](https://arxiv.org/pdf/2305.15021.pdf)[[Project]](https://embodiedgpt.github.io/)
221 | * MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [[Paper]](https://openreview.net/forum?id=rc8o_j8I8PX)[[Code]](https://github.com/MineDojo/MineDojo)
222 | * Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [[Paper]](https://arxiv.org/abs/2206.11795)
223 | * Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction [[Paper]](https://arxiv.org/pdf/2301.10034.pdf)[[Code]](https://github.com/CraftJarvis/MC-Controller)
224 | * Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [[Paper]](https://arxiv.org/pdf/2302.01560.pdf)[[Code]](https://github.com/CraftJarvis/MC-Planner)
225 | * Voyager: An Open-Ended Embodied Agent with Large Language Models [[Paper]](https://arxiv.org/abs/2305.16291)[[Project]](https://voyager.minedojo.org/)[[Code]](https://github.com/MineDojo/Voyager)
226 | * Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [[Paper]](https://arxiv.org/abs/2305.17144)[[Project]](https://github.com/OpenGVLab/GITM)
227 | * Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [[Paper]](https://arxiv.org/pdf/2201.07207.pdf)[[Project]](https://wenlong.page/language-planner/)[[Code]](https://github.com/huangwl18/language-planner)
228 | * GROOT: Learning to Follow Instructions by Watching Gameplay Videos [[Paper]](https://arxiv.org/pdf/2310.08235.pdf)[[Project]](https://craftjarvis.github.io/GROOT/)[[Code]](https://github.com/CraftJarvis/GROOT)
229 | * JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [[Paper]](https://arxiv.org/pdf/2311.05997.pdf)[[Project]](https://craftjarvis.github.io/JARVIS-1/)[[Code]](https://github.com/CraftJarvis/JARVIS-1)
230 | * SQA3D: Situated Question Answering in 3D Scenes [[Paper]](https://arxiv.org/pdf/2210.07474.pdf)[[Project]](https://sqa3d.github.io/)[[Code]](https://github.com/SilongYong/SQA3D)
231 | * MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [[Paper]](https://arxiv.org/pdf/2312.07472.pdf)[[Project]](https://iranqin.github.io/MP5.github.io/)[[Code]](https://github.com/IranQin/MP5)
232 | * MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [[Paper]](https://arxiv.org/pdf/2403.12037.pdf)[[Project]](https://sites.google.com/view/minedreamer/main)[[Code]](https://github.com/Zhoues/MineDreamer)
233 |
234 | ### Generalist AI
235 | * Generative Agents: Interactive Simulacra of Human Behavior [[Paper]](https://arxiv.org/abs/2304.03442)
236 | * Towards Generalist Robots: A Promising Paradigm via Generative Simulation [[Paper]](https://arxiv.org/abs/2305.10455)
237 | * A generalist agent [[Paper]](https://arxiv.org/abs/2205.06175)
238 | * An Embodied Generalist Agent in 3D World [[Paper]](https://arxiv.org/pdf/2311.12871.pdf)[[Project]](https://embodied-generalist.github.io/)[[Code]](https://github.com/embodied-generalist/embodied-generalist)
239 |
240 | ### Simulators
241 | * Gibson Env: real-world perception for embodied agents [[Paper]](https://arxiv.org/abs/1808.10654)
242 | * iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [[Paper]](https://arxiv.org/abs/2108.03272)[[Project]](https://svl.stanford.edu/igibson/)
243 | * BEHAVIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation [[Paper]](https://openreview.net/forum?id=_8DoIe8G3t)[[Project]](https://behavior.stanford.edu/behavior-1k)
244 | * Habitat: A Platform for Embodied AI Research [[Paper]](https://arxiv.org/abs/1904.01201)[[Project]](https://aihabitat.org/)
245 | * Habitat 2.0: Training home assistants to rearrange their habitat [[Paper]](https://arxiv.org/abs/2106.14405)
246 | * Robothor: An open simulation-to-real embodied ai platform [[Paper]](https://arxiv.org/abs/2004.06799)[[Project]](https://ai2thor.allenai.org/robothor/)
247 | * VirtualHome: Simulating Household Activities via Programs [[Paper]](https://arxiv.org/abs/1806.07011)
248 | * ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [[Paper]](https://arxiv.org/pdf/2304.04321.pdf)[[Project]](https://arnold-benchmark.github.io/)[[Code]](https://github.com/arnold-benchmark/arnold)
249 | * ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [[Paper]](https://arxiv.org/abs/1912.01734)[[Project]](https://askforalfred.com/)[[Code]](https://github.com/askforalfred/alfred)
250 | * LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [[Paper]](https://arxiv.org/pdf/2306.03310.pdf)[[Project]](https://lifelong-robot-learning.github.io/LIBERO/html/getting_started/overview.html)[[Code]](https://github.com/Lifelong-Robot-Learning/LIBERO)
251 | * ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [[Paper]](https://arxiv.org/abs/2206.06994)[[Project]](https://procthor.allenai.org/)[[Code]](https://github.com/allenai/procthor)
252 |
--------------------------------------------------------------------------------