└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # AwesomeGAIManipulation 2 | 3 | ## Survey 4 | 5 | ## Data Generation 6 | - **GRUtopia: Dream General Robots in a City at Scale** 7 | [[paper]](https://arxiv.org/abs/2407.10943) 8 | [[code]](https://github.com/OpenRobotLab/GRUtopia) 9 | - **Diffusion for Multi-Embodiment Grasping** 10 | [[paper]](https://arxiv.org/html/2410.18835v1) 11 | - **Gen2sim: Scaling up robot learning in simulation with generative model (ICRA 2024)** 12 | [[paper]](https://arxiv.org/abs/2310.18308) 13 | [[code]](https://github.com/pushkalkatara/Gen2Sim) 14 | [[webpage]](https://gen2sim.github.io/) 15 | - **RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation (ICML 2024)** 16 | [[paper]](https://arxiv.org/abs/2311.01455) 17 | [[code]](https://github.com/Genesis-Embodied-AI/RoboGen) 18 | [[webpage]](https://robogen-ai.github.io/) 19 | - **Holodeck: Language Guided Generation of 3D Embodied AI Environments (CVPR 2024)** 20 | [[paper]](https://arxiv.org/abs/2312.09067) 21 | [[code]](https://github.com/allenai/Holodeck) 22 | [[webpage]](https://yueyang1996.github.io/holodeck/) 23 | - **Video Generation Models as World Simulators** 24 | [[paper]](https://arxiv.org/abs/2410.18072) 25 | [[webpage]](https://openai.com/research/video-generation-models-as-world-simulators) 26 | - **Learning Interactive Real-World Simulators** 27 | [[paper]](https://arxiv.org/abs/2310.06114) 28 | - **MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations (CoRL 2023)** 29 | [[paper]](https://proceedings.mlr.press/v229/mandlekar23a/mandlekar23a.pdf) 30 | [[code]](https://mimicgen.github.io) 31 | [[webpage]](https://mimicgen.github.io/) 32 | - **CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation (CVPR 2024)** 33 | [[paper]](https://arxiv.org/pdf/2402.14795) 34 | [[code]](https://github.com/wang59695487/hand_teleop_real_sim_mix_adr) 35 | [[webpage]]( https://cyber-demo.github.io/) 36 | - **Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning** 37 | [[paper]](https://arxiv.org/abs/2402.17768) 38 | [[code]](https://github.com/ErinZhang1998/dmd_diffusion) 39 | [[webpage]]() 40 | - **DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning (ICRA 2025)** 41 | [[paper]](https://arxiv.org/pdf/2410.24185) 42 | [[code]](https://github.com/NVlabs/dexmimicgen/) 43 | [[webpage]](https://dexmimicgen.github.io/) 44 | - **IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning** 45 | [[paper]](https://arxiv.org/abs/2405.01472) 46 | [[webpage]](https://sites.google.com/view/intervengen2024) 47 | - **Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (CoRL 2023)** 48 | [[paper]](https://proceedings.mlr.press/v229/ha23a/ha23a.pdf) 49 | [[code]](https://github.com/real-stanford/scalingup) 50 | [[webpage]](https://www.cs.columbia.edu/~huy/scalingup/) 51 | - **GenAug: Retargeting behaviors to unseen situations via Generative Augmentation (RSS 2023)** 52 | [[paper]](https://arxiv.org/abs/2302.06671) 53 | [[code]](https://github.com/genaug/genaug) 54 | [[webpage]](https://genaug.github.io/) 55 | - **Scaling Robot Learning with Semantically Imagined Experience (RSS 2023)** 56 | [[paper]](https://arxiv.org/abs/2302.11550) 57 | [[webpage]](https://diffusion-rosie.github.io/) 58 | - **RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning (CoRL 2024)** 59 | [[paper]](https://rovi-aug.github.io/static/pdf/rovi_aug_paper.pdf) 60 | [[code]](https://github.com/BerkeleyAutomation/rovi-aug) 61 | [[webpage]](https://rovi-aug.github.io/) 62 | - **Learning Robust Real-World Dexterous Grasping Policies via Implicit Shape Augmentation (CoRL 2022)** 63 | [[paper]](https://arxiv.org/abs/2210.13638) 64 | [[webpage]]( https://sites.google.com/view/implicitaugmentation/home) 65 | - **DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics** 66 | [[paper]](https://arxiv.org/abs/2210.02438) 67 | - **Shadow: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer (CoRL 2024)** 68 | [[paper]](https://shadow-cross-embodiment.github.io/static/shadow24.pdf) 69 | [[code]](https://shadow-cross-embodiment.github.io/) 70 | - **Human-to-Robot Imitation in the Wild (RSS 2022)** 71 | [[paper]](https://arxiv.org/abs/2207.09450) 72 | [[webpage]](https://human2robot.github.io/) 73 | - **Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting (RSS 2024)** 74 | [[paper]]( https://robot-mirage.github.io/static/pdf/mirage_paper.pdf) 75 | [[code]](https://github.com/BerkeleyAutomation/mirage) 76 | [[webpage]](https://robot-mirage.github.io/) 77 | - **CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning** 78 | [[paper]](https://arxiv.org/abs/2212.05711) 79 | [[webpage]](https://cacti-framework.github.io/) 80 | - **RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking (ICRA 2024)** 81 | [[paper]](https://arxiv.org/abs/2309.01918) 82 | [[code]](https://github.com/robopen/roboagent/) 83 | [[webpage]](https://robopen.github.io/) 84 | - **ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation (ICRA 2023)** 85 | [[paper]](https://arxiv.org/abs/2210.07450) 86 | [[code]](https://github.com/NHirose/ExAug) 87 | [[webpage]](https://sites.google.com/view/exaug-nav) 88 | - **RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning (ICRA 2025)** 89 | [[paper]](https://arxiv.org/abs/2409.14674) 90 | [[code]](https://github.com/sled-group/RACER) 91 | [[webpage]]( https://rich-language-failure-recovery.github.io/) 92 | - **Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models (RSS 2023)** 93 | [[paper]](https://arxiv.org/abs/2211.11736) 94 | [[webpage]](https://instructionaugmentation.github.io/) 95 | 96 | 97 | ## Reward Generation 98 | - **Language to Rewards for Robotic Skill Synthesis (CoRL 2023)** 99 | [[paper]](https://openreview.net/forum?id=SgTPdyehXMA) 100 | [[code]](https://github.com/google-deepmind/language_to_reward_2023) 101 | [[webpage]](https://language-to-reward.github.io/) 102 | - **Vision-Language Models as Success Detectors (CoLLA 2023)** 103 | [[paper]](https://proceedings.mlr.press/v232/du23b/du23b.pdf) 104 | - **Scaling robot policy learning via zero-shot labeling with foundation models (CoRL 2024)** 105 | [[paper]](https://arxiv.org/abs/2410.17772) 106 | [[code]](https://robottasklabeling.github.io/) 107 | [[webpage]](https://robottasklabeling.github.io/) 108 | - **FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning (ICML 2024)** 109 | [[paper]]([https://arxiv.org/abs/2402.00000](https://arxiv.org/abs/2406.00645)) 110 | [[code]](https://github.com/fuyw/FuRL) 111 | - **Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning (ICLR 2024)** 112 | [[paper]](https://openreview.net/forum?id=tUM39YTRxH) 113 | - **Eureka: Human-Level Reward Design via Coding Large Language Models (NeurIPS 2023)** 114 | [[paper]](https://arxiv.org/abs/2310.12931) 115 | - **Agentic Skill Discovery (CoRL 2024 workshop & ICRA@40)** 116 | [[paper]](https://arxiv.org/abs/2405.15019) 117 | [[code]](https://github.com/xf-zhao/Agentic-Skill-Discovery) 118 | - **CLIPort: What and Where Pathways for Robotic Manipulation** 119 | [[paper]](https://arxiv.org/abs/2109.12098) 120 | - **R3M: A Universal Visual Representation for Robot Manipulation** 121 | [[paper]](https://arxiv.org/abs/2203.12601) 122 | [[code]](https://github.com/facebookresearch/r3m) 123 | [[webpage]](https://sites.google.com/view/robot-r3m/?pli=1) 124 | - **LIV: Language-Image Representations and Rewards for Robotic Control (ICML 2023)** 125 | [[paper]](https://arxiv.org/abs/2306.00958) 126 | [[code]](https://github.com/penn-pal-lab/LIV) 127 | [[webpage]](https://penn-pal-lab.github.io/LIV/) 128 | - **Learning Reward Functions for Robotic Manipulation by Observing Humans** 129 | [[paper]](https://arxiv.org/abs/2211.09019) 130 | - **Deep visual foresight for planning robot motion (ICRA 2017)** 131 | [[paper]](https://arxiv.org/abs/1610.00696) 132 | - **VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation (RSS 2024)** 133 | [[paper]](https://arxiv.org/abs/2407.09829) 134 | [[code]](https://github.com/PPjmchen/VLMPC) 135 | - **Learning Reward for Robot Skills Using Large Language Models via Self-Alignment (ICML 2024)** 136 | [[paper]](https://arxiv.org/abs/2405.07162) 137 | - **Video Prediction Models as Rewards for Reinforcement Learning** 138 | [[paper]](https://arxiv.org/abs/2305.14343) 139 | [[code]](https://escontrela.me/viper) 140 | - **Vip: Towards universal visual reward and representation via value-implicit pre-training (ICLR 2023)** 141 | [[paper]](https://arxiv.org/abs/2210.00030) 142 | [[code]](https://github.com/facebookresearch/vip) 143 | - **Learning to Understand Goal Specifications by Modelling Reward** 144 | [[paper]](https://arxiv.org/pdf/1806.01946) 145 | - **Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks** 146 | [[paper]](https://arxiv.org/abs/2405.01534) 147 | - **Policy improvement using language feedback models (NeurIPS 2024)** 148 | [[paper]](https://arxiv.org/abs/2402.07876) 149 | 150 | 151 | 152 | ## State Generation 153 | 154 | 155 | - **Reinforcement learning with action-free pre-training from videos (ICML2022)** 156 | [[paper]](https://proceedings.mlr.press/v162/seo22a/seo22a.pdf) 157 | [[code]](https://github.com/younggyoseo/apv) 158 | - **Mastering diverse domains through world models** 159 | [[paper]](https://arxiv.org/pdf/2301.04104v2) 160 | [[code]](https://github.com/danijar/dreamerv3) 161 | [[webpage]](https://danijar.com/project/dreamerv3/) 162 | - **Dream to Control: Learning Behaviors by Latent Imagination** 163 | [[paper]](https://arxiv.org/abs/1912.01603) 164 | - **Robot Shape and Location Retention in Video Generation Using Diffusion Models** 165 | [[paper]](https://arxiv.org/abs/2407.02873) 166 | [[code]](https://github.com/PengPaulWang/diffusion-robots) 167 | - **Uncertainty-aware active learning of nerf-based object models for robot manipulators using visual and re-orientation actions** 168 | [[paper]](https://actnerf.github.io/) 169 | [[code]](https://github.com/ActNeRF/ActNeRF) 170 | [[webpage]](https://actnerf.github.io/) 171 | - **Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors** 172 | [[paper]](https://arxiv.org/abs/2403.14526) 173 | [[code]](https://github.com/tsagkas/click2grasp) 174 | [[webpage]](https://tsagkas.github.io/click2grasp/) 175 | - **Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL (ECCV2024)** 176 | [[paper]](https://arxiv.org/abs/2404.09857) 177 | - **Doughnet: A visual predictive model for topological manipulation of deformable objects** 178 | [[paper]](https://arxiv.org/abs/2404.12524) 179 | - **KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations (ICML2024)** 180 | [[paper]](https://openreview.net/pdf?id=oCI9gHocws) 181 | - **DynSyn: Dynamical Synergistic Representation for Efficient Learning and Control in Overactuated Embodied Systems (ICML2024)** 182 | [[paper]](https://arxiv.org/abs/2407.11472) 183 | - **Symmetry-Aware Robot Design with Structured Subgroups (ICML2023)** 184 | [[paper]](https://arxiv.org/abs/2306.00036) 185 | - **Total-recon: Deformable scene reconstruction for embodied view synthesis (ICCV2023)** 186 | [[paper]](https://arxiv.org/abs/2304.12317) 187 | [[code & data]](https://github.com/andrewsonga/Total-Recon) 188 | [[webpage]](https://andrewsonga.github.io/totalrecon) 189 | - **Explore and Tell: Embodied Visual Captioning in 3D Environments (ICCV2023)** 190 | [[paper]](https://arxiv.org/abs/2308.10447) 191 | [[code & data]](https://aim3-ruc.github.io/ExploreAndTell) 192 | - **Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation (ECCV2024)** 193 | [[paper]](https://arxiv.org/abs/2405.01527) 194 | - **Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation (CoRL2024)** 195 | [[paper]](https://arxiv.org/abs/2403.08321) 196 | [[code]](https://github.com/GuanxingLu/ManiGaussian) 197 | [[webpage]](https://guanxinglu.github.io/ManiGaussian/) 198 | - **Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics** 199 | [[paper]](https://arxiv.org/abs/2406.10788) 200 | - **Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (NeurIPS2024)** 201 | [[paper]](https://arxiv.org/pdf/2402.14407) 202 | [[code]](https://github.com/tinnerhrhe/VPDD) 203 | [[webpage]](https://video-diff.github.io/) 204 | - **PreLAR: World Model Pre-training with Learnable Action Representation (ECCV2024)** 205 | [[paper]](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03363.pdf) 206 | [[code]](https://github.com/zhanglixuan0720/PreLAR) 207 | - **Octopus: Embodied vision-language programmer from environmental feedback** 208 | [[paper]](https://arxiv.org/abs/2310.08588) 209 | [[code]](https://github.com/dongyh20/Octopus) 210 | [[webpage]](https://choiszt.github.io/Octopus/) 211 | - **Ec2: Emergent communication for embodied control (CVPR2023)** 212 | [[papar]](https://arxiv.org/abs/2304.09448) 213 | - **Voxposer: Composable 3d value maps for robotic manipulation with language models** 214 | [[paper]](https://arxiv.org/abs/2307.05973) 215 | 216 | ## Language Generation 217 | - **Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (PMLR 2022)** 218 | [[paper]](https://arxiv.org/pdf/2201.07207.pdf) 219 | [[code]](https://github.com/huangwl18/language-planner) 220 | - **Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (PMLR 2023)** 221 | [[paper]](https://arxiv.org/abs/2307.14535) 222 | [[code]](https://github.com/real-stanford/scalingup) 223 | - **Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks (ICLR 2024)** 224 | [[paper]](https://arxiv.org/pdf/2405.01534) 225 | [[code]](https://github.com/mihdalal/planseqlearn) 226 | - **Large language models as commonsense knowledge for large-scale task planning (NeurIPS 2023)** 227 | [[paper]](https://arxiv.org/abs/2305.14078) 228 | [[code]](https://github.com/1989Ryan/llm-mcts) 229 | - **REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction (CoRL 2023)** 230 | [[paper]](https://arxiv.org/abs/2306.15724) 231 | [[code]](https://github.com/real-stanford/reflect) 232 | - **Gesture-Informed Robot Assistance via Foundation Models (CoRL 2023)** 233 | [[paper]](https://openreview.net/pdf?id=Ffn8Z4Q-zU) 234 | - **Large Language Models for Robotics: Opportunities, Challenges, and Perspectives** 235 | [[paper]](https://arxiv.org/pdf/2401.04334) 236 | - **Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making (NeurIPS 2024 Track Datasets and Benchmarks)** 237 | [[paper]](https://arxiv.org/abs/2410.07166) 238 | [[code]](https://github.com/embodied-agent-interface/embodied-agent-interface) 239 | - **Embodiedgpt: Vision-language pre-training via embodied chain of thought (NeurIPS 2023)** 240 | [[paper]](https://arxiv.org/pdf/2305.15021.pdf) 241 | [[code]](https://github.com/OpenGVLab/EmbodiedGPT) 242 | - **Chat with the Environment: Interactive Multimodal Perception using Large Language Models (IROS 2023)** 243 | [[paper]](https://arxiv.org/abs/2303.08268) 244 | [[code]](https://github.com/xf-zhao/Matcha) 245 | - **Embodied CoT Distillation From LLM To Off-the-shelf Agents (ICML 2024)** 246 | [[paper]](https://arxiv.org/html/2412.11499v1) 247 | - **Do as i can, not as i say: Grounding language in robotic affordances** 248 | [[paper]](https://say-can.github.io/assets/palm_saycan.pdf) 249 | [[code]](https://github.com/google-research/google-research/tree/master/saycan) 250 | - **Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents (NeurIPS 2023)** 251 | [[paper]](https://openreview.net/pdf?id=JCCi58IUsh) 252 | - **Inner Monologue: Embodied Reasoning through Planning with Language Models (CoRL 2022)** 253 | [[paper]](https://arxiv.org/abs/2207.05608) 254 | - **PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models** 255 | [[paper]](https://arxiv.org/pdf/2402.16836.pdf) 256 | [[code]](https://github.com/dkguo/PhyGrasp) 257 | - **SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning (CoRL 2023)** 258 | [[paper]](https://arxiv.org/abs/2307.06135) 259 | - **Robomp2: A robotic multimodal perception-planning framework with multimodal large language models (ICML 2024)** 260 | [[paper]](https://arxiv.org/abs/2404.04929) 261 | [[code]](https://github.com/aopolin-lv/RoboMP2) 262 | - **Text2Motion: From Natural Language Instructions to Feasible Plans (Autonomous Robots 2023)** 263 | [[paper]](https://openreview.net/pdf?id=M1yTyG5P7Cl) 264 | - **STAP: Sequencing Task-Agnostic Policies (ICRA 2023)** 265 | [[paper]](https://arxiv.org/abs/2210.12250) 266 | [[code]](https://github.com/agiachris/STAP) 267 | 268 | ## Code Generation 269 | 270 | - **Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V (arXiv 2024)** 271 | [[paper]](https://arxiv.org/abs/2404.10220) 272 | - **ProgPrompt: Program Generation for Situated Robot Task Planning Using Large Language Models (Autonomous Robots 2023)** 273 | [[paper]](https://arxiv.org/abs/2209.11302) 274 | - **See and Think: Embodied Agent in Virtual Environment (arXiv 2023)** 275 | [[paper]](https://arxiv.org/abs/2311.15209) 276 | - **Octopus: Embodied Vision-Language Programmer from Environmental Feedback (ECCV 2024)** 277 | [[paper]](https://arxiv.org/abs/2310.08588) 278 | [[webpage]](https://choiszt.github.io/Octopus/) 279 | [[code]](https://github.com/dongyh20/Octopus) 280 | - **Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought (NeurIPS 2023)** 281 | [[paper]](https://arxiv.org/abs/2305.16744) 282 | [[webpage]](https://portal-cornell.github.io/demo2code/) 283 | [[code]](https://github.com/portal-cornell/demo2code) 284 | - **EC2: Emergent Communication for Embodied Control (CVPR 2023)** 285 | [[paper]](https://arxiv.org/abs/2304.09448) 286 | - **When Prolog Meets Generative Models: A New Approach for Managing Knowledge and Planning in Robotic Applications (ICRA 2024)** 287 | [[paper]](https://arxiv.org/abs/2309.15049) 288 | - **Code as Policies: Language Model Programs for Embodied Control (ICRA 2023)** 289 | [[paper]](https://arxiv.org/abs/2209.07753) 290 | [[webpage]](https://code-as-policies.github.io/) 291 | [[code]](https://github.com/google-research/google-research/tree/master/code_as_policies) 292 | - **GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks (arXiv 2024)** 293 | [[paper]](https://arxiv.org/abs/2404.06645) 294 | - **VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (CoRL 2023)** 295 | [[paper]](https://arxiv.org/abs/2307.05973) 296 | [[webpage]](https://voxposer.github.io/) 297 | [[code]](https://github.com/huangwl18/VoxPoser) 298 | - **ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation (arXiv 2024)** 299 | [[paper]](https://arxiv.org/abs/2409.01652) 300 | [[webpage]](https://rekep-robot.github.io/) 301 | [[code]](https://github.com/huangwl18/ReKep) 302 | - **RoboScript: Code Generation for Free-Form Manipulation Tasks Across Real and Simulation (arXiv 2024)** 303 | [[paper]](https://arxiv.org/abs/2402.16117) 304 | - **RobotGPT: Robot Manipulation Learning From ChatGPT (RAL 2024)** 305 | [[paper]](https://arxiv.org/abs/2312.01421) 306 | - **RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (ICML 2024)** 307 | [[paper]](https://arxiv.org/abs/2402.16117) 308 | [[webpage]](https://sites.google.com/view/robocodex) 309 | [[code]](https://github.com/RoboCodeX-source/RoboCodeX_code) 310 | - **Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model (arXiv 2023)** 311 | [[paper]](https://arxiv.org/abs/2305.11176) 312 | [[code]](https://github.com/OpenGVLab/Instruct2Act) 313 | - **GenSim: Generating Robotic Simulation Tasks via Large Language Models (ICLR 2024)** 314 | [[paper]](https://arxiv.org/abs/2310.01361) 315 | [[code]](https://github.com/liruiw/GenSim) 316 | 317 | ## Visual Generation 318 | - **Learning Universal Policies via Text-Guided Video Generation (NeurIPS 2023)** 319 | [[paper]](https://arxiv.org/abs/2302.00111) 320 | [[webpage]](https://universal-policy.github.io/unipi/) 321 | - **SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation (ICLR 2025)** 322 | [[paper]](https://arxiv.org/abs/2410.23277) 323 | [[webpage]](https://slowfast-vgen.github.io/) 324 | - **Using Left and Right Brains Together: Towards Vision and Language Planning (ICML 2024)** 325 | [[paper]](https://arxiv.org/abs/2402.10534) 326 | - **Compositional Foundation Models for Hierarchical Planning (NeurIPS 2023)** 327 | [[paper]](https://arxiv.org/abs/2309.08587) 328 | [[webpage]](https://hierarchical-planning-foundation-model.github.io/) 329 | - **Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation (NeurIPS 2024)** 330 | [[paper]](https://arxiv.org/abs/2409.09016) 331 | [[code]](https://github.com/OpenDriveLab/CLOVER) 332 | - **GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation** 333 | [[webpage]](https://gr1-manipulation.github.io) 334 | [[code]](https://github.com/bytedance/GR-1) 335 | - **GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation** 336 | [[webpage]](https://gr2-manipulation.github.io) 337 | - **Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models (ICLR 2024)** 338 | [[paper]](https://arxiv.org/abs/2310.10639) 339 | [[webpage]](https://rail-berkeley.github.io/susie/) 340 | [[code]](https://github.com/kvablack/susie) 341 | - **Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts (CVPR 2024)** 342 | [[paper]](https://openaccess.thecvf.com/content/CVPR2024/html/Ni_Generate_Subgoal_Images_before_Act_Unlocking_the_Chain-of-Thought_Reasoning_in_CVPR_2024_paper.html) 343 | [[webpage]](https://cotdiffusion.github.io/) 344 | - **Surfer: Progressive Reasoning with World Models for Robotic Manipulation** 345 | [[paper]](https://arxiv.org/abs/2306.11335) 346 | - **TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation (CoRL 2022)** 347 | [[paper]](https://arxiv.org/abs/2211.09325) 348 | [[webpage]](https://sites.google.com/view/tax-pose/home) 349 | [[code]](https://github.com/r-pad/taxpose) 350 | - **Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies (CoRL 2024)** 351 | [[paper]](https://arxiv.org/abs/2406.11740) 352 | [[webpage]](https://haojhuang.github.io/imagine_page/) 353 | [[code]](https://github.com/HaojHuang/imagination-policy-cor24) 354 | - **Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions** 355 | [[paper]](https://arxiv.org/abs/2404.01812) 356 | [[webpage]](https://actnerf.github.io/) 357 | [[code]](https://github.com/ActNeRF/ActNeRF) 358 | - **Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation (CoRL 2022)** 359 | [[paper]](https://arxiv.org/abs/2209.05451) 360 | [[webpage]](https://peract.github.io/) 361 | [[code]](https://github.com/peract/peract) 362 | - **ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (ECCV 2024)** 363 | [[paper]](https://arxiv.org/abs/2403.08321) 364 | [[webpage]](https://guanxinglu.github.io/ManiGaussian/) 365 | [[code]](https://github.com/GuanxingLu/ManiGaussian) 366 | - **GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields (CoRL 2023)** 367 | [[paper]](https://arxiv.org/abs/2308.16891) 368 | [[webpage]](https://yanjieze.com/GNFactor/) 369 | [[code]](https://github.com/YanjieZe/GNFactor) 370 | - **WorldVLA: Towards Autoregressive Action World Model** 371 | [[paper]](https://arxiv.org/pdf/2506.21539) 372 | [[webpage]](https://github.com/alibaba-damo-academy/WorldVLA) 373 | [[code]](https://github.com/alibaba-damo-academy/WorldVLA) 374 | 375 | 376 | 377 | 378 | ## Grasp Generation 379 | 380 | ## Trajectory Generation 381 | - **Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation** 382 | [[webpage]](https://mobile-aloha.github.io) 383 | - **Diffusion Policy: Visuomotor Policy Learning via Action Diffusion** 384 | [[webpage]](https://diffusion-policy.cs.columbia.edu) 385 | - **3D Diffuser Actor: Policy Diffusion with 3D Scene Representations** 386 | [[webpage]](https://3d-diffuser-actor.github.io) 387 | - **RT-1: Robotics Transformer for Real-World Control at Scale** 388 | [[webpage]](https://robotics-transformer1.github.io) 389 | - **RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control** 390 | [[webpage]](https://robotics-transformer2.github.io) 391 | - **RVT: Robotic View Transformer for 3D Object Manipulation** 392 | [[webpage]](https://robotic-view-transformer.github.io) 393 | - **RVT-2: Learning Precise Manipulation from Few Examples** 394 | [[webpage]](https://robotic-view-transformer-2.github.io) 395 | - **GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation** 396 | [[webpage]](https://gr1-manipulation.github.io) 397 | - **GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation** 398 | [[webpage]](https://gr2-manipulation.github.io) 399 | - **ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation** 400 | [[webpage]](https://rekep-robot.github.io) 401 | - **Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation** 402 | [[webpage]](https://homangab.github.io/gen2act) 403 | - **OpenVLA: An Open-Source Vision-Language-Action Model** 404 | [[webpage]](https://openvla.github.io/) 405 | - **RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation** 406 | [[webpage]](https://rdt-robotics.github.io/rdt-robotics) 407 | - **π0: Our First Generalist Policy** 408 | [[webpage]](https://www.physicalintelligence.company/blog/pi0) 409 | 410 | --------------------------------------------------------------------------------