└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # AwesomeGAIManipulation
  2 | 
  3 | ## Survey
  4 | 
  5 | ## Data Generation
  6 | - **GRUtopia: Dream General Robots in a City at Scale**
  7 |   [[paper]](https://arxiv.org/abs/2407.10943)
  8 |   [[code]](https://github.com/OpenRobotLab/GRUtopia)
  9 | - **Diffusion for Multi-Embodiment Grasping**
 10 |   [[paper]](https://arxiv.org/html/2410.18835v1)
 11 | - **Gen2sim: Scaling up robot learning in simulation with generative model (ICRA 2024)**
 12 |   [[paper]](https://arxiv.org/abs/2310.18308)
 13 |   [[code]](https://github.com/pushkalkatara/Gen2Sim)
 14 |   [[webpage]](https://gen2sim.github.io/)
 15 | - **RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation (ICML 2024)**
 16 |   [[paper]](https://arxiv.org/abs/2311.01455)
 17 |   [[code]](https://github.com/Genesis-Embodied-AI/RoboGen)
 18 |   [[webpage]](https://robogen-ai.github.io/)
 19 | - **Holodeck: Language Guided Generation of 3D Embodied AI Environments (CVPR 2024)**
 20 |   [[paper]](https://arxiv.org/abs/2312.09067)
 21 |   [[code]](https://github.com/allenai/Holodeck)
 22 |   [[webpage]](https://yueyang1996.github.io/holodeck/)
 23 | - **Video Generation Models as World Simulators**
 24 |   [[paper]](https://arxiv.org/abs/2410.18072)
 25 |   [[webpage]](https://openai.com/research/video-generation-models-as-world-simulators)
 26 | - **Learning Interactive Real-World Simulators**
 27 |   [[paper]](https://arxiv.org/abs/2310.06114)
 28 | - **MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations (CoRL 2023)**
 29 |   [[paper]](https://proceedings.mlr.press/v229/mandlekar23a/mandlekar23a.pdf)
 30 |   [[code]](https://mimicgen.github.io)
 31 |   [[webpage]](https://mimicgen.github.io/)
 32 | - **CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation (CVPR 2024)**
 33 |   [[paper]](https://arxiv.org/pdf/2402.14795)
 34 |   [[code]](https://github.com/wang59695487/hand_teleop_real_sim_mix_adr)
 35 |   [[webpage]]( https://cyber-demo.github.io/)
 36 | - **Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning**
 37 |   [[paper]](https://arxiv.org/abs/2402.17768)
 38 |   [[code]](https://github.com/ErinZhang1998/dmd_diffusion)
 39 |   [[webpage]]()
 40 | - **DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning (ICRA 2025)**
 41 |   [[paper]](https://arxiv.org/pdf/2410.24185)
 42 |   [[code]](https://github.com/NVlabs/dexmimicgen/)
 43 |   [[webpage]](https://dexmimicgen.github.io/)
 44 | - **IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning**
 45 |   [[paper]](https://arxiv.org/abs/2405.01472)
 46 |   [[webpage]](https://sites.google.com/view/intervengen2024)
 47 | - **Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (CoRL 2023)**
 48 |   [[paper]](https://proceedings.mlr.press/v229/ha23a/ha23a.pdf)
 49 |   [[code]](https://github.com/real-stanford/scalingup)
 50 |   [[webpage]](https://www.cs.columbia.edu/~huy/scalingup/)
 51 | - **GenAug: Retargeting behaviors to unseen situations via Generative Augmentation (RSS 2023)**
 52 |   [[paper]](https://arxiv.org/abs/2302.06671)
 53 |   [[code]](https://github.com/genaug/genaug)
 54 |   [[webpage]](https://genaug.github.io/)
 55 | - **Scaling Robot Learning with Semantically Imagined Experience (RSS 2023)**
 56 |   [[paper]](https://arxiv.org/abs/2302.11550)
 57 |   [[webpage]](https://diffusion-rosie.github.io/)
 58 | - **RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning (CoRL 2024)**
 59 |   [[paper]](https://rovi-aug.github.io/static/pdf/rovi_aug_paper.pdf)
 60 |   [[code]](https://github.com/BerkeleyAutomation/rovi-aug)
 61 |   [[webpage]](https://rovi-aug.github.io/)
 62 | - **Learning Robust Real-World Dexterous Grasping Policies via Implicit Shape Augmentation (CoRL 2022)**
 63 |   [[paper]](https://arxiv.org/abs/2210.13638)
 64 |   [[webpage]]( https://sites.google.com/view/implicitaugmentation/home)
 65 | - **DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics**
 66 |   [[paper]](https://arxiv.org/abs/2210.02438)
 67 | - **Shadow: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer (CoRL 2024)**
 68 |   [[paper]](https://shadow-cross-embodiment.github.io/static/shadow24.pdf)
 69 |   [[code]](https://shadow-cross-embodiment.github.io/)
 70 | - **Human-to-Robot Imitation in the Wild (RSS 2022)**
 71 |   [[paper]](https://arxiv.org/abs/2207.09450)
 72 |   [[webpage]](https://human2robot.github.io/)
 73 | - **Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting (RSS 2024)**
 74 |   [[paper]]( https://robot-mirage.github.io/static/pdf/mirage_paper.pdf)
 75 |   [[code]](https://github.com/BerkeleyAutomation/mirage)
 76 |   [[webpage]](https://robot-mirage.github.io/)
 77 | - **CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning**
 78 |   [[paper]](https://arxiv.org/abs/2212.05711)
 79 |   [[webpage]](https://cacti-framework.github.io/)
 80 | - **RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking (ICRA 2024)**
 81 |   [[paper]](https://arxiv.org/abs/2309.01918)
 82 |   [[code]](https://github.com/robopen/roboagent/)
 83 |   [[webpage]](https://robopen.github.io/)
 84 | - **ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation (ICRA 2023)**
 85 |   [[paper]](https://arxiv.org/abs/2210.07450)
 86 |   [[code]](https://github.com/NHirose/ExAug)
 87 |   [[webpage]](https://sites.google.com/view/exaug-nav)
 88 | - **RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning (ICRA 2025)**
 89 |   [[paper]](https://arxiv.org/abs/2409.14674)
 90 |   [[code]](https://github.com/sled-group/RACER)
 91 |   [[webpage]]( https://rich-language-failure-recovery.github.io/)
 92 | - **Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models (RSS 2023)**
 93 |   [[paper]](https://arxiv.org/abs/2211.11736)
 94 |   [[webpage]](https://instructionaugmentation.github.io/)
 95 | 
 96 |       
 97 | ## Reward Generation
 98 | - **Language to Rewards for Robotic Skill Synthesis (CoRL 2023)**
 99 |   [[paper]](https://openreview.net/forum?id=SgTPdyehXMA)
100 |   [[code]](https://github.com/google-deepmind/language_to_reward_2023)
101 |   [[webpage]](https://language-to-reward.github.io/)
102 | - **Vision-Language Models as Success Detectors (CoLLA 2023)**
103 |   [[paper]](https://proceedings.mlr.press/v232/du23b/du23b.pdf)
104 | - **Scaling robot policy learning via zero-shot labeling with foundation models (CoRL 2024)**
105 |   [[paper]](https://arxiv.org/abs/2410.17772)
106 |   [[code]](https://robottasklabeling.github.io/)
107 |   [[webpage]](https://robottasklabeling.github.io/) 
108 | - **FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning (ICML 2024)**
109 |   [[paper]]([https://arxiv.org/abs/2402.00000](https://arxiv.org/abs/2406.00645))
110 |   [[code]](https://github.com/fuyw/FuRL)  
111 | - **Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning (ICLR 2024)**
112 |   [[paper]](https://openreview.net/forum?id=tUM39YTRxH)
113 | - **Eureka: Human-Level Reward Design via Coding Large Language Models (NeurIPS 2023)**
114 |   [[paper]](https://arxiv.org/abs/2310.12931)
115 | - **Agentic Skill Discovery (CoRL 2024 workshop & ICRA@40)**
116 |   [[paper]](https://arxiv.org/abs/2405.15019)
117 |   [[code]](https://github.com/xf-zhao/Agentic-Skill-Discovery)
118 | - **CLIPort: What and Where Pathways for Robotic Manipulation**
119 |   [[paper]](https://arxiv.org/abs/2109.12098)
120 | - **R3M: A Universal Visual Representation for Robot Manipulation**
121 |   [[paper]](https://arxiv.org/abs/2203.12601)
122 |   [[code]](https://github.com/facebookresearch/r3m)
123 |   [[webpage]](https://sites.google.com/view/robot-r3m/?pli=1)
124 | - **LIV: Language-Image Representations and Rewards for Robotic Control (ICML 2023)**
125 |   [[paper]](https://arxiv.org/abs/2306.00958)
126 |   [[code]](https://github.com/penn-pal-lab/LIV)
127 |   [[webpage]](https://penn-pal-lab.github.io/LIV/)
128 | - **Learning Reward Functions for Robotic Manipulation by Observing Humans**
129 |   [[paper]](https://arxiv.org/abs/2211.09019)
130 | - **Deep visual foresight for planning robot motion (ICRA 2017)**
131 |   [[paper]](https://arxiv.org/abs/1610.00696)
132 | - **VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation (RSS 2024)**
133 |   [[paper]](https://arxiv.org/abs/2407.09829)
134 |   [[code]](https://github.com/PPjmchen/VLMPC)
135 | - **Learning Reward for Robot Skills Using Large Language Models via Self-Alignment (ICML 2024)**
136 |   [[paper]](https://arxiv.org/abs/2405.07162)
137 | - **Video Prediction Models as Rewards for Reinforcement Learning**
138 |   [[paper]](https://arxiv.org/abs/2305.14343)
139 |   [[code]](https://escontrela.me/viper)
140 | - **Vip: Towards universal visual reward and representation via value-implicit pre-training (ICLR 2023)**
141 |   [[paper]](https://arxiv.org/abs/2210.00030)
142 |   [[code]](https://github.com/facebookresearch/vip)
143 | - **Learning to Understand Goal Specifications by Modelling Reward**
144 |   [[paper]](https://arxiv.org/pdf/1806.01946)
145 | - **Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks**
146 |   [[paper]](https://arxiv.org/abs/2405.01534)
147 | - **Policy improvement using language feedback models (NeurIPS 2024)**
148 |   [[paper]](https://arxiv.org/abs/2402.07876)
149 | 
150 | 
151 | 
152 | ## State Generation
153 | 
154 | 
155 | - **Reinforcement learning with action-free pre-training from videos (ICML2022)**
156 |   [[paper]](https://proceedings.mlr.press/v162/seo22a/seo22a.pdf)
157 |   [[code]](https://github.com/younggyoseo/apv)
158 | - **Mastering diverse domains through world models**
159 |   [[paper]](https://arxiv.org/pdf/2301.04104v2)
160 |   [[code]](https://github.com/danijar/dreamerv3)
161 |   [[webpage]](https://danijar.com/project/dreamerv3/)
162 | - **Dream to Control: Learning Behaviors by Latent Imagination**
163 |   [[paper]](https://arxiv.org/abs/1912.01603)
164 | - **Robot Shape and Location Retention in Video Generation Using Diffusion Models**
165 |   [[paper]](https://arxiv.org/abs/2407.02873)
166 |   [[code]](https://github.com/PengPaulWang/diffusion-robots)
167 | - **Uncertainty-aware active learning of nerf-based object models for robot manipulators using visual and re-orientation actions**
168 |   [[paper]](https://actnerf.github.io/)
169 |   [[code]](https://github.com/ActNeRF/ActNeRF)
170 |   [[webpage]](https://actnerf.github.io/)
171 | - **Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors**
172 |   [[paper]](https://arxiv.org/abs/2403.14526)
173 |   [[code]](https://github.com/tsagkas/click2grasp)
174 |   [[webpage]](https://tsagkas.github.io/click2grasp/)
175 | - **Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL (ECCV2024)**
176 |   [[paper]](https://arxiv.org/abs/2404.09857)
177 | - **Doughnet: A visual predictive model for topological manipulation of deformable objects**
178 |   [[paper]](https://arxiv.org/abs/2404.12524)
179 | - **KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations (ICML2024)**
180 |   [[paper]](https://openreview.net/pdf?id=oCI9gHocws)
181 | - **DynSyn: Dynamical Synergistic Representation for Efficient Learning and Control in Overactuated Embodied Systems (ICML2024)**
182 |   [[paper]](https://arxiv.org/abs/2407.11472)
183 | - **Symmetry-Aware Robot Design with Structured Subgroups (ICML2023)**
184 |   [[paper]](https://arxiv.org/abs/2306.00036)
185 | - **Total-recon: Deformable scene reconstruction for embodied view synthesis (ICCV2023)**
186 |   [[paper]](https://arxiv.org/abs/2304.12317)
187 |   [[code & data]](https://github.com/andrewsonga/Total-Recon)
188 |   [[webpage]](https://andrewsonga.github.io/totalrecon)
189 | - **Explore and Tell: Embodied Visual Captioning in 3D Environments (ICCV2023)**
190 |   [[paper]](https://arxiv.org/abs/2308.10447)
191 |   [[code & data]](https://aim3-ruc.github.io/ExploreAndTell)
192 | - **Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation (ECCV2024)**
193 |   [[paper]](https://arxiv.org/abs/2405.01527)
194 | - **Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation (CoRL2024)**
195 |   [[paper]](https://arxiv.org/abs/2403.08321)
196 |   [[code]](https://github.com/GuanxingLu/ManiGaussian)
197 |   [[webpage]](https://guanxinglu.github.io/ManiGaussian/)
198 | - **Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics**
199 |   [[paper]](https://arxiv.org/abs/2406.10788)
200 | - **Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (NeurIPS2024)**
201 |   [[paper]](https://arxiv.org/pdf/2402.14407)
202 |   [[code]](https://github.com/tinnerhrhe/VPDD)
203 |   [[webpage]](https://video-diff.github.io/)
204 | - **PreLAR: World Model Pre-training with Learnable Action Representation (ECCV2024)**
205 |   [[paper]](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03363.pdf)
206 |   [[code]](https://github.com/zhanglixuan0720/PreLAR)
207 | - **Octopus: Embodied vision-language programmer from environmental feedback**
208 |   [[paper]](https://arxiv.org/abs/2310.08588)
209 |   [[code]](https://github.com/dongyh20/Octopus)
210 |   [[webpage]](https://choiszt.github.io/Octopus/)
211 | - **Ec2: Emergent communication for embodied control (CVPR2023)**
212 |   [[papar]](https://arxiv.org/abs/2304.09448)
213 | - **Voxposer: Composable 3d value maps for robotic manipulation with language models**
214 |   [[paper]](https://arxiv.org/abs/2307.05973)
215 | 
216 | ## Language Generation
217 | - **Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (PMLR 2022)**
218 |   [[paper]](https://arxiv.org/pdf/2201.07207.pdf)
219 |   [[code]](https://github.com/huangwl18/language-planner)
220 | - **Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (PMLR 2023)**
221 |   [[paper]](https://arxiv.org/abs/2307.14535)
222 |   [[code]](https://github.com/real-stanford/scalingup)
223 | - **Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks (ICLR 2024)**
224 |   [[paper]](https://arxiv.org/pdf/2405.01534)
225 |   [[code]](https://github.com/mihdalal/planseqlearn)
226 | - **Large language models as commonsense knowledge for large-scale task planning (NeurIPS 2023)**
227 |   [[paper]](https://arxiv.org/abs/2305.14078)
228 |   [[code]](https://github.com/1989Ryan/llm-mcts)
229 | - **REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction (CoRL 2023)**
230 |   [[paper]](https://arxiv.org/abs/2306.15724)
231 |   [[code]](https://github.com/real-stanford/reflect)
232 | - **Gesture-Informed Robot Assistance via Foundation Models (CoRL 2023)**
233 |   [[paper]](https://openreview.net/pdf?id=Ffn8Z4Q-zU)
234 | - **Large Language Models for Robotics: Opportunities, Challenges, and Perspectives**
235 |   [[paper]](https://arxiv.org/pdf/2401.04334)
236 | - **Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making (NeurIPS 2024 Track Datasets and Benchmarks)**
237 |   [[paper]](https://arxiv.org/abs/2410.07166)
238 |   [[code]](https://github.com/embodied-agent-interface/embodied-agent-interface)
239 | - **Embodiedgpt: Vision-language pre-training via embodied chain of thought (NeurIPS 2023)**
240 |   [[paper]](https://arxiv.org/pdf/2305.15021.pdf)
241 |   [[code]](https://github.com/OpenGVLab/EmbodiedGPT)
242 | - **Chat with the Environment: Interactive Multimodal Perception using Large Language Models (IROS 2023)**
243 |   [[paper]](https://arxiv.org/abs/2303.08268)
244 |   [[code]](https://github.com/xf-zhao/Matcha)
245 | - **Embodied CoT Distillation From LLM To Off-the-shelf Agents (ICML 2024)**
246 |    [[paper]](https://arxiv.org/html/2412.11499v1)
247 | - **Do as i can, not as i say: Grounding language in robotic affordances**
248 |   [[paper]](https://say-can.github.io/assets/palm_saycan.pdf)
249 |   [[code]](https://github.com/google-research/google-research/tree/master/saycan)
250 | - **Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents (NeurIPS 2023)**
251 |   [[paper]](https://openreview.net/pdf?id=JCCi58IUsh)
252 | - **Inner Monologue: Embodied Reasoning through Planning with Language Models (CoRL 2022)**
253 |  [[paper]](https://arxiv.org/abs/2207.05608)
254 | - **PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models**
255 |   [[paper]](https://arxiv.org/pdf/2402.16836.pdf)
256 |   [[code]](https://github.com/dkguo/PhyGrasp)
257 | - **SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning (CoRL 2023)**
258 |   [[paper]](https://arxiv.org/abs/2307.06135)
259 | - **Robomp2: A robotic multimodal perception-planning framework with multimodal large language models (ICML 2024)**
260 |   [[paper]](https://arxiv.org/abs/2404.04929)
261 |   [[code]](https://github.com/aopolin-lv/RoboMP2)
262 | - **Text2Motion: From Natural Language Instructions to Feasible Plans (Autonomous Robots 2023)**
263 |    [[paper]](https://openreview.net/pdf?id=M1yTyG5P7Cl)
264 | - **STAP: Sequencing Task-Agnostic Policies (ICRA 2023)**
265 |   [[paper]](https://arxiv.org/abs/2210.12250)
266 |   [[code]](https://github.com/agiachris/STAP)
267 |   
268 | ## Code Generation
269 | 
270 | - **Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V (arXiv 2024)**
271 |   [[paper]](https://arxiv.org/abs/2404.10220)
272 | - **ProgPrompt: Program Generation for Situated Robot Task Planning Using Large Language Models (Autonomous Robots 2023)**
273 |   [[paper]](https://arxiv.org/abs/2209.11302)
274 | - **See and Think: Embodied Agent in Virtual Environment (arXiv 2023)**
275 |   [[paper]](https://arxiv.org/abs/2311.15209)
276 | - **Octopus: Embodied Vision-Language Programmer from Environmental Feedback (ECCV 2024)**
277 |   [[paper]](https://arxiv.org/abs/2310.08588)
278 |   [[webpage]](https://choiszt.github.io/Octopus/)
279 |   [[code]](https://github.com/dongyh20/Octopus)
280 | - **Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought (NeurIPS 2023)**
281 |   [[paper]](https://arxiv.org/abs/2305.16744)
282 |   [[webpage]](https://portal-cornell.github.io/demo2code/)
283 |   [[code]](https://github.com/portal-cornell/demo2code)
284 | - **EC2: Emergent Communication for Embodied Control (CVPR 2023)**
285 |   [[paper]](https://arxiv.org/abs/2304.09448)
286 | - **When Prolog Meets Generative Models: A New Approach for Managing Knowledge and Planning in Robotic Applications (ICRA 2024)**
287 |   [[paper]](https://arxiv.org/abs/2309.15049)
288 | - **Code as Policies: Language Model Programs for Embodied Control (ICRA 2023)**
289 |   [[paper]](https://arxiv.org/abs/2209.07753)
290 |   [[webpage]](https://code-as-policies.github.io/)
291 |   [[code]](https://github.com/google-research/google-research/tree/master/code_as_policies)
292 | - **GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks (arXiv 2024)**
293 |   [[paper]](https://arxiv.org/abs/2404.06645)
294 | - **VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (CoRL 2023)**
295 |   [[paper]](https://arxiv.org/abs/2307.05973)
296 |   [[webpage]](https://voxposer.github.io/)
297 |   [[code]](https://github.com/huangwl18/VoxPoser)
298 | - **ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation (arXiv 2024)**
299 |   [[paper]](https://arxiv.org/abs/2409.01652)
300 |   [[webpage]](https://rekep-robot.github.io/)
301 |   [[code]](https://github.com/huangwl18/ReKep)
302 | - **RoboScript: Code Generation for Free-Form Manipulation Tasks Across Real and Simulation (arXiv 2024)**
303 |   [[paper]](https://arxiv.org/abs/2402.16117)
304 | - **RobotGPT: Robot Manipulation Learning From ChatGPT (RAL 2024)**
305 |   [[paper]](https://arxiv.org/abs/2312.01421)
306 | - **RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (ICML 2024)**
307 |   [[paper]](https://arxiv.org/abs/2402.16117)
308 |   [[webpage]](https://sites.google.com/view/robocodex)
309 |   [[code]](https://github.com/RoboCodeX-source/RoboCodeX_code)
310 | - **Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model (arXiv 2023)**
311 |   [[paper]](https://arxiv.org/abs/2305.11176)
312 |   [[code]](https://github.com/OpenGVLab/Instruct2Act)
313 | - **GenSim: Generating Robotic Simulation Tasks via Large Language Models (ICLR 2024)**
314 |   [[paper]](https://arxiv.org/abs/2310.01361)
315 |   [[code]](https://github.com/liruiw/GenSim)
316 | 
317 | ## Visual Generation
318 | - **Learning Universal Policies via Text-Guided Video Generation (NeurIPS 2023)**
319 |   [[paper]](https://arxiv.org/abs/2302.00111)
320 |   [[webpage]](https://universal-policy.github.io/unipi/)
321 | - **SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation (ICLR 2025)**
322 |   [[paper]](https://arxiv.org/abs/2410.23277)
323 |   [[webpage]](https://slowfast-vgen.github.io/)
324 | - **Using Left and Right Brains Together: Towards Vision and Language Planning (ICML 2024)**
325 |   [[paper]](https://arxiv.org/abs/2402.10534)
326 | - **Compositional Foundation Models for Hierarchical Planning (NeurIPS 2023)**
327 |   [[paper]](https://arxiv.org/abs/2309.08587)
328 |   [[webpage]](https://hierarchical-planning-foundation-model.github.io/)
329 | - **Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation (NeurIPS 2024)**
330 |   [[paper]](https://arxiv.org/abs/2409.09016)
331 |   [[code]](https://github.com/OpenDriveLab/CLOVER)
332 | - **GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation**
333 |   [[webpage]](https://gr1-manipulation.github.io)
334 |   [[code]](https://github.com/bytedance/GR-1)
335 | - **GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation**
336 |   [[webpage]](https://gr2-manipulation.github.io)
337 | - **Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models (ICLR 2024)**
338 |   [[paper]](https://arxiv.org/abs/2310.10639)
339 |   [[webpage]](https://rail-berkeley.github.io/susie/)
340 |   [[code]](https://github.com/kvablack/susie)
341 | - **Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts (CVPR 2024)**
342 |   [[paper]](https://openaccess.thecvf.com/content/CVPR2024/html/Ni_Generate_Subgoal_Images_before_Act_Unlocking_the_Chain-of-Thought_Reasoning_in_CVPR_2024_paper.html)
343 |   [[webpage]](https://cotdiffusion.github.io/)
344 | - **Surfer: Progressive Reasoning with World Models for Robotic Manipulation**
345 |   [[paper]](https://arxiv.org/abs/2306.11335)
346 | - **TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation (CoRL 2022)**
347 |   [[paper]](https://arxiv.org/abs/2211.09325)
348 |   [[webpage]](https://sites.google.com/view/tax-pose/home)
349 |   [[code]](https://github.com/r-pad/taxpose)
350 | - **Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies (CoRL 2024)**
351 |   [[paper]](https://arxiv.org/abs/2406.11740)
352 |   [[webpage]](https://haojhuang.github.io/imagine_page/)
353 |   [[code]](https://github.com/HaojHuang/imagination-policy-cor24)
354 | - **Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions**
355 |   [[paper]](https://arxiv.org/abs/2404.01812)
356 |   [[webpage]](https://actnerf.github.io/)
357 |   [[code]](https://github.com/ActNeRF/ActNeRF)
358 | - **Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation (CoRL 2022)**
359 |   [[paper]](https://arxiv.org/abs/2209.05451)
360 |   [[webpage]](https://peract.github.io/)
361 |   [[code]](https://github.com/peract/peract)
362 | - **ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (ECCV 2024)**
363 |   [[paper]](https://arxiv.org/abs/2403.08321)
364 |   [[webpage]](https://guanxinglu.github.io/ManiGaussian/)
365 |   [[code]](https://github.com/GuanxingLu/ManiGaussian)
366 | - **GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields (CoRL 2023)**
367 |   [[paper]](https://arxiv.org/abs/2308.16891)
368 |   [[webpage]](https://yanjieze.com/GNFactor/)
369 |   [[code]](https://github.com/YanjieZe/GNFactor)
370 | - **WorldVLA: Towards Autoregressive Action World Model**
371 |   [[paper]](https://arxiv.org/pdf/2506.21539)
372 |   [[webpage]](https://github.com/alibaba-damo-academy/WorldVLA)
373 |   [[code]](https://github.com/alibaba-damo-academy/WorldVLA)
374 | 
375 | 
376 | 
377 | 
378 | ## Grasp Generation
379 | 
380 | ## Trajectory Generation
381 | - **Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation**
382 |   [[webpage]](https://mobile-aloha.github.io)
383 | - **Diffusion Policy: Visuomotor Policy Learning via Action Diffusion**
384 |   [[webpage]](https://diffusion-policy.cs.columbia.edu)
385 | - **3D Diffuser Actor: Policy Diffusion with 3D Scene Representations**
386 |   [[webpage]](https://3d-diffuser-actor.github.io)
387 | - **RT-1: Robotics Transformer for Real-World Control at Scale**
388 |   [[webpage]](https://robotics-transformer1.github.io)
389 | - **RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control**
390 |   [[webpage]](https://robotics-transformer2.github.io)
391 | - **RVT: Robotic View Transformer for 3D Object Manipulation**
392 |   [[webpage]](https://robotic-view-transformer.github.io)
393 | - **RVT-2: Learning Precise Manipulation from Few Examples**
394 |   [[webpage]](https://robotic-view-transformer-2.github.io)
395 | - **GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation**
396 |   [[webpage]](https://gr1-manipulation.github.io)
397 | - **GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation**
398 |   [[webpage]](https://gr2-manipulation.github.io)
399 | - **ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation**
400 |   [[webpage]](https://rekep-robot.github.io)
401 | - **Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation**
402 |   [[webpage]](https://homangab.github.io/gen2act)
403 | - **OpenVLA: An Open-Source Vision-Language-Action Model**
404 |   [[webpage]](https://openvla.github.io/)
405 | - **RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation**
406 |   [[webpage]](https://rdt-robotics.github.io/rdt-robotics)
407 | - **π0: Our First Generalist Policy**
408 |   [[webpage]](https://www.physicalintelligence.company/blog/pi0)  
409 |     
410 | 


--------------------------------------------------------------------------------