├── README.md
└── assest
    ├── assemb_1.gif
    ├── bin_1.gif
    ├── cabinet_3.gif
    ├── door_1.gif
    ├── drawer_1.gif
    ├── hammer_2.gif
    ├── hammer_4.gif
    ├── imgs
    ├── kn_3.gif
    ├── light_1.gif
    ├── main_features_embodiedgpt.png
    ├── mic_1.gif
    └── overall_frame_embodiedgpt.png


/README.md:
--------------------------------------------------------------------------------
 1 | # EmbodiedGPT [[Paper](https://arxiv.org/pdf/2305.15021.pdf)] 
 2 | 
 3 | 
 4 | <!-- ## Description -->
 5 | 
 6 | Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce **EmbodiedGPT**, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed **EgoCOT**. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
 7 | 
 8 | ## 🤖💬 Online Demo
 9 | 
10 | [**TODO**] EmbodiedGPT will be integrated into [InternGPT](https://github.com/OpenGVLab/InternGPT).
11 | 
12 | **InternGPT** is online (see [https://igpt.opengvlab.com](https://igpt.opengvlab.com/)). Let's try it!
13 | 
14 | 
15 | ## 🗓️ Schedule
16 | - [ ] Release EgoCOT dataset
17 | - [ ] Release EgoVQA dataset
18 | - [ ] Release code and models
19 | 
20 | ## 🏠 Overview
21 | <img width="800" alt="image" src="https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch/blob/main/assest/overall_frame_embodiedgpt.png">
22 | 
23 | ## 🎁 Major Features 
24 | <img width="800" alt="image" src="https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch/blob/main/assest/main_features_embodiedgpt.png">
25 | 
26 | ## Demos
27 | ### Demos in Franka Kitchen
28 | 
29 | |Open the cabinet|Turn the light on|Open the microwave oven|Slide the door open|
30 | |----------------|----------------|----------------|----------------|
31 | | ![GIF1](./assest/cabinet_3.gif) | ![GIF2](./assest/light_1.gif) | ![GIF3](./assest/mic_1.gif) | ![GIF4](./assest/door_1.gif) |
32 | 
33 | ### Demos in Meta-World
34 | 
35 | |Assemble task|Place the bin|Hammer the nail |Open the drawer|
36 | |----------------|----------------|----------------|----------------|
37 | | ![GIF1](./assest/assemb_1.gif) | ![GIF2](./assest/bin_1.gif) | ![GIF3](./assest/hammer_2.gif) | ![GIF4](./assest/drawer_1.gif) |
38 | 
39 | ## 🎫 License
40 | 
41 | This project is released under the [Apache 2.0 license](LICENSE). 
42 | 
43 | ## 🖊️ Citation
44 | 
45 | If you find this project useful in your research, please consider cite:
46 | 
47 | ```BibTeX
48 | @misc{2023embodiedgpt,
49 |     title={EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought},
50 |     author={Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo},
51 |     howpublished = {\url{https://arxiv.org/abs/2305.15021)},
52 |     year={2023}
53 | }
54 | ```
55 | 


--------------------------------------------------------------------------------
/assest/assemb_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/assemb_1.gif


--------------------------------------------------------------------------------
/assest/bin_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/bin_1.gif


--------------------------------------------------------------------------------
/assest/cabinet_3.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/cabinet_3.gif


--------------------------------------------------------------------------------
/assest/door_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/door_1.gif


--------------------------------------------------------------------------------
/assest/drawer_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/drawer_1.gif


--------------------------------------------------------------------------------
/assest/hammer_2.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/hammer_2.gif


--------------------------------------------------------------------------------
/assest/hammer_4.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/hammer_4.gif


--------------------------------------------------------------------------------
/assest/imgs:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/assest/kn_3.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/kn_3.gif


--------------------------------------------------------------------------------
/assest/light_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/light_1.gif


--------------------------------------------------------------------------------
/assest/main_features_embodiedgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/main_features_embodiedgpt.png


--------------------------------------------------------------------------------
/assest/mic_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/mic_1.gif


--------------------------------------------------------------------------------
/assest/overall_frame_embodiedgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenGVLab/EmbodiedGPT/93af9a23f372f40ba878ae9ebe0f07ba11991133/assest/overall_frame_embodiedgpt.png


--------------------------------------------------------------------------------