├── README.md └── example.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Video Language Planning 2 | ## [Project Page] 3 | 4 | [//]: # (### Abstract) 5 | 6 | We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains – from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms). 7 | 8 | For more info see the [project webpage](https://video-language-planning.github.io/). 9 | 10 | ## Code 11 | 12 | Most of the code in the paper was run using Google infrastructure -- we have attached an [illustrative colab notebook](https://github.com/video-language-planning/vlp_code/blob/master/example.ipynb) to illustrate the code for VLP. Please see this [codebase](https://github.com/UMass-Foundation-Model/COMBO/) for open source implementation of VLP which include code for training video and VLM models as well for tree search. 13 | 14 | ## Data 15 | 16 | The long horizon dataset used in the paper can be found at the GS bucket: gs://gresearch/robotics/language_table/captions/ 17 | 18 | Each loaded sample will consist of: 19 | 20 | long_horizon_instructions 21 | start_times 22 | captions 23 | frames 24 | end_times 25 | 26 | In this dataset, long_horizon_instructions correspond to long-horizon text goal of the video, while captions are labels for 27 | the short horizon text goal in the video, where start_times and end_times corresponds to the frames each short horizon text 28 | goal corresponds to. 29 | 30 | ## Bibtex 31 | 32 | ``` 33 | @article{du2023video, 34 | title={Video Language Planning}, 35 | author={Du, Yilun and Yang, Mengjiao and Florence, Pete and Xia, Fei and Wahid, Ayzaan and Ichter, Brian and Sermanet, Pierre and Yu, Tianhe and Abbeel, Pieter and Tenenbaum, Joshua B and others}, 36 | journal={arXiv preprint arXiv:2310.10625}, 37 | year={2023} 38 | } 39 | ``` 40 | --------------------------------------------------------------------------------