├── index.html └── static ├── css ├── bulma-carousel.min.css ├── bulma-slider.min.css ├── bulma.css.map.txt ├── bulma.min.css ├── fontawesome.all.min.css └── index.css ├── images ├── archi2.jpg ├── carousel1.jpg ├── carousel2.jpg ├── carousel3.jpg ├── carousel4.jpg ├── dive-b.ico ├── favicon.ico ├── progress.png ├── results.png ├── scene_edit.jpg ├── table.png └── vis_cmp.png ├── js ├── bulma-carousel.js ├── bulma-carousel.min.js ├── bulma-slider.js ├── bulma-slider.min.js ├── fontawesome.all.min.js └── index.js ├── pdfs ├── archi2.pdf └── sample.pdf └── videos ├── banner_video.mp4 ├── carousel1.mp4 ├── carousel2.mp4 ├── carousel3.mp4 └── dive ├── .DS_Store ├── night ├── 23779301ebc34c1284e539ddf057f0b4.mp4 ├── 604d12ebcf784c3f945189f79262f19c.mp4 └── 80c91574d0174206a74435200aba8ba8.mp4 ├── rainy ├── 23779301ebc34c1284e539ddf057f0b4.mp4 ├── 604d12ebcf784c3f945189f79262f19c.mp4 └── 80c91574d0174206a74435200aba8ba8.mp4 └── sunny ├── .DS_Store ├── 23779301ebc34c1284e539ddf057f0b4.mp4 ├── 604d12ebcf784c3f945189f79262f19c.mp4 └── 80c91574d0174206a74435200aba8ba8.mp4 /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |
5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |200 | Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant 201 | challenge, 202 | e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to 203 | tackcle the mentioned problem, 204 | i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are 205 | targeted on exploring the potential for 206 | multi-view videos generation scenarios. 207 | Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and 208 | multi-view consistent videos which 209 | precisely match the given bird's-eye view layouts control. 210 | Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism 211 | to guarantee the cross-view consistency, 212 | where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the 213 | precision of control. 214 | To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, 215 | particularly in some most challenging corner cases. 216 | In summary, the effectiveness of our proposed method in producing long, controllable, and highly 217 | consistent videos under difficult 218 | conditions is proven to be effective. 219 |
220 |235 | Long videos generated by DiVE (up to 240 frames at 12 Hz) on the nuScenes dataset. 236 |
237 |BibTex Code Here
350 |