├── .gitattributes ├── README.md └── figs ├── pipeline-8-15.PNG └── teaser-8-13.PNG /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Python >=3.5](https://img.shields.io/badge/Python->=3.5-yellow.svg) 2 | ![PyTorch >=1.0](https://img.shields.io/badge/PyTorch->=1.6-blue.svg) 3 | 4 | # Progressive Text-to-Image Diffusion with Soft Latent Direction 5 | 6 | The *official* repository for Progressive Text-to-Image Diffusion with Soft Latent Direction. 7 | 8 | ## News 9 | - 2023.09 Code will be released coming soon. 10 | 11 | ## Progressive Text-to-Image 12 | 13 | ![framework](figs/teaser-8-13.PNG) 14 | 15 | Existing text-to-image synthesis approaches struggle with textual prompts involving multiple entities and specified relational directions. We propose to decompose the protracted prompt into a set of short commands, including synthesis, editing and erasing operations, using a Large Language Model (LLM) and progressively generate the image. Our strategy enhances both controllability and fidelity and allows for interactive modifications from user interference at each generation step. 16 | 17 | ## Pipeline 18 | 19 | ![framework](figs/pipeline-8-15.PNG) 20 | 21 | Overview of our unified framework emphasizes progressive synthesis, editing, and erasing. In each progressive step, A random latent \(z_t\) is directed through the cross-attention map in inverse diffusion. Specifically, we design a soft stimulus loss that evaluates the positional difference between entity attention and the target mask region, leading to a gradient for updating the latent \(z_{t-1}^{*}\) as a latent response. Subsequentially, another forward diffusion pass is applied to denoise \(z^*_{t}\), yielding deriving \(z^{*}_{t-1}\). In the latent fusion phase, we transform the previous \(i\)-th image into a latent code \(z^{bg}_{t-1}\) using DDIM inversion. The blending of \(z^{*}_{t-1}\) with \(z^{bg}_{t-1}\) incorporates a dynamic evolving mask, which starts with a layout box and gradually shifts to cross-attention. Finally, \(z^{*}_{t-1}\) undergoes multiple diffusion reverse steps and results in the \((i+1)\)-th image. 22 | -------------------------------------------------------------------------------- /figs/pipeline-8-15.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/babahui/Progressive-Text-to-Image/e2a2fe2c107afac498706396656a36f11a4a7f5b/figs/pipeline-8-15.PNG -------------------------------------------------------------------------------- /figs/teaser-8-13.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/babahui/Progressive-Text-to-Image/e2a2fe2c107afac498706396656a36f11a4a7f5b/figs/teaser-8-13.PNG --------------------------------------------------------------------------------