├── .gitattributes
├── README.md
└── figs
    ├── pipeline-8-15.PNG
    └── teaser-8-13.PNG


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![Python >=3.5](https://img.shields.io/badge/Python->=3.5-yellow.svg)
 2 | ![PyTorch >=1.0](https://img.shields.io/badge/PyTorch->=1.6-blue.svg)
 3 | 
 4 | # Progressive Text-to-Image Diffusion with Soft Latent Direction
 5 | 
 6 | The *official* repository for Progressive Text-to-Image Diffusion with Soft Latent Direction.
 7 | 
 8 | ## News
 9 | - 2023.09  Code will be released coming soon.
10 | 
11 | ## Progressive Text-to-Image
12 | 
13 | ![framework](figs/teaser-8-13.PNG)
14 | 
15 | Existing text-to-image synthesis approaches struggle with textual prompts involving multiple entities and specified relational directions. We propose to decompose the protracted prompt into a set of short commands, including synthesis, editing and erasing operations, using a Large Language Model (LLM) and progressively generate the image. Our strategy enhances both controllability and fidelity and allows for interactive modifications from user interference at each generation step.
16 | 
17 | ## Pipeline
18 | 
19 | ![framework](figs/pipeline-8-15.PNG)
20 | 
21 | Overview of our unified framework emphasizes progressive synthesis, editing, and erasing. In each progressive step, A random latent \(z_t\) is directed through the cross-attention map in inverse diffusion. Specifically, we design a soft stimulus loss that evaluates the positional difference between entity attention and the target mask region, leading to a gradient for updating the latent \(z_{t-1}^{*}\) as a latent response. Subsequentially, another forward diffusion pass is applied to denoise \(z^*_{t}\), yielding deriving \(z^{*}_{t-1}\). In the latent fusion phase, we transform the previous \(i\)-th image into a latent code \(z^{bg}_{t-1}\) using DDIM inversion. The blending of \(z^{*}_{t-1}\) with \(z^{bg}_{t-1}\) incorporates a dynamic evolving mask, which starts with a layout box and gradually shifts to cross-attention. Finally, \(z^{*}_{t-1}\) undergoes multiple diffusion reverse steps and results in the \((i+1)\)-th image.
22 | 


--------------------------------------------------------------------------------
/figs/pipeline-8-15.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/babahui/Progressive-Text-to-Image/e2a2fe2c107afac498706396656a36f11a4a7f5b/figs/pipeline-8-15.PNG


--------------------------------------------------------------------------------
/figs/teaser-8-13.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/babahui/Progressive-Text-to-Image/e2a2fe2c107afac498706396656a36f11a4a7f5b/figs/teaser-8-13.PNG


--------------------------------------------------------------------------------