└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Pixel Diffusion Papers 2 | 3 | A curated list of notable papers on pixel-space diffusion models for image and video generation. Papers are sorted by publication year in descending order, focusing on end-to-end pixel diffusion approaches that operate directly in raw pixel space, avoiding latent encodings where possible. 4 | 5 | ## 2025 6 | 7 | - **PixelDiT: Pixel Diffusion Transformers for Image Generation** 8 | [arXiv:2511.20645](https://arxiv.org/abs/2511.20645) 9 | A fully transformer-based model with dual-level design (patch-level and pixel-level DiTs) for end-to-end pixel-space diffusion, achieving strong FID scores on high-resolution images. 10 | 11 | - **Back to Basics: Let Denoising Generative Models Denoise** 12 | [arXiv:2511.13720](https://arxiv.org/abs/2511.13720) 13 | Proposes "Just image Transformers" (JiT) that directly predict clean images in pixel space, leveraging the manifold assumption for efficient high-dimensional generation without noise prediction. 14 | 15 | - **DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation** 16 | [arXiv:2511.19365](https://arxiv.org/abs/2511.19365) 17 | Introduces frequency decoupling to enable stable end-to-end pixel diffusion, addressing VAE limitations for high-fidelity synthesis. 18 | 19 | - **DiP: Taming Diffusion Models in Pixel Space** 20 | [arXiv:2511.18822](https://arxiv.org/abs/2511.18822) 21 | Decouples global structure (via DiT on large patches) and local details (via lightweight head) for efficient pixel-space generation, balancing quality and compute. 22 | 23 | - **PixNerd: Pixel Neural Field Diffusion** 24 | [arXiv:2507.23268](https://arxiv.org/abs/2507.23268) 25 | Combines pixel-space diffusion with neural fields for continuous image representation and generation. 26 | 27 | - **Advancing End-to-End Pixel Space Generative Modeling via Self-Supervised Pre-training** 28 | [arXiv:2510.12586](https://arxiv.org/abs/2510.12586) 29 | A two-stage framework with self-supervised distillation to bridge performance gaps in pixel-space diffusion and consistency models. 30 | 31 | - **PixelFlow: Pixel-Space Generative Models with Flow** 32 | [arXiv:2504.07963](https://arxiv.org/abs/2504.07963) 33 | An end-to-end flow-based framework for direct pixel-space image generation, eliminating VAEs and upsamplers for simplicity and performance. 34 | 35 | ## 2024 36 | 37 | - **Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with Pixel-Space Diffusion** 38 | [arXiv:2410.19324](https://arxiv.org/abs/2410.19324) 39 | A scalable recipe for pixel-space diffusion using sigmoid loss-weighting and guidance intervals, achieving SOTA on ImageNet at 512x512 resolution. 40 | 41 | - **Novel View Synthesis with Pixel-Space Diffusion Models** 42 | [arXiv:2411.07765](https://arxiv.org/abs/2411.07765) 43 | Adapts diffusion architectures for end-to-end novel view synthesis directly in pixel space, outperforming prior methods. 44 | 45 | - **Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models** 46 | [arXiv:2411.07126](https://arxiv.org/abs/2411.07126) 47 | Cascaded models using Laplacian diffusion for photorealistic, pixel-accurate generation. 48 | 49 | 50 | - **Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers** 51 | [arXiv:2401.11605](https://arxiv.org/abs/2401.11605) 52 | Linear scaling with pixel count via hourglass transformers for high-res training. 53 | 54 | - **Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis** 55 | [arXiv:2402.14797](https://arxiv.org/abs/2402.14797) 56 | Extends pixel diffusion to video with scaled spatiotemporal transformers for text-conditioned synthesis. 57 | 58 | ## 2023 and Earlier 59 | 60 | - **Simple Diffusion: End-to-End Diffusion for High Resolution Images** 61 | [arXiv:2301.11093](https://arxiv.org/abs/2301.11093) 62 | Simplifies high-res diffusion training in pixel space without complex modifiers, achieving SOTA FID on ImageNet. 63 | 64 | - **Hierarchical Text-Conditional Image Generation with CLIP Latents** (GLIDE) 65 | [arXiv:2204.06125](https://arxiv.org/abs/2204.06125) 66 | Two-stage hierarchical model using CLIP latents for text-to-image, influential for later pixel-space extensions. 67 | 68 | - **Cascaded Diffusion Models for High Fidelity Image Generation** 69 | [arXiv:2106.15282](https://arxiv.org/abs/2106.15282) 70 | Introduces a pipeline of diffusion models generating images at increasing resolutions, starting from a base low-res model followed by super-resolution stages, enabling high-fidelity class-conditional ImageNet generation without auxiliary classifiers. 71 | 72 | This list includes the provided papers and expands with recent works from arXiv searches up to December 2025. For updates, check arXiv under "pixel diffusion" or "pixel-space diffusion models." Contributions welcome! 73 | --------------------------------------------------------------------------------