└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Awesome Pixel Diffusion Papers
 2 | 
 3 | A curated list of notable papers on pixel-space diffusion models for image and video generation. Papers are sorted by publication year in descending order, focusing on end-to-end pixel diffusion approaches that operate directly in raw pixel space, avoiding latent encodings where possible.
 4 | 
 5 | ## 2025
 6 | 
 7 | - **PixelDiT: Pixel Diffusion Transformers for Image Generation**  
 8 |   [arXiv:2511.20645](https://arxiv.org/abs/2511.20645)  
 9 |   A fully transformer-based model with dual-level design (patch-level and pixel-level DiTs) for end-to-end pixel-space diffusion, achieving strong FID scores on high-resolution images.
10 | 
11 | - **Back to Basics: Let Denoising Generative Models Denoise**  
12 |   [arXiv:2511.13720](https://arxiv.org/abs/2511.13720)  
13 |   Proposes "Just image Transformers" (JiT) that directly predict clean images in pixel space, leveraging the manifold assumption for efficient high-dimensional generation without noise prediction.
14 | 
15 | - **DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation**  
16 |   [arXiv:2511.19365](https://arxiv.org/abs/2511.19365)  
17 |   Introduces frequency decoupling to enable stable end-to-end pixel diffusion, addressing VAE limitations for high-fidelity synthesis.
18 | 
19 | - **DiP: Taming Diffusion Models in Pixel Space**  
20 |   [arXiv:2511.18822](https://arxiv.org/abs/2511.18822)  
21 |   Decouples global structure (via DiT on large patches) and local details (via lightweight head) for efficient pixel-space generation, balancing quality and compute.
22 | 
23 | - **PixNerd: Pixel Neural Field Diffusion**  
24 |   [arXiv:2507.23268](https://arxiv.org/abs/2507.23268)  
25 |   Combines pixel-space diffusion with neural fields for continuous image representation and generation.
26 | 
27 | - **Advancing End-to-End Pixel Space Generative Modeling via Self-Supervised Pre-training**  
28 |   [arXiv:2510.12586](https://arxiv.org/abs/2510.12586)  
29 |   A two-stage framework with self-supervised distillation to bridge performance gaps in pixel-space diffusion and consistency models.
30 | 
31 | - **PixelFlow: Pixel-Space Generative Models with Flow**  
32 |   [arXiv:2504.07963](https://arxiv.org/abs/2504.07963)  
33 |   An end-to-end flow-based framework for direct pixel-space image generation, eliminating VAEs and upsamplers for simplicity and performance.
34 | 
35 | ## 2024
36 | 
37 | - **Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with Pixel-Space Diffusion**  
38 |   [arXiv:2410.19324](https://arxiv.org/abs/2410.19324)  
39 |   A scalable recipe for pixel-space diffusion using sigmoid loss-weighting and guidance intervals, achieving SOTA on ImageNet at 512x512 resolution.
40 | 
41 | - **Novel View Synthesis with Pixel-Space Diffusion Models**  
42 |   [arXiv:2411.07765](https://arxiv.org/abs/2411.07765)  
43 |   Adapts diffusion architectures for end-to-end novel view synthesis directly in pixel space, outperforming prior methods.
44 | 
45 | - **Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models**  
46 |   [arXiv:2411.07126](https://arxiv.org/abs/2411.07126)  
47 |   Cascaded models using Laplacian diffusion for photorealistic, pixel-accurate generation.
48 | 
49 | 
50 | - **Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers**  
51 |   [arXiv:2401.11605](https://arxiv.org/abs/2401.11605)  
52 |   Linear scaling with pixel count via hourglass transformers for high-res training.
53 | 
54 | - **Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis**  
55 |   [arXiv:2402.14797](https://arxiv.org/abs/2402.14797)  
56 |   Extends pixel diffusion to video with scaled spatiotemporal transformers for text-conditioned synthesis.
57 | 
58 | ## 2023 and Earlier
59 | 
60 | - **Simple Diffusion: End-to-End Diffusion for High Resolution Images**  
61 |   [arXiv:2301.11093](https://arxiv.org/abs/2301.11093)  
62 |   Simplifies high-res diffusion training in pixel space without complex modifiers, achieving SOTA FID on ImageNet.
63 | 
64 | - **Hierarchical Text-Conditional Image Generation with CLIP Latents** (GLIDE)  
65 |   [arXiv:2204.06125](https://arxiv.org/abs/2204.06125)  
66 |   Two-stage hierarchical model using CLIP latents for text-to-image, influential for later pixel-space extensions.
67 |   
68 | - **Cascaded Diffusion Models for High Fidelity Image Generation**  
69 |   [arXiv:2106.15282](https://arxiv.org/abs/2106.15282)  
70 |   Introduces a pipeline of diffusion models generating images at increasing resolutions, starting from a base low-res model followed by super-resolution stages, enabling high-fidelity class-conditional ImageNet generation without auxiliary classifiers.
71 | 
72 | This list includes the provided papers and expands with recent works from arXiv searches up to December 2025. For updates, check arXiv under "pixel diffusion" or "pixel-space diffusion models." Contributions welcome!
73 | 


--------------------------------------------------------------------------------