├── LICENSE.md └── README.md /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Mike Brave 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | - [Mikes-StableDiffusionNotes](#mikes-stablediffusionnotes) 2 | - [What is Stable Diffusion](#what-is-stable-diffusion) 3 | - [Origins and Research of Stable Diffusion](#origins-and-research-of-stable-diffusion) 4 | - [Initial Training Data](#initial-training-data) 5 | - [Core Technologies](#core-technologies) 6 | - [Tech That Stable Diffusion is Built On \& Technical Terms](#tech-that-stable-diffusion-is-built-on--technical-terms) 7 | - [Similar Technology / Top Competitors](#similar-technology--top-competitors) 8 | - [DALL-E2:](#dall-e2) 9 | - [Google's Imagen:](#googles-imagen) 10 | - [Midjourney:](#midjourney) 11 | - [Stable Diffusion Powered Websites and Communities](#stable-diffusion-powered-websites-and-communities) 12 | - [DreamStudio (Official by StabilityAI):](#dreamstudio-official-by-stabilityai) 13 | - [PlaygroundAI:](#playgroundai) 14 | - [LeonardoAI:](#leonardoai) 15 | - [NightCafe:](#nightcafe) 16 | - [BlueWillow:](#bluewillow) 17 | - [DreamUp By DeviantArt:](#dreamup-by-deviantart) 18 | - [Lexica:](#lexica) 19 | - [Dreamlike Art:](#dreamlike-art) 20 | - [Art Breeder Collage Tool:](#art-breeder-collage-tool) 21 | - [Dream by Wombo:](#dream-by-wombo) 22 | - [Draw Things](#draw-things) 23 | - [Krea AI](#krea-ai) 24 | - [Community Chatrooms and Gathering Locations](#community-chatrooms-and-gathering-locations) 25 | - [Prompt Inspiration Communities \& Tools](#prompt-inspiration-communities--tools) 26 | - [Use Cases of Stable Diffusion](#use-cases-of-stable-diffusion) 27 | - [Core Functionality \& Use Cases](#core-functionality--use-cases) 28 | - [Image Generation](#image-generation) 29 | - [Upscaling Images](#upscaling-images) 30 | - [Editing Images](#editing-images) 31 | - [Style Transfer](#style-transfer) 32 | - [Photo Repair/Touchups](#photo-repairtouchups) 33 | - [Color/Texture Filling](#colortexture-filling) 34 | - [Image Completion/Polishing](#image-completionpolishing) 35 | - [Image Variation](#image-variation) 36 | - [Outpainting](#outpainting) 37 | - [Character Design](#character-design) 38 | - [Video Game Asset Creation](#video-game-asset-creation) 39 | - [Architecture and Interior Design](#architecture-and-interior-design) 40 | - [Use Cases Other Than Image Generation](#use-cases-other-than-image-generation) 41 | - [Video \& Animation](#video--animation) 42 | - [Deforum Animation](#deforum-animation) 43 | - [Depth Module for Stable Diffusion](#depth-module-for-stable-diffusion) 44 | - [Gen1](#gen1) 45 | - [3D Generation Techniques for Stable Diffusion \& Related Diffusion Based 3D Generation](#3d-generation-techniques-for-stable-diffusion--related-diffusion-based-3d-generation) 46 | - [Text to 3D](#text-to-3d) 47 | - [DMT Meshes / Point Cloud Based](#dmt-meshes--point-cloud-based) 48 | - [3D radiance Fields](#3d-radiance-fields) 49 | - [Novel View Synthesis](#novel-view-synthesis) 50 | - [NeRF Based:](#nerf-based) 51 | - [Img to Fspy to Blender:](#img-to-fspy-to-blender) 52 | - [Image to Shapes](#image-to-shapes) 53 | - [3D Texturing Techniques for Stable Diffusion](#3d-texturing-techniques-for-stable-diffusion) 54 | - [Using Stable Diffusion for 3D Texturing:](#using-stable-diffusion-for-3d-texturing) 55 | - [Dream Textures:](#dream-textures) 56 | - [Music](#music) 57 | - [Riffusion](#riffusion) 58 | - [Image-Based Mind Reading](#image-based-mind-reading) 59 | - [Synthetic Data Creation](#synthetic-data-creation) 60 | - [How Stable Diffusion Works](#how-stable-diffusion-works) 61 | - [Hardware Requirements and Cloud-Based Solutions](#hardware-requirements-and-cloud-based-solutions) 62 | - [Methods of Compute](#methods-of-compute) 63 | - [Xformers for Stable Diffusion](#xformers-for-stable-diffusion) 64 | - [Beginner's How To](#beginners-how-to) 65 | - [Basics, Settings and Operations](#basics-settings-and-operations) 66 | - [Popular UIs](#popular-uis) 67 | - [Automatic 1111](#automatic-1111) 68 | - [Automatic 1111 Extensions](#automatic-1111-extensions) 69 | - [Ultimate Upscale:](#ultimate-upscale) 70 | - [Config Presets:](#config-presets) 71 | - [Image Browser:](#image-browser) 72 | - [Prompt Tag Autocomplete:](#prompt-tag-autocomplete) 73 | - [Txt2Mask:](#txt2mask) 74 | - [Ultimate HD Upscaler:](#ultimate-hd-upscaler) 75 | - [Aesthetic Scorer:](#aesthetic-scorer) 76 | - [Tagger:](#tagger) 77 | - [Inspiration Images:](#inspiration-images) 78 | - [Depth Map Library and Poser:](#depth-map-library-and-poser) 79 | - [OpenPose Editor:](#openpose-editor) 80 | - [Shift Attention Script](#shift-attention-script) 81 | - [prompt interpolation](#prompt-interpolation) 82 | - [Text2Palette](#text2palette) 83 | - [Multiple Hypernetworks](#multiple-hypernetworks) 84 | - [Img2Tiles \& Img2Mosaic](#img2tiles--img2mosaic) 85 | - [Depthmap \& Stereo Image](#depthmap--stereo-image) 86 | - [Layers Editing, Blending](#layers-editing-blending) 87 | - [Model Toolkit](#model-toolkit) 88 | - [Prompt Test](#prompt-test) 89 | - [Booru Tag Autocomplete](#booru-tag-autocomplete) 90 | - [Alpha Canvas](#alpha-canvas) 91 | - [Unofficial PEZ - hard prompts made easy](#unofficial-pez---hard-prompts-made-easy) 92 | - [Two Shot](#two-shot) 93 | - [Composable Lora](#composable-lora) 94 | - [Couple Helper - lets you choose where to apply prompts on a grid](#couple-helper---lets-you-choose-where-to-apply-prompts-on-a-grid) 95 | - [Latent Couple Extension](#latent-couple-extension) 96 | - [Remove Background](#remove-background) 97 | - [Models for Background Removal](#models-for-background-removal) 98 | - [Anime Background Remover](#anime-background-remover) 99 | - [Kohya](#kohya) 100 | - [Addons](#addons) 101 | - [EasyDiffusion (Formerly Stable Diffusion UI)](#easydiffusion-formerly-stable-diffusion-ui) 102 | - [InvokeAI](#invokeai) 103 | - [DiffusionBee (Mac OS)](#diffusionbee-mac-os) 104 | - [NKMD GUI](#nkmd-gui) 105 | - [ComfyUi](#comfyui) 106 | - [AINodes](#ainodes) 107 | - [Model Training and Other Training UIs](#model-training-and-other-training-uis) 108 | - [Other Sofware Addons that Act like a UI](#other-sofware-addons-that-act-like-a-ui) 109 | - [Resources \& Useful Links](#resources--useful-links) 110 | - [Helpful Tools](#helpful-tools) 111 | - [Tool Directories and Explanations](#tool-directories-and-explanations) 112 | - [Where to Get Models Made By Community](#where-to-get-models-made-by-community) 113 | - [Notes About Models](#notes-about-models) 114 | - [Model Safety Measures](#model-safety-measures) 115 | - [Generating Images \& Methods of Image Generation](#generating-images--methods-of-image-generation) 116 | - [Text2Image](#text2image) 117 | - [Notes on Resolution](#notes-on-resolution) 118 | - [Prompt Editing](#prompt-editing) 119 | - [Negative Prompts](#negative-prompts) 120 | - [Alternating Words](#alternating-words) 121 | - [Prompt Delay](#prompt-delay) 122 | - [Prompt Weighting](#prompt-weighting) 123 | - [Ui specific Syntax](#ui-specific-syntax) 124 | - [Exploring](#exploring) 125 | - [Randomness](#randomness) 126 | - [Random Words](#random-words) 127 | - [Wildcards](#wildcards) 128 | - [Brute Force](#brute-force) 129 | - [Prompt Matrix](#prompt-matrix) 130 | - [XY Grid](#xy-grid) 131 | - [One Parameter](#one-parameter) 132 | - [Editing Composition](#editing-composition) 133 | - [Image2Image](#image2image) 134 | - [Img2Img](#img2img) 135 | - [Inpainting](#inpainting) 136 | - [Outpainting](#outpainting-1) 137 | - [Loopback](#loopback) 138 | - [InstructPix2Pix](#instructpix2pix) 139 | - [Depth2Image](#depth2image) 140 | - [Depth Map](#depth-map) 141 | - [Depth Preserving Img2Img](#depth-preserving-img2img) 142 | - [ControlNet](#controlnet) 143 | - [Pix2Pix-zero](#pix2pix-zero) 144 | - [Seed Resize](#seed-resize) 145 | - [Variations](#variations) 146 | - [Finishing](#finishing) 147 | - [Upscaling](#upscaling) 148 | - [BSRGAN](#bsrgan) 149 | - [ESRGAN](#esrgan) 150 | - [4x RealESRGAN](#4x-realesrgan) 151 | - [Lollypop](#lollypop) 152 | - [Universal Upscaler](#universal-upscaler) 153 | - [Ultrasharp](#ultrasharp) 154 | - [Uniscale](#uniscale) 155 | - [NMKD Superscale](#nmkd-superscale) 156 | - [Remacri by Foolhardy](#remacri-by-foolhardy) 157 | - [SD Upscale](#sd-upscale) 158 | - [SD 2.0 4xUpscaler](#sd-20-4xupscaler) 159 | - [Restoring](#restoring) 160 | - [Face Restoration](#face-restoration) 161 | - [GFPGAN](#gfpgan) 162 | - [Code Former](#code-former) 163 | - [Models ETC](#models-etc) 164 | - [Base Models for Stable Diffusion](#base-models-for-stable-diffusion) 165 | - [Stable Diffusion Models 1.4 and 1.5](#stable-diffusion-models-14-and-15) 166 | - [Stable Diffusion Models 2.0 and 2.1](#stable-diffusion-models-20-and-21) 167 | - [512-Depth Model for Image-to-Image Translation](#512-depth-model-for-image-to-image-translation) 168 | - [Community Models](#community-models) 169 | - [Fine Tuned](#fine-tuned) 170 | - [Merged/Merges](#mergedmerges) 171 | - [Tutorial for Add Difference Method](#tutorial-for-add-difference-method) 172 | - [Megamerged/MegaMerges](#megamergedmegamerges) 173 | - [Embeddings](#embeddings) 174 | - [Community Forks](#community-forks) 175 | - [VAE (Variational Autoencoder) in Stable Diffusion](#vae-variational-autoencoder-in-stable-diffusion) 176 | - [Original Autoencoder in Stable Diffusion](#original-autoencoder-in-stable-diffusion) 177 | - [EMA VAE in Stable Diffusion](#ema-vae-in-stable-diffusion) 178 | - [MSE VAE in Stable Diffusion](#mse-vae-in-stable-diffusion) 179 | - [Samplers](#samplers) 180 | - [Ancestral Samplers](#ancestral-samplers) 181 | - [DPM++ 2S A Karras](#dpm-2s-a-karras) 182 | - [DPM++ A](#dpm-a) 183 | - [Euler A](#euler-a) 184 | - [DPM Fast](#dpm-fast) 185 | - [DPM Adaptive](#dpm-adaptive) 186 | - [DPM++](#dpm) 187 | - [DPM++ SDE](#dpm-sde) 188 | - [DPM++ 2M](#dpm-2m) 189 | - [Common Samplers / Equilibrium Samplers](#common-samplers--equilibrium-samplers) 190 | - [k\_LMS](#k_lms) 191 | - [DDIM](#ddim) 192 | - [k\_euler\_a and Heun](#k_euler_a-and-heun) 193 | - [k\_dpm\_2\_a](#k_dpm_2_a) 194 | - [Methods of Training Models and Creating Embeddings](#methods-of-training-models-and-creating-embeddings) 195 | - [Dataset and Image Preparation](#dataset-and-image-preparation) 196 | - [Choosing Images](#choosing-images) 197 | - [Tip for training faces and characters](#tip-for-training-faces-and-characters) 198 | - [Captioning](#captioning) 199 | - [Regularization/Classifier Images](#regularizationclassifier-images) 200 | - [Links to Some Regularization Images](#links-to-some-regularization-images) 201 | - [Training Tutorials](#training-tutorials) 202 | - [Types of Training](#types-of-training) 203 | - [File Type Overview](#file-type-overview) 204 | - [CKPT/Diffuser/Safetensor](#ckptdiffusersafetensor) 205 | - [Textual Inversion](#textual-inversion) 206 | - [Negative Embedding](#negative-embedding) 207 | - [LORA](#lora) 208 | - [LoHa](#loha) 209 | - [Hypernetworks](#hypernetworks) 210 | - [Aescetic Gradients](#aescetic-gradients) 211 | - [Fine Tuning / Checkpoints/Diffusers/Safetensors](#fine-tuning--checkpointsdiffuserssafetensors) 212 | - [Token Based](#token-based) 213 | - [Dreambooth](#dreambooth) 214 | - [Custom Diffusion by Adobe](#custom-diffusion-by-adobe) 215 | - [Caption Based Fine Tuning](#caption-based-fine-tuning) 216 | - [Fine Tuning](#fine-tuning) 217 | - [EveryDream 2](#everydream-2) 218 | - [Stable Tuner](#stable-tuner) 219 | - [Dream Artist Auto1111 Extension](#dream-artist-auto1111-extension) 220 | - [Decoding Checkpoints](#decoding-checkpoints) 221 | - [Mixing](#mixing) 222 | - [Using Multiple types of models and embeddings](#using-multiple-types-of-models-and-embeddings) 223 | - [Multiple Embeddings](#multiple-embeddings) 224 | - [Multiple Hypernetworks](#multiple-hypernetworks-1) 225 | - [Multiple LORA's](#multiple-loras) 226 | - [Merging](#merging) 227 | - [Merging Checkpoints](#merging-checkpoints) 228 | - [Converting Checkpoints/Diffusers/LORAs](#converting-checkpointsdiffusersloras) 229 | - [Image2Text](#image2text) 230 | - [CLIP Interrogation](#clip-interrogation) 231 | - [BLIP Captioning](#blip-captioning) 232 | - [DanBooru Tags / Deepdanbooru](#danbooru-tags--deepdanbooru) 233 | - [Waifu Diffusion 1.4 tagger - Using DeepDanBooru Tags](#waifu-diffusion-14-tagger---using-deepdanbooru-tags) 234 | - [Pruning Models](#pruning-models) 235 | - [One Shot Learning \& Similar](#one-shot-learning--similar) 236 | - [DreamArtist (WebUI Extension)](#dreamartist-webui-extension) 237 | - [Universal Guided Diffusion](#universal-guided-diffusion) 238 | - [Other Software Addons](#other-software-addons) 239 | - [Blender Addons](#blender-addons) 240 | - [Blender ControlNet](#blender-controlnet) 241 | - [Makes Textures / Vision](#makes-textures--vision) 242 | - [OpenPose](#openpose) 243 | - [OpenPose Editor](#openpose-editor-1) 244 | - [Dream Textures](#dream-textures-1) 245 | - [AI Render](#ai-render) 246 | - [Stability AI's official Blender](#stability-ais-official-blender) 247 | - [CEB Stable Diffusion (Paid)](#ceb-stable-diffusion-paid) 248 | - [Cozy Auto Texture](#cozy-auto-texture) 249 | - [Blender Rigs/Bones](#blender-rigsbones) 250 | - [ImpactFrames' OpenPose Rig](#impactframes-openpose-rig) 251 | - [ToyXYZ's Character bones that look like Openpose for blender](#toyxyzs-character-bones-that-look-like-openpose-for-blender) 252 | - [3D posable Mannequin Doll](#3d-posable-mannequin-doll) 253 | - [Riggify model](#riggify-model) 254 | - [Maya](#maya) 255 | - [ControlNet Maya Rig](#controlnet-maya-rig) 256 | - [Photoshop](#photoshop) 257 | - [Stable.Art](#stableart) 258 | - [Auto Photoshop Plugin](#auto-photoshop-plugin) 259 | - [Daz](#daz) 260 | - [Daz Control Rig](#daz-control-rig) 261 | - [Cinema4D](#cinema4d) 262 | - [Colors Scene (possibly no longer needed since controlNet Update)](#colors-scene-possibly-no-longer-needed-since-controlnet-update) 263 | - [Unity](#unity) 264 | - [Stable Diffusion Unity Integration](#stable-diffusion-unity-integration) 265 | - [Related Technologies, Communities and Tools, not necessarily Stable Diffusion, but Adjacent](#related-technologies-communities-and-tools-not-necessarily-stable-diffusion-but-adjacent) 266 | - [Techniques \& Possibilities](#techniques--possibilities) 267 | - [Seed and prompt blending](#seed-and-prompt-blending) 268 | - [Loopback Superimpose](#loopback-superimpose) 269 | - [txt2img2img](#txt2img2img) 270 | - [Seed Traveling](#seed-traveling) 271 | - [Alternate Noise Samplers](#alternate-noise-samplers) 272 | - [Clip Skip \& Alternating](#clip-skip--alternating) 273 | - [Multi Control Net and blender for perfect Hands](#multi-control-net-and-blender-for-perfect-hands) 274 | - [Blender to Depth Map](#blender-to-depth-map) 275 | - [Blender to depth map for concept art](#blender-to-depth-map-for-concept-art) 276 | - [depth map for terrain and map generation?](#depth-map-for-terrain-and-map-generation) 277 | - [Detextify - removes pseudo text from generations](#detextify---removes-pseudo-text-from-generations) 278 | - [Blender as Camera Rig](#blender-as-camera-rig) 279 | - [SD depthmap to blender for stretched single viewpoint depth perception model](#sd-depthmap-to-blender-for-stretched-single-viewpoint-depth-perception-model) 280 | - [Daz3D for posing](#daz3d-for-posing) 281 | - [Mixamo for Posing](#mixamo-for-posing) 282 | - [Figure Drawing Poses as Reference Poses](#figure-drawing-poses-as-reference-poses) 283 | - [Generating Images to turn into 3D sculpting brushes](#generating-images-to-turn-into-3d-sculpting-brushes) 284 | - [Stable Diffusion to Blender to create particles using automesh plugin](#stable-diffusion-to-blender-to-create-particles-using-automesh-plugin) 285 | - [Not Stable Diffusion But Relevant Techniques](#not-stable-diffusion-but-relevant-techniques) 286 | - [Other Resources](#other-resources) 287 | - [API's](#apis) 288 | 289 | # Mikes-StableDiffusionNotes 290 | Notes on Stable Diffusion: An attempt at a comprehensive list 291 | 292 | The following is a list of stable diffusion tools and resources compiled from personal research and understanding, with a focus on what is possible to do with this technology while also cataloging resources and useful links along with explanations. Please note that an item or link listed here is not a recommendation unless stated otherwise. Feedback, suggestions and corrections are welcomed and can be submitted through a pull request or by contacting me on Reddit (https://www.reddit.com/user/mikebrave) or Discord (MikeBrave#6085). 293 | 294 | 295 | 296 | 297 | ## What is Stable Diffusion 298 | 299 | Stable Diffusion is an open-source machine learning model that can generate images from text, modify images based on text or enhance low-resolution or low-detail images. It has been trained on billions of images and can produce results that are on par with those generated by DALL-E 2 and MidJourney. 300 | 301 | Stable Diffusion (SD) is a deep-learning, text-to-image model that was released in 2022. Its primary function is to generate detailed images based on text descriptions. The model uses a combination of random static generation, noise, and pattern recognition through neural nets that are trained on keyword pairs. These pairs correspond to patterns found in a given training image that match a particular keyword. 302 | 303 | To generate an image, the user inputs a text description, and the SD model references the keyword pairs associated with the words in the description. The model then produces a shape that corresponds to the patterns identified in the image. Over several passes, the image becomes clearer and eventually results in a final image that matches the text prompt. 304 | 305 | Stable Diffusion is a latent diffusion model, which is a type of deep generative neural network. It was developed by the CompVis group at LMU Munich in collaboration with Stability AI, Runway, EleutherAI, and LAION. In October 2022, Stability AI raised US$101 million in a round led by Lightspeed Venture Partners and Coatue Management. 306 | 307 | Stable Diffusion's code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. This marks a departure from previous proprietary text-to-image models such as DALL-E and Midjourney, which were accessible only via cloud services. 308 | 309 | To better understand Stable Diffusion and how it works, there are several visual guides available. Jalammar's blog (https://jalammar.github.io/illustrated-stable-diffusion/) provides an illustrated guide to the model, while the Stable Diffusion Art website (https://stable-diffusion-art.com/how-stable-diffusion-work/) offers a step-by-step breakdown of the process. 310 | 311 | In addition, a Colab notebook (https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wFCHhGLFooW_pf1?usp=sharing) is available to allow users to experiment with and gain a deeper understanding of the Stable Diffusion model. 312 | 313 | Wikiepedia: https://en.wikipedia.org/wiki/Stable_Diffusion 314 | source code: https://github.com/justinpinkney/stable-diffusion 315 | Homepage: https://stability.ai/ 316 | 317 | 318 | 319 | ### Origins and Research of Stable Diffusion 320 | 321 | Stable Diffusion (SD) is a deep-learning, text-to-image model that was released in 2022. It was developed by the CompVis group at LMU Munich in collaboration with Stability AI, Runway, EleutherAI, and LAION. The model was created through extensive research into deep generative neural networks and the diffusion process. 322 | 323 | In the original announcement (https://stability.ai/blog/stable-diffusion-announcement), the creators of SD outlined the model's key features and capabilities. These include the ability to generate high-quality images based on text descriptions, as well as the flexibility to be applied to other tasks such as inpainting and image-to-image translation. 324 | 325 | Stable Diffusion is a latent diffusion model, which is a type of deep generative neural network that uses a process of random noise generation and diffusion to create images. The model is trained on large datasets of images and text descriptions to learn the relationships between the two. This training process involves extensive experimentation and optimization to ensure that the model can accurately generate images based on text prompts. 326 | 327 | The source code for Stable Diffusion is publicly available on GitHub (https://github.com/CompVis/stable-diffusion). This allows researchers and developers to experiment with the model, contribute to its development, and use it for their own projects. 328 | 329 | Stability AI, the primary sponsor of Stable Diffusion, raised US$101 million in October 2022 to support further research and development of the model. The success of the model has highlighted the potential of deep learning and generative neural networks in the field of computer vision and image generation. 330 | 331 | https://research.runwayml.com/the-research-origins-of-stable-difussion 332 | 333 | #### Initial Training Data 334 | LAION-5B - 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution 335 | Laion-Aesthetics v2 5+ 336 | 337 | #### Core Technologies 338 | 339 | Variational Autoencoder (VAE) 340 | - The simplest explanation is that it makes an image small then makes it bigger again. 341 | - A Variational Autoencoder (VAE) is an artificial neural network architecture that belongs to the families of probabilistic graphical models and variational Bayesian methods. It is a type of neural network that learns to reproduce its input, and also map data to latent space. VAEs use probability modeling in a neural network system to provide the kinds of equilibrium that autoencoders are typically used to produce. The neural network components are typically referred to as the encoder and decoder for the first and second component respectively. VAE's are part of the neural network model that encodes and decodes the images to and from the smaller latent space, so that computation can be faster. Any models you use, be it v1, v2 or custom, already comes with a default VAE 342 | - See also [VAE (Variational Autoencoder) in Stable Diffusion](#vae-variational-autoencoder-in-stable-diffusion) 343 | 344 | 345 | U-Net 346 | - U-Net is used in Stable Diffusion to reduce the noise (denoises) in the image using the text prompt as a conditional. The U-Net model is used in the diffusion process to generate images. The network is based on the fully convolutional network and its architecture was modified and extended to work with fewer training images and to yield more precise segmentations. 347 | - In the case of image segmentation, the goal is to classify each pixel of an image into a specific class. For example, in medical imaging, the goal is to classify each pixel of an image into a specific organ or tissue type. U-Net is used to perform image segmentation by taking an image as input and outputting a segmentation map that classifies each pixel of the input image into a specific class 348 | - U-Net is designed to work with fewer training images by using data augmentation to use the available annotated samples more efficiently 349 | - The architecture of U-Net is also designed to yield more precise segmentations by using a contracting path to capture context and a symmetric expanding path that enables precise localization 350 | 351 | 352 | Text Encoder 353 | - Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder1. The text encoder is used to turn your prompt into a latent vector 354 | - In the context of machine learning, a latent vector is a vector that represents a learned feature or representation of a data point that is not directly observable. For example, in the case of Stable Diffusion, the text encoder is used to turn your prompt into a latent vector that represents a learned feature or representation of the prompt that is not directly observable. 355 | 356 | #### Tech That Stable Diffusion is Built On & Technical Terms 357 | Transformers 358 | - A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV) 359 | - Transformers are neural networks that learn context and understanding through sequential data analysis. The Transformer models use a modern and evolving mathematical techniques set, generally known as attention or self-attention. This set helps identify how distant data elements influence and depend on one another 360 | 361 | 362 | LLM 363 | - LLM stands for Large Language Model. Large language models are a type of neural network that can generate human-like text by predicting the probability of the next word in a sequence of words. a good example of this would be ChatGPT 364 | 365 | 366 | VQGAN 367 | - VQGAN is short for Vector Quantized Generative Adversarial Network and is utilized for high-resolution images; and is a type of neural network architecture that combines convolutional neural networks with Transformers. VQGAN employs the same two-stage structure by learning an intermediary representation before feeding it to a transformer. However, instead of downsampling the image, VQGAN uses a codebook to represent visual parts. 368 | - https://compvis.github.io/taming-transformers/ 369 | 370 | 371 | Diffusion Models 372 | - a simple explanation is that it uses noising and denoising to learn how to reconstruct images. 373 | - Diffusion models are a class of generative models used in machine learning to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space1. They are Markov chains trained using variational inference1. The goal of diffusion models is to generate data similar to the data on which they are trained by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process2. 374 | - Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video generation, and molecule design 375 | 376 | 377 | Latent Diffusion Models 378 | - Latent diffusion models are machine learning models designed to learn the underlying structure of a dataset by mapping it to a lower-dimensional latent space. This latent space represents the data in which the relationships between different data points are more easily understood and analyzed1. Latent diffusion models use an auto-encoder to map between image space and latent space. The diffusion model works on the latent space, which makes it a lot easier to train2. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs 379 | 380 | 381 | CLIP 382 | - CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similar to the zero-shot capabilities of GPT-2 and 31. CLIP is much more efficient and achieves the same accuracy roughly 10x faster2. Because they learn a wide range of visual concepts directly from natural language, CLIP models are significantly more flexible and general than existing ImageNet models 383 | - https://research.runwayml.com/ 384 | 385 | 386 | Gaussian Noise 387 | - the simplest way to explain it is random static that get's used a lot for things we want randomness for. 388 | - Gaussian noise is a term from signal processing theory denoting a kind of signal noise that has a probability density function (pdf) equal to that of the normal distribution (which is also known as the Gaussian distribution)1. Gaussian noise is a statistical noise having a probability density function equal to normal distribution, also known as Gaussian Distribution. Random Gaussian function is added to Image function to generate this noise2. Gaussian noise is a type of noise that follows a Gaussian distribution. A Gaussian filter is a tool for de-noising, smoothing and blurring 389 | 390 | 391 | Denoising Autoencoders 392 | - A Denoising Autoencoder (DAE) is a type of autoencoder, which is a type of neural network used for unsupervised learning. The DAE is used to remove noise from data, making it better for analysis. The DAE works by taking a noisy input signal and encoding it into a smaller representation, removing the noise. The smaller representation is then decoded back into the original input signal1. Denoising autoencoders are a stochastic version of standard autoencoders that reduces the risk of learning the identity function2. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction 393 | 394 | 395 | ResNet 396 | - ResNet, short for Residual Network is a specific type of neural network that was introduced in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun in their paper “Deep Residual Learning for Image Recognition”. The ResNet models were extremely successful which you can guess from the following: ResNet won the ImageNet and COCO 2015 competitions, and its variants were the foundations of the first places in all five main tracks of the ImageNet and COCO 2016 competitions1. A Residual Neural Network (ResNet) is an Artificial Neural Network (ANN) of a kind that stacks residual blocks on top of each other to form a network2. ResNet is a deep neural network that is capable of learning thousands of layers 397 | 398 | 399 | Latent Space 400 | - Latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances between the items1. If I have to describe latent space in one sentence, it simply means a representation of compressed data2. Latent space is a concept in machine learning and deep learning that refers to the space of latent variables that are learned by a model 401 | 402 | 403 | Watermark Detection 404 | - The creators of LAION-5B trained a watermark detection model and used it to calculate confidence scores for every image in LAION-5B 405 | - https://github.com/LAION-AI/LAION-5B-WatermarkDetection 406 | 407 | 408 | 409 | 410 | ### Similar Technology / Top Competitors 411 | 412 | Stable Diffusion (SD) is a cutting-edge text-to-image generation model that has been receiving significant attention since its release in 2022. However, there are other similar technologies and programs that have also been developed for this purpose. Some of the most notable ones are: 413 | 414 | #### DALL-E2: 415 | This is a text-to-image model developed by OpenAI that is similar to Stable Diffusion in its approach. It uses a transformer architecture and a discrete VAE to generate high-quality images based on text prompts. 416 | https://openai.com/product/dall-e-2 417 | 418 | #### Google's Imagen: 419 | This is a machine learning system developed by Google that generates realistic images from textual descriptions. It uses a combination of neural networks and computer vision algorithms to create images that match the text prompt. This has yet to be released to the public. It seems to be able to do Text very well and seems to be GAN based with a massive amount of image data. 420 | https://imagen.research.google/ 421 | 422 | #### Midjourney: 423 | This is a text-to-image model developed by a team of researchers at Peking University that generates images from textual descriptions. It uses a combination of attention mechanisms and adversarial training to generate high-quality images that match the text input. 424 | https://www.midjourney.com/ 425 | 426 | Each of these programs uses different approaches and techniques to generate images from text descriptions. While Stable Diffusion has received significant attention recently, these other programs offer alternative methods for generating images based on text prompts. 427 | 428 | 429 | ### Stable Diffusion Powered Websites and Communities 430 | Some of the most notable websites and communities based on SD are: 431 | 432 | #### DreamStudio (Official by StabilityAI): 433 | This website uses Stable Diffusion to generate high-quality images based on user-submitted text prompts. It offers a simple and intuitive user interface and allows users to download or share their generated images. 434 | https://dreamstudio.ai/ 435 | 436 | #### PlaygroundAI: 437 | This is an online community that focuses on exploring the capabilities of Stable Diffusion and other deep-learning models. It provides a platform for researchers and enthusiasts to share their work, collaborate on projects, and discuss the latest developments in the field. 438 | https://playgroundai.com/ 439 | 440 | #### LeonardoAI: 441 | This is an online community that uses Stable Diffusion and other AI models to generate high-quality art and design. It provides a platform for artists and designers to experiment with new tools and techniques and showcase their work to a wider audience. 442 | https://app.leonardo.ai/ 443 | 444 | #### NightCafe: 445 | This website uses Stable Diffusion to generate surreal and dreamlike images based on user-submitted text prompts. It offers a unique and creative approach to image generation and has gained a dedicated following among art enthusiasts. 446 | https://nightcafe.studio/ 447 | 448 | #### BlueWillow: 449 | This is a design studio that uses Stable Diffusion and other deep-learning models to generate unique and creative designs for clients. It offers a range of services, including branding, website design, and digital art, and has gained a reputation for its innovative use of AI in design. 450 | https://www.bluewillow.ai/ 451 | 452 | #### DreamUp By DeviantArt: 453 | DreamUp is an image-generation tool powered by your prompts that allows you to visualize most anything you can DreamUp! It is operated by DeviantArt, Inc. and is designed to create AI art knowing that creators and their work are treated fairly. You can create any image you can imagine with the power of artificial intelligence! You can try DreamUp with 5 free prompts.g. DeviantArt CEO Moti Levy says that the site isn’t doing any DeveintArt-specific training for DreamUp and that the tool is Stable Diffusion. 454 | https://www.deviantart.com/dreamup 455 | 456 | #### Lexica: 457 | Lexica is a self-styled stable diffusion search engine, it is a web app that provides access to a massive database of AI-generated images and their accompanying text prompts. It features a simple search box and discord link, a grid layout mode to view hundreds of images on one page, and a slider to change the size of the image previews. It also has image generation capabilities which can be especially useful when finding a prompt you like that you would like to immediately try. 458 | https://lexica.art/ 459 | 460 | #### Dreamlike Art: 461 | Dreamlike.art that lets you generate free AI art straight from their website. It features a “Infinity Canvas” feature which allows you to outpaint images. This lets you create images larger than usual and can result in some amazing panoramic-style pictures 462 | https://dreamlike.art/ 463 | https://www.reddit.com/r/DreamlikeArt/ 464 | 465 | #### Art Breeder Collage Tool: 466 | Artbreeder Collage is a structured image generation tool with prompts and simple drawing tools. It allows mixing different pictures and shapes you can choose from the library or draw yourself with a text prompt to generate new art with the power of neural networks12. You can start with a collage that someone else has already created and make your own tweaks by moving, resizing and changing the colors of elements or by adding new ones. Or you can start out from scratch, either using a text prompt generated by the platform or by writing your own 467 | https://www.artbreeder.com/browse 468 | 469 | #### Dream by Wombo: 470 | This is a mobile application that uses Stable Diffusion to generate animated images based on user-submitted audio prompts. It has gained significant popularity for its ability to create humorous and entertaining animations. 471 | https://dream.ai/ 472 | 473 | #### Draw Things 474 | https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820 475 | 476 | #### Krea AI 477 | https://www.krea.ai/ 478 | 479 | 480 | This is not a comprehensive list, there are many other websites and communities that use Stable Diffusion and other text-to-image models. Please contribute to this list. 481 | 482 | ### Community Chatrooms and Gathering Locations 483 | Reddit Core Communities 484 | - /r/StableDiffusion https://www.reddit.com/r/StableDiffusion 485 | - /r/sdforall https://www.reddit.com/r/sdforall 486 | - /r/dreambooth https://www.reddit.com/r/dreambooth 487 | - /r/stablediffusionUI https://www.reddit.com/r/stablediffusionUI 488 | - /r/civitai https://www.reddit.com/r/civitai 489 | 490 | Reddit Related Communities 491 | - /r/aiArt https://www.reddit.com/r/aiArt 492 | - /r/AIArtistWorkflows https://www.reddit.com/r/AIArtistWorkflows 493 | - /r/aigamedev https://www.reddit.com/r/aigamedev 494 | - /r/AItoolsCatalog https://www.reddit.com/r/AItoolsCatalog 495 | - /r/artificial https://www.reddit.com/r/artificial 496 | - /r/bigsleep https://www.reddit.com/r/bigsleep 497 | - /r/deepdream https://www.reddit.com/r/deepdream 498 | - /r/dndai https://www.reddit.com/r/dndai 499 | - /r/dreamlikeart https://www.reddit.com/r/dreamlikeart 500 | - /r/MediaSynthesis https://www.reddit.com/r/MediaSynthesis 501 | 502 | Discord 503 | - Stable Foundation https://discord.gg/stablediffusion 504 | 505 | #### Prompt Inspiration Communities & Tools 506 | Websites and platforms that offer prompt inspiration for SD. 507 | 508 | Libraire.ai: 509 | This is a website that offers a wide range of writing prompts and exercises for writers. Many of these prompts can be adapted for use with SD to generate images based on text. 510 | 511 | Lexica.art: 512 | This is a platform that offers a range of creative prompts for artists and writers. These prompts can be used to generate ideas for SD images and to refine text prompts for better results. 513 | 514 | Krea.ai: 515 | This is a platform that offers a range of prompts for creative projects, including writing and art prompts. Many of these prompts can be adapted for use with SD to generate high-quality images based on text. 516 | 517 | PromptHero.com: 518 | This is a website that offers a wide range of prompts for writing, storytelling, and creative projects. These prompts can be used to generate ideas for SD images and to refine text prompts for better results. 519 | 520 | OpenArt.ai: 521 | This is a platform that offers a range of creative prompts and challenges for artists and designers. These prompts can be used to generate ideas for SD images and to refine text prompts for better results. 522 | 523 | PageBrain.ai: 524 | This is a website that offers a range of writing prompts and exercises for writers. Many of these prompts can be adapted for use with SD to generate images based on text. 525 | 526 | 527 | 528 | 529 | 530 | 531 | ## Use Cases of Stable Diffusion 532 | 533 | ### Core Functionality & Use Cases 534 | Stable diffusion is primarily used for image generation, upscaling images and editing images. Subsets of these activities could be style transfer, photo repair, color or texture filling, image completion or polishing, and image variation. 535 | 536 | #### Image Generation 537 | #### Upscaling Images 538 | #### Editing Images 539 | #### Style Transfer 540 | #### Photo Repair/Touchups 541 | #### Color/Texture Filling 542 | #### Image Completion/Polishing 543 | #### Image Variation 544 | #### Outpainting 545 | 546 | #### Character Design 547 | 548 | #### Video Game Asset Creation 549 | 550 | #### Architecture and Interior Design 551 | 552 | ### Use Cases Other Than Image Generation 553 | 554 | #### Video & Animation 555 | 556 | ##### Deforum Animation 557 | https://github.com/deforum-art/deforum-stable-diffusion 558 | 559 | helpful Addons: 560 | https://github.com/deforum-art/deforum-for-automatic1111-webui 561 | https://github.com/rewbs/sd-parseq 562 | 563 | ##### Depth Module for Stable Diffusion 564 | 565 | Stable Diffusion (SD) is a powerful text-to-image generation model that can be used for a wide range of applications. To generate videos with a 3D perspective, a Depth Module has been developed that adds a mesh generation capability to SD. 566 | 567 | The Depth Module can be accessed through the Github repository (https://github.com/thygate/stable-diffusion-webui-depthmap-script). To generate the mesh required for video generation, the user needs to enable the "Generate 3D inpainted mesh" option on the Depth tab. This option can take several minutes to an hour, depending on the size of the image being processed. Once completed, the mesh in PLY format and four demo videos are generated, and all files are saved to the extras directory. 568 | 569 | The Depth Module also allows for the generation of videos from the PLY mesh on the Depth tab. This option requires the mesh created by the extension, as files created elsewhere might not work correctly. Some additional information is stored in the file that is required for the video generation process, such as the required value for dolly. Most options are self-explanatory and can be adjusted to achieve the desired results. 570 | 571 | The Depth Module is a useful extension to Stable Diffusion that enables users to create videos with a 3D perspective. It requires some additional processing time, but the results can be impressive and add a new dimension to the images generated by the model. 572 | 573 | ##### Gen1 574 | though not publicly released and technically separate from stable diffusion, it is created by the same company and original authors of stable diffusion and we can assume that a lot of the technology under the hood is similar if not the same. But a note about it should be included here. 575 | 576 | Gen1 takes a video and a style image and applies that style to that image, this allows for things like a video of stacks of boxes to be turned into a cityscape or things like that. 577 | https://research.runwayml.com/gen1 578 | 579 | 580 | #### 3D Generation Techniques for Stable Diffusion & Related Diffusion Based 3D Generation 581 | Stable Diffusion (SD) is a powerful text-to-image generation model that has inspired the development of several techniques for generating 3D images and scenes based on text prompts. Two of the most notable methods are: 582 | 583 | ##### Text to 3D 584 | https://dreamfusion3d.github.io/ 585 | https://github.com/ashawkey/stable-dreamfusion 586 | 587 | ##### DMT Meshes / Point Cloud Based 588 | https://github.com/Firework-Games-AI-Division/dmt-meshes 589 | 590 | ##### 3D radiance Fields 591 | not technically stable diffusion but diffusion based 3D modeling 592 | https://sirwyver.github.io/DiffRF/ 593 | 594 | ##### Novel View Synthesis 595 | not technicall stable diffusion but is related 596 | https://3d-diffusion.github.io/ 597 | 598 | ##### NeRF Based: 599 | This technique uses the Neural Radiance Fields (NeRF) algorithm to generate 3D models based on 2D images. The Stable Dreamfusion repository on Github (https://github.com/ashawkey/stable-dreamfusion) is an implementation of this technique for Stable Diffusion. It allows users to generate high-quality 3D models from text prompts and can be customized to achieve specific effects and styles. 600 | 601 | ##### Img to Fspy to Blender: 602 | This technique uses a combination of image analysis and 3D modeling software to create 3D scenes based on 2D images. It involves using the Img to Fspy tool (https://fspy.io/) to analyze an image and generate a camera location, then importing the camera location into Blender to create a 3D scene. A tutorial on this technique is available on YouTube (https://youtu.be/5ntdkwAt3Uw) and provides step-by-step instructions for generating 3D scenes based on images. 603 | 604 | Both of these techniques offer powerful tools for generating 3D images and scenes based on text prompts. They require some additional software and processing time, but the results can be impressive and add a new dimension to the images generated by Stable Diffusion. 605 | 606 | ##### Image to Shapes 607 | 3D shapes on top of images. A tutorial on this technique is available on YouTube by Albert Bozesan (https://youtu.be/ooSW5kcA6gI) and provides step-by-step instructions for building 3D shapes based on images. Roughly we lay out the image inside blender then extrude the shapes and polish the model while using the image as texture. 608 | 609 | Similar to https://github.com/jeacom25b/blender-boundary-aligned-remesh https://www.youtube.com/watch?v=AQckQBNHRMA 610 | 611 | #### 3D Texturing Techniques for Stable Diffusion 612 | Stable Diffusion (SD) is a powerful text-to-image generation model that has inspired the development of several techniques for generating 3D textures based on text prompts. Two of the most notable methods are: 613 | 614 | ##### Using Stable Diffusion for 3D Texturing: 615 | This technique involves using Stable Diffusion to generate high-quality images based on text prompts, and then using those images as textures for 3D models. This technique is described in detail in an article on 80.lv (https://80.lv/articles/using-stable-diffusion-for-3d-texturing/) and offers a powerful tool for generating realistic and detailed 3D textures. 616 | 617 | ##### Dream Textures: 618 | This is a project on Github (https://github.com/carson-katri/dream-textures) that uses Stable Diffusion to generate high-quality textures for 3D models. It allows users to customize the texture generation process and create unique and creative textures based on text prompts. 619 | 620 | 621 | 622 | #### Music 623 | 624 | ##### Riffusion 625 | https://en.wikipedia.org/wiki/Riffusion 626 | 627 | #### Image-Based Mind Reading 628 | https://the-decoder.com/stable-diffusion-can-visualize-human-thoughts-from-mri-data/ 629 | 630 | #### Synthetic Data Creation 631 | https://hai.stanford.edu/news/could-stable-diffusion-solve-gap-medical-imaging-data 632 | 633 | 634 | 635 | ## How Stable Diffusion Works 636 | 637 | 638 | 639 | ## Hardware Requirements and Cloud-Based Solutions 640 | 641 | ### Methods of Compute 642 | Personal Hardware 643 | - Requires Cuda GPU 644 | - Requires Minimum of 8gb VRAM, more is better 645 | 646 | 647 | Community Contributed Compute 648 | - Stable Horde 649 | 650 | Cloud Based Solutions 651 | - Colab 652 | - 653 | 654 | 655 | ### Xformers for Stable Diffusion 656 | 657 | Xformers is a set of transformers that can be used as an alternative to Stable Diffusion's built-in transformers for text-to-image generation. Xformers can run on fewer resources and provide comparable or better results than built-in transformers, making them a popular choice for many users. 658 | 659 | However, Xformers can be prone to compatibility issues when upgrading, and many users have reported problems when upgrading to newer versions. Some users have had to downgrade to previous versions to resolve these issues. 660 | 661 | To downgrade Xformers, users can follow these instructions: 662 | 663 | Navigate to your Stable Diffusion webUI folder and go into venv, then scripts. 664 | 665 | Select the navigation bar and type in CMD. This should open a CMD window in this folder. Alternatively, users can open the CMD window and navigate to this folder. 666 | 667 | Type "activate" and hit enter to activate the virtual environment. 668 | 669 | Run the following command: "pip install xformers==0.0.17.dev449". 670 | 671 | This will downgrade Xformers to the specified version and resolve any compatibility issues. However, users should be aware that downgrading may result in some loss of functionality or performance compared to newer versions. It is recommended to carefully evaluate the specific needs and requirements of your project before downgrading. 672 | 673 | 674 | ## Beginner's How To 675 | 676 | 677 | ### Basics, Settings and Operations 678 | 679 | different sample methods 680 | 681 | sample steps 682 | 683 | CFG (Classifier-Free Guidance) Scale 684 | it is a setting that tells the AI how much effort it should use to force your prompt onto the seed theme. 685 | Higher CFG can cause higher contrast and saturation, lower can be blurry and desaturated, this is due to CFG stacking layers of influence each pass. 686 | - https://arxiv.org/abs/2112.10741 687 | 688 | denoising settings 689 | 690 | Seed Selection and Randomization 691 | Seeds that look kind of like what you want or have similar coloration to what you want will help you make that image easier and clearer and can do so with lower CFG. 692 | https://www.reddit.com/r/StableDiffusion/comments/xhsf8c/a_seed_tutorial/ 693 | https://www.reddit.com/r/StableDiffusion/comments/x8szj9/tutorial_seed_selection_and_the_impact_on_your/ 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | ## Popular UIs 703 | 704 | 705 | ### Automatic 1111 706 | Automatic 1111's superpower is its rapid development speed and leveraging of community addons, usually within days of research being shown an addon for it in Auto1111 appears, if those addons prove popular enough they are eventually merged into standard features of the UI. I would likely say that because of this Aato1111 is the default choice of UI for most users until they have a specialized need or desire something easier to use. It is a powerful and comprehensive UI. 707 | 708 | Github: 709 | https://github.com/AUTOMATIC1111/stable-diffusion-webui 710 | 711 | Features: 712 | https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features 713 | 714 | Wiki: 715 | https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki 716 | 717 | Local Installation: 718 | https://github.com/AUTOMATIC1111/stable-diffusion-webui#installation-and-running 719 | - Windows Auto installer https://github.com/EmpireMediaScience/A1111-Web-UI-Installer 720 | 721 | Colab: 722 | 723 | Tutorials / How to Use: 724 | 725 | #### Automatic 1111 Extensions 726 | Stable Diffusion (SD) is a powerful text-to-image generation model that has inspired the development of several extensions and plugins that enhance its capabilities and offer new features. Many of these extensions can be found on the Github repository for AUTOMATIC1111 (https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Extensions) and can be installed through the extensions tab inside of AUTOMATIC1111, or by cloning the respective Github repositories into the extensions folder inside your AUTOMATIC1111 webUI/extensions directory. 727 | Github: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Extensions 728 | Github: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Custom-Scripts 729 | 730 | Some of the most notable extensions for Stable Diffusion are: 731 | 732 | ##### Ultimate Upscale: 733 | This is an extension that uses the ESRGAN algorithm to upscale images generated by Stable Diffusion to high-resolution versions. 734 | Github: https://github.com/Coyote-A/ultimate-upscale-for-automatic1111 735 | FAQ: https://github.com/Coyote-A/ultimate-upscale-for-automatic1111/wiki/FAQ 736 | 737 | ##### Config Presets: 738 | This is an extension that allows users to save and load configuration presets for Stable Diffusion. It simplifies the process of setting up Stable Diffusion for specific tasks and allows users to switch between presets quickly. The Github repository for this extension is available at https://github.com/Zyin055/Config-Presets. 739 | 740 | ##### Image Browser: 741 | This is an extension that provides a visual interface for browsing and selecting images to use as input for Stable Diffusion. It simplifies the process of selecting and managing images and allows users to preview images before generating output. 742 | 743 | ##### Prompt Tag Autocomplete: 744 | This is an extension that provides autocomplete suggestions for text prompts based on previously used prompts. It speeds up the process of entering prompts and reduces the likelihood of errors. 745 | 746 | ##### Txt2Mask: 747 | This is an extension that generates masks from text prompts. It allows users to select specific regions of an image to generate output from and can be useful for tasks such as object removal or image editing. 748 | 749 | ##### Ultimate HD Upscaler: 750 | This is an extension that uses a neural network to upscale images generated by Stable Diffusion to high-resolution versions. It offers improved upscaling quality compared to traditional algorithms. 751 | 752 | ##### Aesthetic Scorer: 753 | This is an extension that uses a neural network to score the aesthetic quality of images generated by Stable Diffusion. It can be used to evaluate the quality of generated images and provide feedback for improvement. 754 | https://github.com/grexzen/SD-Chad 755 | 756 | ##### Tagger: 757 | This is an extension that adds tags to generated images based on the input text prompts. It can be useful for organizing and managing large numbers of generated images. 758 | 759 | ##### Inspiration Images: 760 | This is an extension that provides a database of images for use as input prompts. It can be useful for generating images based on specific themes or styles. 761 | 762 | ##### Depth Map Library and Poser: 763 | https://github.com/jexom/sd-webui-depth-lib 764 | 765 | ##### OpenPose Editor: 766 | https://github.com/fkunn1326/openpose-editor 767 | 768 | ##### Shift Attention Script 769 | https://github.com/yownas/shift-attention 770 | 771 | ##### prompt interpolation 772 | https://github.com/EugeoSynthesisThirtyTwo/prompt-interpolation-script-for-sd-webui 773 | 774 | ##### Text2Palette 775 | https://github.com/1ort/txt2palette 776 | 777 | ##### Multiple Hypernetworks 778 | https://github.com/antis0007/sd-webui-multiple-hypernetworks 779 | 780 | ##### Img2Tiles & Img2Mosaic 781 | https://github.com/arcanite24/img2tiles 782 | https://github.com/1ort/img2mosaic 783 | 784 | ##### Depthmap & Stereo Image 785 | https://github.com/thygate/stable-diffusion-webui-depthmap-script 786 | 787 | ##### Layers Editing, Blending 788 | https://github.com/KohakuBlueleaf/a1111-sd-webui-haku-img 789 | 790 | ##### Model Toolkit 791 | https://github.com/arenatemp/stable-diffusion-webui-model-toolkit 792 | 793 | ##### Prompt Test 794 | it creates a grid of entire prompt but each image has one item of the prompt removed so you can see which part of the prompt affected the image and which did not 795 | https://github.com/Extraltodeus/test_my_prompt 796 | 797 | ##### Booru Tag Autocomplete 798 | https://github.com/DominikDoom/a1111-sd-webui-tagcomplete 799 | 800 | ##### Alpha Canvas 801 | https://github.com/TKoestlerx/sdexperiments 802 | 803 | ##### Unofficial PEZ - hard prompts made easy 804 | https://github.com/YuxinWenRick/hard-prompts-made-easy 805 | 806 | ##### Two Shot 807 | https://github.com/opparco/stable-diffusion-webui-two-shot 808 | 809 | ##### Composable Lora 810 | https://github.com/opparco/stable-diffusion-webui-composable-lora 811 | 812 | ##### Couple Helper - lets you choose where to apply prompts on a grid 813 | https://github.com/Zuntan03/LatentCoupleHelper 814 | https://github-com.translate.goog/Zuntan03/LatentCoupleHelper?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp 815 | 816 | ##### Latent Couple Extension 817 | https://github.com/miZyind/sd-webui-latent-couple 818 | https://github.com/ashen-sensored/stable-diffusion-webui-two-shot 819 | 820 | ##### Remove Background 821 | https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg 822 | 823 | ###### Models for Background Removal 824 | taken from this comment: https://www.reddit.com/r/StableDiffusion/comments/11s02mx/comment/jcbe029/?utm_source=share&utm_medium=web2x&context=3 825 | 826 | u2net (download:https://github.com/danielgatis/rembg/releases/download/v0.0.0/u2net.onnx, source:https://github.com/xuebinqin/U-2-Net): A pre-trained model for general use cases. 827 | u2netp (download:https://github.com/danielgatis/rembg/releases/download/v0.0.0/u2netp.onnx, source:https://github.com/xuebinqin/U-2-Net): A lightweight version of u2net model. 828 | u2net_human_seg (download:https://github.com/danielgatis/rembg/releases/download/v0.0.0/u2net_human_seg.onnx, source:https://github.com/xuebinqin/U-2-Net): A pre-trained model for human segmentation. 829 | u2net_cloth_seg (download:https://github.com/danielgatis/rembg/releases/download/v0.0.0/u2net_cloth_seg.onnx, source:https://github.com/levindabhi/cloth-segmentation): A pre-trained model for Cloths Parsing from human portrait. Here clothes are parsed into 3 category: Upper body, Lower body and Full body. 830 | silueta (download:https://github.com/danielgatis/rembg/releases/download/v0.0.0/silueta.onnx, source:https://github.com/xuebinqin/U-2-Net/issues/295): Same as u2net but the size is reduced to 43Mb. 831 | 832 | ##### Anime Background Remover 833 | https://github.com/KutsuyaYuki/ABG_extension 834 | 835 | 836 | 837 | 838 | 839 | ### Kohya 840 | Kohya's superpower is how it is able to use LORAs and can even merge them with ckpts and convert ckpts into them. 841 | 842 | Windows: 843 | https://github.com/bmaltais/kohya_ss 844 | 845 | Linux: 846 | https://github.com/Thund3rPat/kohya_ss-linux 847 | 848 | Colab: 849 | https://github.com/Spaceginner/kohya_ss_colab 850 | 851 | Colab and/or Auto1111 addon: 852 | https://github.com/ddPn08/kohya-sd-scripts-webui 853 | 854 | #### Addons 855 | https://github.com/kohya-ss/sd-webui-additional-networks 856 | 857 | 858 | 859 | 860 | ### EasyDiffusion (Formerly Stable Diffusion UI) 861 | https://github.com/cmdr2/stable-diffusion-ui 862 | 863 | 864 | ### InvokeAI 865 | https://github.com/invoke-ai/InvokeAI 866 | 867 | Unified Canvas Option 868 | 869 | Diffusers can be used natively 870 | 871 | 872 | ### DiffusionBee (Mac OS) 873 | https://github.com/divamgupta/diffusionbee-stable-diffusion-ui 874 | 875 | ### NKMD GUI 876 | https://nmkd.itch.io/t2i-gui 877 | https://github.com/n00mkrad/text2image-gui 878 | 879 | Apparently it has tools for pruning models 880 | 881 | Requirements: 882 | https://github.com/n00mkrad/text2image-gui/blob/main/README.md#system-requirements 883 | 884 | Features: 885 | https://github.com/n00mkrad/text2image-gui/blob/main/README.md#features-and-how-to-use-them 886 | 887 | ### ComfyUi 888 | https://github.com/comfyanonymous/ComfyUI 889 | 890 | ### AINodes 891 | it's not popular yet, but I expect it will be 892 | https://www.reddit.com/r/StableDiffusion/comments/11psrvp/ainodes_teaser_update/ 893 | https://github.com/XmYx/ainodes-engine 894 | 895 | ## Model Training and Other Training UIs 896 | webui model toolkit https://github.com/arenatemp/stable-diffusion-webui-model-toolkit 897 | 898 | 899 | 900 | 901 | 902 | ### Other Sofware Addons that Act like a UI 903 | https://github.com/carson-katri/dream-textures 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | ## Resources & Useful Links 916 | 917 | ### Helpful Tools 918 | 919 | #### Tool Directories and Explanations 920 | https://sdtools.org/ 921 | 922 | https://diffusiondb.com/ 923 | 924 | 925 | 926 | ### Where to Get Models Made By Community 927 | https://civitai.com/ 928 | 929 | https://huggingface.co/spaces/huggingface-projects/diffusers-gallery 930 | 931 | https://huggingface.co/sd-dreambooth-library 932 | 933 | https://fantasy.ai/ 934 | 935 | https://sinkin.ai/ 936 | 937 | 938 | #### Notes About Models 939 | 940 | ##### Model Safety Measures 941 | 942 | In the world of machine learning, there are two formats in which models can be saved: .ckpt and .safetensors. The older format, .ckpt, is basically Python code and therefore has the potential to do anything that a program can do, including erasing or modifying files on your computer. The newer format, .safetensors, was created to address this weakness and supposedly loads faster when switching models. 943 | 944 | To ensure the safety and performance of your machine learning models, it is important to only download from trusted sources and to verify the authenticity of the model with a pickle file or by downloading a .safetensors model. While there haven't been any reported cases of code injection through a .ckpt model, there is always a possibility, and it is better to err on the side of caution. 945 | 946 | Both pickling and SafeTensors are crucial techniques for saving, loading, and transferring machine learning models while also ensuring the security of the data used in machine learning. By utilizing these techniques, you can ensure that your machine learning models are both safe and effective. 947 | 948 | Pickling: 949 | This is a technique used to serialize and deserialize Python objects. Pickling is used in SD to save and load models, as well as to transfer data between processes. Pickling can be used to save the state of the model at various stages of training or to transfer a model between different machines or environments. However, pickling can also introduce security risks if used improperly, as it allows for arbitrary code execution. 950 | 951 | SafeTensors: 952 | This is a technique used to ensure that the tensors used in SD are safe and do not pose a security risk. SafeTensors are created by wrapping tensors with metadata that defines their type and shape. This metadata can be used to verify that tensors are being used correctly and to prevent attacks such as tensor poisoning. 953 | 954 | 955 | 956 | 957 | 958 | 959 | ## Generating Images & Methods of Image Generation 960 | In the context of stable diffusion generally refers to the process of generating an image from scratch using a combination of textual prompts and/or image inputs. This process can be done using various techniques such as fine-tuning pre-trained models, using multiple embeddings, hypernetworks, and LORAs, merging models, and utilizing aesthetic gradients. The goal is to generate an image that reflects the desired style, subject, or concept that the user has in mind. Once an image has been generated, it can be further refined and tweaked using techniques such as image manipulation, denoising, and interpolation to achieve the desired outcome. 961 | 962 | 963 | ### Text2Image 964 | Stable Diffusion is a machine learning framework that is used for generating images from textual prompts. This is achieved through a process known as Text2Image, where textual input is used to generate corresponding images. The core functionality of Stable Diffusion is based on the use of a diffusion process, where a series of random noise vectors are iteratively modified to generate high-quality images. This process involves using a series of convolutional neural networks and other machine-learning techniques to generate the final image output. 965 | 966 | The Text2Image functionality of Stable Diffusion has been detailed in a paper available on arXiv, and there are also various tutorials and videos available to help users understand how the framework works. The main advantage of using Stable Diffusion for generating images from text is that it can produce high-quality, realistic images with relatively little input. This makes it a useful tool for a wide range of applications, from generating art to creating realistic simulations for computer games and other applications. 967 | Paper: https://arxiv.org/pdf/2112.10752.pdf 968 | How SD Works: https://www.youtube.com/watch?v=1CIpzeNxIhU 969 | 970 | #### Notes on Resolution 971 | the initial dataset was trained on 512x512px images, so when one deviates from that size it can sometimes act like it's generating and merging two images, this is the usual culprit when an image has a double head (stacked on top of another head). Other models like 2.0 have been trained on a larger subset of 768x768 and some custom user models also have custom image size training data. The most important thing to note is that deviating from the size it is trained on can sometimes cause unforeseen strangeness in the images generated. 972 | 973 | #### Prompt Editing 974 | Prompt editing is a powerful tool in Stable Diffusion that allows users to manipulate and refine prompts to guide the generation process. Prompts come in two types: positive prompts and negative prompts. Positive prompts encourage the model to generate specific features, while negative prompts discourage the model from generating unwanted features. Prompt editing techniques include prompt emphasis, which allows users to highlight specific words or phrases in the prompt, and prompt delay, which introduces a time delay between each word in the prompt to allow for more fine-tuned control over the generation process. Other techniques include alternating words and using prompts that contain specific features, such as the rule of thirds, contrasting colors, sharp focus, and intricate details. These prompt editing techniques can help users achieve more precise and nuanced control over the generated images. 975 | 976 | https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing 977 | 978 | prompt engineering resources https://www.reddit.com/r/StableDiffusion/comments/xcrm4d/useful_prompt_engineering_tools_and_resources/?utm_source=share&utm_medium=web2x&context=3 979 | 980 | #### Negative Prompts 981 | Negative prompts are used to guide Stable Diffusion models away from certain image characteristics. However, the impact of negative prompts can be unpredictable and requires experimentation. It is important to note that there is no guaranteed set of negative prompts that will always produce the desired outcome, and the effectiveness of negative prompts can vary depending on the specific model, textural inversions, hypernetworks, or LoRA being used. It is recommended to focus on negative prompts that are relevant to the specific image you are trying to generate, rather than including irrelevant or meaningless prompts. Ultimately, it is important to experiment with different prompts and learn what works best for each specific use case. 982 | 983 | #### Alternating Words 984 | Alternating Words is a feature in Auto1111 that allows users to alternate between two keywords at each time step. This feature can be used by specifying the two keywords in square brackets separated by a vertical bar, such as [Salvador Dali|Pixel Art]. The model will then alternate between these two keywords when generating the image. This can be useful for exploring different styles or concepts in the generated images, as well as adding variety to the output. 985 | https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#alternating-words 986 | 987 | #### Prompt Delay 988 | Prompt Delay is a feature that can be used in various Stable Diffusion interfaces, allowing users to delay the appearance of certain keywords until a minimum number of steps have been reached. The syntax for Prompt Delay involves adding a delay value to the end of a keyword, represented as a decimal between 0 and 1. For example, the prompt [Salvador Dali: Pixel Art:0.2] would delay the appearance of "Pixel Art" until 20% of the process has been completed, with "Salvador Dali" being used for the remaining 80%. This feature can be useful for fine-tuning the progression and appearance of keywords in the generation process. 989 | 990 | #### Prompt Weighting 991 | Prompt Weighting can be used in several interfaces for stable diffusion. The syntax is [Salvador Dali:1.1 Pixel Art:1], where here Salvador Dali has a weight of 1.1 and Pixel Art has a weight of 1. The weights allow you to adjust the importance of each keyword in the prompt, with higher weights indicating more importance. 992 | 993 | #### Ui specific Syntax 994 | Stable Diffusion offers various UI frontends, each with its own unique or specific syntax for prompting. Examples of these syntaxes are provided with links to demonstrate the differences. 995 | 996 | 997 | ### Exploring 998 | Exploring the latent space in Stable Diffusion can be a daunting task due to its sheer size. However, there are several methods to explore it and find the desired image. One approach is to use brute force on a small part of the space near the optimal solution. Another method is to use random words or parameters to explore the space and discover new and interesting images. Overall, exploring the latent space is a key component of using Stable Diffusion effectively, and there are various techniques available to help with this task. 999 | 1000 | #### Randomness 1001 | Using randomness in the prompts and parameters is a powerful tool to explore different styles and types of images in Stable Diffusion. Randomness can be introduced in various ways, such as through the use of random words or phrases in the prompt, or through random adjustments to parameters such as temperature or noise. This approach can lead to surprising and creative results, but can also be unpredictable and may require experimentation to achieve the desired outcome. Overall, incorporating randomness into the Stable Diffusion process can be a useful way to expand the range of possible images and generate novel and unexpected results. 1002 | 1003 | ##### Random Words 1004 | Random Words are a technique used in Stable Diffusion to explore different styles and types of images. The approach involves generating a large number of images using a combination of words that are randomly chosen. This technique can help to uncover new and interesting combinations of keywords and produce unique and unexpected results. There are several tools and libraries available for generating random words and incorporating them into Stable Diffusion prompts. One example is the sd-dynamic-prompts library available on GitHub.https://github.com/adieyal/sd-dynamic-prompts 1005 | 1006 | ##### Wildcards 1007 | Wildcards are a feature in Stable Diffusion that allow users to explore the latent space by using a combination of random words and placeholders. These placeholders, represented by asterisks (*), can be used to substitute for any word or phrase, allowing for greater flexibility in the prompts. Users can generate a large number of images with randomly chosen words and placeholders, allowing for a wider exploration of the model's capabilities. This feature is available in the Stable Diffusion AI Prompt Examples repository on GitHub. 1008 | https://github.com/joetech/stable-diffusion-ai-prompt-examples 1009 | 1010 | #### Brute Force 1011 | Brute force is a method of exploring the parameter space systematically. It can be performed in one, two, or multiple dimensions, such as exploring the impact of the configuration scale, steps, samplers, denoising strength, etc. This approach involves systematically testing all possible combinations of parameters to find the optimal solution or to explore the parameter space. However, brute force can be computationally expensive and time-consuming, especially when exploring high-dimensional spaces. As such, it is important to carefully consider the trade-off between computational cost and the potential benefits of using brute force. 1012 | 1013 | ##### Prompt Matrix 1014 | A prompt matrix is a method of generating a grid of images by combining two prompts to create all possible combinations. For example, if you have two prompts "chaotic" and "evil," a prompt matrix would generate a grid of images showing all possible combinations of the two prompts such as "chaotic good," "chaotic neutral," "evil good," and so on. This technique can be useful for exploring different combinations of prompts and generating a wide range of images. 1015 | 1016 | ##### XY Grid 1017 | XY Grid exploration is a method of exploring the parameter space of stable diffusion by generating a grid of images through varying two parameters. For example, steps and cfg scale can be varied to generate a grid of images with different values of these two parameters. This method can be useful for systematically exploring how changes in different parameters affect the output image. By generating a grid of images with different parameter values, it is possible to compare and analyze the effects of different parameter settings on the output image. 1018 | 1019 | ##### One Parameter 1020 | One parameter exploration involves generating a set of images by varying a single parameter, such as the delay in prompt delay. It can be useful for fine-tuning the impact of a particular parameter on image generation. 1021 | 1022 | 1023 | 1024 | 1025 | ## Editing Composition 1026 | Tools in Stable Diffusion used to edit the composition of an image 1027 | 1028 | 1029 | 1030 | ### Image2Image 1031 | Img2img, or image-to-image, is a feature of Stable Diffusion that allows for image generation using both a prompt and an existing image. Users upload a base photo, and the AI applies changes based on entered prompts, resulting in refined and sophisticated art. The feature is similar to text-to-image generation, but with the added component of an existing image as a starting point. The possibilities for img2img generation are endless, with users experimenting with messy drawings, portraits, landscapes, and more to create a wide range of unique and creative artwork. The higher the denoising strength, the more different the image obtained will be. 1032 | 1033 | #### Img2Img 1034 | If you like the general composition of the image but don't want to change very many of the details use img2img it with a lowish denoising strength. If you want to change it a lot more just use a higher denoising strength. 1035 | 1036 | #### Inpainting 1037 | Inpainting is a feature of Stable Diffusion that allows users to change small details within an image composition. For example, if a user is creating scenery and wants to change part of a river, they can use inpainting to edit the river until it appears as desired. Similarly, if a user is creating a character and wants to add or edit features such as hands or a hat, they can use inpainting to make those changes. Inpainting uses specifically trained inpainting models that can be merged with other models. This feature enables users to create highly detailed and customized images with ease. 1038 | 1039 | #### Outpainting 1040 | Outpainting is a feature in Stable Diffusion that allows you to extend the boundaries of your image to create a larger composition. For example, if you have a character that you want to show in a specific environment, you can use outpainting to gradually extend the scenery around the character to create a more complete and consistent image. The feature uses specifically trained models for outpainting and can be merged with other models for more creative possibilities. 1041 | 1042 | #### Loopback 1043 | Loopback is a feature of Stable Diffusion where the output of image2image is fed into the input of the next image2image in a loop. This can be useful for creating a sequence of images with gradually decreasing changes between each image. By adjusting the denoising strength factor between each run, the number of changes can be progressively reduced, resulting in a smoother and more gradual transition between images. Loopback can also be used for creating animated sequences, where the output of each loop is fed into a video encoder to create a final animation. 1044 | 1045 | #### InstructPix2Pix 1046 | InstructPix2Pix is a tool that allows users to provide natural language instructions to Stable Diffusion for changing specific parts of an image. It uses a Pix2Pix-based neural network to generate the changed image. Users can input a sentence such as "make the sky red" or "remove the trees," and the tool will generate a modified version of the original image according to the instruction. It provides an easy and intuitive way for users to edit their images without requiring specific technical knowledge or skills. The tool is available on GitHub for free use and experimentation. 1047 | https://github.com/timothybrooks/instruct-pix2pix 1048 | 1049 | #### Depth2Image 1050 | Depth2Image is a feature of Stable Diffusion that performs image generation similar to img2img, but also takes into account depth information estimated using the monocular depth estimator MIDAS. This allows for better preservation of composition in the generated image compared to img2img. 1051 | 1052 | A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output. 1053 | 1054 | https://zenn.dev/discus0434/articles/ef418a8b0b3dc0 (Japanese) 1055 | 1056 | ##### Depth Map 1057 | A depth map is an image that assigns a depth value to each pixel in a given image. It provides information about the distance of objects in the scene from the viewpoint of the camera. In the context of Stable Diffusion, a depth map can be used as a reference to generate images with higher accuracy and create 3D-like effects. It can also be used to separate objects and perform post-processing, such as creating videos. There are scripts available on GitHub that allow for depth map functionality in Stable Diffusion's web interface. 1058 | https://github.com/thygate/stable-diffusion-webui-depthmap-script 1059 | 1060 | ##### Depth Preserving Img2Img 1061 | Depth Preserving Image2Image is a feature in Stable Diffusion that preserves the depth information of the original image during the image generation process. This allows for more accurate and consistent results when applying prompts and generating new images. For example, if you want to cartoonize a photo, using a conventional Image2Image with a prompt may change the proportions and positioning of the elements in the image. However, with depth preserving Image2Image, the generated image will maintain the same proportions and positions as in the original photo, while still applying the desired style or effect. This allows for greater creative flexibility while preserving the composition of the original image. 1062 | 1063 | #### ControlNet 1064 | ControlNet is an upgraded version of img2img that emphasizes edges and uses them in newly generated images. It refines the images by using special ControlNet models and can be used with any normal model. It allows for greater control of inputs and outputs and is ideal for coloring, filling in linework, texture reskin, style changes, or marking complex edges in an image that you don't want changed. ControlNet can also use scribbles as inputs and play well with larger and custom resolutions. The weights and models of ControlNet vary in their function and can include Midas/Depth, Canny-linework, HED-a mask, MLSD- for Architecture/Buildings/Straight Lines, OpenPose-pose transfer, and Scribble- a cross between Canny/HED for drawing scribbles. However, ControlNet has limitations with variation beyond "filling in" since it keeps the edges strongly. Overall, it is similar to Depth Maps, Normal Maps, and Holistically-nested edge detection. The ControlNet demo can be found on Hugging Face, and the research paper, repository, models, and tutorial can be found on GitHub. 1065 | 1066 | Different models of this do different things, and weight of it affects it too 1067 | - Midas - Depth 1068 | - Canny - linework 1069 | - HED - a mask? 1070 | - MLSD - for Architecture / Buildings / Straight Lines 1071 | - OpenPose - can transfer a pose from one image to another 1072 | - Scribble - like a cross between Canny/HED but meant to be used for drawing scribbles 1073 | 1074 | Demo - https://huggingface.co/spaces/hysts/ControlNet 1075 | Research Paper - https://raw.githubusercontent.com/lllyasviel/ControlNet/main/github_page/control.pdf 1076 | Repo - https://github.com/lllyasviel/ControlNet 1077 | Models - https://huggingface.co/lllyasviel/ControlNet 1078 | Compressed Models - https://huggingface.co/webui/ControlNet-modules-safetensors/tree/main 1079 | Automatic 1111 Addon - https://github.com/Mikubill/sd-webui-controlnet 1080 | Tutorial - https://youtu.be/vhqqmkTBMlU https://youtu.be/OxFcIv8Gq8o 1081 | Explanation - https://www.reddit.com/r/StableDiffusion/comments/119o71b/a1111_controlnet_extension_explained_like_youre_5/?utm_source=share&utm_medium=web2x&context=3 1082 | 1083 | 1084 | ### Pix2Pix-zero 1085 | Pix2Pix-zero is an interactive image-to-image translation tool built on top of the Pix2Pix architecture. It allows users to sketch simple drawings, which are then transformed into a fully realized image by the model. The unique aspect of Pix2Pix-zero is that it is a zero-shot learning approach, meaning that it can generate images based on unseen or incomplete sketches. 1086 | 1087 | The interface of Pix2Pix-zero is simple and easy to use, with a sketch pad on the left and a preview of the generated image on the right. Users can select from several different models trained on different datasets to generate images in different styles. The models are trained on datasets such as horses, shoes, and handbags. 1088 | 1089 | The Pix2Pix-zero repository on GitHub includes pre-trained models as well as code for training your own models on custom datasets. Additionally, the website provides a live demo where users can try out the tool and generate their own images from sketches. Overall, Pix2Pix-zero provides an intuitive and interactive way for users to create images without needing advanced artistic skills. 1090 | https://pix2pixzero.github.io/ 1091 | https://github.com/pix2pixzero/pix2pix-zero 1092 | 1093 | 1094 | 1095 | ### Seed Resize 1096 | Seed resize is a feature in Stable Diffusion that allows users to preserve the composition of an image while changing its size. Users can resize the seed image, which is the initial image that is fed into the image generation process to generate images of different sizes while maintaining the same composition. This feature is useful for creating images of different resolutions or aspect ratios without sacrificing the overall composition. It is also helpful in generating images for specific platforms or devices that require specific resolutions or sizes. 1097 | 1098 | #### Variations 1099 | Variations are a feature of Stable Diffusion that allows for traversing latent space near the seed with a defined amount of difference. It generates a set of images that are similar to the original but with variations based on the given parameters. The variations can be used to explore different styles and variations for the same image, or to fine-tune the final output to the desired result. The feature can be useful in creating art that has a consistent theme or style while still being unique and interesting. 1100 | 1101 | 1102 | 1103 | 1104 | ## Finishing 1105 | Finishing in Stable Diffusion refers to the final touches required to display the generated image. These include correcting any issues with faces using face restoration techniques. Once the image is satisfactory, it can be upscaled to the desired image size using SD upscaling, which is considered one of the best methods for this task. In some cases, inpainting can also be used to touch up small details after upscaling. 1106 | 1107 | 1108 | ### Upscaling 1109 | Upscaling is a process of increasing the resolution of an image. In Stable Diffusion, images are usually generated at a lower resolution such as 512x512 or 768x768 for faster processing. However, to obtain higher-quality output or to use the generated image for printing or large displays, upscaling is necessary. There are various upscaling techniques available, including interpolation-based methods and deep learning-based methods. In Stable Diffusion, the preferred upscaling method is SD Upscale, which is a deep learning-based method specifically designed for stable diffusion. 1110 | 1111 | https://upscale.wiki/wiki/Model_Database 1112 | 1113 | #### BSRGAN 1114 | BSRGAN is a type of GAN (Generative Adversarial Network) that can be used for image super-resolution. It is designed to produce high-quality images with finer details and better textures than traditional methods. BSRGAN uses a combination of a generator network and a discriminator network to produce realistic images with high resolution. The generator network upscales a low-resolution image to a high-resolution image, while the discriminator network evaluates the quality of the generated image. The generator network is trained using a loss function that includes both adversarial loss and content loss. BSRGAN has been shown to produce high-quality super-resolved images in comparison to other state-of-the-art methods. The code for BSRGAN is available on GitHub. 1115 | https://github.com/cszn/BSRGAN 1116 | 1117 | #### ESRGAN 1118 | ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks) is an image upscaling method that uses deep neural networks to generate high-resolution images from low-resolution inputs. It was introduced in a 2018 research paper by Wang et al. and has since been widely used in image-processing tasks. 1119 | 1120 | ESRGAN is based on the super-resolution GAN (SRGAN) model, which was introduced in 2017. However, ESRGAN improves upon SRGAN by incorporating residual blocks and a novel architecture called the "Enhancement Network" to enhance the high-frequency details in the output images. It also uses a perceptual loss function that takes into account both the content and style of the input image, resulting in more visually pleasing outputs. 1121 | 1122 | ESRGAN has been used in a variety of applications, including image restoration, image super-resolution, and image synthesis. It has shown promising results in producing high-quality, detailed images from low-resolution inputs, making it a useful tool for various industries such as film, gaming, and art. 1123 | 1124 | ##### 4x RealESRGAN 1125 | 4x RealESRGAN is an algorithm that is an upgrade to the ESRGAN algorithm. It is capable of upscaling images up to four times their original size while maintaining high image quality. RealESRGAN is based on deep neural networks and is trained on a large dataset of high-resolution images to learn how to upscale images without losing quality. The RealESRGAN algorithm can be accessed on GitHub, and a demo is available on the Hugging Face website. 1126 | https://github.com/xinntao/Real-ESRGAN 1127 | DEMO: https://huggingface.co/spaces/akhaliq/Real-ESRGAN 1128 | 1129 | ##### Lollypop 1130 | Lollipop is exceptional at making cartoon, manga, anime and pixel art content. 1131 | 1132 | Lollypop upscaler is a universal model aimed at pre-rendered images, including realistic faces, manga, pixel art, and dithering. The model is trained using the patchgan discriminator with cx loss, cutmixup, and frequency separation, resulting in good results with a slight grain due to patchgan and sharpening using cutmixup. It can handle a variety of image types and is designed for upscaling images to a higher resolution. 1133 | 1134 | ##### Universal Upscaler 1135 | Seems well-liked, It comes with a different level of sharpness. Universal Upscaler Neutral, Universal Upscaler Sharp, Universal Upscaler Sharper. 1136 | 1137 | ##### Ultrasharp 1138 | 4x-ultrasharp is a powerful upscaling model that generates high amounts of detail and texture, particularly for images with JPEG compression. It can also restore highly compressed images. If a more balanced output is desired, the UltraMix Collection is recommended, which is a set of interpolated models based on UltraSharp and other models. 1139 | 1140 | ##### Uniscale 1141 | Uniscale is a tool that is useful for upscaling images, and it comes in various settings depending on whether the user wants a sharper or softer upscale. Some of these settings include Uniscale Balanced, Uniscale Strong, Uniscale V2 Soft, Uniscale V2 Moderate, Uniscale V2 Sharp, Uniscale NR Balanced, Uniscale NR Strong, and Uniscale Interp. 1142 | 1143 | ##### NMKD Superscale 1144 | NMKD Superscale is a model specifically designed for upscaling realistic images and photos that contain noise and compression artifacts. It is trained using a combination of adversarial and perceptual losses, which helps to preserve details and textures while removing artifacts. The model has been optimized for JPEG and WebP compressed images, making it well-suited for images downloaded from the internet or taken on a mobile device. NMKD Superscale has been well-received by users for its ability to produce high-quality upscaled images with minimal artifacts. 1145 | 1146 | ##### Remacri by Foolhardy 1147 | Remacri is an image upscaler that is an interpolated version of IRL models like Siax, Superscale, Superscale Artisoft, Pixel Perfect, and more. It is based on BSRGAN but has more details and less smoothing, which helps preserve features like skin texture and other fine details. The goal is to prevent images from becoming mushy and blurry during the upscaling process. 1148 | 1149 | #### SD Upscale 1150 | SD Upscale is a method of upscaling images that uses Stable Diffusion to add details tile by tile after upscaling with a conventional upscaler. This is done to avoid running out of VRAM when processing the entire upscaled image. Any Stable Diffusion checkpoint can be used for this process. For example, an image can be generated using Stable Diffusion 1.5 and then upscaled using the depth model, or it can be generated using Stable Diffusion 2.1 and then upscaled using Robodiffusion. 1151 | 1152 | ##### SD 2.0 4xUpscaler 1153 | SD 2.0 4x Upscaler is the official model from stability.ai that allows for upscaling images by a factor of four. However, it requires a lot of VRAM to use, which can be a limitation for some users. 1154 | 1155 | 1156 | ### Restoring 1157 | Restoring is a process of fixing and improving the quality of an image. It can involve sharpening the image to enhance its details, or it can be used to fix specific issues like smoothing out skin textures or removing noise and artifacts. Restoring can be performed using various techniques and algorithms, depending on the specific needs of the image. For example, face restoration can be used to improve the quality of facial features and expressions, while denoising algorithms can be used to remove unwanted noise and improve the clarity of the image. Restoring is an important step in the image creation process to ensure that the final product is of high quality and meets the desired standards. 1158 | 1159 | #### Face Restoration 1160 | Face restoration algorithms are used to adjust the details of a face in an image, such as the eyes, skin texture, and overall clarity. These algorithms use machine learning techniques to identify facial features and make targeted adjustments to improve the overall appearance of the face. They can be used to enhance the quality of portrait photographs, as well as to correct facial imperfections or blemishes. Some popular face restoration algorithms include DeepFaceLab, Faceswap, and OpenCV. 1161 | 1162 | ##### GFPGAN 1163 | GFPGAN is an algorithm that uses StyleGAN for face restoration. The algorithm is based on a generative adversarial network that is trained to generate high-quality images of faces. It can be used for tasks such as face super-resolution, face inpainting, and face colorization. GFPGAN is an improvement over previous face restoration algorithms because it is able to produce more realistic results with better detail and texture. It is open source and available on GitHub, and a demo can be found on Hugging Face. 1164 | https://github.com/TencentARC/GFPGAN 1165 | DEMO: https://huggingface.co/spaces/akhaliq/GFPGAN 1166 | 1167 | ##### Code Former 1168 | Code Former is a face restoration algorithm that utilizes a convolutional neural network (CNN) to restore and refine facial features. The algorithm uses an encoder-decoder architecture with skip connections to effectively capture facial features and details while maintaining a smooth output. It also incorporates adversarial training to improve the realism of the output. The Code Former algorithm can be implemented using Python and Tensorflow. It has been shown to produce high-quality results in facial restoration tasks. 1169 | https://github.com/sczhou/CodeFormer 1170 | DEMO: https://huggingface.co/spaces/sczhou/CodeFormer 1171 | 1172 | 1173 | 1174 | ## Models ETC 1175 | 1176 | At the core of SD is the stable diffusion model, which is contained in a ckpt file. The stable diffusion model consists of three sub-models: 1177 | 1178 | Variational autoencoder (VAE): This sub-model is responsible for compressing and decompressing the image data into a smaller latent space. The VAE is used to generate a representation of the input image that can be easily manipulated by the other sub-models. 1179 | 1180 | U-Net: This sub-model is responsible for performing the diffusion process that generates the final image. The U-Net is used to gradually refine the image by adding or removing noise and information based on the text prompts. 1181 | 1182 | CLIP: This sub-model is responsible for guiding the diffusion process with text prompts. CLIP is a natural language processing model that is used to generate embeddings of the text prompts that are used to guide the diffusion process. 1183 | 1184 | Different models can use different versions of the VAE, U-Net, and CLIP models, depending on the specific requirements of the project. In addition, different samplers can be used to perform denoising in different ways, providing additional flexibility and control over the image generation process. 1185 | 1186 | Understanding the core components and models of SD is important for optimizing its performance and for selecting the appropriate models and settings for specific projects. 1187 | 1188 | 1189 | 1190 | ### Base Models for Stable Diffusion 1191 | 1192 | Stable Diffusion (SD) relies on pre-trained models to generate high-quality images from text prompts. These models can be broadly categorized into two types: official models and community models. 1193 | 1194 | Official models are trained on large datasets of images, typically billions of images, and are often referred to by their dataset size. For example, the LAION-2B model was trained on a dataset of 2 billion images, while the LAION-5B model was trained on a dataset of 5.6 billion images. These models are typically trained on a wide range of images and can generate high-quality images that are suitable for many different applications. 1195 | 1196 | Community models, on the other hand, are models that have been finetuned by users for specific styles or objects. These models are often based on the official models, but with modifications to the Unet and decoder or just the Unet. For example, a user might finetune an official model to generate images of specific animals or to generate images with a particular style or aesthetic. 1197 | 1198 | The choice of which model to use depends on the specific requirements of the project. Official models are generally more versatile and can be used for a wide range of applications, but may not produce the specific style or quality of image desired. on the other hand, community models may be more tailored to specific applications but may not be as versatile as official models. 1199 | 1200 | It is important to carefully evaluate the specific needs and requirements of a project before selecting a model and to consider factors such as dataset size, style, object, and computational resources when making a decision. 1201 | 1202 | #### Stable Diffusion Models 1.4 and 1.5 1203 | 1204 | Stable Diffusion (SD) has gone through several iterations of models, each trained on different datasets and with different hyperparameters. The earliest models, 1.1, 1.2, and 1.3, were trained on subsets of the LAION-2B dataset at resolutions of 256x256 and 512x512. 1205 | 1206 | Model 1.4 was the first SD model to really stand out, and it was trained on the LAION-aesthetics v2.5+ dataset at a resolution of 512x512 for 225k steps. Model 1.5 was also trained on the LAION-aesthetics v2.5+ dataset, but for 595k steps. It comes in two flavors: vanilla 1.5 and inpainting 1.5. 1207 | 1208 | Both models are widely used in the SD community, with many finetuned models and embeddings based on 1.4. However, 1.5 is considered the dominant model in use because it produces good results and is a solid all-purpose model. 1209 | 1210 | One important consideration for users is compatibility between models. Most things are compatible between 1.4 and 1.5, which makes it easier for users to switch between models and take advantage of different features or capabilities. 1211 | 1212 | It is important to evaluate the specific needs and requirements of a project when selecting a model and to consider factors such as dataset size, resolution, and hyperparameters when making a decision. 1213 | 1214 | #### Stable Diffusion Models 2.0 and 2.1 1215 | 1216 | Stable Diffusion (SD) models 2.0 and 2.1 were released closely together, with 2.1 considered an improvement over 2.0. Both models were trained on the LAION-5B dataset, which contains roughly 5 billion images, compared to the LAION-2B dataset used for earlier models. 1217 | 1218 | One of the biggest changes from a user perspective was the switch from CLIP (OpenAI) to OpenCLIP, which is an open-source version of CLIP. While this is a positive development from an open-source perspective, it does mean that some workflows and capabilities that were easy to achieve in earlier versions may not be as easy to replicate in 2.0 and 2.1. 1219 | 1220 | SD2.1 comes in both 512x512 and 768x768 versions. Because it uses OpenCLIP instead of CLIP, some users have expressed frustration at not being able to replicate their SD1.5 workflows on SD2.1. However, new fine-tuned models and embeddings are emerging rapidly, which are extending the capabilities of SD2.1 and making it more versatile for different applications. 1221 | 1222 | As with earlier models, it is important to carefully evaluate the specific needs and requirements of a project when selecting a model and to consider factors such as dataset size, resolution, and hyperparameters when making a decision. 1223 | 1224 | ##### 512-Depth Model for Image-to-Image Translation 1225 | 1226 | The 512-depth model is a Stable Diffusion model that enables image-to-image translation at a resolution of 512x512. While conventional image-to-image translation methods can suffer from issues with preserving the composition of the original image, the 512-depth model is designed to preserve composition much better. However, it is important to note that this model is limited to image-to-image translation and does not support other tasks such as text-to-image generation or inpainting. 1227 | 1228 | 1229 | ### Community Models 1230 | #### Fine Tuned 1231 | Fine-tuned models for Stable Diffusion are models that have been trained on top of the pre-trained Stable Diffusion model using a specific dataset or a specific task. These fine-tuned models can be more specialized and provide better results for certain tasks, such as generating images of specific objects or styles. 1232 | 1233 | For example, a fine-tuned model for generating anime-style images can be trained on a dataset of anime images. Similarly, a fine-tuned model for generating high-resolution images can be trained on a dataset of high-resolution images. 1234 | 1235 | Fine-tuned models can be created using transfer learning, where the pre-trained model is used as a starting point and the weights are fine-tuned on the specific task or dataset. This approach can significantly reduce the time and resources required to train a new model from scratch. 1236 | 1237 | There are many fine-tuned models available for Stable Diffusion, and they can be found on various repositories and platforms, such as Hugging Face, GitHub, and other online communities. 1238 | 1239 | #### Merged/Merges 1240 | In Stable Diffusion, merged models are created by combining the weights of two or more pre-trained models to create a new model. This process involves taking the learned parameters of each model and averaging them to create a new set of weights. 1241 | 1242 | Merging models is often done to combine the strengths of multiple models and create a new model that is better suited for a specific task. For example, one might merge a model that is good at generating realistic faces with a model that excels at generating landscapes to create a new model that can generate realistic faces in landscapes. 1243 | 1244 | Merging models requires some knowledge of deep learning and neural networks, as the models being merged need to have similar architectures and be trained on similar tasks to be effectively combined. However, there are many pre-trained models available in Stable Diffusion that have already been merged and fine-tuned for specific tasks, making it easier for users to quickly find and use models that are suitable for their needs. 1245 | 1246 | ##### Tutorial for Add Difference Method 1247 | An alternative method to merge models is the use of the merge_lora script by kohya_ss. 1248 | 1249 | To use this method, first, create a mix of the target model and the LoRa model using the merge_lora script. The resulting image is almost identical to just adding the LoRa, with the difference attributed to small rounding errors. 1250 | 1251 | Next, add the LoRa to the target model, and also add the result of the add_difference method applied to the fine-tuned model and the mix of the target model and LoRa. The resulting merge, called the Ultimate_Merge, is 99.99% similar to the target model and can handle massive merges of hundreds of specialized models with the preferred mix without affecting it much. The Ultimate_Merge only loses 0.01% or even less of the information. 1252 | 1253 | link to original tutorial/comment (NSFW) https://www.reddit.com/r/sdnsfw/comments/10nb2jr/comment/j67trgn/ 1254 | 1255 | 1256 | 1257 | #### Megamerged/MegaMerges 1258 | Megamerged models in Stable Diffusion are models that have been created by merging more than 5 models with a specific style, object, or capabilities in mind. These models can be quite complex and powerful and are often used for specific purposes or applications. 1259 | 1260 | Creating a megamerged model involves taking several existing models and merging them together in a way that preserves the desired features of each individual model. This can be done using techniques like add_difference or merge_lora, as well as other methods. The resulting megamerged model is a new model that combines the strengths of each of the individual models that were used to create it. 1261 | 1262 | Megamerged models can be quite powerful and effective, but they can also be more complex and difficult to work with than simpler models. They may require more VRAM and longer training times, and they may require more expertise to fine-tune and optimize for specific tasks. However, for certain applications and use cases, megamerged models can be an effective tool for achieving high-quality results. 1263 | 1264 | #### Embeddings 1265 | Embeddings in Stable Diffusion are a way to add additional information to the model through text prompts. Community embeddings are created through textual inversion and can be added to prompts to achieve a desired style or object without using a fully fine-tuned model. These embeddings are not a checkpoint, but rather a new set of embeddings created by the community. Using embeddings can improve the quality and specificity of the generated images. Embeddings can be used to reduce biases within the original model or mimic visual styles. 1266 | 1267 | #### Community Forks 1268 | Style2Paints 1269 | Community forks are variations of the Stable Diffusion model that are developed and maintained by individuals or groups within the community. One such fork is Style2Paints, which is focused on being more of an artist's assistant than creating random generations. It seems to be highly anime-focused, but it is doing some interesting things with sketch infilling. The Style2Paints fork can be found on GitHub and includes a preview of version 5. 1270 | https://github.com/lllyasviel/style2paints/tree/master/V5_preview 1271 | 1272 | 1273 | 1274 | 1275 | 1276 | ### VAE (Variational Autoencoder) in Stable Diffusion 1277 | 1278 | In Stable Diffusion, the VAE (or encoder-decoder) component is responsible for compressing the input images into a smaller, latent space, which helps to reduce the VRAM requirements for the diffusion process. In practice, it is important to use a decoder that can effectively reconstruct the original image from the latent space representation. 1279 | 1280 | While the default VAE models included with Stable Diffusion are suitable for many applications, there are other fine-tuned models available that may better meet specific needs. For example, the Hugging Face model repository includes a range of fine-tuned VAE models that may be useful for certain tasks. 1281 | 1282 | When selecting a VAE model, it is important to consider factors such as dataset size, resolution, and other hyperparameters that may impact performance. Ultimately, the choice of VAE model will depend on the specific needs and requirements of the project at hand. 1283 | 1284 | #### Original Autoencoder in Stable Diffusion 1285 | 1286 | The original autoencoder included in Stable Diffusion is the default encoder-decoder used in the model. While it is generally effective at compressing images into a latent space for the diffusion process, it may not perform as well on certain types of images, particularly human faces. 1287 | 1288 | Over time, several fine-tuned autoencoder models have been developed and made available to the community. These models often perform better than the original autoencoder for specific tasks and image types. 1289 | 1290 | When selecting an autoencoder model for a specific application, it is important to consider factors such as image resolution, dataset size, and other hyperparameters that may impact performance. Ultimately, the choice of the autoencoder model will depend on the specific needs and requirements of the project at hand. 1291 | 1292 | #### EMA VAE in Stable Diffusion 1293 | 1294 | The EMA (Exponential Moving Average) VAE is a fine-tuned encoder-decoder included in Stable Diffusion that is specifically designed to perform well on human faces. This model uses an exponential moving average of the encoder weights during training, which helps to stabilize the training process and improve overall performance. 1295 | 1296 | Compared to the original autoencoder included with Stable Diffusion, the EMA VAE generally produces better results on images of human faces. However, it is important to consider other factors such as image resolution, dataset size, and other hyperparameters when selecting a VAE model for a specific application. 1297 | 1298 | Overall, the EMA VAE is a valuable addition to the range of encoder-decoder models available in Stable Diffusion, particularly for applications that require high-quality image generation of human faces. 1299 | 1300 | #### MSE VAE in Stable Diffusion 1301 | 1302 | The MSE (Mean Squared Error) VAE is another fine-tuned encoder-decoder included in Stable Diffusion that is designed to perform well on images of human faces. This model uses MSE as the reconstruction loss during training, which can help to improve the quality of the reconstructed images. 1303 | 1304 | Compared to the original autoencoder and other VAE models included with Stable Diffusion, the MSE VAE generally produces better results on images of human faces. However, as with any model selection, it is important to consider other factors such as image resolution, dataset size, and other hyperparameters. 1305 | 1306 | Overall, the MSE VAE is a useful option for applications that require high-quality image generation of human faces, particularly when used in combination with other techniques such as diffusion and CLIP-guidance. 1307 | 1308 | 1309 | 1310 | ### Samplers 1311 | samplers are used in Stable Diffusion to denoise images during the diffusion process. They are different methods to solve differential equations, and there are both classic methods like Euler and Heun as well as newer neural network-based methods like DDIM, DPM, and DPM2. Some samplers are faster than others, and some converge to a final image while others like ancestral samplers simply keep generating new images with an increasing number of steps. It's important to test and compare the speed and performance of different samplers for different use cases, but generally, the DPM++ sampler is considered the best option for most situations. 1312 | 1313 | https://www.youtube.com/watch?v=gtr-4CUBfeQ 1314 | 1315 | #### Ancestral Samplers 1316 | Ancestral samplers are designed to maintain the stochasticity of the diffusion process, where a small amount of noise is added to the image at each step, leading to different possible outcomes. This is in contrast to non-ancestral samplers, which aim to converge to a single image by minimizing diffusion loss. Ancestral samplers can produce interesting and diverse results with a low number of steps, but the downside is that the generated images can be more noisy and less realistic compared to the results obtained from non-ancestral samplers. 1317 | 1318 | ##### DPM++ 2S A Karras 1319 | DPM++ 2S A Karras is a two-step DPM++ solver. The "2S" in the name stands for "two-step". The "A" means it is an ancestral sampler and the "Karras" refers to the fact that it is based on the architecture used in the StyleGAN2 paper by Tero Karras et al. 1320 | 1321 | ##### DPM++ A 1322 | DPM++ A is an ancestral sampler version of the DPM++ sampler, meaning that it adds a little bit of noise at each step and never converges to a final image. It is a multi-step sampler that is based on a neural network approach to solving the diffusion process. It has been shown to produce high-quality results and is often used for generating images with complex textures and patterns. However, it can be computationally expensive and may take longer to generate images compared to other samplers. 1323 | 1324 | ##### Euler A 1325 | Euler A is an ancestral sampler that uses the classic Euler method to solve the discretized differential equations involved in the denoising process but adds a bit of noise at each step. This results in an image that is not necessarily converging to a single solution but rather keeps generating new variations at each step. Euler A is particularly effective at generating high-quality images at low step counts and offers a degree of control over the amount of noise added at each step for adjusting the output image. 1326 | 1327 | ##### DPM Fast 1328 | DPM Fast is a fast implementation of DPM (Dynamic Progressive Mesh) sampler, which is a neural network-based method of solving the problem of image denoising in Stable Diffusion models. It is a single-step method that is designed to converge faster than other methods, but it sacrifices some image quality to achieve this speed. DPM Fast is typically used for large batch processing, where speed is of the utmost importance. However, it may not be suitable for high-quality image generation where image fidelity is a priority. 1329 | 1330 | ##### DPM Adaptive 1331 | DPM Adaptive is a sampling method for Stable Diffusion that adapts the number of steps required to achieve a certain level of denoising based on the input image. It is designed to be more efficient than other methods by reducing the number of unnecessary steps and thus, the overall processing time. However, unlike other samplers, DPM Adaptive does not converge to a final image, meaning it will continue generating different variations of the image with an increasing number of steps. It is particularly useful for large images that require more processing time to denoise. 1332 | 1333 | #### DPM++ 1334 | DPM++ is a diffusion probabilistic model that uses a fast solver to speed up guided sampling. Compared to other samplers like Euler, LMS, PLMS, and DDIM, DPM++ is super fast and can achieve the same result in fewer steps. Its speed makes it a popular choice for generating high-quality images quickly. The DPM++ model is described in two research papers, available at the links provided. 1335 | PAPER: https://arxiv.org/pdf/2211.01095.pdf 1336 | PAPER: https://arxiv.org/pdf/2206.00364.pdf 1337 | 1338 | ##### DPM++ SDE 1339 | DPM++ SDE is a stochastic version of the DPM++ sampler. It solves the diffusion process using a stochastic differential equation (SDE) solver, which can handle both continuous and discrete-time noise. This sampler is designed to handle larger-scale guided sampling and can generate high-quality images in a relatively small number of steps. It is also one of the fastest DPM++ samplers available. The Karras version is a similar sampler that produces similar images but is optimized for smaller guidance scales. 1340 | 1341 | ##### DPM++ 2M 1342 | DPM++ 2M is a multi-step sampler based on the Diffusion Probabilistic Models (DPM++) solver. It is designed to perform better for large guidance scales and produces high-quality images in fewer steps compared to other samplers. The Karras version is also available, which produces similar results to the original DPM++ 2M sampler. DPM++ 2M is recommended for users who want to generate high-quality images with large guidance scales efficiently. 1343 | 1344 | #### Common Samplers / Equilibrium Samplers 1345 | 1346 | ##### k_LMS 1347 | The k-LMS Stable Diffusion technique involves a sequence of minute, stochastic increments that proceed along the gradient of the distribution, originating from a specific location within the parameter space. By adapting the step magnitude according to the curvature of the distribution, this method reduces sample variance. Consequently, it facilitates swifter and more efficient sampling in the direction of the desired distribution. 1348 | 1349 | ##### DDIM 1350 | The DDIM Stable Diffusion technique represents an advanced adaptation of the k-LMS Stable Diffusion algorithm, delivering superior sampling accuracy. By further reducing sample variance and bolstering convergence towards the target distribution, this method attains enhanced performance. This improvement is achieved through the incorporation of additional information regarding the distribution's curvature into the model. Distinct from alternative algorithms, DDIM necessitates a mere eight steps to generate exceptional imagery. 1351 | 1352 | ##### k_euler_a and Heun 1353 | Analogous to DDIM, the k_euler_a and Heun samplers exhibit remarkable speed and generate outstanding outcomes with a minimal number of steps. Nonetheless, these methods also considerably modify the generative style. To achieve the optimal result, it is advised to transfer a promising image discovered in k_euler and Heun samplers to DDIM, or vice versa, and iterate until the ideal outcome is obtained. 1354 | 1355 | ##### k_dpm_2_a 1356 | Regarded by numerous experts as surpassing its counterparts, the k_dpm_2_a sampler prioritizes quality over speed. Entailing a 30- to 80-step procedure, this sampler yields exceptional outcomes. Ideally, it is employed for meticulously refined prompts exhibiting minimal inaccuracies, and may not be the most suitable sampler for exploratory purposes. 1357 | 1358 | 1359 | 1360 | 1361 | 1362 | 1363 | ## Methods of Training Models and Creating Embeddings 1364 | Capturing concepts involves training a model to generate images that match a certain style or object. This can be done in several ways, such as using a dataset of images that represent the desired style or object, or by fine-tuning an existing model on a small dataset of images that match the desired concept. 1365 | 1366 | One approach to capturing concepts is to use a method called "guided diffusion," which involves generating images that match a given prompt or text description. This can be done by using a pre-trained model and fine-tuning it on a small dataset of images that match the desired concept, or by using a style transfer method to transfer the desired style onto a set of images. 1367 | 1368 | Another approach is to use a method called "latent space interpolation," which involves exploring the latent space of a pre-trained model and manipulating the latent vectors to generate images that match a desired style or object. This method can be used to generate new images that are similar to a given image or to explore the space of different styles or objects. 1369 | 1370 | Overall, capturing concepts involves training a model to generate images that match a desired style or object, and there are several methods available for doing so, including guided diffusion and latent space interpolation. 1371 | 1372 | ### Dataset and Image Preparation 1373 | Dataset and Image Preparation is a crucial step in training and generating images with stable diffusion models. A well-prepared dataset can lead to better image quality and more efficient training. 1374 | 1375 | Image preparation is also important to ensure that images are of good quality and uniform in size. Images can be resized and cropped to a consistent aspect ratio, and color correction can be applied to ensure consistency across the dataset. 1376 | 1377 | A screenshot pipeline can be used to automatically extract screenshots from anime or video game footage. This can be a more efficient way to gather images for training or generating images in a specific style. 1378 | 1379 | Overall, preparing a high-quality dataset is essential for stable diffusion models to generate high-quality images. 1380 | 1381 | Tutorial: https://github.com/nitrosocke/dreambooth-training-guide 1382 | 1383 | Screenshot Pipeline: https://github.com/cyber-meow/anime_screenshot_pipeline 1384 | 1385 | #### Choosing Images 1386 | Try to only use high res images that you shrink down to the training size, stretching out smaller images will end up with a low quality training, it will create things that look blurry and pixelated. If you have to upscale them, use an upscaler, or the photoshop blur/sharpen/NeuralFiler-JPEGArtifactRemoval method 1387 | 1388 | ##### Tip for training faces and characters 1389 | try to have close to 30 images, have 10 be face shots, 10 be head shots, 6 be torso shots, 4 be full body shots. Have different outfits and backgrounds unless the outfit is core to their character. Label all the parts that are not an inherent part of the character, for example if a hairstyle is part of the character you don't need to label it, but if the character often changes their hairstyle then it should be labelled. 1390 | 1391 | #### Captioning 1392 | Captioning is the process of providing textual descriptions or labels to images, which is a crucial step in many machine-learning tasks, such as image recognition, object detection, and image captioning. In the context of training Stable Diffusion models, captioning can be helpful in providing additional context and guidance to the model, particularly when dealing with images of specific objects or styles. 1393 | 1394 | For example, when training a model to generate images of a particular character with different hairstyles or clothing, providing captions that mention the character's hair or clothing can help the model to focus on remembering the character's other built-in features and reproduce these features more consistently while allowing for variation of the features that were captioned. 1395 | 1396 | Captioning can also be useful in creating training datasets by automatically generating captions for images using techniques like object recognition or text-based image retrieval. These captions can then be used to train models for a variety of image-related tasks, including Stable Diffusion. 1397 | 1398 | #### Regularization/Classifier Images 1399 | Regularization/Classifier Images are images used during the training process to help stabilize and regularize the model. They are typically created by training a classifier on a set of images and using the activations of that classifier as a form of regularization during training. 1400 | 1401 | The use of regularization images was initially met with skepticism in the Stable Diffusion community but has since been shown to be effective in improving model stability and image quality. 1402 | 1403 | The process of creating regularization images involves training a classifier on a dataset of images and then using the activations of that classifier as a form of regularization during the training process. This helps to ensure that the model is not overfitting to the training data and is able to generalize to new images. 1404 | 1405 | In addition to their use in regularization, classifier images can also be used to generate prompts for image generation. By identifying the features and attributes of images that are most important for classification, these images can be used to guide the generation of new images that meet certain criteria. 1406 | 1407 | Overall, regularization/classifier images are an important tool in the stable diffusion training process, helping to ensure that models are stable, generalizable, and capable of generating high-quality images. 1408 | https://www.reddit.com/r/StableDiffusion/comments/z9g46h/i_was_wrong_classifierregularization_images_do/ 1409 | 1410 | ##### Links to Some Regularization Images 1411 | https://github.com/aitrepreneur/REGULARIZATION-IMAGES-SD 1412 | 1413 | 1414 | #### Training Tutorials 1415 | 1416 | BlueFaux's Tutorial: https://www.reddit.com/r/StableDiffusion/comments/10zze8f/how_to_train_your_series_abridged_link_to_full/ 1417 | 1418 | 1419 | ### Types of Training 1420 | Training is the process of fine-tuning a pre-existing model or creating a new one from scratch to generate images based on a specific subject or style. This is achieved by feeding the model with a large dataset of images that represent the subject or style. The model then learns the patterns and features of the input images and uses them to generate new images that are similar in style or subject. 1421 | 1422 | Training a model can be done in various ways, including transfer learning, where a pre-existing model is fine-tuned on a new dataset, or by creating a new model from scratch. The process typically involves setting hyperparameters, selecting the training dataset, defining the loss function, and training the model using an optimizer. 1423 | 1424 | Once a model is trained, it can be used to generate new images that represent the subject or style it was trained on. This can be useful for creating custom art, generating images for specific applications, or even creating new datasets for further training. Training a model can be a complex and time-consuming process, but it can also be very rewarding in terms of the results that can be achieved. 1425 | 1426 | #### File Type Overview 1427 | Most common files types used as models or embeddings 1428 | 1429 | Models: 1430 | .ckpt (Checkpoint file): This is a file format used by TensorFlow to save model checkpoints. It contains the weights and biases of the trained model and can be used to restore the model at a later time. 1431 | .safetensor (SafeTensor file): This is a custom file format used by Stable Diffusion to store models and embeddings. It is optimized for efficient storage and retrieval of large models and embeddings, is also designed to be more secure. 1432 | .pth (PyTorch model file): This is a file format used by PyTorch to save trained models. It contains the model architecture and the learned parameters. 1433 | .pkl: A Python pickle file, which is a serialized object that can be saved to disk and loaded later. This is the most common file type for saved models in Stable Diffusion. 1434 | .pt: A PyTorch model file, which is used to save PyTorch models. This file type is also used in Stable Diffusion for saved models. 1435 | .h5: A Hierarchical Data Format file, which is commonly used in machine learning for saving models. This file type is used for some Stable Diffusion models. 1436 | 1437 | Embeddings: 1438 | .pt: PyTorch tensor file, which is a file format used for PyTorch tensors. This is the most common file type for embeddings in Stable Diffusion. 1439 | .npy: NumPy array file, which is a file format used for NumPy arrays. Some Stable Diffusion embeddings are saved in this format. 1440 | .h5: Hierarchical Data Format file, which can also be used for saving embeddings in Stable Diffusion. 1441 | .bin (Binary file): This is a general-purpose file format that can be used to store binary data, including models and embeddings. It is a compact format that is efficient for storing large amounts of data. 1442 | 1443 | #### CKPT/Diffuser/Safetensor 1444 | 1445 | 1446 | #### Textual Inversion 1447 | Textual inversion is a technique in which a new keyword is created to represent data that is already known to the model, without changing its weights. It can be particularly useful for creating images of characters or people. Textual inversion can be used in conjunction with almost any other option and can help achieve more consistent results when training models. It is not simply a compilation of prompts, but rather a way to push the output toward a desired outcome. By mixing and matching different techniques, interesting and unique results can be achieved. 1448 | 1449 | Textual inversion is trained on a model so although it will often work with compatible models this is not always the case. 1450 | 1451 | https://github.com/rinongal/textual_inversion 1452 | COLAB: https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion_textual_inversion_library_navigator.ipynb 1453 | 1454 | Train New Embedding Tutorial: https://youtu.be/7OnZ_I5dYgw 1455 | 1456 | ##### Negative Embedding 1457 | A negative embedding is an embedding used as a negative prompt to avoid certain unwanted aspects in generated images. These embeddings are typically created by generating images using only negative prompts. They can be used to group or condense a long negative prompt into a single word or phrase. Negative embeddings are useful in improving the consistency and quality of generated images, particularly in avoiding undesirable artistic aspects. 1458 | 1459 | #### LORA 1460 | LORA, or Low-Rank Adaptation, is a technique for training a model to a specific subject or style. LORA is advantageous over Dreambooth in that it only requires 6GB of VRAM to run and produces two small files of 6MB, making it less hardware-intensive. However, it is less flexible than Dreambooth and primarily focuses on faces. LORA can be thought of as injecting a part of a model and teaching it new concepts, making it a powerful tool for fine-tuning generated images without altering the underlying model architecture. One of the primary benefits of LORA is that it has a lower hardware requirement to train, although it can be more complex to train than other techniques. It also does not water down the model in the same way that merging models does. 1461 | 1462 | Training LORA requires following a specific set of instructions, which can be found in various tutorials available online. It is important to consider the weight of the LORA during training, with a recommended weight range of 0.5 to 0.7. 1463 | 1464 | LORA is not solely used in Stable Diffusion and is used in other machine learning projects as well. Additionally, DIM-Networks can be used in conjunction with LORA to further enhance training. 1465 | 1466 | https://github.com/cloneofsimo/lora 1467 | DEMO - Broken?: https://huggingface.co/spaces/ysharma/Low-rank-Adaptation 1468 | 1469 | Training LORA 1470 | Tutorial: https://www.reddit.com/r/StableDiffusion/comments/111mhsl/lora_training_guide_version_20_i_added_multiple/?utm_source=share&utm_medium=web2x&context=3 1471 | 1472 | Changing Lora Weight example: 0.5-:0.7 1473 | 1474 | 1475 | Number of Images in training data 1476 | 1477 | Converting Checkpoint to LORA 1478 | 1479 | ##### LoHa 1480 | Seems to be a LORA that has something to do with federated learning, meaning can be trained in small pieces by many computers instead of all at once in one large go? I'm not completely sure yet 1481 | Github: https://github.com/KohakuBlueleaf/LyCORIS 1482 | Paper: https://openreview.net/pdf?id=d71n4ftoCBy 1483 | 1484 | 1485 | #### Hypernetworks 1486 | Hypernetworks are a machine learning technique that allows for the training of a model without altering its weights. This technique involves the use of a separate small network, known as a hypernetwork, to modify the generated images after they have been created. This approach can be useful for fine-tuning generated images without changing the underlying model architecture. 1487 | 1488 | How Hypernetworks Work: 1489 | 1490 | Hypernetworks are typically applied to various points within a larger neural network. This allows them to steer results in a particular direction, such as imitating the art style of a specific artist, even if the artist is not recognized by the original model. Hypernetworks work by finding key areas of importance in the image, such as hair and eyes, and then patching these areas in secondary latent space. 1491 | 1492 | Benefits of Hypernetworks: 1493 | 1494 | One of the main benefits of hypernetworks is that they can be used to fine-tune generated images without changing the underlying model architecture. This can be useful in situations where changing the model architecture is not feasible or desirable. Additionally, hypernetworks are known for their lower hardware requirements compared to other training methods. 1495 | 1496 | Limitations of Hypernetworks: 1497 | 1498 | Despite their benefits, hypernetworks can be difficult to train effectively. Many users have voiced that hypernetworks work best with styles rather than faces or characters. This means that hypernetworks may not be suitable for all types of image-generation tasks. 1499 | 1500 | Tutorial: https://www.youtube.com/watch?v=1mEggRgRgfg 1501 | 1502 | G.A.? 1503 | 1504 | 1505 | #### Aescetic Gradients 1506 | Aesthetic gradients are a type of image input that can be used as an alternative to textual prompts. They are useful when trying to generate an image that is difficult to describe in words, allowing for a more intuitive approach to image generation. However, some users have reported underwhelming results when using aesthetic gradients as input. The settings to modify weight may be unclear and unintuitive, making experimentation necessary. Aesthetic gradients may work best as a supplement to a trained model, as both the model and the gradients have been trained on the same data, allowing for added variation in generated images. 1507 | 1508 | 1509 | 1510 | ### Fine Tuning / Checkpoints/Diffusers/Safetensors 1511 | To fine-tune a model, you start with a pre-trained checkpoint or diffuser and then continue training it on your own dataset or with your own prompts. This allows you to customize the model to better fit your specific needs. Checkpoints are saved models that can be loaded to continue training or to generate images. Diffusers, on the other hand, are used for guiding the diffusion process during image generation. 1512 | 1513 | Fine-tuning can be done on a variety of pre-trained models, including the base models such as 1.4, 1.5, 2.0, 2.1, as well as custom models. Fine-tuning can be useful for training a model to recognize a specific subject or style, or for improving the performance of a model on a specific task. 1514 | 1515 | A diffuser, checkpoint (ckpt), and safetensor are all related to the process of training and using neural network models, but they serve different purposes: 1516 | 1517 | A diffuser is a term used in the Stable Diffusion framework to refer to a specific type of image generation model. Diffusers are trained using a diffusion process that gradually adds noise to an image, allowing the model to generate increasingly complex images over time. Diffusers are a key component of the Stable Diffusion framework and are used to generate high-quality images based on textual prompts. 1518 | 1519 | A checkpoint (ckpt) is a file that contains the trained parameters (weights) of a neural network model at a particular point in the training process. Checkpoints are typically used for saving the progress of a training session so that it can be resumed later, or for transferring a pre-trained model to another computer or environment. Checkpoints can also be used to fine-tune a pre-trained model on a new dataset or task. 1520 | 1521 | A safetensor is a file format used to store the trained parameters (weights) of a neural network model in a way that is optimized for fast and efficient loading and processing. Safetensors are similar to checkpoints in that they store the model parameters, but they are specifically designed for use with the TensorFlow machine learning library. Safetensors can be used to save and load pre-trained models in TensorFlow, and can also be used for fine-tuning or transfer learning. 1522 | 1523 | In summary, diffusers are a type of image generation model used in the Stable Diffusion framework, while checkpoints and safetensors are file formats used to store and load the trained parameters of a neural network model. Checkpoints and safetensors are often used for fine-tuning or transfer learning, while diffusers are used for generating high-quality images based on textual prompts. 1524 | 1525 | #### Token Based 1526 | Token-based fine-tuning is a simplified form of fine-tuning that requires fewer images and utilizes a single token to modify the model. This approach does not require captions for each image, making it easier to execute and reducing the chances of error. The single token is used to modify the model's weights to achieve the desired outcome. While token-based fine-tuning is a simpler method, it may not provide the same level of accuracy and customization as other forms of fine-tuning that use more detailed captions or multiple tokens. 1527 | 1528 | ##### Dreambooth 1529 | Dreambooth is a tool that allows you to fine-tune a Stable Diffusion checkpoint based on a single keyword that represents all of your images, for example, "mycat." This approach does not require you to caption each individual image, which can save time and effort. To use Dreambooth, you need to prepare at least 20 images in a square format of 512x512 or 768x768 and fine-tune the Stable Diffusion checkpoint on them. This process requires a significant amount of VRAM, typically above 15GB, and will produce a file ranging from 2GB to 5GB. Accumulative stacking is also possible in Dreambooth, which involves consecutive training while maintaining the structure of the models. However, this technique is challenging to execute. Overall, Dreambooth can be a useful tool for fine-tuning a Stable Diffusion checkpoint to a specific set of images using a single keyword. 1530 | 1531 | PAPER: https://dreambooth.github.io/ 1532 | TUTORIAL: https://www.youtube.com/watch?v=7m__xadX0z0 or https://www.youtube.com/watch?v=Bdl-jWR3Ukc 1533 | COLAB: https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast-DreamBooth.ipynb 1534 | 1535 | 1536 | 1537 | ##### Custom Diffusion by Adobe 1538 | Custom Diffusion by Adobe is a technique for fine-tuning a Stable Diffusion model to a specific dataset. This approach involves training a new model on the dataset using the Diffusion process, which can take several days or even weeks depending on the size and complexity of the dataset. The resulting model can then be used to generate images with the specific style or content of the training dataset. 1539 | 1540 | One of the key benefits of Custom Diffusion by Adobe is its ability to generate high-quality images that are visually consistent with the training data. This makes it a powerful tool for a wide range of applications, from generating art and design to creating realistic simulations for video games and movies. 1541 | 1542 | However, Custom Diffusion by Adobe is also a computationally intensive process that requires significant resources, including powerful hardware and access to large amounts of data. As such, it may not be practical for all users or applications. Additionally, the technique may require significant expertise and training to use effectively, making it more suitable for advanced users with experience in machine learning and computer vision. 1543 | https://github.com/adobe-research/custom-diffusion 1544 | https://huggingface.co/spaces/nupurkmr9/custom-diffusion 1545 | 1546 | #### Caption Based Fine Tuning 1547 | Caption-based fine-tuning is a method of fine-tuning a stable diffusion model that requires a large number of images, typically in the range of hundreds to thousands. In this approach, the image captions are used as the basis for fine-tuning, allowing for multi-concept training. While this method allows for more flexibility in training, it requires more work than other methods such as token-based fine-tuning. The key advantage of this approach is its ability to capture multiple concepts in the fine-tuning process, enabling more nuanced image generation. 1548 | 1549 | Caption-based fine-tuning requires a lot of captions, not necessarily a lot of images. It can be done with a smaller set of images, as long as they have a diverse range of captions that represent the desired concepts or styles. 1550 | 1551 | #### Fine Tuning 1552 | Fine tuning is a technique used to create a new checkpoint based on image captions. Unlike token-based fine tuning, this method requires a lot of images, ranging from hundreds to thousands. With fine tuning, you can choose to tune just the Unet or both the Unet and the decoder. This process requires a minimum of 15GB VRAM and produces a file ranging from 2GB to 5GB in size. While conventional dreambooth codes can be used for fine tuning, it is important to select the options that allow the use of captions instead of tokens. 1553 | 1554 | ##### EveryDream 2 1555 | I've found this one to personally give great results 1556 | 1557 | Github: https://github.com/victorchall/EveryDream2trainer 1558 | Discord: https://discord.gg/uheqxU6sXN 1559 | 1560 | TUTORIAL: https://docs.google.com/document/d/1x9B08tMeAxdg87iuc3G4TQZeRv8YmV4tAcb-irTjuwc/edit 1561 | 1562 | ##### Stable Tuner 1563 | Github: https://github.com/devilismyfriend/StableTuner 1564 | Discord: https://discord.gg/DahNECrBUZ 1565 | 1566 | ##### Dream Artist Auto1111 Extension 1567 | some have used this for single image training 1568 | 1569 | Github: https://github.com/7eu7d7/DreamArtist-sd-webui-extension 1570 | 1571 | #### Decoding Checkpoints 1572 | Decoding checkpoints refer to a method of using pre-trained models to generate images based on textual prompts or other inputs. These checkpoints contain a set of weights that have been optimized during the training process to produce high-quality images. The decoding process involves feeding a textual prompt into the model and using the learned weights to generate an image that matches the input. These checkpoints can be used for a wide variety of image generation tasks, including creating artwork, generating realistic photographs, or creating new designs for products. Different types of decoding checkpoints may be used for different types of tasks, and users may experiment with different models to find the one that works best for their specific needs. Overall, decoding checkpoints are a powerful tool for generating high-quality images quickly and efficiently. 1573 | 1574 | 1575 | 1576 | ### Mixing 1577 | Mixing in Stable Diffusion refers to combining different models, embeddings, prompts, or other inputs to generate novel and varied images. Image2text is a tool that can be used to analyze existing images and generate prompts that capture the style or content of the image. These prompts can then be used to generate new images using Stable Diffusion models. Additionally, mixing can be achieved by combining different models or embeddings together, either through merging or using hypernetworks. This can allow for greater flexibility in generating images with unique styles and content. 1578 | 1579 | #### Using Multiple types of models and embeddings 1580 | Using multiple types of models and embeddings such as hypernetworks, embeddings, or LORA can be useful for mixing different styles and objects together. By combining the strengths of multiple models, you can create more unique and diverse images. For example, using multiple embeddings can give you a wider range of prompts to use in image generation, while combining hypernetworks can help fine-tune the generated images without changing the underlying model architecture. However, using too many models at once can lead to decreased performance and longer training times. It is important to find a balance between using multiple models and keeping your system resources efficient. 1581 | 1582 | ##### Multiple Embeddings 1583 | When using Stable Diffusion for image generation, it is possible to use multiple embeddings simultaneously by adding the different keywords of the embeddings to your prompt. This can be helpful when attempting to mix different styles or objects together in your generated image. By using multiple embeddings, you can create more complex and nuanced prompts for the model to generate images from. 1584 | 1585 | ##### Multiple Hypernetworks 1586 | Using multiple hypernetworks can help mix styles and objects together in image generation. These hypernetworks can be added to the model to modify images in a certain way after they are created, without changing the underlying model architecture. While powerful, hypernetworks can be difficult to train and require a lower hardware requirement than fine-tuning models. By using multiple hypernetworks, users can achieve more diverse and nuanced results in their image generation. 1587 | https://github.com/antis0007/sd-webui-multiple-hypernetworks 1588 | 1589 | ##### Multiple LORA's 1590 | To achieve a more customized image output, multiple LORA models can be used in combination with custom models and embeddings. However, some users have reported that using more than 5 LORA models simultaneously can lead to poor results. It is important to experiment with different combinations and find the optimal balance of LORA models to achieve the desired output. 1591 | 1592 | #### Merging 1593 | Merging checkpoints allows for mixing two concepts together. This can be done by combining the weights of two or more pre-trained models. However, it is important to note that merging can cause a loss or weakening of some concepts in the final output due to the differences in the underlying architectures and training data of the models being merged. It is recommended to experiment with different merging approaches and models to achieve the desired results. 1594 | 1595 | ##### Merging Checkpoints 1596 | Merging checkpoints is a technique used to combine two different models to create a new model with characteristics of both. This process allows you to mix the models together in various proportions, ranging from 0% to 100%. By merging models, you can create entirely new styles and outputs that wouldn't be possible with a single model. However, it's important to note that merging models can also result in a loss or weakening of certain concepts. Therefore, it's important to experiment with different combinations and proportions to achieve the desired result. 1597 | 1598 | #### Converting Checkpoints/Diffusers/LORAs 1599 | Converting Checkpoints to LORA and Safetensors involves transforming the trained model weights into a compressed format that can be used in other applications. 1600 | 1601 | To convert a checkpoint to LORA, you can use the "compress.py" script provided in the LORA repository. This script takes a trained checkpoint and compresses it into a LORA file, which can be used in other machine learning projects. This can also be done with Kohya Ui. 1602 | 1603 | To convert a checkpoint to a Safetensor, you can use the "export.py" script provided in the Safetensor repository. This script takes a trained checkpoint and exports it as a Safetensor, which is a compressed and encrypted version of the model that can be safely shared with others. This can also be done with most UIs. 1604 | 1605 | Converting checkpoints to LORA or Safetensors can be useful for sharing models with others or for using them in other applications that require compressed model files. 1606 | 1607 | 1608 | 1609 | 1610 | ### Image2Text 1611 | Image2text is a technique used to convert images into text descriptions, also known as image captioning. It involves using a trained model to generate a textual description of the content of an image. This can be useful for a variety of applications, such as generating captions for social media posts or providing context for image datasets used in machine learning. 1612 | 1613 | There are a few different approaches to image captioning, such as using a CNN-RNN model, which involves using a convolutional neural network to extract features from an image and then passing those features to a recurrent neural network to generate a description. Other models may use attention mechanisms or transformer architectures. 1614 | 1615 | To train an image captioning model, a large dataset of images with corresponding text descriptions is typically used. The model is then trained on this dataset using a loss function that compares the generated captions to the actual captions. Once trained, the model can be used to generate captions for new images. 1616 | 1617 | In the context of mixing two concepts, image2text can be used to generate textual descriptions of the different styles or objects being combined. These descriptions can then be used as prompts for a diffusion model to generate an image that combines those concepts. 1618 | 1619 | #### CLIP Interrogation 1620 | CLIP Interrogator is a Python package that enables users to find the most suitable text prompts that describe an existing image based on the CLIP model. This tool can be useful for generating and refining prompts for image generation models or for labeling images programmatically during training. 1621 | 1622 | CLIP Interrogator is available on GitHub and can be installed via pip. The package also includes a demo notebook showcasing the tool's functionality. Additionally, the package can be used with the Hugging Face Transformers library to further streamline the prompt generation process. 1623 | https://github.com/pharmapsychotic/clip-interrogator 1624 | DEMO: https://huggingface.co/spaces/pharma/CLIP-Interrogator 1625 | 1626 | #### BLIP Captioning 1627 | BLIP (Bootstrapping Language-Image Pre-training) is a framework for pre-training vision and language models that can generate captions for images. It uses a two-stage approach, where the first stage involves training an image encoder and a text decoder on large-scale image-caption datasets, and the second stage involves fine-tuning the model on a smaller dataset with captions and corresponding prompts. This fine-tuning process uses a novel method called Contrastive Learning for Prompt (CLP) which aims to learn the relationship between the image and the prompt. 1628 | 1629 | BLIP Image Captioning allows you to generate prompts for an existing image by interrogating the model. This is helpful in crafting your own prompts or for programatically labeling images during training. BLIP2 is the latest version of BLIP which has been further improved with new training techniques and larger datasets. A demo of BLIP Image Captioning can be found on the Hugging Face website. 1630 | Paper: https://arxiv.org/pdf/2201.12086.pdf 1631 | Summary: https://ahmed-sabir.medium.com/paper-summary-blip-bootstrapping-language-image-pre-training-for-unified-vision-language-c1df6f6c9166 1632 | DEMO: https://huggingface.co/spaces/Salesforce/BLIP 1633 | BLIP2: 1634 | 1635 | #### DanBooru Tags / Deepdanbooru 1636 | Danbooru is a popular anime and manga imageboard website where users can upload and tag images. DeepDanbooru is a neural network trained on the Danbooru2018 dataset to automatically tag images with relevant tags. The tags can then be used as prompts to generate images in a particular style or with certain objects. 1637 | 1638 | DeepDanbooru is available as a web service or can be run locally on a machine with GPU support. The DeepDanbooru model is trained on more than 3 million images and over 10,000 tags, and is capable of tagging images with a high degree of accuracy. 1639 | 1640 | Using DeepDanbooru tags as prompts can be a powerful tool for generating anime and manga-style images or images featuring particular characters or objects. It can also be used for automating the tagging process for large collections of images. 1641 | 1642 | #### Waifu Diffusion 1.4 tagger - Using DeepDanBooru Tags 1643 | Waifu Diffusion 1.4 tagger is a tool developed using the DeepDanBooru tagger to automatically generate tags for images. The tool uses Stable Diffusion 1.4 model to generate images and DeepDanBooru model to generate tags. The generated tags can be used for various purposes such as organizing and searching images. 1644 | 1645 | The tagger works by taking an input image and generating tags for it using DeepDanBooru model. The generated tags are then displayed alongside the image. The user can edit the generated tags and add new tags as required. Once the tags are finalized, they can be saved and used for organizing and searching images. 1646 | 1647 | The tool is available as an open-source project on GitHub and can be used by anyone for free. 1648 | https://github.com/toriato/stable-diffusion-webui-wd14-tagger 1649 | 1650 | 1651 | 1652 | 1653 | 1654 | 1655 | ### Pruning Models 1656 | NKMD GUI has pruning functionality 1657 | Dreambooth has this functionality? 1658 | 1659 | https://medium.com/@souvik.paul01/pruning-in-deep-learning-models-1067a19acd89 1660 | https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide 1661 | https://colab.research.google.com/drive/1bBWC_MNN6MJvPXxw4e4paVwPwdzSJ-X0?usp=sharing 1662 | 1663 | https://raw.githubusercontent.com/prettydeep/Dreambooth-SD-ckpt-pruning/main/prune-ckpt.py 1664 | https://github.com/JoePenna/Dreambooth-Stable-Diffusion/blob/main/prune_ckpt.py 1665 | https://github.com/lopho/stable-diffusion-prune 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | ### One Shot Learning & Similar 1673 | One-shot learning is a machine learning technique where a model is trained on a small set of examples to classify new examples. In the context of Stable Diffusion, one-shot learning can be used to quickly train a model on a new concept or object with just a few images. 1674 | 1675 | One way to do this is to use a technique called fine-tuning, where a pre-trained model is modified to fit the new data. For example, if you want to train a model to generate images of your pet cat, you can fine-tune an existing Stable Diffusion model on a small set of images of your cat. This will allow the model to learn the specific characteristics of your cat and generate new images of it. 1676 | 1677 | Another approach is to use a technique called contrastive learning, where a model is trained to differentiate between positive and negative examples of a concept. For example, you can train a model to recognize your cat by showing it a few positive examples of your cat, and many negative examples of other cats or animals. This will allow the model to learn the unique features of your cat and distinguish it from other animals. 1678 | 1679 | One-shot learning can be useful in scenarios where there are only a few examples of a concept, or where collecting large amounts of data is not feasible. However, it may not always produce the same level of accuracy as traditional training methods that use large datasets. Additionally, the quality of the generated images may depend on the quality of the initial few examples used for training. 1680 | 1681 | #### DreamArtist (WebUI Extension) 1682 | DreamArtist is a web extension that allows users to generate custom art using Stable Diffusion. The extension provides a user-friendly interface that makes it easy for anyone to generate images without any coding experience. Users can upload their images, choose a specific style or subject, adjust settings such as resolution and noise level, and generate new images with a single click. DreamArtist also allows users to save and share their creations with others. It is a convenient tool for anyone who wants to experiment with Stable Diffusion and create unique digital art. 1683 | https://github.com/7eu7d7/DreamArtist-sd-webui-extension 1684 | 1685 | #### Universal Guided Diffusion 1686 | Universal Guided Diffusion is a method for training a diffusion model that can generate diverse high-quality images from a wide range of natural image distributions. It involves conditioning the diffusion process on a universal latent code that captures global properties of the image distribution, as well as a guided conditioning signal that captures local details. This approach allows for a high degree of flexibility in generating images with diverse styles and content, making it suitable for a wide range of image-generation tasks. The code is available on GitHub, and a paper describing the method is available on arXiv. 1687 | https://github.com/arpitbansal297/Universal-Guided-Diffusion 1688 | PAPER: https://arxiv.org/abs/2302.07121 1689 | 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | 1697 | 1698 | ## Other Software Addons 1699 | 1700 | ### Blender Addons 1701 | #### Blender ControlNet 1702 | - https://github.com/coolzilj/Blender-ControlNet 1703 | #### Makes Textures / Vision 1704 | - https://www.reddit.com/r/blender/comments/11pudeo/create_a_360_nonerepetitive_textures_with_stable/ 1705 | #### OpenPose 1706 | - https://gitlab.com/sat-mtl/metalab/blender-addon-openpose 1707 | #### OpenPose Editor 1708 | - https://github.com/fkunn1326/openpose-editor 1709 | #### Dream Textures 1710 | - https://github.com/carson-katri/dream-textures https://www.youtube.com/watch?v=yqQvMnJFtfE https://www.youtube.com/watch?v=4C_3HCKn10A, similar to materialize https://boundingboxsoftware.com/materialize/ https://github.com/BoundingBoxSoftware/Materialize 1711 | #### AI Render 1712 | - https://blendermarket.com/products/ai-render https://www.youtube.com/watch?v=goRvGFs1sdc https://github.com/benrugg/AI-Render https://airender.gumroad.com/l/ai-render https://blendermarket.com/products/ai-render https://www.youtube.com/watch?v=tmyln5bwnO8 https://github.com/benrugg/AI-Render/wiki/Animation 1713 | #### Stability AI's official Blender 1714 | - https://platform.stability.ai/docs/integrations/blender 1715 | #### CEB Stable Diffusion (Paid) 1716 | - https://carlosedubarreto.gumroad.com/l/ceb_sd 1717 | #### Cozy Auto Texture 1718 | - https://github.com/torrinworx/Cozy-Auto-Texture 1719 | 1720 | ### Blender Rigs/Bones 1721 | #### ImpactFrames' OpenPose Rig 1722 | - https://ko-fi.com/s/f3da7bd683 https://impactframes.gumroad.com/l/fxnyez https://www.youtube.com/watch?v=MGjdLiz2YLk https://www.reddit.com/r/StableDiffusion/comments/11cxy5h/comment/jacorrt/?utm_source=share&utm_medium=web2x&context=3 1723 | #### ToyXYZ's Character bones that look like Openpose for blender 1724 | - https://toyxyz.gumroad.com/l/ciojz script to help it https://www.reddit.com/r/StableDiffusion/comments/11fyd6q/blender_script_for_toyxyzs_46_handfootpose/ 1725 | #### 3D posable Mannequin Doll 1726 | - https://www.artstation.com/marketplace/p/VOAyv/stable-diffusion-3d-posable-manekin-doll https://www.youtube.com/watch?v=MClbPwu-75o 1727 | #### Riggify model 1728 | - https://3dcinetv.gumroad.com/l/osezw 1729 | - 1730 | 1731 | ### Maya 1732 | #### ControlNet Maya Rig 1733 | - https://impactframes.gumroad.com/l/gtefj https://youtu.be/CFrAEp-qSsU 1734 | 1735 | ### Photoshop 1736 | #### Stable.Art 1737 | - https://github.com/isekaidev/stable.art 1738 | #### Auto Photoshop Plugin 1739 | - https://github.com/AbdullahAlfaraj/Auto-Photoshop-StableDiffusion-Plugin 1740 | 1741 | ### Daz 1742 | #### Daz Control Rig 1743 | - https://civitai.com/models/13478/dazstudiog8openposerig 1744 | 1745 | ### Cinema4D 1746 | #### Colors Scene (possibly no longer needed since controlNet Update) 1747 | - https://www.reddit.com/r/StableDiffusion/comments/11flemo/color150_segmentation_colors_for_cinema4d_and/ 1748 | 1749 | ### Unity 1750 | #### Stable Diffusion Unity Integration 1751 | - https://github.com/dobrado76/Stable-Diffusion-Unity-Integration 1752 | 1753 | ## Related Technologies, Communities and Tools, not necessarily Stable Diffusion, but Adjacent 1754 | DeepDream 1755 | - https://deepdreamgenerator.com/ 1756 | 1757 | StylGAN Transfer 1758 | 1759 | AI Colorizers 1760 | - DeOldify 1761 | 1762 | - Style2Paint https://github.com/lllyasviel/style2paints 1763 | 1764 | ## Techniques & Possibilities 1765 | 1766 | ### Seed and prompt blending 1767 | https://github.com/amotile/stable-diffusion-backend/tree/master/src/process/implementations/automatic1111_scripts 1768 | 1769 | ### Loopback Superimpose 1770 | https://github.com/DiceOwl/StableDiffusionStuff 1771 | https://github.com/Extraltodeus/advanced-loopback-for-sd-webui 1772 | 1773 | ### txt2img2img 1774 | https://github.com/ThereforeGames/txt2img2img (Outdated) 1775 | https://github.com/ThereforeGames/unprompted 1776 | 1777 | ### Seed Traveling 1778 | https://github.com/yownas/seed_travel 1779 | 1780 | ### Alternate Noise Samplers 1781 | https://gist.github.com/dfaker/f88aa62e3a14b559fe4e5f6b345db664 1782 | 1783 | ### Clip Skip & Alternating 1784 | CLIP-Skip is a slider option in the settings of Stable Diffusion that controls how early the processing of prompt by the CLIP network should be stopped. It is important to note that CLIP-Skip should only be used with models that were trained with this kind of tweak, which in this case are the NovelAI models. When using CLIP-Skip, the output of the neural network will be based on fewer layers of processing, resulting in better image generation on the appropriate models. 1785 | https://www.youtube.com/watch?v=IkMIoRCfCgE 1786 | https://www.reddit.com/r/StableDiffusion/comments/yj58r0/psa_clipskip_should_only_be_used_with_models/ 1787 | 1788 | ### Multi Control Net and blender for perfect Hands 1789 | https://www.youtube.com/watch?v=ptEZQrKgHAg&t=4s 1790 | 1791 | ### Blender to Depth Map 1792 | https://www.reddit.com/r/StableDiffusion/comments/115ieay/how_do_i_feed_normal_map_created_in_blender/ 1793 | 1794 | Many use freestyle to controlNet instead, claim it gives best results 1795 | 1796 | https://www.reddit.com/r/StableDiffusion/comments/zh8ava/comment/izks993/?utm_source=share&utm_medium=web2x&context=3 1797 | https://stable-diffusion-art.com/depth-to-image/ 1798 | 1799 | #### Blender to depth map for concept art 1800 | https://www.youtube.com/watch?v=L6J4IGjjr9w 1801 | 1802 | #### depth map for terrain and map generation? 1803 | 1804 | #### Detextify - removes pseudo text from generations 1805 | https://github.com/iuliaturc/detextify 1806 | 1807 | 1808 | ### Blender as Camera Rig 1809 | https://www.reddit.com/r/StableDiffusion/comments/10fqg7u/quick_test_of_ai_and_blender_with_camera/ 1810 | 1811 | 1812 | ### SD depthmap to blender for stretched single viewpoint depth perception model 1813 | https://www.youtube.com/watch?v=vfu5yzs_2EU https://github.com/Ladypoly/Serpens-Bledner-Addons importdepthmap 1814 | 1815 | similar to https://huggingface.co/spaces/mattiagatti/image2mesh https://towardsdatascience.com/generate-a-3d-mesh-from-an-image-with-python-12210c73e5cc 1816 | similar to https://github.com/hesom/depth_to_mesh 1817 | 1818 | ### Daz3D for posing 1819 | https://www.reddit.com/r/StableDiffusion/comments/11owo31/comment/jbvdmsm/?utm_source=share&utm_medium=web2x&context=3 1820 | 1821 | ### Mixamo for Posing 1822 | https://www.reddit.com/r/StableDiffusion/comments/11owo31/something_that_might_help_ppl_with_posing/ 1823 | 1824 | ### Figure Drawing Poses as Reference Poses 1825 | https://figurosity.com/figure-drawing-poses 1826 | 1827 | 1828 | ### Generating Images to turn into 3D sculpting brushes 1829 | https://www.reddit.com/r/StableDiffusion/comments/xjju0q/ai_generated_3d_sculpting_brushes/ 1830 | 1831 | 1832 | ### Stable Diffusion to Blender to create particles using automesh plugin 1833 | https://twitter.com/subcivic/status/1570754141995290626 1834 | https://wesxdz.gumroad.com/l/xfdmzx 1835 | 1836 | ## Not Stable Diffusion But Relevant Techniques 1837 | 3D photo effect https://shihmengli.github.io/3D-Photo-Inpainting/ 1838 | 1839 | ## Other Resources 1840 | 1841 | ### API's 1842 | 1843 | NextML API for STable Diffusion https://api.stable-diffusion.nextml.com/redoc 1844 | 1845 | DreamStudio API --------------------------------------------------------------------------------