├── README.md
├── examples
    ├── example0_0.jpg
    ├── example0_1.jpg
    └── example1_0.jpg
└── scripts
    └── depth-image-io.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Depth Image I/O for stable-diffusion-webui
 2 | An extension to allow managing custom depth inputs to Stable Diffusion depth2img models for the stable-diffusion-webui repo.
 3 | 
 4 | For instance an artist may wish to use depth data rendered from a model made in a 3D rendering suite (Blender, Maya, etc.) and have Stable Diffusion "render" the scene, or one may wish to take the MiDaS-estimated depth (that the vanilla depth2img pipeline uses) of one image and combine it with the colour-information from another.
 5 | 
 6 | Here are some examples taking depth data from an old 3D model of mine:
 7 | 
 8 | ![An image showing different txt2img and depth2img outputs](./examples/example1_0.jpg)
 9 | 
10 | This code is somewhat fragile at present because the manner in which stable-diffusion-webui handles depth2img models is not currently exposed usefully; accordingly for now this extension monkeypatches some behaviour (although in a manner which should be totally-harmless when the extension's script is not active).
11 | 
12 | **Aside:** An 8-bit-depth greyscale image might seem like a potentially-insfuccient data format for linear depth data, but for reasons it seems to suffice here in practice. I say this because of the following:
13 | 
14 | Firstly, the model is used to estimated depth fields from MiDaS which tend to put most of their variation of values within a relatively small range that makes visual-but-not-physical sense, and/or be somewhat compositionally aware of what's "background" and often more-or-less ignore it, for instance:
15 | 
16 | ![man looking out over a city from a room](./examples/example0_0.jpg) ![MiDaS depth estimation of the former image](./examples/example0_1.jpg)
17 | 
18 | Note that MiDaS **A)** decides the skyscrapers are small blocks just outside his window, and **B)** seemingly hand-waves all the furniture at the back of the room as 'unimportant'. This is the sort of data the Stable Diffusion 2.0 depth2img model was trained to reproduce images from.
19 | 
20 | Accordingly, one will find that attempting to give realistic linear depth values for the skyscrapers would produce bad results (de-facto, obliterating the model's sense of everything in the room as it is compressed to a relatively tiny range), and one should expect Stable Diffusion to "fill in"/"hallucinate" objects on vague background-ish swathes of depth like the back room wall here.
21 | 
22 | Additionaly, for the model this is designed for, the depth is downsampled to 1/8th the target dimensions (so 64x64 for 512x512) and normalized to a [-1.0, 1.0] range, and the model does not seem particularly sensitive to very fine differences in the depth values; the lack of precision from having only 255 possible depth values doesn't seem to have much any effect in a handful of casual tests (I had tried to support floating-point depth values in an OpenEXR format at first; it doesn't seem worth it).
23 | 
24 | (A strange upshot of all this is that it seems that whether a depth image "looks" correct at a glance can seemingly matter much more than whether it "is" physically correct! I've even thrown the linearity of the depth out the window in some few of my tests (e.g. used logarithmic depth) and it never seemed to hurt much so long as it looked "sort of like what MiDaS outputs" to my eyeballs.)
25 | 


--------------------------------------------------------------------------------
/examples/example0_0.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AnonymousCervine/depth-image-io-for-SDWebui/eb22863da52f3cd10d28087b43870f1419178d4c/examples/example0_0.jpg


--------------------------------------------------------------------------------
/examples/example0_1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AnonymousCervine/depth-image-io-for-SDWebui/eb22863da52f3cd10d28087b43870f1419178d4c/examples/example0_1.jpg


--------------------------------------------------------------------------------
/examples/example1_0.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AnonymousCervine/depth-image-io-for-SDWebui/eb22863da52f3cd10d28087b43870f1419178d4c/examples/example1_0.jpg


--------------------------------------------------------------------------------
/scripts/depth-image-io.py:
--------------------------------------------------------------------------------
  1 | # Re: Copyright and licenses
  2 | #
  3 | # Firstly: This uses a few lines of boilerplate code from the Stable Diffusion 2.0 reference code at https://github.com/Stability-AI/stablediffusion
  4 | # To the extent those few lines of boilerplate are copyrightable, the MIT license applies as per https://github.com/Stability-AI/stablediffusion/blob/main/LICENSE
  5 | #
  6 | # Secondly: This uses some lines of code from https://github.com/AUTOMATIC1111/stable-diffusion-webui, with which this code is designed to interoperate; almost all of the
  7 | # code in question from this latter source is, in turn, lightly-modified boilerplate from the former.
  8 | # stable-diffusion-webui, however, has no license. The best that can be said is that de-facto it is generally understood that permission is granted to copy the code for (at least)
  9 | # personal use by its (myriad and various) contributors.
 10 | #
 11 | # Because of that quibble I cannot in good faith assign a license to this code; it is not fully clear to me that I have the rights to do so in the first place (even if
 12 | # I BELEIVE in good faith that the few lines reproduced here are very likely not-sufficient in nature, nor extent, to constitute any copyright violations via their inclusion).
 13 | # Notwithstanding that, for all such portions of the code as I do own, I freely permit the use of them as-is, with no representations made as to their performance nor dangers.
 14 | # ...That much should, I think, be enough for anyone who is willing, in the first place, to use the nebulously-copyrighted codebase that this code is designed to operate with.
 15 | 
 16 | import os
 17 | import traceback
 18 | from types import MethodType
 19 | 
 20 | import gradio as gr
 21 | import numpy as np
 22 | from PIL import Image
 23 | import torch
 24 | from einops import repeat, rearrange
 25 | from ldm.data.util import AddMiDaS
 26 | 
 27 | import modules.processing as processing
 28 | import modules.scripts as scripts
 29 | import modules.shared as shared
 30 | 
 31 | instructions = """\
 32 | ### Depth Image I/O
 33 | ###### (Only applicable to Depth2Image models)
 34 | 
 35 | This is a script for the combined purposes of: **A)** Inserting custom depth images *into* a Depth2Img model and **B)** Getting the depth images *generated* by MiDaS back out (when a custom depth image is *not* specified). The depth2img model can infer quite a lot from just depth!
 36 | 
 37 | **General Notes and Observations:**
 38 |  - The depth image should be greyscale (Accordingly: Anything beyond the first channel (i.e. red) of any RGB image will be ignored; if you put a color image in, it will gamely try to interpret the red channel as depth with probably undesired results!)
 39 |  - White is nearest to the camera, black is farthest
 40 |    - If you don't use the whole black-to-white range, the image will be normalized to the full range automatically (which is what the model expects and was trained on, although in reality it isn't too picky).
 41 |  - The distance values should theoretically be linear (although again, in practice, it turns out it's not terribly picky about this)
 42 |    - But pick your range wisely, and 'faraway' distances should all perhaps just be sort of uniformly blackish, depending(see next bullet).
 43 |  - **FOR BEST RESULTS with handmade input, base your inputs on what the auto-generated depth-images look like.**
 44 |    - All the important features of your composition should take up most-if-not-all of the available depth space
 45 |    - To permit that, anything in the "distant (or even not-so-distant) background" can just be a uniform-ish black/dark-grey
 46 |      - Such "far from the camera" distances seem indeed to be treated by this model as "draw whatever you like past here"
 47 |    - Alternatively, if one wants to keep faraway details, one may instead make them smaller and closer than they physically would be (in order to fit nicely into the limited depth space).
 48 |   - **NOTE**: You can leave `Append adjusted depth image to outputs` checked when using your own images, to confirm that your depth values look the way you expect after being normalized.
 49 |  - Note that the image will be downscaled down to 1/8th of the target image size (so 64x64 for 512x512 output) internally, so fine details may be lost.
 50 |    - That said the model can extrapolate a surprising amount of detail from a downscaled 64x64 image!
 51 | 
 52 | *Finally, be aware that this code is slightly fragile and may break in a future update! Have fun, and good luck!*
 53 | """
 54 | 
 55 | batching_notes = """\
 56 | **Batching Usage Notes (Experimental!):**
 57 |  - Batch inputs will (hopefully) override whatever fields they replace, fully and equivalently.
 58 |    - ***EXCEPT (IMPORTANT):*** For reasons, when using color batch input, the img2img tab needs a dummy image in the regular spot, to function (otherwise it will fail before it hands things off to this extension). 
 59 |    - Color image inputs will also not work on the txt2img tab, as one might expect.
 60 |  - If both color and depth batches are provided, batching will make pairs of color and depth images under the assumption that they are matched pairs provided in alphabetical order (by filename), unless `Batch each depth image against every single color image` is checked.
 61 |    - (But see important notes below on caveats to "alphabetical" ordering)
 62 |  - There are some complicating limitations imposed by how Gradio transfers files:
 63 |    - You have to drop (or select) all the files in the batch in one go; Gradio's file-upload interface is not sophisticated. 
 64 |    - Images will be processed in alphabetical order... **approximately**. (Important if you want your depth and color images to line up correctly)
 65 |       - NOT alphanumeric. So `img10_depth.png` will come before `img2_depth.png`
 66 |       - Gradio appears to mess with the file names before handing them off via temporary-file creation mechanisms, placing a random hex number before the first dot **on Windows** (I'm not sure of other operating-system's implementations, please report any unexpected behaviour!)
 67 |         - So `img01.png` internally becomes something like `img01a6doxg5s.png`
 68 |         - This means that if you provide images where *length* of the name is counted on for sorting—for instance `a.png` and `ab.png`—they may be processed in a random order!
 69 |         - If in doubt and having trouble: Keep file-names identical length, and keep all distinguishing information before the first period in the filename. (For example `img0001_color.png, image0002_color.png...) 
 70 |    - Temporary copies will be made of the input files (because Gradio doesn't know if they're coming from a local or a remote machine). This normally won't matter much at all, but do note that:
 71 |      - While I try my best to clean them up, if the program crashes or throws an error at an inopportune time some copies may be left in whatever location your OS provisions for temporary files (e.g. `%TEMP%` on Windows)
 72 |      - for *excessively* large batches, note the disk-space usage this implies (since a full temporary copy of all the input files will be made)
 73 | """
 74 | 
 75 | # Used to remove any temp files that were downloaded for batches, lest we clutter the computer.
 76 | def cleanup_temp_files(tempfile_list):
 77 |     if tempfile_list:
 78 |         for tmp_file in tempfile_list:
 79 |             filename = tmp_file.name
 80 |             tmp_file.close()
 81 |             os.remove(filename)
 82 |             
 83 | def concat_processed_outputs(p1, p2):
 84 |     if not p1:
 85 |         return p2
 86 |     if not p2:
 87 |         return p1
 88 |     p1.images += p2.images
 89 |     return p1
 90 | 
 91 | def sanitize_pil_image_mode(img):
 92 |     # Set of PIL image modes for which, in a greyscale image, the first channel would NOT be a good representation of brightness:
 93 |     invalid_modes = {'P', 'CMYK', 'HSV'}
 94 |     if img.mode in invalid_modes:
 95 |         img = img.convert(mode='RGB')
 96 |     return img
 97 | 
 98 | class Script(scripts.Script):
 99 |     def title(self):
100 |         return "Custom Depth Images (input/output)"
101 | 
102 |     def ui(self, is_img2img):
103 |         show_depth = gr.Checkbox(True, label="Append (normalized) depth image to outputs\n(yours if supplied; the auto-generated if otherwise)")
104 |         with gr.Accordion("Notes and Hints (click to expand)", open=False):
105 |             gr.Markdown(instructions)
106 |         gr.Markdown("---\n\nPut depth image here ⤵")
107 |         input_depth_img = gr.Image(source='upload', type="pil", image_mode=None) # If we don't say image_mode is None, Gradio will auto-convert to RGB and potentially destroy data.
108 |         with gr.Accordion("Batch Processing (Experimental)", open=False):
109 |             batch_many_to_many = gr.Checkbox(False, label="Batch each depth image against every single color image. (Warning: Use cautiously with large batches!)")
110 |             batch_img_input = gr.File(file_types=['image'], file_count='multiple', label="Input Color Images")
111 |             batch_depth_input = gr.File(file_types=['image'], file_count='multiple', label="Input Depth Images")
112 |             gr.Markdown(batching_notes)
113 |         return [input_depth_img, show_depth, batch_img_input, batch_depth_input, batch_many_to_many]
114 |         
115 |     def run_inner(self, p, input_depth_img, show_depth):
116 |         is_img2img = isinstance(p, processing.StableDiffusionProcessingImg2Img)
117 |         use_custom_depth_input = bool(input_depth_img)
118 |         if not is_img2img and not use_custom_depth_input:
119 |             raise RuntimeError("If using the Custom Depth Images I/O script with txt2img, you MUST PROVIDE A DEPTH IMAGE (because there's no other image to infer depth from!)")
120 |         p.out_depth_image = None
121 |     
122 |         # Monkeypatch depth2img_image_conditioning on this Processing instance to let us feed in our own depth and/or capture that which is autogenerated.
123 |         old_depth2img_image_conditioning = p.depth2img_image_conditioning
124 |         def alt_depth_image_conditioning(self, source_image):
125 |             conditioning_image = self.sd_model.get_first_stage_encoding(self.sd_model.encode_first_stage(source_image))
126 |             if use_custom_depth_input:
127 |                 depth_data = np.array(sanitize_pil_image_mode(input_depth_img))
128 |                 if len(np.shape(depth_data)) == 2: # that is, if it's a single-channel image with only width and height.
129 |                     depth_data = rearrange(depth_data, "h w -> 1 1 h w")
130 |                 else:
131 |                     depth_data = rearrange(depth_data, "h w c -> c 1 1 h w")[0] # Rearrange and discard anything past the first channel.
132 |                 depth_data = torch.from_numpy(depth_data).to(device=shared.device).to(dtype=torch.float32) # Whatever the color range was (e.g. 0 to 255 for 8-bit) doesn't matter; we're going to normalize it anyway.
133 |                 depth_data = repeat(depth_data, "1 ... -> n ...", n=self.batch_size)
134 |             else:
135 |                 transformer = AddMiDaS(model_type="dpt_hybrid")
136 |                 transformed = transformer({"jpg": rearrange(source_image[0], "c h w -> h w c")})
137 |                 midas_in = torch.from_numpy(transformed["midas_in"][None, ...]).to(device=shared.device)
138 |                 midas_in = repeat(midas_in, "1 ... -> n ...", n=self.batch_size)
139 |                 depth_data = self.sd_model.depth_model(midas_in)
140 | 
141 |             if show_depth:
142 |                 (depth_min, depth_max) = torch.aminmax(depth_data)
143 |                 display_depth = (depth_data - depth_min) / (depth_max - depth_min)
144 |                 depth_image = Image.fromarray(
145 |                         (display_depth[0, 0, ...].cpu().numpy() * 255.).astype(np.uint8))
146 |                 self.out_depth_image = depth_image
147 |             
148 |             conditioning = torch.nn.functional.interpolate(
149 |                 depth_data,
150 |                 size=conditioning_image.shape[2:],
151 |                 mode="bicubic",
152 |                 align_corners=False,
153 |             )
154 |             (depth_min, depth_max) = torch.aminmax(conditioning) #recalculating may be unneccessary but bicubic interpolation would theoretically allow these to have expanded infinitesimally.
155 |             conditioning = 2. * (conditioning - depth_min) / (depth_max - depth_min) - 1.
156 |             return conditioning
157 |         p.depth2img_image_conditioning = MethodType(alt_depth_image_conditioning, p)
158 |         
159 |         # Also monkeypatch txt2img_image_conditioning to handle txt2img side
160 |         # This approach is just barely possible solely because the txt2img_image_conditioning method already exists to fill in a blank mask on dedicated inpainting models.
161 |         def alt_txt2img_image_conditioning(self, x, width=None, height=None):
162 |             fake_img = torch.zeros(1, 3, height or self.height, width or self.width).to(shared.device).type(self.sd_model.dtype) # Single fake input rgb 'image', used just to conform to existing code paths.
163 |                                                                                                                                  # Ultimately all we'll get from processing it is learning that the target depth
164 |                                                                                                                                  # image size is (width/8, depth/8)
165 |             return self.depth2img_image_conditioning(fake_img) # Point at our new depth2img_image_conditioning function that we've just overridden.
166 |         p.txt2img_image_conditioning = MethodType(alt_txt2img_image_conditioning, p)
167 | 
168 |         processed_output = processing.process_images(p)
169 |         if show_depth and p.out_depth_image:
170 |             processed_output.images.append(p.out_depth_image)
171 |         return processed_output
172 | 
173 |     def run(self, p, input_depth_img, show_depth, batch_img_input, batch_depth_input, batch_many_to_many):
174 |         if not (batch_img_input or batch_depth_input):
175 |             return self.run_inner(p, input_depth_img, show_depth)
176 |             
177 |         try:
178 |             if batch_img_input and batch_depth_input and not batch_many_to_many:
179 |                 depth_batch_size = len(batch_depth_input)
180 |                 color_batch_size = len(batch_img_input)
181 |                 larger_batch_size = max(color_batch_size, depth_batch_size)
182 |                 shorter_batch_size = min(color_batch_size, depth_batch_size)
183 |                 if larger_batch_size != shorter_batch_size:
184 |                     raise RuntimeError(f"Depth Image Batch-size ({depth_batch_size}) and Color Image Batch-size ({color_batch_size}) are not the same length. Unclear how to proceed.")
185 |             
186 |             
187 |             batch_color = []
188 |             batch_depth = []
189 |             
190 |             if batch_img_input:
191 |                 batch_img_input.sort(key = lambda file : file.name)
192 |                 for imgfile in batch_img_input:
193 |                     batch_color.append(Image.open(imgfile))
194 |             else:
195 |                 batch_many_to_many = True # Well, one-to-many in this case, really.
196 |                 batch_color = [None] # Note: For color batching, we will treat None as "no override"; it will not prevent a traditional img2img input.
197 |             
198 |             if batch_depth_input:
199 |                 batch_depth_input.sort(key = lambda file : file.name)
200 |                 for imgfile in batch_depth_input:
201 |                     batch_depth.append(Image.open(imgfile))
202 |             else:
203 |                 batch_many_to_many = True # One-to-many
204 |                 batch_depth = [input_depth_img]
205 |             
206 |             processed_output = None
207 |             
208 |             if batch_many_to_many:
209 |                 num_outputs = len(batch_color) * len(batch_depth)
210 |             else:
211 |                 num_outputs = max(len(batch_color), len(batch_depth))
212 |             shared.state.job_count = num_outputs * p.n_iter
213 |             
214 |             def batch_item(depth_img, color_img):
215 |                 if color_img:
216 |                     # override input images
217 |                     p.init_images = [color_img] * max(len(p.init_images), 1)
218 |                 return self.run_inner(p, depth_image, show_depth)
219 |             
220 |             for i, depth_image in enumerate(batch_depth):
221 |                 if batch_many_to_many:
222 |                     for color_image in batch_color:
223 |                         processed_output = concat_processed_outputs(processed_output, batch_item(depth_image, color_image))
224 |                 else:
225 |                     processed_output = concat_processed_outputs(processed_output, batch_item(depth_image, batch_color[i]))
226 |                     
227 |         finally:
228 |             cleanup_temp_files(batch_img_input)
229 |             cleanup_temp_files(batch_depth_input)
230 |         
231 |         return processed_output
232 | 


--------------------------------------------------------------------------------