├── README.md ├── LICENSE └── StableDiffusion.livemd /README.md: -------------------------------------------------------------------------------- 1 | # BumbleBooth 2 | An implementation of Dreambooth in Elixir using Stable Diffusion and BumbleBee. 3 | 4 | ## Notebooks 5 | [StableDiffusion.livemd](StableDiffusion.livemd) - Contains a breakdown of basic Stable Diffusion inference and Image-to-Image using SD 6 | 7 | ## Roadmap 8 | * [x] SD inference running using the BumbleeBee example 9 | * [x] SD inference without `Nx.Serving` 10 | * [x] Image2Image 11 | * [ ] SD + Textual Inversion 12 | * [ ] SD + Dreambooth 13 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Rohan Relan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /StableDiffusion.livemd: -------------------------------------------------------------------------------- 1 | # Stable Diffusion using BumbleBee 2 | 3 | ```elixir 4 | # This should be set appropriately for the system 5 | # eg. cuda118 for a machine with a GPU and CUDA 11.8+ 6 | IO.puts(System.get_env("XLA_TARGET")) 7 | 8 | Mix.install( 9 | [ 10 | {:bumblebee, github: "elixir-nx/bumblebee", branch: "main", override: true}, 11 | {:exla, ">= 0.0.0"}, 12 | {:kino_bumblebee, "~> 0.1.0"} 13 | ], 14 | config: [nx: [default_backend: EXLA.Backend]] 15 | ) 16 | 17 | Nx.global_default_backend(EXLA.Backend) 18 | Nx.Defn.global_default_options(compiler: EXLA) 19 | ``` 20 | 21 | ## Run SD inference with Nx.Serving 22 | 23 | To start, we're going to try to get Stable Diffusion running using `BumbleBee` and `Nx.Serving`. This should (!) be pretty straightforward, we'll use the example notebook [here](https://github.com/elixir-nx/bumblebee/blob/main/notebooks/stable_diffusion.livemd) 24 | 25 | ```elixir 26 | repository_id = "CompVis/stable-diffusion-v1-4" 27 | 28 | {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"}) 29 | 30 | {:ok, clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"}) 31 | 32 | {:ok, unet} = 33 | Bumblebee.load_model({:hf, repository_id, subdir: "unet"}, 34 | params_filename: "diffusion_pytorch_model.bin" 35 | ) 36 | 37 | {:ok, vae} = 38 | Bumblebee.load_model({:hf, repository_id, subdir: "vae"}, 39 | architecture: :decoder, 40 | params_filename: "diffusion_pytorch_model.bin" 41 | ) 42 | 43 | {:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"}) 44 | {:ok, featurizer} = Bumblebee.load_featurizer({:hf, repository_id, subdir: "feature_extractor"}) 45 | {:ok, safety_checker} = Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"}) 46 | 47 | :ok 48 | ``` 49 | 50 | ```elixir 51 | %Bumblebee.Diffusion.DdimScheduler{} 52 | ``` 53 | 54 | ```elixir 55 | serving = 56 | Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler, 57 | num_steps: 20, 58 | num_images_per_prompt: 2, 59 | safety_checker: safety_checker, 60 | safety_checker_featurizer: featurizer, 61 | compile: [batch_size: 1, sequence_length: 60], 62 | defn_options: [compiler: EXLA], 63 | seed: 0 64 | ) 65 | 66 | text_input = 67 | Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art") 68 | ``` 69 | 70 | ```elixir 71 | prompt = Kino.Input.read(text_input) 72 | 73 | output = Nx.Serving.run(serving, prompt) 74 | 75 | for result <- output.results do 76 | Kino.Image.new(result.image) 77 | end 78 | |> Kino.Layout.grid(columns: 2) 79 | ``` 80 | 81 | That worked! If you got a CuDNN error in the last step (as I did), check your `XLA_TARGET` and make sure you haven't used one with too high a CuDNN. If so, downgrade your `XLA_TARGET` or upgrade CuDNN. 82 | 83 | ## SD Inference broken down 84 | 85 | Now that we know the basics are working (which is a huge step since it means CUDA, XLA and Nx are all set up correctly), we're going to go a little deeper. Right now, a lot of the details of Stable Diffusion are hidden behind the `Bumblebee.Diffusion.StableDiffusion.text_to_image` function. We're going to break this function down into its parts in this notebook so we can start modifying the pieces in the next step. 86 | 87 | For this breakdown, we're going to ignore some of the unnecessary/less interesting bits like the safety checker and doing multiple images per prompt. 88 | 89 | This breakdown is based on the code underlying [Bumblebee.Diffusion.StableDiffusion.text_to_image](https://github.com/elixir-nx/bumblebee/blob/main/lib/bumblebee/diffusion/stable_diffusion.ex) and [this notebook](https://github.com/fastai/diffusion-nbs/blob/master/Stable%20Diffusion%20Deep%20Dive.ipynb) from [fast.ai](https://fast.ai) 90 | 91 | Let's start by processing the prompt - in this case processing means turn the natural language text into text embeddings. 92 | 93 | The way SD works is we create outputs for two prompts: the conditional prompt which is the prompt we give it, and the unconditional prompt which is simply the empty string. The final result is a sort of weighted sum between these two results. 94 | 95 | ```elixir 96 | prompt = Kino.Input.read(text_input) 97 | seq_length = 60 98 | batch_size = 2 99 | num_steps = 20 100 | guidance_scale = 7.5 101 | 102 | tokenizer_options = [ 103 | length: seq_length, 104 | return_token_type_ids: false, 105 | return_attention_mask: false 106 | ] 107 | 108 | cond_tokens = Bumblebee.Text.ClipTokenizer.apply(tokenizer, prompt, tokenizer_options) 109 | uncond_tokens = Bumblebee.Text.ClipTokenizer.apply(tokenizer, "", tokenizer_options) 110 | # Since cond_tokens and uncond_tokens are maps, this concats the corresponding keys correctly 111 | tokens = Bumblebee.Utils.Nx.composite_concatenate(uncond_tokens, cond_tokens) 112 | %{hidden_state: text_embeddings} = Axon.predict(clip.model, clip.params, tokens) 113 | # Shape = {2, 60, 768} 114 | text_embeddings 115 | ``` 116 | 117 | We have the text embeddings for the conditional and unconditional prompt, but we need to replicate these to match our batch size. We're going to put our batch size in the *2nd* dimension so later we can easily split the output into the conditional and unconditional parts. 118 | 119 | ```elixir 120 | text_embeddings = 121 | text_embeddings 122 | |> Nx.new_axis(1) 123 | |> Nx.tile([1, batch_size, 1, 1]) 124 | |> Nx.reshape({:auto, seq_length, 768}) 125 | ``` 126 | 127 | The final shape of `text_embeddings` is `{batch_size*2, 60, 768}` with the first batch of size batch_size being the embeddings for the empty prompt (unconditional) and the 2nd batch being the embeddings for our target prompt (conditional). 128 | 129 | Next, we'll create our starting random latent vectors, one for each generation we're going to do. For this model, we can look at the spec to determine these are 64x64x4 tensors. So our final output is going to have shape `{batch_size, 64, 64, 4}`, one latent for each image we're going to generate. 130 | 131 | ```elixir 132 | latents_shape = {batch_size, unet.spec.sample_size, unet.spec.sample_size, unet.spec.in_channels} 133 | key = Nx.Random.key(0) 134 | {latents, _new_key} = Nx.Random.normal(key, shape: latents_shape) 135 | latents 136 | ``` 137 | 138 | Now we have all the pieces, we can do a single step of our eventual loop. We'll initialize the scheduler and then predict the in our current (totally noisy) latents. We can even try to visualize our outputs - though don't expect much... yet. 139 | 140 | ```elixir 141 | defmodule SDHelper do 142 | import Nx.Defn 143 | 144 | # We need this in a module so we can use use `Nx.Defn` so the operators work 145 | defn apply_guidance(noise_pred_uncond, noise_pred_cond, guidance_scale) do 146 | noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond) 147 | end 148 | 149 | # TODO: remove this once https://github.com/elixir-nx/bumblebee/issues/123 lands in main 150 | defn scheduler_step(step_fn, scheduler_state, latents, noise_pred) do 151 | step_fn.(scheduler_state, latents, noise_pred) 152 | end 153 | end 154 | 155 | {scheduler_state, timesteps} = Bumblebee.scheduler_init(scheduler, num_steps, latents_shape) 156 | 157 | unet_inputs = %{ 158 | # One batch for uncond and one for cond_tokens 159 | "sample" => Nx.concatenate([latents, latents]), 160 | "timestep" => timesteps[0], 161 | "encoder_hidden_state" => text_embeddings 162 | } 163 | 164 | %{sample: noise_pred} = Axon.predict(unet.model, unet.params, unet_inputs) 165 | noise_pred_uncond = noise_pred[0..(batch_size - 1)] 166 | noise_pred_cond = noise_pred[batch_size..-1//1] 167 | noise_pred = SDHelper.apply_guidance(noise_pred_uncond, noise_pred_cond, guidance_scale) 168 | scheduler_step_fn = &Bumblebee.scheduler_step(scheduler, &1, &2, &3) 169 | 170 | {_state, new_latents} = 171 | SDHelper.scheduler_step(scheduler_step_fn, scheduler_state, latents, noise_pred) 172 | ``` 173 | 174 | Now that we have our new latents after one step of the diffusion process, we can run them through the VAE to see what the image looks like. But we're not expecting much since it's just a single step. 175 | 176 | ```elixir 177 | # Scaling before we pass it to the VAE 178 | new_latents = Nx.multiply(new_latents, 1 / 0.18215) 179 | %{sample: image} = Axon.predict(vae.model, vae.params, new_latents) 180 | images = NxImage.from_continuous(image, -1, 1) 181 | 182 | Kino.Layout.grid( 183 | [ 184 | Kino.Image.new(images[0]), 185 | Kino.Image.new(images[1]) 186 | ], 187 | boxed: true, 188 | columns: 2 189 | ) 190 | ``` 191 | 192 | It's a noisy mess but that's what we expected. Let's put this into a loop and run all 20 steps so we can get a real image. We'll use a `while` loop in `defn` for a performance gain. To do that, we'll have to use the pre-built versions of our models to avoid issues passing them into the `defn` 193 | 194 | ```elixir 195 | {_, unet_predict} = Axon.build(unet.model, compiler: EXLA) 196 | {_, vae_predict} = Axon.build(vae.model, compiler: EXLA) 197 | 198 | defmodule SDLoop do 199 | import Nx.Defn 200 | 201 | defn run( 202 | guidance_scale, 203 | latents, 204 | timesteps, 205 | text_embeddings, 206 | unet_predict, 207 | unet_params, 208 | scheduler_step_fn, 209 | scheduler_state 210 | ) do 211 | {scheduler_state, latents, _, _, _} = 212 | while {scheduler_state, latents, unet_params, text_embeddings, guidance_scale}, 213 | timestep <- timesteps do 214 | unet_inputs = %{ 215 | "sample" => Nx.concatenate([latents, latents]), 216 | "timestep" => timestep, 217 | "encoder_hidden_state" => text_embeddings 218 | } 219 | 220 | %{sample: noise_pred} = unet_predict.(unet_params, unet_inputs) 221 | batch_size = div(Nx.axis_size(noise_pred, 0), 2) 222 | noise_pred_uncond = noise_pred[0..(batch_size - 1)] 223 | noise_pred_cond = noise_pred[batch_size..-1//1] 224 | noise_pred = SDHelper.apply_guidance(noise_pred_uncond, noise_pred_cond, guidance_scale) 225 | 226 | {scheduler_state, latents} = 227 | SDHelper.scheduler_step(scheduler_step_fn, scheduler_state, latents, noise_pred) 228 | 229 | {scheduler_state, latents, unet_params, text_embeddings, guidance_scale} 230 | end 231 | 232 | {scheduler_state, latents} 233 | end 234 | end 235 | 236 | {_final_scheduler_state, final_latents} = 237 | SDLoop.run( 238 | guidance_scale, 239 | latents, 240 | timesteps, 241 | text_embeddings, 242 | unet_predict, 243 | unet.params, 244 | scheduler_step_fn, 245 | scheduler_state 246 | ) 247 | ``` 248 | 249 | Now `final_latents` represents the latents for our image after many steps of the diffusion denoising process. Let's run them through the VAE to see what they look like. 250 | 251 | ```elixir 252 | # Scaling before we pass it to the VAE 253 | final_latents = Nx.multiply(final_latents, 1 / 0.18215) 254 | %{sample: image} = Axon.predict(vae.model, vae.params, final_latents) 255 | images = NxImage.from_continuous(image, -1, 1) 256 | 257 | Kino.Layout.grid( 258 | [ 259 | Kino.Image.new(images[0]), 260 | Kino.Image.new(images[1]) 261 | ], 262 | boxed: true, 263 | columns: 2 264 | ) 265 | ``` 266 | 267 | Success! It matches our original generation using `Bumblebee` and `Nx.Serving` because we used the same seed of 0. 268 | 269 | The advantage of breaking the model down like this is now we can use the additional control to add features. Let's do that now. 270 | 271 | ## Using our controllable SD inference 272 | 273 | It requires too much patience to wait while the image is being generated. Let's make it so we can see the intermediate results as they're generated. 274 | 275 | ```elixir 276 | frame = Kino.Frame.new() |> Kino.render() 277 | 278 | defmodule SDRenderer do 279 | def render_latents(latents, vae) do 280 | latents = Nx.multiply(latents, 1 / 0.18215) 281 | %{sample: image} = Axon.predict(vae.model, vae.params, latents) 282 | images = NxImage.from_continuous(image, -1, 1) 283 | Enum.map(0..(Nx.axis_size(images, 0) - 1), &Kino.Image.new(images[&1])) 284 | end 285 | 286 | def render_latents(latents, vae, frame) do 287 | kino_images = render_latents(latents, vae) 288 | image_grid = Kino.Layout.grid(kino_images, boxed: true, columns: 2) 289 | Kino.Frame.render(frame, image_grid) 290 | end 291 | end 292 | 293 | chunked_timesteps = 294 | timesteps 295 | |> Nx.to_flat_list() 296 | |> Enum.chunk_every(4) 297 | |> Enum.map(&Nx.tensor/1) 298 | 299 | {_, final_latents} = 300 | Enum.reduce(chunked_timesteps, {scheduler_state, latents}, fn timesteps, 301 | {scheduler_state, latents} -> 302 | {scheduler_state, latents} = 303 | SDLoop.run( 304 | guidance_scale, 305 | latents, 306 | timesteps, 307 | text_embeddings, 308 | unet_predict, 309 | unet.params, 310 | scheduler_step_fn, 311 | scheduler_state 312 | ) 313 | 314 | SDRenderer.render_latents(latents, vae, frame) 315 | {scheduler_state, latents} 316 | end) 317 | ``` 318 | 319 | ## Image-to-Image 320 | 321 | With controllable SD inference, we can try something new - image2image! First, we'll need both the VAE decoder and *encoder*. We already have the decoder from earlier, so let's load the encoder. 322 | 323 | ```elixir 324 | vae_decoder = vae 325 | 326 | {:ok, vae_encoder} = 327 | Bumblebee.load_model({:hf, repository_id, subdir: "vae"}, 328 | architecture: :encoder, 329 | params_filename: "diffusion_pytorch_model.bin" 330 | ) 331 | 332 | :ok 333 | ``` 334 | 335 | Next we can use Kino to create a control for upload an image, which we'll center crop and then preprocess into the right `Nx` format. 336 | 337 | ```elixir 338 | image = Kino.Input.image("Source image", size: {512, 512}, fit: :crop) 339 | ``` 340 | 341 | We need to extract the binary from the uploaded image, convert it to `Nx` in HWC format, put it into the range -1 to 1 instead of 0 to 255 and add a batch dimension. 342 | 343 | ```elixir 344 | %{data: content, format: _, height: height, width: width} = Kino.Input.read(image) 345 | 346 | source_image = 347 | Nx.from_binary(content, :u8) 348 | |> Nx.reshape({height, width, 3}) 349 | 350 | image_tensor = 351 | source_image 352 | |> NxImage.to_continuous(-1, 1) 353 | |> Nx.new_axis(0) 354 | ``` 355 | 356 | The image tensor is ready to pass to our VAE. A VAE doesn't directly output the latents - instead it outputs the distribution from which we sample the latents, so we need a sampling function. 357 | 358 | ```elixir 359 | sample = fn posterior -> 360 | z = Nx.random_normal(Nx.shape(posterior.mean)) 361 | Nx.add(posterior.mean, Nx.multiply(posterior.std, z)) 362 | end 363 | 364 | %{latent_dist: posterior} = Axon.predict(vae_encoder.model, vae_encoder.params, image_tensor) 365 | # Scale the latents 366 | latent = Nx.multiply(sample.(posterior), 0.18215) 367 | ``` 368 | 369 | We can make sure that we did everything right by running the decoder on our latent. We should get back our original image 370 | 371 | ```elixir 372 | frame = Kino.Frame.new() |> Kino.render() 373 | SDRenderer.render_latents(latent, vae_decoder, frame) 374 | ``` 375 | 376 | Looks right! The way image2image works is, instead of starting with random latent like we did earlier, we're going to start with a noisy version of the latent for our source image (which we just calculated). So first, we need to add the right amount of noise to the image where the right amount is determined by the scheduler. Then we run the diffusion process and get our new image. 377 | 378 | To add the "right amount of noise", we'll have to use the scheduler to compute the noise. The function below is a port from the python [diffusers](https://github.com/huggingface/diffusers/blob/v0.10.2/src/diffusers/schedulers/scheduling_pndm.py#L401) library. We're going to use the `DdimScheduler` because it's simpler to understand. 379 | 380 | ```elixir 381 | num_steps = 40 382 | 383 | scheduler = %Bumblebee.Diffusion.DdimScheduler{ 384 | beta_start: 0.00085, 385 | beta_end: 0.012, 386 | clip_denoised_sample: false, 387 | alpha_clip_strategy: :alpha_zero 388 | } 389 | 390 | latents_shape = {1, unet.spec.sample_size, unet.spec.sample_size, unet.spec.in_channels} 391 | {scheduler_state, timesteps} = Bumblebee.scheduler_init(scheduler, num_steps, latents_shape) 392 | 393 | defmodule Noiser do 394 | import Nx.Defn 395 | 396 | defn add_noise(scheduler_state, original_samples, noise, timesteps) do 397 | alpha_bars = scheduler_state.alpha_bars 398 | 399 | sqrt_alpha_bars = 400 | (alpha_bars[timesteps] ** 0.5) 401 | |> Nx.flatten() 402 | |> expand_dims(Nx.rank(original_samples)) 403 | 404 | sqrt_one_minus_alpha_bars = 405 | ((1 - alpha_bars[timesteps]) ** 0.5) 406 | |> Nx.flatten() 407 | |> expand_dims(Nx.rank(original_samples)) 408 | 409 | sqrt_alpha_bars * original_samples + sqrt_one_minus_alpha_bars * noise 410 | end 411 | 412 | # Adds dimensions at the end until the tensor rank matches `rank` 413 | defn expand_dims(tensor, rank) do 414 | if Nx.rank(tensor) < rank do 415 | expand_dims(Nx.new_axis(tensor, -1), rank) 416 | else 417 | tensor 418 | end 419 | end 420 | end 421 | 422 | sampling_step = 15 423 | key = Nx.Random.key(0) 424 | {noise, _new_key} = Nx.Random.normal(key, shape: latent) 425 | 426 | noisy_latent = 427 | Noiser.add_noise(scheduler_state, latent, noise, Nx.new_axis(timesteps[sampling_step], 0)) 428 | 429 | # Note how we have to update the scheduler_state to reflect we've "done" some iterations 430 | scheduler_state = %{scheduler_state | iteration: sampling_step} 431 | 432 | SDRenderer.render_latents(noisy_latent, vae_decoder, Kino.render(Kino.Frame.new())) 433 | ``` 434 | 435 | A noisy image! We set the `sampling_step` to 15, so it's as if we had already run the diffusion process for 15 steps and the `noisy_latent` was the result. Now we can run the remaning steps starting from step 15 to finish the diffusion process and get our new image. 436 | 437 | We'll also need new text embeddings to guide the process. 438 | 439 | ```elixir 440 | im2im_text_input = 441 | Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art") 442 | ``` 443 | 444 | ```elixir 445 | defmodule SDEmbeddings do 446 | def get_embeddings(text_input, clip, tokenizer, seq_length, batch_size) do 447 | prompt = Kino.Input.read(text_input) 448 | 449 | tokenizer_options = [ 450 | length: seq_length, 451 | return_token_type_ids: false, 452 | return_attention_mask: false 453 | ] 454 | 455 | cond_tokens = Bumblebee.Text.ClipTokenizer.apply(tokenizer, prompt, tokenizer_options) 456 | uncond_tokens = Bumblebee.Text.ClipTokenizer.apply(tokenizer, "", tokenizer_options) 457 | tokens = Bumblebee.Utils.Nx.composite_concatenate(uncond_tokens, cond_tokens) 458 | %{hidden_state: text_embeddings} = Axon.predict(clip.model, clip.params, tokens) 459 | 460 | text_embeddings 461 | |> Nx.new_axis(1) 462 | |> Nx.tile([1, batch_size, 1, 1]) 463 | |> Nx.reshape({:auto, seq_length, 768}) 464 | end 465 | end 466 | 467 | im2im_embeddings = SDEmbeddings.get_embeddings(im2im_text_input, clip, tokenizer, 60, 1) 468 | ``` 469 | 470 | ```elixir 471 | scheduler_step_fn = &Bumblebee.scheduler_step(scheduler, &1, &2, &3) 472 | frame = Kino.Frame.new() |> Kino.render() 473 | render = &Kino.Frame.render(frame, Kino.Layout.grid(&1, boxed: true, columns: 2)) 474 | source_image_kino = Kino.Image.new(source_image) 475 | render.([source_image_kino | SDRenderer.render_latents(noisy_latent, vae_decoder)]) 476 | 477 | chunked_timesteps = 478 | timesteps[sampling_step..-1//1] 479 | |> Nx.to_flat_list() 480 | |> Enum.chunk_every(4) 481 | |> Enum.map(&Nx.tensor/1) 482 | 483 | {_, final_latents} = 484 | Enum.reduce(chunked_timesteps, {scheduler_state, noisy_latent}, fn timesteps, 485 | {scheduler_state, latents} -> 486 | {scheduler_state, latents} = 487 | SDLoop.run( 488 | 15, 489 | latents, 490 | timesteps, 491 | im2im_embeddings, 492 | unet_predict, 493 | unet.params, 494 | scheduler_step_fn, 495 | scheduler_state 496 | ) 497 | 498 | kino_images = SDRenderer.render_latents(latents, vae) 499 | 500 | [source_image_kino | kino_images] 501 | |> render.() 502 | 503 | {scheduler_state, latents} 504 | end) 505 | 506 | :ok 507 | ``` 508 | --------------------------------------------------------------------------------