CS 180: Intro to Computer Vision and Computational Photography, Fall 2024
Project 5A: Fun with Diffusion Models
Ian Dong
Overview
In this first part of the project, I played around with the DeepFloyd IF diffusion model, implemented diffusion sampling loops, and then used them to create inpainted and optical illusions. I learned a lot about diffusion models and how they can be used to create cool effects on images.Section 0: Setup
Using the DeepFloyd IF Diffusion Model
For this part, I instantiated DeepFloyd's stage_1
and stage_2
and passed in the following text prompts: an oil
painting of a snowy mountain village, a man wearing a hat, and a rocket ship. I set the
seed to be 180 for all parts. Then, I varied the inference steps to generate the following images:
|
|
|
|
|
|
|
|
|
|
|
|
The quality of the images becomes much more clearer after stage 2. It seems that all of the images fit the given prompt pretty well. For each of the prompts, as we increase the number of inference steps it seems that the images become much more detailed and fancy. For example in the man with the hat it looks more realistic and the snow has a lot more details with the shading.
Section I: Implementing the Forward Process
Implementing the Forward Process
In this section, I implemented the forward process of the diffusion model to gradually add more noise to a clean image. The forward process is defined by: $$ q(x_{t} | x_{0}) = \mathcal{N}(X_{t}, \sqrt{\overline{\alpha}}x_{0}, (1 - \overline{\alpha}_{t})\mathbf{I})$$ which is equivalent to: $$x_{t} = \sqrt{\overline{\alpha}_{t}}x_{0} + \sqrt{1 - \overline{\alpha}_{t}} \epsilon \quad \text{where } \epsilon \sim \mathcal{N}(0, 1).$$ I sampled from a Gaussian distribution to add more noise to the original image. Here are the results:
|
|
|
|
Section II: Classical Denoising
Classical Denoising
Then, I applied Gaussian blur filtering over the above images in an attempt to reduce the noise. Here are the results:
|
|
|
|
|
|
Section III: One-Step Denoising
One-Step Denoising
For the next step, I implemented the one-step denoising process by using the stage_1.unet
to estimate the Gaussian noise and subtract it from
the noisy images. Here are the results:
|
|
|
|
|
|
|
|
|
Section IV: Iterative Denoising
Iterative Denoising
Finally, I implemented iterative denoising to further reduce the noise and get a better estimate. I used strided timesteps to speed things up and skip steps. The formula I used was: $$x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma$$ This helps to move from a noiser image to a cleaner image similar to finding a linear interpolation between both of them. Here are the results:
|
|
|
|
|
|
|
|
|
Section V: Diffusion Model Sampling
Diffusion Model Sampling
In this section, I passed in random noise and used the iterative_denoise
to effectively defnoise pure noise. This allows
me to generate images from scratch by setting i_start = 0
. Here
are the results with the prompt = a high quality photo:
|
|
|
|
|
Section VI: Classifier Free Guidance
Classifier Free Guidance
In the previous section, the generated images did not have great quality. I used classifier free guidance to greatly improve image quality. First, I computed both a noise estimate on a text prompt and an unconditional noise estimate. I let the new noise estimate be: $$ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$$ where $\gamma$ helps to control the strength of CFG. Here are the results with the prompt = a high quality photo:
|
|
|
|
|
Section VII: Image to Image Translation
Image to Image Translation
In the previous sections, we can see that as we add noise we can make modifications to existing images. The greater the noise the more potential for edits. I took some original images, added noise to them, and then ran the forward process without any conditioning for a variety of starting indices. Here are the results:
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
Editing Hand-Drawn and Web Images
I decided to try out the same approach above but on nonrealistic images. One of them was downloaded from the internet while the other two are hand-drawn images. Here are the results:
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
Inpainting
I used the same approach above to implement inpainting of images. Given an original image, $x_\text{orig}$ and a binary mask $m$, I can generate a new image that that retains the original content where $m = 0$, while generating new content in the regions where $m = 1$. I used the following expression to get the new image: $$x_{t} \leftarrow \mathbf{m}x_{t} + (1 - \mathbf{m})\text{forward}(x_{\text{orig}}, t)$$ This helps to leave everything within the mask region untouched while only denoising and replacing everything outside of it with new content. Here are the results:
|
|
|
|
|
|
|
|
|
|
|
|
In the first image, I made it so that the Campanile would look more like a lighthouse. Then I changed the Doe Library to look more like an oil painting by replacing the left side with a mountainous background. Finally, for Moffitt Library, I replaced the building with a cool looking red shed and a nice green scenery.
Text-Conditional Image-to-image Translation
Afterwards, I followed the same procedure as above but instead will guide the projection using a text prompt. This means that it is no longer a pure "projection to the natural image manifold" anymore but also adds control using language. I changed the prompt from a high quality photo to a rocket ship for the Campanile image, an oil painting of a snowy mountain village for the Doe Library image, and finally a lithograph of waterfalls for the Moffitt Library image. Here are the results:
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
i_start=1 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
|
Section VIII: Visual Anagrams
Visual Anagrams
In this section, I created an image what would look like a certain prompt but when flipped upside down it will look like a completely different prompt. To achieve this, I needed to take in two different prompts and calculate their respective estimated noise. These were the following equations I used: $$ \begin{align*} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon &= (\epsilon_1 + \epsilon_2) / 2 \end{align*} $$ By flipping the image and running the UNet on it, I am able to apply a reverse diffusion step to generate image on the opposite direction. Here are the results:
|
|
|
|
|
|
Section IX: Hybrid Images
Hybrid Images
In this section, I implemented factorized diffusion to create hybrid images similar to project 2. I would take in two different prompts and then passed in the estimated noises into a low-pass filter and a high-pass filter. Afterwards, I used this as the final noise estimate. For the low-pass and high-pass filters, I simply used a Gaussian filter with a kernel size of 33 and sigma of 2. These were the following equations I used: $$ \begin{align*} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{UNet}(x_t, t, p_2) \\ \epsilon &= f_\text{low pass}(\epsilon_1) + f_\text{high pass}(\epsilon_2) \end{align*} $$ Here are the results:
|
|
|
Section V: Conclusion
Learnings
The coolest thing I learned from this project was how to create the estimated noise, sampling loops and then use them to edit the images. I loved using it to do the inpainting and visual anagrams. It was very cool to generate an image from random noise too.
Project 5B: Diffusion Models from Scratch
Overview
The second part of this project focuses on building diffusion models from scratch to use on the MNIST dataset. I implemented a single-step denoising UNet and added time and class conditioning to iteratively denoise an image and get better results.Section I: Training a Single-Step Denoising UNet
Implementing the UNet
In the part A, I had tested out the diffusion model to implement loop sampling, inpainting, and hybridization of images. In this part, I implemented a single-step denoising UNet to denoise images. I used the following architecture for the UNet:
|
|
I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections.
Using the UNet to Train a Denoiser
In this section, I am trying to solve the following denoising problem: Given a noisy image $z$, I am tring to train a denoiser $D_\theta$ such that it can map $z$ to a clean image $x$. I used the following loss function to train the denoiser: $$ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2.$$ For each training batch, I will generate $z$ with the following process: $$ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I)$$ To train this model, I took the clean images, added noise to them, and then passed them into the UNet model. The model would try to return the denoised images. Afterwards, I calculated the MSE loss between the denoised images and the original images. By minimizing this loss, I was able to train the model to denoise images. The hyperparameters and other architecture that I used were as follows:
- Batch Size: 256
- Hidden Dim: 128
- $\sigma:$ 0.5
- Learning Rate: 1e-4
- Optimizer: Adam
- Epochs: 5
- Loss Function: MSE Loss
Here is the visualization of the noising process where $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$
|
I had trained the model for 5 epochs over the entire MNIST training dataset. Here is a training loss curve plot during the entire process:
Here are sample results after the 1st and 5th epoch:
Out-of-Distribution Testing
Once the model has been trained, I tested the denoising UNet on noisy samples from the test dataset. I kept the same image but varied the noise added to it. Here are the results:
|
Section II: Training a Diffusion Model
Adding Time Conditioning to UNet
In this section, I added time conditioning to the previous UNet model that can iteratively denoise an image. The small change to the problem is that now I want to use the UNet to predict the added noise $\epsilon$ instead of the clean image $x$. The equation for the loss function is as follows: $$L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2$$. To iteratively denoise an image, I needed to generate noisy images $x-t$ from $x_0$ using the following equation: $$ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)$$ Intuitively, when $t = 0$, I should get back the denoised image while when $t = T$ it should be a pure noise image. I also used the DDPM to build the list of $\bar{\alpha}$ values to use for the training process.
- $\beta$ is a list of length $T = 300$ equally spaced between 0.0001 and 0.02
- $\alpha_t = 1 - \beta_t$
- $\bar{\alpha}_t$ is a cumulative product of $\alpha_t$ for $t \in \{1, \cdots, T\}$
-
unflatten = unflatten + t1
-
upsample = upsample + t2
t1, t2
are the results from passing the timestep through the FCBlocks.
Finally, I conditioned a single UNet on timestep $t$. I used the following architecture for the time
conditioned UNet:
|
|
I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections. I embedded the time conditioning by normalizing $t$ and adding to the unflatten and up sample blocks.
Training the UNet
To train this model, I took the clean images, uniformly sampled to create the timesteps $t$, used the $\bar{\alpha}$ for the timesteps to get the noisy images, and finally trained with the UNet model. The model would try to return the expected noise from the image. Afterwards, I calculated the MSE loss between the expected noise and the random pure noise I had generated. By minimizing this loss, I was able to train the model to denoise images. The hyperparameters and other architecture that I used were as follows:
- Batch Size: 128
- Hidden Dim: 64
- Learning Rate: 1e-3
- Optimizer: Adam
- Scheduler Gamma: $0.1^{(1 / \text{epochs})}$
- Epochs: 20
- Loss Function: MSE Loss
|
|
I had trained directly with the time conditioned UNet by following along with the algorithm below:
Here is a training loss curve plot during the entire process:
Sampling from the UNet
I had also sampled directly from the time conditioned UNet by following along with the algorithm below:
Here are some sampling results for the time conditioned UNet model:
Adding Class-Conditioning to UNet
To improve results and allow for more control over the image generation, I added class conditioning to the previous UNet for the digit class 0-9. This included adding 2 more FCBlocks and a one-hot encoded vector $c$ for each of the datapoints instead of a single scalar. Since I don't want the model to overfit on the classes, I made sure to drop the one-ho t encoded vector with a probability of 0.1. I replaced the unflatten and upsample blocks with the following:
-
unflatten = c1 * unflatten + t1
-
upsample = c2 * upsample + t2
c1, c2
are the results from passing the one-hot encoded vector through the FCBlocks and
t1, t2
are the results from passing the timestep through the FCBlocks.
Finally, I conditioned a single UNet on timestep $t$.
I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections. I embedded the time conditioning by normalizing $t$ and adding to the unflatten and up sample blocks and class conditioning by adding the one-hot encoded vector to the unflatten and up sample blocks.
Compared to the previous time conditioned UNet's training algorithm, the only main difference being the addition of the conditioning vector $c$ and periodically performing unconditional generation.
Here is a training loss curve plot during the entire process:
Sampling from the Class-Conditioned UNet
I had also sampled directly from the class conditioned UNet by following along with the algorithm below:
Here are some sampling results for the class conditioned UNet model with classifier-free guidance of $\gamma = 5.0$:
Section III: Conclusion
Learnings
The coolest thing I learned from this project was how to build my own diffusion model with each of the components and how to connect them properly. I learned the importance of having correct tensor shapes so that the model can properly train on the images.