CS 180 Project 5: Diffusion Models

Overview

In this first part of the project, I played around with the DeepFloyd IF diffusion model, implemented diffusion sampling loops, and then used them to create inpainted and optical illusions. I learned a lot about diffusion models and how they can be used to create cool effects on images.

Using the DeepFloyd IF Diffusion Model

For this part, I instantiated DeepFloyd's stage_1 and stage_2 and passed in the following text prompts: an oil painting of a snowy mountain village, a man wearing a hat, and a rocket ship. I set the seed to be 180 for all parts. Then, I varied the inference steps to generate the following images:

Stage 1 (20 Steps)	Stage 2 (20 Stepss)	Stage 1 (50 Steps)	Stage 2 (50 Steps)
Stage 1 (20 Steps)	Stage 2 (20 Steps)	Stage 1 (50 Steps)	Stage 2 (50 Steps)
Stage 1 (20 Steps)	Stage 2 (20 Steps)	OStage 1 (50 Steps)	Stage 2 (50 Steps)

The quality of the images becomes much more clearer after stage 2. It seems that all of the images fit the given prompt pretty well. For each of the prompts, as we increase the number of inference steps it seems that the images become much more detailed and fancy. For example in the man with the hat it looks more realistic and the snow has a lot more details with the shading.

Implementing the Forward Process

In this section, I implemented the forward process of the diffusion model to gradually add more noise to a clean image. The forward process is defined by: $$ q(x_{t} | x_{0}) = \mathcal{N}(X_{t}, \sqrt{\overline{\alpha}}x_{0}, (1 - \overline{\alpha}_{t})\mathbf{I})$$ which is equivalent to: $$x_{t} = \sqrt{\overline{\alpha}_{t}}x_{0} + \sqrt{1 - \overline{\alpha}_{t}} \epsilon \quad \text{where } \epsilon \sim \mathcal{N}(0, 1).$$ I sampled from a Gaussian distribution to add more noise to the original image. Here are the results:

Campanile

Noise Level = 250

Noise Level = 500

Noise Level = 750

Classical Denoising

Then, I applied Gaussian blur filtering over the above images in an attempt to reduce the noise. Here are the results:

Noise Level = 250	Noise Level = 500	Noise Level = 750
Gaussian Denoise Level = 250	Gaussian Denoise Level = 500	Gaussian Denoise Level = 750

One-Step Denoising

For the next step, I implemented the one-step denoising process by using the stage_1.unet to estimate the Gaussian noise and subtract it from the noisy images. Here are the results:

Original	Original	Original
Noise Level = 250	Noise Level = 500	Noise Level = 750
Estimate of Original	Estimate of Original	Estimate of Original

Iterative Denoising

Finally, I implemented iterative denoising to further reduce the noise and get a better estimate. I used strided timesteps to speed things up and skip steps. The formula I used was: $$x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma$$ This helps to move from a noiser image to a cleaner image similar to finding a linear interpolation between both of them. Here are the results:

Noise Level = 690

Noise Level = 540

Noise Level = 390

Noise Level = 240

Noise Level = 90

Original

Iteratively Denoised

One Step Denoised

Gaussian Denoised

Diffusion Model Sampling

In this section, I passed in random noise and used the iterative_denoise to effectively defnoise pure noise. This allows me to generate images from scratch by setting i_start = 0. Here are the results with the prompt = a high quality photo:

Image 1

Image 2

Image 3

Image 4

Image 5

Classifier Free Guidance

In the previous section, the generated images did not have great quality. I used classifier free guidance to greatly improve image quality. First, I computed both a noise estimate on a text prompt and an unconditional noise estimate. I let the new noise estimate be: $$ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$$ where $\gamma$ helps to control the strength of CFG. Here are the results with the prompt = a high quality photo:

CFG Image 1

CFG Image 2

CFG Image 3

CFG Image 4

CFG Image 5

Image to Image Translation

In the previous sections, we can see that as we add noise we can make modifications to existing images. The greater the noise the more potential for edits. I took some original images, added noise to them, and then ran the forward process without any conditioning for a variety of starting indices. Here are the results:

`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original

Editing Hand-Drawn and Web Images

I decided to try out the same approach above but on nonrealistic images. One of them was downloaded from the internet while the other two are hand-drawn images. Here are the results:

`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original

Inpainting

I used the same approach above to implement inpainting of images. Given an original image, $x_\text{orig}$ and a binary mask $m$, I can generate a new image that that retains the original content where $m = 0$, while generating new content in the regions where $m = 1$. I used the following expression to get the new image: $$x_{t} \leftarrow \mathbf{m}x_{t} + (1 - \mathbf{m})\text{forward}(x_{\text{orig}}, t)$$ This helps to leave everything within the mask region untouched while only denoising and replacing everything outside of it with new content. Here are the results:

Campanile	Mask	To Replace	Campanile Inpainted
Doe Library	Mask	To Replace	Doe Library Inpainted
Moffitt Library	Mask	To Replace	Moffitt Inpainted

In the first image, I made it so that the Campanile would look more like a lighthouse. Then I changed the Doe Library to look more like an oil painting by replacing the left side with a mountainous background. Finally, for Moffitt Library, I replaced the building with a cool looking red shed and a nice green scenery.

Text-Conditional Image-to-image Translation

Afterwards, I followed the same procedure as above but instead will guide the projection using a text prompt. This means that it is no longer a pure "projection to the natural image manifold" anymore but also adds control using language. I changed the prompt from a high quality photo to a rocket ship for the Campanile image, an oil painting of a snowy mountain village for the Doe Library image, and finally a lithograph of waterfalls for the Moffitt Library image. Here are the results:

`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original
`i_start=1`	`i_start=3`	`i_start=5`	`i_start=7`	`i_start=10`	`i_start=20`	Original

Visual Anagrams

In this section, I created an image what would look like a certain prompt but when flipped upside down it will look like a completely different prompt. To achieve this, I needed to take in two different prompts and calculate their respective estimated noise. These were the following equations I used: $$ \begin{align*} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon &= (\epsilon_1 + \epsilon_2) / 2 \end{align*} $$ By flipping the image and running the UNet on it, I am able to apply a reverse diffusion step to generate image on the opposite direction. Here are the results:

"An Oil Painting of an Old Man"	"An Oil Painting of People Around a Campfire
"An Oil Painting of a Snowy Mountain Village	"A Photo of the Amalfi Coast"
"An Photo of a Man"	"A Photo of a Dog"

Hybrid Images

In this section, I implemented factorized diffusion to create hybrid images similar to project 2. I would take in two different prompts and then passed in the estimated noises into a low-pass filter and a high-pass filter. Afterwards, I used this as the final noise estimate. For the low-pass and high-pass filters, I simply used a Gaussian filter with a kernel size of 33 and sigma of 2. These were the following equations I used: $$ \begin{align*} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{UNet}(x_t, t, p_2) \\ \epsilon &= f_\text{low pass}(\epsilon_1) + f_\text{high pass}(\epsilon_2) \end{align*} $$ Here are the results:

"Hybrid of Skull and Waterfall"

"Hybrid of Snowy Mountain and Old Man

"Hybrid of Skull and Dog

Learnings

The coolest thing I learned from this project was how to create the estimated noise, sampling loops and then use them to edit the images. I loved using it to do the inpainting and visual anagrams. It was very cool to generate an image from random noise too.

Overview

The second part of this project focuses on building diffusion models from scratch to use on the MNIST dataset. I implemented a single-step denoising UNet and added time and class conditioning to iteratively denoise an image and get better results.

Implementing the UNet

In the part A, I had tested out the diffusion model to implement loop sampling, inpainting, and hybridization of images. In this part, I implemented a single-step denoising UNet to denoise images. I used the following architecture for the UNet:

UNet Architecture

Standard Tensor Operations

I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections.

Using the UNet to Train a Denoiser

In this section, I am trying to solve the following denoising problem: Given a noisy image $z$, I am tring to train a denoiser $D_\theta$ such that it can map $z$ to a clean image $x$. I used the following loss function to train the denoiser: $$ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2.$$ For each training batch, I will generate $z$ with the following process: $$ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I)$$ To train this model, I took the clean images, added noise to them, and then passed them into the UNet model. The model would try to return the denoised images. Afterwards, I calculated the MSE loss between the denoised images and the original images. By minimizing this loss, I was able to train the model to denoise images. The hyperparameters and other architecture that I used were as follows:

Batch Size: 256
Hidden Dim: 128
$\sigma:$ 0.5
Learning Rate: 1e-4
Optimizer: Adam
Epochs: 5
Loss Function: MSE Loss

Here is the visualization of the noising process where $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$

Noising Processes for Varying $\sigma$

I had trained the model for 5 epochs over the entire MNIST training dataset. Here is a training loss curve plot during the entire process:

Here are sample results after the 1st and 5th epoch:

Out-of-Distribution Testing

Once the model has been trained, I tested the denoising UNet on noisy samples from the test dataset. I kept the same image but varied the noise added to it. Here are the results:

Denoising Results for Varying $\sigma$

Adding Time Conditioning to UNet

In this section, I added time conditioning to the previous UNet model that can iteratively denoise an image. The small change to the problem is that now I want to use the UNet to predict the added noise $\epsilon$ instead of the clean image $x$. The equation for the loss function is as follows: $$L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2$$. To iteratively denoise an image, I needed to generate noisy images $x-t$ from $x_0$ using the following equation: $$ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)$$ Intuitively, when $t = 0$, I should get back the denoised image while when $t = T$ it should be a pure noise image. I also used the DDPM to build the list of $\bar{\alpha}$ values to use for the training process.

$\beta$ is a list of length $T = 300$ equally spaced between 0.0001 and 0.02
$\alpha_t = 1 - \beta_t$
$\bar{\alpha}_t$ is a cumulative product of $\alpha_t$ for $t \in \{1, \cdots, T\}$

I replaced the unflatten and upsample blocks with the following:

unflatten = unflatten + t1
upsample = upsample + t2

where t1, t2 are the results from passing the timestep through the FCBlocks. Finally, I conditioned a single UNet on timestep $t$. I used the following architecture for the time conditioned UNet:

Time Conditioned UNet Architecture

FCBlock

I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections. I embedded the time conditioning by normalizing $t$ and adding to the unflatten and up sample blocks.

Training the UNet

To train this model, I took the clean images, uniformly sampled to create the timesteps $t$, used the $\bar{\alpha}$ for the timesteps to get the noisy images, and finally trained with the UNet model. The model would try to return the expected noise from the image. Afterwards, I calculated the MSE loss between the expected noise and the random pure noise I had generated. By minimizing this loss, I was able to train the model to denoise images. The hyperparameters and other architecture that I used were as follows:

Batch Size: 128
Hidden Dim: 64
Learning Rate: 1e-3
Optimizer: Adam
Scheduler Gamma: $0.1^{(1 / \text{epochs})}$
Epochs: 20
Loss Function: MSE Loss

Time Conditioned UNet Architecture

FCBlock

I had trained directly with the time conditioned UNet by following along with the algorithm below:

Here is a training loss curve plot during the entire process:

Sampling from the UNet

I had also sampled directly from the time conditioned UNet by following along with the algorithm below:

Here are some sampling results for the time conditioned UNet model:

Adding Class-Conditioning to UNet

To improve results and allow for more control over the image generation, I added class conditioning to the previous UNet for the digit class 0-9. This included adding 2 more FCBlocks and a one-hot encoded vector $c$ for each of the datapoints instead of a single scalar. Since I don't want the model to overfit on the classes, I made sure to drop the one-ho t encoded vector with a probability of 0.1. I replaced the unflatten and upsample blocks with the following:

unflatten = c1 * unflatten + t1
upsample = c2 * upsample + t2

where c1, c2 are the results from passing the one-hot encoded vector through the FCBlocks and t1, t2 are the results from passing the timestep through the FCBlocks. Finally, I conditioned a single UNet on timestep $t$.

I followed along with the standard tensor operations and defined them within the notebook and then used the UNet architecture to create the necessary layers with downsampling and upsampling blocks to build out skip connections. I embedded the time conditioning by normalizing $t$ and adding to the unflatten and up sample blocks and class conditioning by adding the one-hot encoded vector to the unflatten and up sample blocks.

Compared to the previous time conditioned UNet's training algorithm, the only main difference being the addition of the conditioning vector $c$ and periodically performing unconditional generation.

Here is a training loss curve plot during the entire process:

Sampling from the Class-Conditioned UNet

I had also sampled directly from the class conditioned UNet by following along with the algorithm below:

Here are some sampling results for the class conditioned UNet model with classifier-free guidance of $\gamma = 5.0$:

Learnings

The coolest thing I learned from this project was how to build my own diffusion model with each of the components and how to connect them properly. I learned the importance of having correct tensor shapes so that the model can properly train on the images.

CS 180: Intro to Computer Vision and Computational Photography, Fall 2024

Project 5A: Fun with Diffusion Models

Ian Dong

Overview

Section 0: Setup

Using the DeepFloyd IF Diffusion Model

Section I: Implementing the Forward Process

Implementing the Forward Process

Section II: Classical Denoising

Classical Denoising

Section III: One-Step Denoising

One-Step Denoising

Section IV: Iterative Denoising

Iterative Denoising

Section V: Diffusion Model Sampling

Diffusion Model Sampling

Section VI: Classifier Free Guidance

Classifier Free Guidance

Section VII: Image to Image Translation

Image to Image Translation

Editing Hand-Drawn and Web Images

Inpainting

Text-Conditional Image-to-image Translation

Section VIII: Visual Anagrams

Visual Anagrams

Section IX: Hybrid Images

Hybrid Images

Section V: Conclusion

Learnings

Project 5B: Diffusion Models from Scratch

Overview

Section I: Training a Single-Step Denoising UNet

Implementing the UNet

Using the UNet to Train a Denoiser

Out-of-Distribution Testing

Section II: Training a Diffusion Model

Adding Time Conditioning to UNet

Training the UNet

Sampling from the UNet

Adding Class-Conditioning to UNet

Sampling from the Class-Conditioned UNet

Section III: Conclusion

Learnings