Diffusion model

In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.
There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using variational inference. The model responsible for denoising is typically called its "backbone". The [|backbone] may be of any kind, but they are typically U-nets or transformers.
, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise. The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.
Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.
Other than computer vision, diffusion models have also found applications in natural language processing such as text generation and summarization, sound generation, and reinforcement learning.

Denoising diffusion model

Non-equilibrium thermodynamics

Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion.
Consider, for example, how one might model the distribution of all naturally occurring photos. Each image is a point in the space of all images, and the distribution of naturally occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.
The equilibrium distribution is the Gaussian distribution, with pdf. This is just the Maxwell–Boltzmann distribution of particles in a potential well at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

The 2020 paper proposed the Denoising Diffusion Probabilistic Model, which improves upon the previous method by variational inference.

Forward diffusion

To present the model, some notation is required.

are fixed constants.
is the normal distribution with mean and variance, and is the probability density at.
A vertical bar denotes conditioning.

A forward diffusion process starts at some starting point, where is the probability distribution to be learned, then repeatedly adds noise to it bywhere are IID samples from. The coefficients and ensure that assuming that. The values of are chosen such that for any starting distribution of, if it has finite second moment, then converges to.
The entire diffusion process then satisfiesorwhere is a normalization constant and often omitted. In particular, we note that is a Gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with Gaussian process, In particular, notice that for large, the variable converges to. That is, after a long enough diffusion process, we end up with some that is very close to, with all traces of the original gone.
For example, sincewe can sample directly "in one step", instead of going through all the intermediate steps.

Backward diffusion

The key idea of DDPM is to use a neural network parametrized by. The network takes in two arguments, and outputs a vector and a matrix, such that each step in the forward diffusion process can be approximately undone by. This then gives us a backward diffusion process defined byThe goal now is to learn the parameters such that is as close to as possible. To do that, we use maximum likelihood estimation with variational inference.

Variational inference

The ELBO inequality states that, and taking one more expectation, we getWe see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.
Define the loss functionand now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified towhere does not depend on the parameter, and thus can be ignored. Since also does not depend on the parameter, the term can also be ignored. This leaves just with to be minimized.

Noise prediction network

Since, this suggests that we should use ; however, the network does not have access to, and so it has to estimate it instead. Now, since, we may write, where is some unknown Gaussian noise. Now we see that estimating is equivalent to estimating.
Therefore, let the network output a noise vector, and let it predictIt remains to design. The DDPM paper suggested not learning it, but fixing it at some value, where either yielded similar performance.
With this, the loss simplifies to which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss functionresulted in better models.

Backward diffusion process

After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows:

Compute the noise estimate
Compute the original data estimate
Sample the previous data
Change time
Score-based generative model

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network or score-matching with Langevin dynamics.

Score matching

The idea of score functions

Consider the problem of image generation. Let represent an image, and let be the probability distribution over all possible images. If we have itself, then we can say for certain how likely a certain image is. However, this is intractable in general.
Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?
Consequently, we are actually quite uninterested in itself, but rather,. This has two major effects:

One, we no longer need to normalize, but can use any, where is any unknown constant that is of no concern to us.
Two, we are comparing neighbors, by

Let the score function be ; then consider what we can do with.
As it turns out, allows us to sample from using thermodynamics. Specifically, if we have a potential energy function, and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution. At temperature, the Boltzmann distribution is exactly.
Therefore, to model, we may start with a particle sampled at any convenient distribution, then simulate the motion of the particle forwards according to the Langevin equation
and the Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distribution has, the distribution of converges in distribution to as.

Learning the score function

Given a density, we wish to learn a score function approximation. This is score matching. Typically, score matching is formalized as minimizing Fisher divergence function. By expanding the integral, and performing an integration by parts, giving us a loss function, also known as the Hyvärinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

Suppose we need to model the distribution of images, and we want, a white-noise image. Now, most white-noise images do not look like real images, so for large swaths of. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function at that point, then we cannot impose the time-evolution equation on a particle:To deal with this problem, we perform annealing. If is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.