Stochastic Differential Equations and Fokker–Planck Equations

In this post, we explore how stochastic differential equations (SDEs) give rise to the Fokker–Planck equations (FPEs), both in forward and backward form. These equations describe the time evolution of probability densities associated with stochastic processes, which is central to the theory of diffusion models in machine learning.

1. Stochastic Differential Equations (SDEs)

Consider a stochastic process \( X_t \in \mathbb{R}^d \) governed by the Itô SDE:

\[ dX_t = f(X_t, t)\,dt + g(X_t, t)\,dW_t \]

where \( f : \mathbb{R}^d \times \mathbb{R} \to \mathbb{R}^d \) is the drift vector, \( g : \mathbb{R}^d \times \mathbb{R} \to \mathbb{R}^{d \times m} \) is the diffusion matrix, and \( W_t \) is an \( m \)-dimensional standard Wiener process.

2. Fokker–Planck Equation (Forward)

The Fokker–Planck (forward Kolmogorov) equation describes the evolution of the probability density \( p(x, t) \) of the process \( X_t \):

\[ \frac{\partial p}{\partial t} = -\nabla \cdot (f p) + \frac{1}{2} \sum_{i,j} \frac{\partial^2}{\partial x_i \partial x_j}\left( (D_{ij} p) \right) \]

where \( D(x, t) = g(x, t)g(x, t)^\top \) is the diffusion tensor.

Proof Sketch (Itô to Fokker–Planck)

We derive the Fokker–Planck equation by applying Itô’s lemma to a smooth test function \( \phi(x) \) and computing the time derivative of the expectation:

\[ \frac{d}{dt} \mathbb{E}[\phi(X_t)] = \mathbb{E} \left[ \frac{\partial \phi}{\partial t} + \sum_i f_i \frac{\partial \phi}{\partial x_i} + \frac{1}{2} \sum_{i,j} D_{ij} \frac{\partial^2 \phi}{\partial x_i \partial x_j} \right] \]

Using the identity \( \mathbb{E}[\phi(X_t)] = \int \phi(x) p(x, t)\, dx \) and integrating by parts, we match both sides and deduce the Fokker–Planck PDE for \( p(x, t) \).

3. Backward Kolmogorov Equation

Instead of tracking how \( p(x, t) \) evolves forward in time, the backward Kolmogorov equation looks at how the expected value of a future function evolves backward. For \( u(x, t) = \mathbb{E}_{x, t}[f(X_T)] \), it satisfies:

\[ \frac{\partial u}{\partial t} + f \cdot \nabla u + \frac{1}{2} \text{Tr}(D \nabla^2 u) = 0 \]

This PDE is used heavily in control theory and score-based diffusion models.

4. Anderson’s Theorem: Time Reversal of Diffusions

Suppose we observe a diffusion process \( X_t \) defined by:

\[ dX_t = f(X_t, t)\,dt + g\,dW_t \]

with constant \( g \in \mathbb{R}^{d \times d} \), and \( p(x, t) \) the marginal density at time \( t \). Then the time-reversed process \( \tilde{X}_t := X_{T - t} \) also satisfies an SDE:

\[ d\tilde{X}_t = \left( f(\tilde{X}_t, T - t) - g g^\top \nabla_x \log p(\tilde{X}_t, T - t) \right) dt + g\,d\bar{W}_t \]

where \( \bar{W}_t \) is a Wiener process under the time-reversed filtration.

Proof of Anderson’s Theorem

Let the original forward SDE be: \[ dX_t = f(X_t, t)\,dt + g\,dW_t \]
The Fokker–Planck equation for \( p(x, t) \) is: \[ \frac{\partial p}{\partial t} = -\nabla \cdot (fp) + \frac{1}{2} \nabla \cdot (D \nabla p) \] with \( D = g g^\top \).
Let \( \tilde{X}_t = X_{T - t} \). The backward transition probabilities must match the forward ones, so the backward drift \( \tilde{f} \) must satisfy: \[ \tilde{f}(x, t) = f(x, T - t) - D \nabla_x \log p(x, T - t) \]
Therefore, the reversed SDE is: \[ d\tilde{X}_t = \left( f(\tilde{X}_t, T - t) - D \nabla_x \log p(\tilde{X}_t, T - t) \right)\,dt + g\,d\bar{W}_t \]

The proof hinges on deriving the change in drift necessary to ensure that the reversed process yields the same marginals as the original. This result is key in score-based diffusion models, where the score \( \nabla \log p(x, t) \) is learned to run the reversed SDE to sample from a data distribution.

5. Applications in Diffusion Models

Forward process: Adds Gaussian noise via a simple SDE (e.g., Variance Preserving SDE).
Reverse process: Uses learned scores \( s_\theta(x, t) \approx \nabla_x \log p(x, t) \) to guide denoising.

Thanks to Anderson’s theorem, we can sample from complex data distributions by running the learned time-reversed SDE from noise to data — a foundation of diffusion-based generative models like DDPMs and ScoreSDEs.