Score Matching in Diffusion Models
Score matching is a key technique in the training of diffusion-based generative models. These models learn to reverse a diffusion (noise-injection) process by estimating the gradient of the log-density of the data — a quantity known as the score function. This blog post explores the theory and practice of score matching, including rigorous definitions, key lemmas such as Tweedie’s formula, and comparisons with alternative approaches like DDPM/DDIMs and flow matching models.
1. Score Function and Score Matching
Let \( p(x) \) be a differentiable probability density function on \( \mathbb{R}^d \). The score function of \( p \) is defined as:
\[ \nabla_x \log p(x) \]
The idea of score matching is to learn a model \( s_\theta(x) \) to approximate this score function. The classical objective for score matching, introduced by Hyvärinen (2005), minimizes:
\[ \mathbb{E}_{p(x)} \left[ \left\| s_\theta(x) - \nabla_x \log p(x) \right\|^2 \right] \] However, since \( \nabla_x \log p(x) \) is typically unknown, Hyvärinen proposed a reformulation that does not require access to the true score:
\[ \mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{p(x)} \left[ \frac{1}{2} \| s_\theta(x) \|^2 + \nabla \cdot s_\theta(x) \right] \]
where \( \nabla \cdot s_\theta(x) = \sum_{i=1}^d \frac{\partial s_\theta^{(i)}(x)}{\partial x_i} \) is the divergence of the vector field.
2. Score Matching in Diffusion Models
In denoising score matching (DSM), instead of matching the score of the data distribution directly, one matches the score of perturbed data:
\[ \mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{p(x)} \mathbb{E}_{q_t(\tilde{x}|x)} \left[ \left\| s_\theta(\tilde{x}, t) - \nabla_{\tilde{x}} \log q_t(\tilde{x} | x) \right\|^2 \right] \] where \( q_t(\tilde{x}|x) \) is the marginal distribution of the forward diffusion process, typically Gaussian:
\[ \tilde{x} \sim \mathcal{N}\left(\sqrt{\alpha(t)}x, (1 - \alpha(t)) I\right) \]
This forms the training objective in many score-based generative models, such as those using stochastic differential equations (SDEs).
3. Tweedie’s Formula
Tweedie’s formula provides a powerful connection between the posterior mean and the score function. If \( x \sim p(x) \) and \( y = x + \varepsilon \), \( \varepsilon \sim \mathcal{N}(0, \sigma^2 I) \), then:
\[ \mathbb{E}[x | y] = y + \sigma^2 \nabla_y \log p(y) \] or rearranged:
\[ \nabla_y \log p(y) = \frac{1}{\sigma^2} \left( \mathbb{E}[x | y] - y \right) \] This formula underpins the equivalence between denoising and score estimation in DSM.
4. DDPM vs. Score SDEs vs. Flow Matching
- DDPM (Denoising Diffusion Probabilistic Models): Parametrize a denoising model \( \epsilon_\theta \) and train to reverse a discrete Markov chain using variational bounds.
- Score-based SDEs: Continuous-time analog of DDPMs, with training done via score matching and sampling done by solving reverse-time SDEs (e.g., using Predictor-Corrector samplers).
- Flow Matching: Trains a vector field \( v_\theta(x, t) \) to match an optimal transport flow between a prior and the data, typically using supervision from known couplings or optimal flows.
5. Lemma: Connection to Reverse-Time SDE
The reverse SDE associated with a forward Itô process:
\[ dX_t = f(X_t, t)dt + g(t) dW_t \] has reverse-time dynamics given by:
\[ dX_t = \left[f(X_t, t) - g(t)^2 \nabla_x \log p_t(x) \right]dt + g(t) d\bar{W}_t \] where \( \bar{W}_t \) is a reverse Brownian motion. This is why score matching \( \nabla_x \log p_t(x) \) allows generation by solving the reverse SDE.
6. Example: Score Matching for Gaussian
Let \( p(x) = \mathcal{N}(0, I) \). Then:
\[ \nabla_x \log p(x) = -x \] So the ideal score model is \( s^*(x) = -x \). If we add Gaussian noise \( \tilde{x} = x + \sigma \varepsilon \), then the perturbed score becomes:
\[ \nabla_{\tilde{x}} \log q(\tilde{x}) = - \frac{\tilde{x}}{1 + \sigma^2} \]
7. Summary and Future Directions
Score matching plays a foundational role in generative modeling via diffusion. With tools like Tweedie’s formula and reverse SDEs, it connects denoising, density modeling, and transport-based views. Comparisons with DDPM/DDIMs and flow matching reveal the unifying structure of modern generative models through the lens of learning data gradients.
Future directions include improved score estimators (e.g., variance reduction), extensions to discrete domains, and hybrid models combining flow matching and score-based ideas.