Density Estimation: The Dirichlet Process

Dec 10, 2025

Statistical inference often begins with a model. In the classical setting, we might look at a histogram of data and say, "This looks like a Bell curve." We then assume the data comes from a Normal distribution and focus our efforts on finding the best mean (μ\mu) and variance (σ2\sigma^2).

But what if we don't want to make that assumption? What if the data has two peaks, or a skew that no standard distribution captures perfectly?

This is the domain of Density Estimation, where we are concerned with making inferences about an unknown distribution on the basis of an observed sample. Instead of fitting parameters to a fixed curve, we want the data to tell us what the curve should look like.

In this post, we will explore the Dirichlet Process (DP), the most popular prior model used for this task in the Bayesian framework.

The Bayesian Nonparametric Approach

In Bayesian inference, if we want to estimate an unknown parameter, we must place a prior on it.

  • If the unknown is a number (like a coin bias), we might use a Beta distribution.
  • If the unknown is a vector, we might use a Multivariate Normal.

In density estimation, our unknown parameter is the distribution itself. Since a distribution is a function, it is an infinite-dimensional object. To perform Bayesian inference here, we need a probability model defined over the space of all possible probability measures.

This is called a Bayesian Nonparametric (BNP) prior. The Dirichlet Process, introduced by Ferguson in 1973, is the fundamental building block of BNP.

Defining the Dirichlet Process

Let's denote our unknown random probability measure as α\alpha.

We say that α\alpha follows a Dirichlet Process with precision parameter MM and base measure α0\alpha_0, denoted as:

αDP(M,α0)\alpha \sim DP(M, \alpha_0)

But what does it mean for a "measure" to be random? Ferguson provided a definition based on finite partitions of the space.

Definition: Let SS be our sample space (e.g., the real line). A random probability measure α\alpha is a DP if, for every finite partition of the space into sets {B1,...,Bk}\{B_1, ..., B_k\}, the vector of probabilities assigned to these sets follows a finite Dirichlet distribution:

(α(B1),...,α(Bk))Dir(Mα0(B1),...,Mα0(Bk))(\alpha(B_1), ..., \alpha(B_k)) \sim \text{Dir}(M \alpha_0(B_1), ..., M \alpha_0(B_k))

Here:

  • α0\alpha_0 (Base Measure): This is our "best guess" or centering distribution. It determines where the mass is located on average.
  • MM (Precision Parameter): This controls how tightly the random measure α\alpha concentrates around α0\alpha_0.

This definition is powerful because it reduces an infinite-dimensional problem back to a familiar finite-dimensional distribution—the Dirichlet distribution.

Key Properties

To build intuition, let's look at the mean and variance of this process. For any specific set BB:

E[α(B)]=α0(B)E[\alpha(B)] = \alpha_0(B) Var[α(B)]=α0(B)(1α0(B))1+MVar[\alpha(B)] = \frac{\alpha_0(B)(1-\alpha_0(B))}{1+M}

These equations reveal the roles of our parameters:

  1. The Mean: On average, the random measure α\alpha looks exactly like the base measure α0\alpha_0.
  2. The Variance: The variance decreases as MM increases.
    • If MM \to \infty, the variance goes to zero. The random measure α\alpha becomes identical to α0\alpha_0.
    • If MM is small, α\alpha can deviate significantly from α0\alpha_0, allowing the model to adapt more freely to the data.

A Surprising Discreteness

Perhaps the most important property of the DP is the nature of the measures it generates. Even if our base measure α0\alpha_0 is smooth and continuous (like a Normal distribution), any realization α\alpha drawn from a DP is discrete.

Specifically, with probability 1, α\alpha can be written as an infinite weighted sum of point masses:

α()=h=1whδmh()\alpha(\cdot) = \sum_{h=1}^{\infty} w_h \delta_{m_h}(\cdot)
  • The locations mhm_h are points drawn from the base measure α0\alpha_0.
  • The weights whw_h sum to 1.
  • δx\delta_x represents a "Dirac mass" or a spike of probability at xx.

This means a distribution drawn from a DP doesn't look like a smooth curve; it looks like a staircase (a cumulative distribution function of discrete points). This discreteness is actually a feature, not a bug—it naturally leads to clustering, making the DP excellent for mixture models.

Constructing a DP: The Stick-Breaking Process

We know α\alpha is discrete, but how do we actually generate it? How do we determine those infinite weights whw_h?

Sethuraman (1994) provided a constructive definition known as Stick-Breaking. It gives us an explicit recipe to simulate α\alpha.

Imagine a stick of unit length (representing total probability 1). We want to break it into infinite pieces to get our weights w1,w2,...w_1, w_2, ...

  1. Break the first piece: Generate a random fraction v1v_1 from a Beta distribution, v1Beta(1,M)v_1 \sim \text{Beta}(1, M).

    • The first weight is w1=v1w_1 = v_1.
    • The remaining stick has length (1v1)(1 - v_1).
  2. Break the second piece: Generate another fraction v2Beta(1,M)v_2 \sim \text{Beta}(1, M).

    • The second weight is a fraction of what was left: w2=v2×(1v1)w_2 = v_2 \times (1 - v_1).
    • The remaining stick is now (1v1)(1v2)(1 - v_1)(1 - v_2).
  3. Repeat infinitely: Generally, the weight hh is the fraction vhv_h of the remainder from the previous h1h-1 breaks:

    wh=vhl<h(1vl)w_h = v_h \prod_{l<h} (1 - v_l)

Finally, we assign each weight whw_h to a random location mhm_h drawn independently from α0\alpha_0.

α=h=1whδmh\alpha = \sum_{h=1}^{\infty} w_h \delta_{m_h}

Interactive Simulation

Below is a simulation of the Dirichlet Process using the Stick-Breaking construction.

We assume the base measure α0\alpha_0 is a Standard Normal distribution, N(0,1)N(0,1).

  • The White Line represents the CDF of the base measure α0\alpha_0.
  • The Grey Lines are 15 different realizations of the random measure α\alpha.
Precision Parameter (M):
-4-202400.51

Figure: 15 samples from DP(M, α₀) with α₀ = N(0,1).

What to Look For

Try changing the precision parameter M:

  • M = 1 (Low Precision): The stick is broken in large chunks effectively immediately. Often, one or two weights whw_h will be massive. This results in a "blocky" CDF that looks nothing like the smooth normal curve. The variance is high.
  • M = 500 (High Precision): The stick is broken into tiny splinters. It takes many, many atoms to sum up to 1. Because the atoms are drawn from α0\alpha_0, their average behavior creates a very smooth curve that hugs the white line tightly.

This visualization demonstrates the "Large Weak Support" of the DP: while every individual sample is jagged and discrete, the process places probability mass everywhere that α0\alpha_0 does. Given enough data, a DP model can approximate any distribution effectively.

ITI