Density Estimation: The Dirichlet Process

Statistical inference often begins with a model. In the classical setting, we might look at a histogram of data and say, "This looks like a Bell curve." We then assume the data comes from a Normal distribution and focus our efforts on finding the best mean ( $\mu$ ) and variance ( $\sigma^2$ ).

But what if we don't want to make that assumption? What if the data has two peaks, or a skew that no standard distribution captures perfectly?

This is the domain of Density Estimation, where we are concerned with making inferences about an unknown distribution on the basis of an observed sample. Instead of fitting parameters to a fixed curve, we want the data to tell us what the curve should look like.

In this post, we will explore the Dirichlet Process (DP), the most popular prior model used for this task in the Bayesian framework.

The Bayesian Nonparametric Approach

In Bayesian inference, if we want to estimate an unknown parameter, we must place a prior on it.

If the unknown is a number (like a coin bias), we might use a Beta distribution.
If the unknown is a vector, we might use a Multivariate Normal.

In density estimation, our unknown parameter is the distribution itself. Since a distribution is a function, it is an infinite-dimensional object. To perform Bayesian inference here, we need a probability model defined over the space of all possible probability measures.

This is called a Bayesian Nonparametric (BNP) prior. The Dirichlet Process, introduced by Ferguson in 1973, is the fundamental building block of BNP.

Defining the Dirichlet Process

Let's denote our unknown random probability measure as $\alpha$ .

We say that $\alpha$ follows a Dirichlet Process with precision parameter $M$ and base measure $\alpha_0$ , denoted as:

\alpha \sim DP(M, \alpha_0)

But what does it mean for a "measure" to be random? Ferguson provided a definition based on finite partitions of the space.

Definition: Let $S$ be our sample space (e.g., the real line). A random probability measure $\alpha$ is a DP if, for every finite partition of the space into sets $\{B_1, ..., B_k\}$ , the vector of probabilities assigned to these sets follows a finite Dirichlet distribution:

(\alpha(B_1), ..., \alpha(B_k)) \sim \text{Dir}(M \alpha_0(B_1), ..., M \alpha_0(B_k))

Here:

$\alpha_0$ (Base Measure): This is our "best guess" or centering distribution. It determines where the mass is located on average.
$M$ (Precision Parameter): This controls how tightly the random measure $\alpha$ concentrates around $\alpha_0$ .

This definition is powerful because it reduces an infinite-dimensional problem back to a familiar finite-dimensional distribution—the Dirichlet distribution.

Key Properties

To build intuition, let's look at the mean and variance of this process. For any specific set $B$ :

E[\alpha(B)] = \alpha_0(B)

Var[\alpha(B)] = \frac{\alpha_0(B)(1-\alpha_0(B))}{1+M}

These equations reveal the roles of our parameters:

The Mean: On average, the random measure $\alpha$ looks exactly like the base measure $\alpha_0$ .
The Variance: The variance decreases as $M$ $M$ increases.
- If $M \to \infty$ , the variance goes to zero. The random measure $\alpha$ becomes identical to $\alpha_0$ .
- If $M$ is small, $\alpha$ can deviate significantly from $\alpha_0$ , allowing the model to adapt more freely to the data.

A Surprising Discreteness

Perhaps the most important property of the DP is the nature of the measures it generates. Even if our base measure $\alpha_0$ is smooth and continuous (like a Normal distribution), any realization $\alpha$ drawn from a DP is discrete.

Specifically, with probability 1, $\alpha$ can be written as an infinite weighted sum of point masses:

\alpha(\cdot) = \sum_{h=1}^{\infty} w_h \delta_{m_h}(\cdot)

The locations $m_h$ are points drawn from the base measure $\alpha_0$ .
The weights $w_h$ sum to 1.
$\delta_x$ represents a "Dirac mass" or a spike of probability at $x$ .

This means a distribution drawn from a DP doesn't look like a smooth curve; it looks like a staircase (a cumulative distribution function of discrete points). This discreteness is actually a feature, not a bug—it naturally leads to clustering, making the DP excellent for mixture models.

Constructing a DP: The Stick-Breaking Process

We know $\alpha$ is discrete, but how do we actually generate it? How do we determine those infinite weights $w_h$ ?

Sethuraman (1994) provided a constructive definition known as Stick-Breaking. It gives us an explicit recipe to simulate $\alpha$ .

Imagine a stick of unit length (representing total probability 1). We want to break it into infinite pieces to get our weights $w_1, w_2, ...$

Break the first piece: Generate a random fraction $v_1$ from a Beta distribution, $v_1 \sim \text{Beta}(1, M)$ .
- The first weight is $w_1 = v_1$ .
- The remaining stick has length $(1 - v_1)$ .
Break the second piece: Generate another fraction $v_2 \sim \text{Beta}(1, M)$ .
- The second weight is a fraction of what was left: $w_2 = v_2 \times (1 - v_1)$ .
- The remaining stick is now $(1 - v_1)(1 - v_2)$ .
Repeat infinitely: Generally, the weight $h$ is the fraction $v_h$ of the remainder from the previous $h-1$ breaks:
$w_h = v_h \prod_{l<h} (1 - v_l)$

Finally, we assign each weight $w_h$ to a random location $m_h$ drawn independently from $\alpha_0$ .

\alpha = \sum_{h=1}^{\infty} w_h \delta_{m_h}

Interactive Simulation

Below is a simulation of the Dirichlet Process using the Stick-Breaking construction.

We assume the base measure $\alpha_0$ is a Standard Normal distribution, $N(0,1)$ .

The White Line represents the CDF of the base measure $\alpha_0$ .
The Grey Lines are 15 different realizations of the random measure $\alpha$ .

Precision Parameter (M):

Figure: 15 samples from DP(M, α₀) with α₀ = N(0,1).

What to Look For

Try changing the precision parameter M:

M = 1 (Low Precision): The stick is broken in large chunks effectively immediately. Often, one or two weights $w_h$ will be massive. This results in a "blocky" CDF that looks nothing like the smooth normal curve. The variance is high.
M = 500 (High Precision): The stick is broken into tiny splinters. It takes many, many atoms to sum up to 1. Because the atoms are drawn from $\alpha_0$ , their average behavior creates a very smooth curve that hugs the white line tightly.

This visualization demonstrates the "Large Weak Support" of the DP: while every individual sample is jagged and discrete, the process places probability mass everywhere that $\alpha_0$ does. Given enough data, a DP model can approximate any distribution effectively.