Tuesday, April 6, 2010

Sampling Dirichlet Distributions (chapter 2-b)

In the previous post, I talked about how sampling in soft-max space can make Metropolis algorithms work better for Dirichlet distributions. Here are some pictures that show why. These pictures are of three parameter Dirichlet distributions with the simplex projected into two dimensions. Dirichlet distributions with different parameters can look like this:




or this












or this








These were all generated using the standard algorithm for sampling from a Dirichlet distribution.


In R, that algorithm is this

rdirichlet = function(n, alpha) {
  k = length(alpha)
  r = matrix(0, nrow=n, ncol=k)
  for (i in 1:k) {
    r[,i] = rgamma(n, alpha[i], 1)
  }
  r = matrix(mapply(function(r, s) {return (r/s)}, r, rowSums(r)), ncol=k)
  return (r)
}

That is, first sample $(y_1 \ldots y_n)$  using a gamma distribution
\[y_i \sim \mathrm {Gamma} (\alpha_i, 1)\]
Then normalize to get the desired sample
\[\pi_i = {y_i \over \sum_j y_j}\]
In contrast, here are 30,000 samples computed using a simple Metropolis algorithm to draw samples from a Dirichlet distribution with $\alpha = (1,1,1)$ .  This should give a completely uniform impression but there is noticeable thinning in the corners where the acceptance rate of the sampling process drops.
The thinning is only apparent.  In fact, what happens is that the samples in the corners are repeated, but that doesn't show up in this visual presentation.  Thus the algorithm is correct, but it exhibits very different mixing behavior depending on where you look.

A particularly unfortunate aspect of this problem is that decreasing the size of the step distribution in order to increase the acceptance rate in the corners makes the sampler take much longer to traverse the region of interest and thus only partially helps the problem.  Much of the problem is that when we have reached a point near the edge of the simplex or especially in a corner, many candidate steps will be off the simplex entirely and will thus be rejected.  When many parameters are small, this is particularly a problem because steps that are not rejected due to being off the simplex are likely to be rejected because the probability density drops dramatically as we leave the edge of the simplex.  Both of these problems become much, much worse in higher dimensions.

A much better alternative is to generate candidate points in soft-max space using a Gaussian step distribution.  This gives us a symmetric candidate distribution that takes small steps near the edges and corners of the simplex and larger steps in the middle of the range.  The sensitivity to the average step size (in soft-max space) is noticeably decreased and the chance of stepping out of the simplex is eliminated entirely.

This small cleverness leads to the solution of the question posed in a comment some time ago by Andrei about how to sample from the posterior distribution where we observe counts sampled from a multinomial
\[\vec k \sim \mathrm {Multi} (\vec x)\]
where that multinomial has a prior distribution defined by
\[\begin{aligned}
\vec x &= \alpha \vec m \\
\alpha &\sim \mathrm{Exp} (\beta_0) \\
\vec m &\sim \mathrm{Dir} (\beta_1 \ldots \beta_n)
\end{aligned}\]
More about that in my next post.

No comments: