In theory, the Metropolis algorithm (with an arbitrary proposal) works for any distribution in any dimension.
In practice, the Metropolis algorithm is inefficient for high-dimensional distributions.
Current research in MCMC focuses on finding good proposals to speed-up the algorithm, together with theoretical guarantees, e.g., making sure that the chain converges to the target distribution.
In this lecture, we will introduce two popular MCMC algorithms: Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler (NUTS).
These two algorithms are the ones used in the Stan software.
A few words about acceptance rate
Ideally, we want the acceptance rate to be far from 0 and 1.
If the acceptance rate is too low, the Markov chain is not moving.
If the acceptance rate is too high, the Markov chain is only exploring the high-density region.
Adaptive MCMC
In the original Metropolis algorithm, the proposal distribution is fixed.
This can be inefficient, since after some exploration, we gain some knowledge about the target distribution, and we can use this knowledge to improve the proposal distribution.
This can be easily implemented, e.g., reducing the variance of Gaussian proposal every 100 iterations.
However, the theoretical analysis is complicated since the Markov chain is no longer time-homogeneous.
Can we assure that the adaptive Markov chain still converges to the target distribution?
Optimal Gaussian proposal
Let \(\pi\) be a distribution supported on \(\Theta \subseteq \mathbb{R}^d\).
We want to sample from \(\pi\) using Metropolis algorithm with a Gaussian proposal \(N_d(\theta^{(t-1)}, \Xi)\).
What is the optimal choice of \(\Xi\)?
What do we mean by optimal in MCMC? We want the fastest convergence of the Markov Chain.
Roberts & Rosenthal (2001)1 showed that, under some conditions on \(\pi\), the optimal choice is \[
\Xi = \frac{2.38^2}{d} \Sigma_{\pi},
\] where \(\Sigma_{\pi}\) is the covariance matrix of \(\pi\). But we don’t know \(\Sigma_{\pi}\)!
Adaptive Random Walk Metropolis
Algorithm:
At \(t\)th iteration, compute \(\hat{\Sigma}_n\) using \(\theta^{(1)}, \ldots, \theta^{(t-1)}\).
Sample \(\theta^*\) from \(N_d\left(\theta^{(t-1)}, \frac{(2.38)^2}{d}\hat{\Sigma}_n\right)\).
Compute the acceptance ratio \(\rho(\theta^{(t-1)}, \theta^{*})=\min\left\{\frac{\pi(\theta^* \mid x)}{\pi(\theta^{(t-1)} \mid x)}, 1\right\}\).
Set \[
\theta^{(t)}= \begin{cases}\theta^* & \text { with prob. } \rho(\theta^{(t-1)}, \theta^{*}) \\ \theta^{(t-1)} & \text { with prob. } 1-\rho(\theta^{(t-1)}, \theta^{*}).\end{cases}
\]
The simplest choice of \(\hat{\Sigma}_n\) is the sample covariance matrix.
However, it is not a good choice in high dimensions.
The adaptive RWM works well in practice, but the theoretical analysis is difficult.
Adaptation: the change of proposal distribution over time.
For an adaptive MCMC algorithm to work, the adaptation must satisfy two requirements:
Diminishing adaptation: the adaptation must diminish over time.
Containment: the time to stationary from \(X_t\) is bounded, if we fix the adaptation at \(t\), i.e., stop changing proposal distribution after time \(t\).
Random walk Metropolis can fail badly when the target distribution is multi-modal because it uses proposals that are unrelated to the target distribution.
HMC makes better use of what is known about the target distribution by using the gradient of the log density.
HMC can only be used to sample from continuous distributions whose density can be evaluated (up to a normalizing constant) and whose gradient can be computed.
In particular, HMC cannot be used to sample from discrete distributions.
Although we did not show in the class, the HMC indeed has a stationary distribution \(\pi\).
Betancourt (2017) also provides a nice introduction to HMC and NUTS.
Distributional Approximation
Motivation
We have been focusing on simulating samples from the posterior distribution to make inference.
However, the sampling-based approach will be very inefficient for large problems, i.e., a problem with hundreds or thousands of parameters.
In such cases, we might want to approximate the posterior distribution with a simpler distribution.
That is, we trade accuracy for efficiency.
The simplest approach is modal-approximation.
However, this approach is not very flexible and does not scale well to high-dimensional problems.
There are many other approaches, and we focus on one of them: variational inference.
Variational inference
Variational Bayes (VB) is an algorithmic framework for approximating joint distributions.
Suppose we want to approximate the posterior distribution \(\pi(\theta \mid y)\) with a simpler distribution \(q(\theta)\) by minimizing the Kullback-Leibler divergence \[
\text{KL}(q\|\pi) = \int \log\frac{q(\theta)}{\pi(\theta \mid y)} q(\theta)d\theta = \E_q\left[\log\frac{q(\theta)}{\pi(\theta \mid y)}\right].
\]
The difficulties are:
\(\pi(\theta \mid y)\) is available only up to a normalizing constant;
we cannot easily take posterior draws from \(\pi(\theta \mid y)\);
we cannot easily compute expectations of interest, \(\E(h(\theta) \mid y)\).
Hence we are not able to minimize the KL divergence directly.
Evidence Lower Bound (ELBO)
The marginal distribution of \(y\) is given by \[
p(y) = \int \pi(y, \theta) d\theta = \int \pi(y \mid \theta) \pi(\theta) d\theta.
\]
Then the evidence (log of marginal) is \[\begin{align*}
\log p(y) & = \log \int p(y, \theta) d\theta
= \log \int \frac{p(y, \theta)}{q(\theta)} q(\theta) d\theta\\
& = \log\left(\E_{q}\left[\frac{\pi(y, \theta)}{q(\theta)}\right]\right)
\geq \E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right].
\end{align*}\]
The last inequality is due to Jensen’s inequality.
The quantity \(\E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right]\) is called the evidence lower bound (ELBO).
Evidence Lower Bound (ELBO)
The KL divergence can be written as \[\begin{align*}
\text{KL}(q \| \pi) & = \E_q\left[\log\frac{q(\theta)}{\pi(\theta \mid y)}\right] \\
& = \E_q\left[\log q(\theta)\right] - \E_q\left[\log \pi(\theta \mid y)\right] \\
& = \E_q\left[\log q(\theta)\right] - \E_q\left[\log p(y \mid \theta)\pi(\theta)\right] + \log p(y)\\
& = - \text{ELBO}(q) + \log p(y).
\end{align*}\]
Hence minimizing the KL divergence is equivalent to maximizing the ELBO.
Class of approximate distributions
Now we need to choose a class \(\mc{Q}\) of distributions \(q(\theta)\) to approximate the posterior distribution \(\pi(\theta \mid y)\).
We want the \(\mc{Q}\) to be:
flexible enough to approximate \(\pi(\theta \mid y)\) well;
simple enough to be tractable;
the ELBO can be easily computed.
The class \(\mc{Q}\) is often a parametric family, i.e., \[
\mc{Q} = \{q(\theta \mid \phi): \phi \in \Phi\}.
\] and \(\phi\) is called the variational parameter.
Mean-field variational family
The simplest choice of \(\mc{Q}\) is the mean-field variational family: \[
\mc{Q}_{\text{MF}} = \left\{q(\theta \mid \phi) = \prod_{j=1}^d q_j(\theta_j \mid \phi_j): \phi = (\phi_1, \ldots, \phi_d) \in \Phi\right\}.
\]
The mean-field variational family assumes that the parameters are independent.
Obviously, this family does not contain the true posterior distribution.
However, it is often used in practice because it is simple and easy to compute.
We can also partition \(\theta\) into \(K\) blocks and assume that the parameters between different blocks are independent.
Maximizing ELBO
The ELBO is given by \(\text{ELBO}(q) = \E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right]\).
For \(q \in \mc{Q}_{\text{MF}}\), the entropy term is \[
\E_q\left[\log q(\theta)\right] = \E_q\left[\sum_{j=1}^d \log q_j(\theta_j \mid \phi_j)\right] = \sum_{j=1}^d \E_{j}\left[\log q_j(\theta_j \mid \phi_j)\right]
\] where \(\E_j\) is the expectation with respect to \(q_j\).
The first term is \[\begin{align*}
\E_q\left[\log p(y, \theta)\right] &= \int \log \pi(y, \theta) \prod_{j=1}^d q_j(\theta_j) d\theta_1\cdots d\theta_d \\
& = \int \E_{-j}\left[\log \pi(y, \theta)\right] q_j(\theta_j) d\theta_j \\
\end{align*}\]
We need to find \(q_1, \ldots, q_d\) that maximize \(\mc{L}(q_1, \ldots, q_d)\) subject to the constraints \(\int q_j(\theta_j) d\theta_j = 1\).
Use Lagrange multipliers, take ``functional derivative’’ with respect to \(q_j\), set it to zero, and we have \[
\log q_j(\theta_j) = \E_{-j}\left[\log \pi(y, \theta)\right] + \text{const}.
\]
Hence the optimal \(q_j\) is \[
q^{*}_j(\theta_j) \propto \exp\left(\E_{-j}\left[\log \pi(y, \theta)\right]\right).
\]
Example: 8-school problem
The model in the eight-school problem is \[\begin{align*}
Y_i & \ind N(\alpha_i, \sigma_i^2), \quad i = 1, \ldots, 8 \\
\alpha_i & \iid N(\mu, \tau)\\
\pi(\mu, \tau) & \propto 1.
\end{align*}\] (It can be shown that the posterior is proper.)
Let \(\theta = (\alpha_1, \ldots, \alpha_8, \mu, \tau)\).
We consider the mean-field approximation \(q(\theta) = q(\alpha_1) \cdots q(\alpha_8) q(\mu) q(\tau)\).
Mean-field approximation: \(\alpha_j\)
Averaging over all the parameters other than \(\alpha_j\), we have \[
\E_{-\alpha_j} \log p(\theta \mid y)=-\frac{1}{2} \frac{\left(y_j-\alpha_j\right)^2}{\sigma_j^2}-\frac{1}{2} \E_{-\alpha_j}\left(\frac{1}{\tau^2}\right) \E_{-\alpha_j}\left(\left(\alpha_j-\mu\right)^2\right)+\text{const.}
\] which is a quadratic function of \(\alpha_j\).
The optimal \(q(\alpha_j)\) is a Gaussian distribution \[
q\left(\alpha_j\right)=N\left(\alpha_j \middle| \frac{\frac{1}{\sigma_j^2} y_j+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right) \E_{-\alpha_j}(\mu)}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)}, \frac{1}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)}\right).
\]
Mean-field approximation: \(\mu\)
Averaging over all the parameters other than \(\mu\), we have \[
\E_{-\mu} \log p(\theta \mid y) = -\frac{1}{2} \E_{-\mu}\left(\frac{1}{\tau^2}\right) \sum_{j=1}^8\left(\mu - \E_{-\mu} \left(\alpha_j\right)\right)^2+\text{const.}
\]
The optimal \(q(\mu)\) is also a Gaussian distribution \[
q(\mu)=N\left(\mu \middle| \frac{1}{8} \sum_{j=1}^8 \E_{-\mu}\left(\alpha_j\right), \frac{1}{8} \frac{1}{\E_{-\mu}\left(\frac{1}{\tau^2}\right)}\right)
\]
Mean-field approximation: \(\tau\)
Averaging over all the parameters other than \(\tau\), we have \[
\E_{-\tau} \log p(\theta \mid y) = -8 \log \tau -\frac{1}{2} \frac{1}{\tau^2} \sum_{j=1}^8 \E_{-\tau}\left(\left(\alpha_j-\mu\right)^2\right)+\text{const.}
\]
The optimal \(q(\tau^2)\) is an Inverse-Gamma distribution \[
q(\tau^2) = \text{Inv-Gamma}\left(\tau^2 \middle| \frac{7}{2}, \frac{1}{2} \sum_{j=1}^8 \E_{-\tau}\left(\left(\alpha_j-\mu\right)^2\right)\right).
\]