Lecture 08: Hamiltonian Monte Carlo and Variational Inference

Chun-Hao Yang

Motivation

In theory, the Metropolis algorithm (with an arbitrary proposal) works for any distribution in any dimension.
In practice, the Metropolis algorithm is inefficient for high-dimensional distributions.
Current research in MCMC focuses on finding good proposals to speed-up the algorithm, together with theoretical guarantees, e.g., making sure that the chain converges to the target distribution.
In this lecture, we will introduce two popular MCMC algorithms: Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler (NUTS).
These two algorithms are the ones used in the Stan software.

A few words about acceptance rate

Ideally, we want the acceptance rate to be far from 0 and 1.

If the acceptance rate is too low, the Markov chain is not moving.
If the acceptance rate is too high, the Markov chain is only exploring the high-density region.

Adaptive MCMC

In the original Metropolis algorithm, the proposal distribution is fixed.
This can be inefficient, since after some exploration, we gain some knowledge about the target distribution, and we can use this knowledge to improve the proposal distribution.
This can be easily implemented, e.g., reducing the variance of Gaussian proposal every 100 iterations.
However, the theoretical analysis is complicated since the Markov chain is no longer time-homogeneous.
Can we assure that the adaptive Markov chain still converges to the target distribution?

Optimal Gaussian proposal

Let \(\pi\) be a distribution supported on \(\Theta \subseteq \mathbb{R}^d\).
We want to sample from \(\pi\) using Metropolis algorithm with a Gaussian proposal \(N_d(\theta^{(t-1)}, \Xi)\).
What is the optimal choice of \(\Xi\)?
What do we mean by optimal in MCMC? We want the fastest convergence of the Markov Chain.
Roberts & Rosenthal (2001)¹ showed that, under some conditions on \(\pi\), the optimal choice is \[ \Xi = \frac{2.38^2}{d} \Sigma_{\pi}, \] where \(\Sigma_{\pi}\) is the covariance matrix of \(\pi\). But we don’t know \(\Sigma_{\pi}\)!

Adaptive Random Walk Metropolis

Algorithm:
1. At \(t\)th iteration, compute \(\hat{\Sigma}_n\) using \(\theta^{(1)}, \ldots, \theta^{(t-1)}\).
2. Sample \(\theta^*\) from \(N_d\left(\theta^{(t-1)}, \frac{(2.38)^2}{d}\hat{\Sigma}_n\right)\).
3. Compute the acceptance ratio \(\rho(\theta^{(t-1)}, \theta^{*})=\min\left\{\frac{\pi(\theta^* \mid x)}{\pi(\theta^{(t-1)} \mid x)}, 1\right\}\).
4. Set \[ \theta^{(t)}= \begin{cases}\theta^* & \text { with prob. } \rho(\theta^{(t-1)}, \theta^{*}) \\ \theta^{(t-1)} & \text { with prob. } 1-\rho(\theta^{(t-1)}, \theta^{*}).\end{cases} \]
The simplest choice of \(\hat{\Sigma}_n\) is the sample covariance matrix.
However, it is not a good choice in high dimensions.
Google: ``High-dimensional covariance estimation’’!

Does it work?

The adaptive RWM works well in practice, but the theoretical analysis is difficult.
Adaptation: the change of proposal distribution over time.
For an adaptive MCMC algorithm to work, the adaptation must satisfy two requirements:
- Diminishing adaptation: the adaptation must diminish over time.
- Containment: the time to stationary from \(X_t\) is bounded, if we fix the adaptation at \(t\), i.e., stop changing proposal distribution after time \(t\).

Metropolis-adjusted Langevin algorithm (MALA)

Proposal: \(\theta^{(t)} \sim N_d\left(\theta^{(t-1)} + \frac{1}{2}\Sigma\nabla\log\pi(\theta^{(t-1)}), \Sigma\right)\)
This proposal is inspired by Langevin diffusion, and is non-symmatric.
You can also make MALA adaptive by replacing \(\Sigma\) with \(\frac{(2.38)^2}{d}\hat{\Sigma}_n\).

Hamiltonian Monte Carlo (HMC)¹

The main reason that random walk Metropolis is inefficient is that it explores the parameter space by diffusing around.
Hamiltonian dynamics can be used to produce distant proposals for the Metropolis algorithm.

The idea behind HMC:

Hamiltonian dynamics

Let \(p\) be a particle with position \(x\) and velocity \(v\).
The Hamiltonian of \(p\) is the sum of its potential energy and kinetic energy: \[ H(x, v) = U(x) + K(v). \]
The potential energy \(U(x)\) is proportional to the height at \(x\).
The kinetic energy is \(K(v) = \frac{1}{2}mv^Tv\), where \(m\) is the mass of the particle \(p\).
The momentum of \(p\) is defined as \(y = mv\) and hence \(K(y) = \frac{y^Ty}{2m}\).
The total energy is constant:
- when \(p\) moves upward, potential energy increases and kinetic energy decreases;
- when \(p\) moves downward, potential energy decreases and kinetic energy increases.

Hamiltonian dynamics

Let \(x(t)\) be the position of \(p\) at time \(t\) and \(y(t)\) be the momentum of \(p\) at time \(t\).
Define the potential energy to be \(U(x) = -\log \pi(x)\), where \(\pi(x)\) is our target distribution.
Hence the Hamiltonian is \[ H(x, y) = -\log \pi(x) + \frac{y^Ty}{2m}. \]
In Hamiltonian dynamics, the pair \((x(t), y(t))\) satisfies a set of differential equations, given by \[ \begin{aligned} \frac{dx_i}{dt} &= \frac{\partial H}{\partial y_i} = \frac{\partial K}{\partial y_i} = \frac{y_i}{m},\\ \frac{dy_i}{dt} &= -\frac{\partial H}{\partial x_i} = -\frac{\partial U}{\partial x_i} = \frac{\partial \log \pi(x)}{\partial x_i}, \quad i = 1,\ldots,d. \end{aligned} \]

HMC

The HMC is actually sampling from \[ p(x, y) \propto \exp(-H(x, y)) = \pi(x)\exp\left(-\frac{y^Ty}{2m}\right). \]
That is, \(X \sim \pi\) and \(Y \sim N(0, m^{-1}I)\) are independent.
At \(n\)th iteration, we randomly assign a momentum \(y \sim N(0, m^{-1}I)\) to \(x^{(n-1)}\).
Let \((x(0), y(0)) = (x^{(n-1)}, y)\).
Evolve \((x(0, y(0)))\) for time \(t\) according to Hamiltonian dynamics and get \((x(t), y(t))\).
Accept \((x(t), y(t))\) with probability \[ \rho = \frac{\exp(-H(x(t), y(t)))}{\exp(-H(x(0), y(0)))} = \exp(-H(x(t), y(t)) + H(x^{(n-1)}, y)). \]
If accepted, set \(x^{(n)} = x(t)\); otherwise, set \(x^{(n)} = x^{(n-1)}\).

Evolution of Hamiltonian dynamics

Given \((x(0), y(0))\), we can evolve \((x(t), y(t))\) for time \(t\) by solving the Hamiltonian equations.
The most natural way is Euler’s method: \[ \begin{aligned} x_i(t + \epsilon) &= x_i(t) + \epsilon\frac{dx_i}{dt}(t) = x_i(t) + \epsilon \frac{y_i(t)}{m},\\ y_i(t + \epsilon) &= y_i(t) + \epsilon\frac{dy_i}{dt}(t) = y_i(t) + \epsilon \frac{\partial}{\partial x_i}\log \pi(x(t)). \end{aligned} \]
It turns out that the Euler method is inaccurate, and the most widely used method is the leapfrog method.

Leapfrog method

The leapfrog method maps \((x(t), y(t))\) to \((x(t+\epsilon), y(t+\epsilon))\) through the following steps:
1. \(y(t + \epsilon/2) = y(t) + \frac{\epsilon}{2}\nabla_x \log \pi(x(t))\);
2. \(x(t + \epsilon) = x(t) + \epsilon\frac{y(t + \epsilon/2)}{m}\);
3. \(y(t + \epsilon) = y(t + \epsilon/2) + \frac{\epsilon}{2}\nabla_x\log \pi(x(t + \epsilon))\).
In HMC, the leapfrog method evolves \((x(0), y(0))\) for \(L\) iterations.

General HMC algorithm

Input: un-normalized density \(\pi(x)\).
Tuning parameters: step size \(\epsilon\), number of steps \(L\), mass matrix \(M\).
Initialize \(x^{(0)}\).
for \(i = 1, \ldots, n\):
- generate \(y^{(0)} \sim N(0, M)\);
- set \(\tilde{x} \leftarrow x^{(i-1)}\), \(\tilde{y} \leftarrow y^{(0)}\);
- for \(j = 1, \ldots, L\):
  - \(\tilde{y} \leftarrow \tilde{y} + \frac{\epsilon}{2}\nabla_x \log \pi(\tilde{x})\);
  - \(\tilde{x} \leftarrow \tilde{x} + \epsilon M^{-1}\tilde{y}\);
  - \(\tilde{y} \leftarrow \tilde{y} + \frac{\epsilon}{2}\nabla_x \log \pi(\tilde{x})\);
- \(u \sim \text{Unif}(0, 1)\);
- if \(u < \exp(-H(\tilde{x}, \tilde{y}) + H(x^{(i-1)}, y^{(0)}))\):
  - \(x^{(i)} \leftarrow \tilde{x}\);
- else:
  - \(x^{(i)} \leftarrow x^{(i-1)}\);

Remarks

Random walk Metropolis can fail badly when the target distribution is multi-modal because it uses proposals that are unrelated to the target distribution.
HMC makes better use of what is known about the target distribution by using the gradient of the log density.
HMC can only be used to sample from continuous distributions whose density can be evaluated (up to a normalizing constant) and whose gradient can be computed.
In particular, HMC cannot be used to sample from discrete distributions.
Although we did not show in the class, the HMC indeed has a stationary distribution \(\pi\).
See Neal (2011) for more details.

Tuning parameters

There are three tuning parameters in HMC: step size \(\epsilon\), number of steps \(L\), and mass matrix \(M\).
Small \(\epsilon\):
- the leapfrog approximation is good
- higher acceptance rate
Large \(L\):
- bigger moves
- smaller correlations between successive samples
- more computation
Neal (2011) recommends setting \(\epsilon\) and \(L\) such that the acceptance rate is about 0.65.
The mass matrix \(M\) is usually set to be the identity matrix.

No-U-Turn Sampler (NUTS)¹

The hand-tuning of \(\epsilon\) and \(L\) is tedious and time-consuming.
The No-U-Turn Sampler (NUTS) is an extension of HMC that eliminates the need to set the number of steps \(\epsilon\) and \(L\).
The main ides behind NUTS is to simulate a trajectory of the Hamiltonian dynamics until it turns back on itself.
See Hoffman and Gelman (2014) for more details.
Betancourt (2017) also provides a nice introduction to HMC and NUTS.

Distributional Approximation

Motivation

We have been focusing on simulating samples from the posterior distribution to make inference.
However, the sampling-based approach will be very inefficient for large problems, i.e., a problem with hundreds or thousands of parameters.
In such cases, we might want to approximate the posterior distribution with a simpler distribution.
That is, we trade accuracy for efficiency.
The simplest approach is modal-approximation.
However, this approach is not very flexible and does not scale well to high-dimensional problems.
There are many other approaches, and we focus on one of them: variational inference.

Variational inference

Variational Bayes (VB) is an algorithmic framework for approximating joint distributions.
Suppose we want to approximate the posterior distribution \(\pi(\theta \mid y)\) with a simpler distribution \(q(\theta)\) by minimizing the Kullback-Leibler divergence \[ \text{KL}(q\|\pi) = \int \log\frac{q(\theta)}{\pi(\theta \mid y)} q(\theta)d\theta = \E_q\left[\log\frac{q(\theta)}{\pi(\theta \mid y)}\right]. \]
The difficulties are:
- \(\pi(\theta \mid y)\) is available only up to a normalizing constant;
- we cannot easily take posterior draws from \(\pi(\theta \mid y)\);
- we cannot easily compute expectations of interest, \(\E(h(\theta) \mid y)\).
Hence we are not able to minimize the KL divergence directly.

Evidence Lower Bound (ELBO)

The marginal distribution of \(y\) is given by \[ p(y) = \int \pi(y, \theta) d\theta = \int \pi(y \mid \theta) \pi(\theta) d\theta. \]
Then the evidence (log of marginal) is \[\begin{align*} \log p(y) & = \log \int p(y, \theta) d\theta = \log \int \frac{p(y, \theta)}{q(\theta)} q(\theta) d\theta\\ & = \log\left(\E_{q}\left[\frac{\pi(y, \theta)}{q(\theta)}\right]\right) \geq \E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right]. \end{align*}\]
The last inequality is due to Jensen’s inequality.
The quantity \(\E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right]\) is called the evidence lower bound (ELBO).

Evidence Lower Bound (ELBO)

The KL divergence can be written as \[\begin{align*} \text{KL}(q \| \pi) & = \E_q\left[\log\frac{q(\theta)}{\pi(\theta \mid y)}\right] \\ & = \E_q\left[\log q(\theta)\right] - \E_q\left[\log \pi(\theta \mid y)\right] \\ & = \E_q\left[\log q(\theta)\right] - \E_q\left[\log p(y \mid \theta)\pi(\theta)\right] + \log p(y)\\ & = - \text{ELBO}(q) + \log p(y). \end{align*}\]
Hence minimizing the KL divergence is equivalent to maximizing the ELBO.

Class of approximate distributions

Now we need to choose a class \(\mc{Q}\) of distributions \(q(\theta)\) to approximate the posterior distribution \(\pi(\theta \mid y)\).
We want the \(\mc{Q}\) to be:
- flexible enough to approximate \(\pi(\theta \mid y)\) well;
- simple enough to be tractable;
- the ELBO can be easily computed.
The class \(\mc{Q}\) is often a parametric family, i.e., \[ \mc{Q} = \{q(\theta \mid \phi): \phi \in \Phi\}. \] and \(\phi\) is called the variational parameter.

Mean-field variational family

The simplest choice of \(\mc{Q}\) is the mean-field variational family: \[ \mc{Q}_{\text{MF}} = \left\{q(\theta \mid \phi) = \prod_{j=1}^d q_j(\theta_j \mid \phi_j): \phi = (\phi_1, \ldots, \phi_d) \in \Phi\right\}. \]
The mean-field variational family assumes that the parameters are independent.
Obviously, this family does not contain the true posterior distribution.
However, it is often used in practice because it is simple and easy to compute.
We can also partition \(\theta\) into \(K\) blocks and assume that the parameters between different blocks are independent.

Maximizing ELBO

The ELBO is given by \(\text{ELBO}(q) = \E_q\left[\log p(y, \theta)\right] - \E_q\left[\log q(\theta)\right]\).
For \(q \in \mc{Q}_{\text{MF}}\), the entropy term is \[ \E_q\left[\log q(\theta)\right] = \E_q\left[\sum_{j=1}^d \log q_j(\theta_j \mid \phi_j)\right] = \sum_{j=1}^d \E_{j}\left[\log q_j(\theta_j \mid \phi_j)\right] \] where \(\E_j\) is the expectation with respect to \(q_j\).
The first term is \[\begin{align*} \E_q\left[\log p(y, \theta)\right] &= \int \log \pi(y, \theta) \prod_{j=1}^d q_j(\theta_j) d\theta_1\cdots d\theta_d \\ & = \int \E_{-j}\left[\log \pi(y, \theta)\right] q_j(\theta_j) d\theta_j \\ \end{align*}\]

Maximizing ELBO

The ELBO is \[ \mc{L}(q_1, \ldots, q_d) = \int \E_{-j}\left[\log \pi(y, \theta)\right] q_j(\theta_j) d\theta_j - \sum_{j=1}^d \E_{j}\left[\log q_j(\theta_j \mid \phi_j)\right]. \]
We need to find \(q_1, \ldots, q_d\) that maximize \(\mc{L}(q_1, \ldots, q_d)\) subject to the constraints \(\int q_j(\theta_j) d\theta_j = 1\).
Use Lagrange multipliers, take ``functional derivative’’ with respect to \(q_j\), set it to zero, and we have \[ \log q_j(\theta_j) = \E_{-j}\left[\log \pi(y, \theta)\right] + \text{const}. \]
Hence the optimal \(q_j\) is \[ q^{*}_j(\theta_j) \propto \exp\left(\E_{-j}\left[\log \pi(y, \theta)\right]\right). \]

Example: 8-school problem

The model in the eight-school problem is \[\begin{align*} Y_i & \ind N(\alpha_i, \sigma_i^2), \quad i = 1, \ldots, 8 \\ \alpha_i & \iid N(\mu, \tau)\\ \pi(\mu, \tau) & \propto 1. \end{align*}\] (It can be shown that the posterior is proper.)
Let \(\theta = (\alpha_1, \ldots, \alpha_8, \mu, \tau)\).
The log posterior is \[ \log \pi(\theta \mid y) = -\frac{1}{2} \sum_{j=1}^8 \frac{\left(y_j-\alpha_j\right)^2}{\sigma_j^2} - 8 \log \tau-\frac{1}{2} \frac{1}{\tau^2} \sum_{j=1}^8\left(\alpha_j-\mu\right)^2 + \text{const}. \]
We consider the mean-field approximation \(q(\theta) = q(\alpha_1) \cdots q(\alpha_8) q(\mu) q(\tau)\).

Mean-field approximation: \(\alpha_j\)

Averaging over all the parameters other than \(\alpha_j\), we have \[ \E_{-\alpha_j} \log p(\theta \mid y)=-\frac{1}{2} \frac{\left(y_j-\alpha_j\right)^2}{\sigma_j^2}-\frac{1}{2} \E_{-\alpha_j}\left(\frac{1}{\tau^2}\right) \E_{-\alpha_j}\left(\left(\alpha_j-\mu\right)^2\right)+\text{const.} \] which is a quadratic function of \(\alpha_j\).
The optimal \(q(\alpha_j)\) is a Gaussian distribution \[ q\left(\alpha_j\right)=N\left(\alpha_j \middle| \frac{\frac{1}{\sigma_j^2} y_j+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right) \E_{-\alpha_j}(\mu)}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)}, \frac{1}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)}\right). \]

Mean-field approximation: \(\mu\)

Averaging over all the parameters other than \(\mu\), we have \[ \E_{-\mu} \log p(\theta \mid y) = -\frac{1}{2} \E_{-\mu}\left(\frac{1}{\tau^2}\right) \sum_{j=1}^8\left(\mu - \E_{-\mu} \left(\alpha_j\right)\right)^2+\text{const.} \]
The optimal \(q(\mu)\) is also a Gaussian distribution \[ q(\mu)=N\left(\mu \middle| \frac{1}{8} \sum_{j=1}^8 \E_{-\mu}\left(\alpha_j\right), \frac{1}{8} \frac{1}{\E_{-\mu}\left(\frac{1}{\tau^2}\right)}\right) \]

Mean-field approximation: \(\tau\)

Averaging over all the parameters other than \(\tau\), we have \[ \E_{-\tau} \log p(\theta \mid y) = -8 \log \tau -\frac{1}{2} \frac{1}{\tau^2} \sum_{j=1}^8 \E_{-\tau}\left(\left(\alpha_j-\mu\right)^2\right)+\text{const.} \]
The optimal \(q(\tau^2)\) is an Inverse-Gamma distribution \[ q(\tau^2) = \text{Inv-Gamma}\left(\tau^2 \middle| \frac{7}{2}, \frac{1}{2} \sum_{j=1}^8 \E_{-\tau}\left(\left(\alpha_j-\mu\right)^2\right)\right). \]

Summarizing the mean-field approximation

The mean-field approximation is \[\begin{align*} \alpha_j & \sim N(M_{\alpha_j}, S^2_{\alpha_j})\\ \mu & \sim N(M_{\mu}, S^2_{\mu})\\ \tau^2 & \sim \text{Inv-Gamma}\left(\frac{7}{2}, \frac{7}{2}B^2_{\tau}\right). \end{align*}\]
Therefore \[\begin{align*} \E_{-\alpha_j}(\tau^{-2}) & = \E_{-\mu}(\tau^{-2})= \frac{1}{B^2_{\tau}}\\ \E_{-\mu}(\alpha_j) & = M_{\alpha_j}\\ \E_{-\tau}(\alpha_j-\mu)^2 & = S^2_{\alpha_j} + M^2_{\alpha_j} - 2M_{\alpha_j}M_{\mu}+ S^2_{\mu} + M^2_{\mu}. \end{align*}\]

Summarizing the mean-field approximation

Combining all the results we have

\[\begin{align*} M_{\alpha_j} & = \frac{\frac{1}{\sigma_j^2} y_j+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right) \E_{-\alpha_j}(\mu)}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)} = \frac{\frac{1}{\sigma_j^2} y_j+\frac{1}{B^2_{\tau}} M_{\mu}}{\frac{1}{\sigma_j^2}+\frac{1}{B^2_{\tau}}} = \frac{B^2_{\tau} y_j + \sigma_j^2 M_{\mu}}{\sigma_j^2 + B^2_{\tau}}\\ S^2_{\alpha_j} & = \frac{1}{\frac{1}{\sigma_j^2}+\E_{-\alpha_j}\left(\frac{1}{\tau^2}\right)} = \frac{1}{\frac{1}{\sigma_j^2}+\frac{1}{B^2_{\tau}}} = \frac{B^2_{\tau}\sigma_j^2}{\sigma_j^2 + B^2_{\tau}}\\ M_{\mu} & = \frac{1}{8} \sum_{j=1}^8 \E_{-\mu}(\alpha_j) = \frac{1}{8} \sum_{j=1}^8 M_{\alpha_j}\\ S^2_{\mu} & = \frac{1}{8} \frac{1}{\E_{-\mu}\left(\frac{1}{\tau^2}\right)} = \frac{1}{8} \frac{1}{\frac{1}{B^2_{\tau}}} = \frac{B^2_{\tau}}{8}\\ B^2_{\tau} & = \frac{1}{7} \sum_{j=1}^8 \E_{-\tau}\left(\left(\alpha_j-\mu\right)^2\right) = \frac{1}{7} \sum_{j=1}^8 \left(S^2_{\alpha_j} + M^2_{\alpha_j} - 2M_{\alpha_j}M_{\mu}+ S^2_{\mu} + M^2_{\mu}\right) \end{align*}\]

R code

max_iter <- 100

# Data
y <- c(28, 8, -3, 7, -1, 1, 18, 12)
sig2 <- c(15, 10, 16, 11, 9, 11, 10, 18)^2

# Variational parameter
var_par <- list(M_alpha = matrix(0, nrow = max_iter, ncol = 8), 
                S2_alpha = matrix(1, nrow = max_iter, ncol = 8),
                M_mu = rep(0, max_iter),
                S2_mu = rep(1, max_iter),
                B2_tau = rep(1, max_iter))

for (i in 2:max_iter){
    for(j in 1:8){
        var_par$M_alpha[i, j] <- (var_par$B2_tau[i-1] * y[j] + sig2[j] * var_par$M_mu[i-1]) / 
            (sig2[j] + var_par$B2_tau[i-1])
        var_par$S2_alpha[i, j] <- var_par$B2_tau[i-1] * sig2[j] / 
            (sig2[j] + var_par$B2_tau[i-1])
    }
    var_par$M_mu[i] <- mean(var_par$M_alpha[i, ])
    var_par$S2_mu[i] <- var_par$B2_tau[i-1] / 8
    var_par$B2_tau[i] <- sum(var_par$S2_alpha[i, ])/7 + 
        sum(var_par$M_alpha[i, ]^2)/7 - 
        2*sum(var_par$M_alpha[i, ]) * var_par$M_mu[i]/7 +
        var_par$S2_mu[i]*8/7 + 
        var_par$M_mu[i]^2*8/7
}

Results

VB Inference

Posterior mean and standard deviation
parameter	post.mean	post.sd
alpha_1	13.715	7.970
alpha_2	8.051	6.852
alpha_3	5.246	8.109
alpha_4	7.633	7.149
alpha_5	3.347	6.503
alpha_6	5.099	7.149
alpha_7	12.746	6.852
alpha_8	8.934	8.337
mu	8.096	3.326
tau	10.591	3.423

VB using Stan

You can also perform VB using Stan. The following is the Stan code for the school example.

school_model <- rstan::stan_model(model_code = "
    data {
      int<lower=0> J;         
      real y[J];              
      real<lower=0> sigma[J];
    }
    parameters {
      real mu;                
      real<lower=0> tau;      
      real alpha[J];
    }
    model {
      alpha ~ normal(mu, tau); // prior
      y ~ normal(alpha, sigma); // likelihood
    }")

VB using Stan

library(rstan)
schools_dat <- list(J = 8,
    y = c(28, 8, -3, 7, -1, 1, 18, 12),
    sigma = c(15, 10, 16, 11, 9, 11, 10, 18)) 
fit_vb <- rstan::vb(school_model, data = schools_dat, 
                    algorithm = "meanfield")
fit_sampling <- rstan::sampling(school_model, data = schools_dat, 
                                iter = 1000, chains = 4)

Stan output (VB)

	chain:1
mu	1.960
tau	10.271
alpha.1	4.828
alpha.2	2.878
alpha.3	-0.873
alpha.4	1.647
alpha.5	0.027
alpha.6	1.156
alpha.7	5.013
alpha.8	1.404
lp__	0.000

Stan output (sampling)

	mean-chain:1	mean-chain:2	mean-chain:3	mean-chain:4	mean-all chains
mu	8.184	7.792	7.693	8.177	7.961
tau	7.277	7.948	8.428	6.167	7.455
alpha[1]	12.076	11.829	11.925	10.648	11.619
alpha[2]	8.036	8.353	8.149	8.519	8.264
alpha[3]	6.550	5.451	5.088	6.471	5.890
alpha[4]	7.620	7.781	7.503	8.004	7.727
alpha[5]	4.761	4.094	4.231	5.642	4.682
alpha[6]	6.059	5.497	6.013	6.566	6.034
alpha[7]	11.143	11.490	11.501	10.567	11.175
alpha[8]	8.447	8.167	7.922	8.836	8.343
lp__	-18.207	-19.714	-19.857	-17.177	-18.739

Remark

The rstan::vb function is buggy and unstable. DO NOT use it for your research.

Lecture 08: Hamiltonian Monte Carlo and Variational Inference

Motivation

A few words about acceptance rate

Adaptive MCMC

Optimal Gaussian proposal

Adaptive Random Walk Metropolis

Does it work?

Metropolis-adjusted Langevin algorithm (MALA)

Hamiltonian Monte Carlo (HMC)1

Hamiltonian dynamics

Hamiltonian dynamics

HMC

Evolution of Hamiltonian dynamics

Leapfrog method

General HMC algorithm

Remarks

Tuning parameters

No-U-Turn Sampler (NUTS)1

Distributional Approximation

Motivation

Variational inference

Evidence Lower Bound (ELBO)

Evidence Lower Bound (ELBO)

Class of approximate distributions

Mean-field variational family

Maximizing ELBO

Maximizing ELBO

Example: 8-school problem

Mean-field approximation: \(\alpha_j\)

Mean-field approximation: \(\mu\)

Mean-field approximation: \(\tau\)

Summarizing the mean-field approximation

Summarizing the mean-field approximation

R code

Results

VB Inference

VB using Stan

VB using Stan

Stan output (VB)

Stan output (sampling)

Remark

Hamiltonian Monte Carlo (HMC)¹

No-U-Turn Sampler (NUTS)¹