Lecture 03: From Prior Information to Prior Distribution

Chun-Hao Yang

Introduction

The choice of prior distribution is the most critical and difficult step in Bayesian analysis.
In practice, the available prior information is often not precise enough to lead to an exact determination of the prior distribution.
We need to find a prior distribution that is compatible with the prior information.
Another problem is the determination of prior when the prior information is too vague or does not even exist.

Prior Information

Suppose the prior information for $\theta$ is given through some expectations. \[ \E_{\pi}\left[g_k(\theta)\right]=\omega_k, \quad k = 1, \ldots, K, \]
For example, the census records show that the sex ratio at birth is around 1.05 (ranging from 1.03 to 1.07 for most countries).
The prior information about the sex ratio $\theta$ can be expressed as \[ \E_{\pi}(\theta) = 1.05 \quad \text{and} \quad \var_{\pi}(\theta) = 0.01^2. \]
Obviously, there are many distributions compatible with these constraints.
We will discuss two approaches for choosing prior distributions compatible with these information.

Maximum Entropy Priors

Intuition behind entropy

Let $\pi$ be a discrete distribution. The Shannon Entropy¹ of $\pi$ is \[ \mc{E}(\pi) = -\E_{\pi}[\log \pi(\theta)]. \]
Entropy is a measure of randomness: high entropy = high randomness
Entropy is the expectation of information: $\mc{E}(\pi) = \E_{\pi}(I(X))$, where $I(x)$ is the information of the event $\{X = x\}$.
Roughly speaking, information of an event $E$ is the knowledge you obtained after the occurrence of $E$ .

Shannon’s Axioms for Information

An event that always happens yields no information.
The less probable an event is, the more information it yields.
If two independent events are measured separately, the total amount of information is the sum of the information of the individual events.

That is, the information $I(A)$ of an event $A$ satisfies

$I(A) = 0$ if $\P(A) = 1$.
$I(A)$ is a decreasing function of $\P(A)$.
$I(A\cap B) = I(A) + I(B)$ if $A$ and $B$ are independent.

Shannon’s solution: $I(A) = -\log \P(A)$ for $\P(A) > 0$

Shannon Entropy

Let $X$ be a discrete random variable with probability mass function $\pi(x)$. The Shannon information of $\pi$ is \[ I(x) \coloneqq I(\{X = x\}) = -\log \P(X = x) = -\log \pi(x). \]
The Shannon entropy of $\pi$ is $\mc{E}(\pi) = \E_{\pi}(I(X)) = -\E_{\pi}(\log \pi(X))$.
If the support of $\pi(x)$ is a finite set, say $\{x_1,\ldots, x_m\}$, then $\mc{E}(\pi) \leq \log m$ and the equality holds when $\pi(x) = \frac{1}{m}$.
What if $X$ is continuous? If $\pi(x)$ is a probability density, the information $-\log \pi(x)$ can be negative.

Relative Entropy

It does not make sense to have a negative information.
But ``relative information” can be negative.
Let $\pi_0$ be a reference distribution. The relative information of $\pi$ with respect to $\pi_0$ is \[ I(x; \pi_0) \coloneqq \log \frac{1}{\pi_0(x)} - \log \frac{1}{\pi(x)} = \log \frac{\pi(x)}{\pi_0(x)} \]
Then by Jensen’s inequality \[ \E_{\pi_0}[I(X;\pi_0)] = \mc{E}(\pi_0) - \mc{E}(\pi, \pi_0) \leq 0 \] and the equality holds when $\pi = \pi_0$.

Relative Entropy

Cross entropy: $\mc{E}(\pi, \pi_0) = - \E_{\pi_0}(\log \pi(X))$
Relative entropy: \[ \mc{E}(\pi_0\| \pi) = - \E_{\pi_0}[I(X;\pi_0)] = \E_{\pi_0}\left[\log \frac{\pi_0(X)}{\pi(X)}\right] = D_{KL}(\pi_0\| \pi) \geq 0 \]
It is also called the Kullback-Liebler divergence.
A common choice of the reference distribution $\pi_0$ is non-informative distributions, e.g., the Lebesgue measure.

Entropy/Information

Some take-home messages:

Entropy $\approx$ Information $\approx$ Uncertainty $\approx$ Energy $\approx$ Temparature
There are different definitions of entropy, e.g. Rényi entropy, and therefore different divergences can be induced.
Entropy plays the role of utility functions in decision theory: an inference procedure is obtained by maximizing the entropy.

Maximum Entropy Prior (MEP)

The maximum entropy prior (MEP) (with respect to a reference distribution $\pi_0$) is the solution to the optimization problem \[ \max_{\pi \in \Gamma} \; \mc{E}(\pi_0 \| \pi) \qquad s.t. \quad \E_{\pi}\left[g_k(\theta)\right]=\omega_k, \quad k = 1, \ldots, K \tag{1}\] where $\Gamma$ is a class of candidate priors.
The existence of the solution depends on $g_k$’s, $\Gamma$, and $\pi_0$.
The prior $\pi$ maximizing the entropy is, in this information-theoretic sense, minimizing the prior information brought through $\pi$ about $\theta$.

Discrete MEP

Let $\Theta=\{\theta_1, \ldots, \theta_m\}$ and $\pi_0(\theta_i) = \frac{1}{m}$ be the reference distribution.
The prior informations are \[ \E_{\pi}\left[g_k(\theta)\right]=\omega_k, \quad k = 1, \ldots, K. \]
The MEP for $\theta$ is \[ \pi^*(\theta_i)=\frac{\exp \left\{\sum_{k=1}^K \lambda_k g_k\left(\theta_i\right)\right\}}{\sum_j \exp \left\{\sum_{k=1}^K \lambda_k g_k\left(\theta_j\right)\right\}} \] where the $\lambda_k$’s are obtained from Equation 1 as Lagrange multipliers.
Here $\Gamma$ is the class of all discrete distributions on $\Theta$, i.e., $\Gamma = \{(p_1, \ldots, p_m): \sum p_i = 1\}$.
The proof is a simple application of Lagrange multipliers.

Continuous MEP

Let $\Theta = \R$ and the prior information be

\[ \E_{\pi}\left[g_k(\theta)\right]=\omega_k, \quad k = 1, \ldots, K. \]

The MEP for $\theta$ is

\[ \pi^*(\theta)=\frac{\exp \left\{\sum_{k=1}^K \lambda_k g_k(\theta)\right\} \pi_0(\theta)}{\int \exp \left\{\sum_{k=1}^K \lambda_k g_k(\eta)\right\} \pi_0(\eta)d\eta}. \]

Notice that the above distributions $\pi^{*}$ are necessarily in an exponential family.

Examples

Without any prior information, the MEP is the uniform distribution.
With $\E_{\pi}(\theta) = \mu$, the MEP is $\pi^*(\theta) \propto e^{\lambda\theta}$, which can not be normalized to 1.
With $\E_{\pi}(\theta) = \mu$ and $\var_{\pi}(\theta) = \sigma^2$, the MEP is $N(\mu, \sigma^2)$.

Parametric Approximations

A more practical solution to incorporate the prior information is to use parametric approximations.
Choose a parametric family, e.g. normal, and find one in the family that matches the prior information (exactly or approximately).
Example: Suppose the prior informations are: median = 0, lower quartile = -1, and upper quartile = 1.
The normal distribution satisfying these constraints is $N(0, 2.19)$.
The Cauchy distribution satisfying these constraints is $\text{Cauchy}(0,1)$.
Both are reasonable prior distributions (depending on the likelihood and data).

Some problems

Matching moments/quantiles is often impractical and sometimes produces impossible values of the parameters; for instance, it can give negative variances.
A deeper drawback of most parametric approaches is that the selection of the parametrized family is based on ease in the mathematical treatment, not on a subjective basis such as a preliminary histogram approximating $\pi$.

Conjugate Prior

Conjugate prior

Definition 1 A class $\mc{P}$ of prior distributions for $\theta$ is called conjugate for a sampling model $p(x \mid \theta)$ if \[\begin{align*} \pi(\theta) \in \mc{P} \Rightarrow \pi(\theta \mid x) \in \mc{P} . \end{align*}\]

Usually, the class $\mc{P}$ is a parametric family with a finite number of parameters.
Example: Beta is conjugate for the binomial model.
Conjugate priors make posterior calculations easy, but might not actually represent our prior information.
However, mixtures of conjugate prior distributions are very flexible and are computationally tractable.

Exponential families

Recall that a family of distributions is an exponential family if its pdf/pmf can be written as \[\begin{align*} f(x;\theta) & = h(x)\exp\left(\sum_{i=1}^k w_i(\theta)T_i(x) - \psi(\theta)\right)\\ & = h(x)\exp\left(\sum_{i=1}^k \eta_iT_i(x)-\tilde{\psi}(\eta)\right), \quad \eta_i = w_i(\theta), \eta = [\eta_1, \ldots, \eta_k]. \end{align*}\]
- $\eta$ is the natural parameter.
- $T(X) = [T_1(X), \ldots, T_k(X)]$ is a sufficient statistic; it is complete if the parameter space is an open set in $\R^k$.
Example: Normal, Gamma, Beta, Binomial (with known $n$), Poisson, …
Not an exponential family: $\text{Unif}(\theta, \theta+1)$ (if the support of the distribution depends on the parameter, then it is not an exponential family).

Conjugate priors for exponential families

Consider the exponential family $f(x\mid\theta) = h(x)\exp(\theta^Tx - \psi(\theta))$.

Proposition 1 A conjugate family for $f(x \mid \theta)$ is given by \[\begin{align*} \pi(\theta \mid \mu, \lambda)=K(\mu, \lambda) e^{\theta \cdot \mu-\lambda \psi(\theta)}, \end{align*}\] where $K(\mu, \lambda)$ is the normalizing constant of the density. The corresponding posterior distribution is $\pi(\theta \mid \mu+x, \lambda+1)$.

The conjugate prior is again an exponential family.
The parameter spaces for $\mu, \lambda$ are determined by propriety of the posterior.

Example (Logistic regression)

$y \in \{0, 1\}$ and $x \in \R^k$

The conditional distribution of $Y$ given $X = x$ is

\[ \P_\beta(Y=1\mid X = x)=1-\P_\beta(Y=0 \mid X = x)=\frac{\exp \left(\beta^t x\right)}{1+\exp \left(\beta^t x\right)}. \]

Given a sample $(y_1, x_1), \ldots, (y_n, x_n)$, the joint distribution is

\[ f\left(y_1, \ldots, y_n \mid x_1, \ldots, x_n, \beta\right)=\exp \left(\beta^t \sum_{i=1}^n y_i x_i\right) \prod_{i=1}^n\left(1+e^{\beta^t x_i}\right)^{-1}. \]

Example (Logistic regression)

Applying the previous proposition, we can find the conjugate prior for $\beta$: \[ \pi\left(\beta \mid y_0, \lambda\right) \propto e^{\beta^t y_0} \prod_{i=1}^n\left(1+e^{\beta^t x_i}\right)^{-\lambda}. \]
The normalizing constant for $\pi(\beta\mid y_0, \lambda)$ is unknown and approximations of posterior quantities such as the posterior mean and posterior median can only be achieved through simulation techniques.

Mixtures of conjugate priors

Let $f(x \mid \theta)$ be an exponential family and $\pi(\theta \mid \lambda, \mu)$ be its conjugate prior.

Consider $\pi(\theta)=\sum_{i=1}^N w_i \pi\left(\theta \mid \lambda_i, \mu_i\right)$.
Then the posterior is \[ \pi(\theta \mid x)=\sum_{i=1}^N w_i^{\prime}(x) \pi\left(\theta \mid \lambda_i+1, \mu_i+x\right). \] where $w^{\prime}_i(x)$ depends on $w_i$’s and the normalizing constant of $\pi(\theta\mid\lambda,\mu)$¹.
Mixtures can then be used as a basis to approximate any prior distribution.

Noninformative prior

When there is no prior information or the prior information is unreliable, should we still use Bayesian techniques?
The answer is YES, due to some optimality criteria (we will discuss this in the next lecture).
In such cases, the priors should be derived from the sampling distribution, since that is the only information we have.
These priors are called noninformative priors, or objective/default priors.
We will introduce some principles for deriving noninformative priors.

Laplace’s prior

For the binomial model, Laplace proposed to use the uniform distribution for $p$.
His reasoning is called the principle of insufficient reason:

If we are ignorant of the ways an event can occur, the event will occur equally likely in any way. ¹

It is also called the equiprobability principle.

Criticisms against Laplace’s prior

When the parameter space is not compact, this principle leads to an improper prior.
- A distribution $\pi$ is improper if $\pi(\theta) \geq 0$ but $\int \pi(\theta)d\theta = \infty$.
- However, it is actually possible to work with improper priors, as long as we do not try to interpret them as probability distributions.
Laplace’s prior is not invariant under reparametrization.
- Suupose we assign a unifrom prior for $\theta$: $\pi(\theta) \propto 1$.
- We consider a reparametrization $\eta = g(\theta)$, where $g$ is one-to-one.
- The corresponding prior is \[ \pi^*(\eta)=\left|\frac{d}{d \eta} g^{-1}(\eta)\right| \] which is not constant.

Example

Suppose $\Theta = [0, 1]$ and $\pi(\theta) = 1$ for $\theta \in \Theta$.
Consider the odds $\eta = \frac{\theta}{1-\theta}$, which is one-to-one on $\Theta$.
Then $\pi^*(\eta) = \left|\frac{d}{d \eta} g^{-1}(\eta)\right| = \frac{1}{(1+\eta)^2}$, that is, the prior prefers small $\eta$ (corresponding to small $\theta$).

Jeffreys prior

Recall that the Fisher information of a pdf/pmf $f(x\mid\theta)$ is given by

\[ I(\theta)=\E_\theta\left[\left(\frac{\partial \log f(X \mid \theta)}{\partial \theta}\right)^2\right] \stackrel{(*)}{=} -\E_\theta\left[\frac{\partial^2 \log f(X \mid \theta)}{\partial \theta^2}\right]. \]

The Jeffreys prior distribution is $\pi_J(\theta) \propto \sqrt{I(\theta)}$, defined up to a normalization constant when $\pi_J$ is proper.

Example

If $X \sim \bin(n, p)$,

\[ \begin{aligned} f(x \mid p) & =\left(\begin{array}{c} n \\ x \end{array}\right) p^x(1-p)^{n-x}, \\ \frac{\partial^2 \log f(x \mid p)}{\partial p^2} & = - \frac{x}{p^2} - \frac{n-x}{(1-p)^2}, \\ I(p) & =n\left[\frac{1}{p}+\frac{1}{1-p}\right]=\frac{n}{p(1-p)} . \end{aligned} \]

Therefore, the Jeffreys prior for this model is \[ \pi_J(p) \propto[p(1-p)]^{-1 / 2} \] and is thus proper, since it is a $\text{Beta}(1/2, 1/2)$ distribution.

Example

Beta(1,1) (blue) and Beta(1/2, 1/2) (red)

Multidimensional parameters

For $\theta \in \R^k$, the Fisher information matrix $I(\theta)$ has the following elements,

\[ I_{i j}(\theta)=-\mathbb{E}_\theta\left[\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log f(X \mid \theta)\right] \quad(i, j=1, \ldots, k). \]

The Jeffreys noninformative prior is then defined by $\pi_J(\theta) \propto \sqrt{\det(I(\theta))}$.
If $f(x \mid \theta)$ belongs to an exponential family, $f(x \mid \theta)=h(x) \exp (\theta \cdot x-\psi(\theta))$, the Fisher information matrix is given by $I(\theta)=\nabla \nabla^t \psi(\theta)$ and \[ \pi_J(\theta) \propto\left(\prod_{i=1}^k \psi_{i i}^{\prime \prime}(\theta)\right)^{1 / 2} \] where $\psi_{i i}^{\prime \prime}(\theta)=\frac{\partial^2}{\partial \theta_i^2} \psi(\theta)$.

Jeffreys prior

Jeffreys prior has several advantages:

It depends only on the sampling model, through the Fisher information.
The Fisher information is an indicator of the amount of information brought by the model (or the observation) about $\theta$.
It is invariant under reparametrization.

Invariance of Jeffreys prior

By definition, $I(\theta) = I(h(\theta))[h^{\prime}(\theta)]^2$ for any one-to-one $h$.
Let $\eta = g(\theta)$ and $\theta \sim \pi_J(\theta) \propto \sqrt{I(\theta)}$.
Then \[\begin{align*} \pi^*(\eta) & = \pi_J(g^{-1}(\eta))\left|\frac{d}{d \eta} g^{-1}(\eta)\right| = \sqrt{I(g^{-1}(\eta))}\left|\frac{d}{d \eta} g^{-1}(\eta)\right| \\ & = \sqrt{I(\eta)[h^{\prime}(\eta)]^{-2}}|h^{\prime}(\eta)| \quad \text{(Let $h = g^{-1}$)}\\ & = \sqrt{I(\eta)} = \pi_J(\eta). \end{align*}\]

Limitations of Jeffreys prior

Jeffreys prior works well for one-parameter models.
For multiparameter models however, Jeffreys prior leads to incoherences or even paradoxes.
They do not necessarily perform satisfactorily for all inferential purposes, in particular when considering subvectors of interest (see Example 3.5.9 in Bayesian Choice).

Example

Consider $X \sim N(\mu, \sigma^2)$ with $\theta = (\mu, \sigma)$ unknown.
In this case, \[ I(\theta) =\left(\begin{array}{cc} 1 / \sigma^2 & 0 \\ 0 & 2 / \sigma^2 \end{array}\right). \]
Thus $\pi_J(\mu, \sigma) \propto \sigma^{-2}$.
If we assume $\mu$ and $\sigma^2$ are a priori independent, then \[ \tilde{\pi}_J(\mu, \sigma) = \pi_J(\mu)\pi_J(\sigma) \propto \sigma^{-1}. \]
Theoretically, $\tilde{\pi}_J(\mu,\sigma) \propto \sigma^{-1}$ has better convergence properties than $\pi_J(\mu,\sigma) \propto \sigma^{-2}$.

Example

We have seen that binomial and negative binomial modelings could lead to the same likelihood.
However, if $X \sim$ $\bin(n, \theta)$, the noninformative prior $\pi_1(\theta)$ is $\text{Beta}(1 / 2,1 / 2)$.
If $N \sim \text{NB}(x, \theta)$, the Fisher information is \[\begin{align*} I(\theta) & =-\E_\theta\left[\frac{\partial^2}{\partial \theta^2} \log f(N \mid \theta)\right] \\ & =\E_\theta\left[\frac{x}{\theta^2}+\frac{N-x}{(1-\theta)^2}\right]=\frac{x}{\theta^2(1-\theta)}, \end{align*}\]
The Jeffreys prior is $\pi_2(\theta) \propto \theta^{-1}(1-\theta)^{-1 / 2}$, which is improper and, more importantly, differs from $\pi_1$.

Reference prior

Bernardo (1979) proposed a modification of the Jeffreys approach called the reference prior approach.
A major difference is that this method distinguishes between parameters of interest and nuisance parameters.
Hence the prior depends on the sampling model as well as the inferential problem.
The main idea behind the reference approach is to a prior that maximizes the information brought by the data, i.e. the KL divergence between prior and posterior.
It has been shown that for one-parameter models, the reference approach gives the Jeffreys prior.

The reference prior approach

Let $X \sim f(x \mid \theta)$ and $\theta=\left(\theta_1, \theta_2\right)$. Suppose $\theta_1$ is the parameter of interest, and $\theta_2$ is the nuisance parameter.

Define $\pi_J\left(\theta_2 \mid \theta_1\right)$ as the Jeffreys prior associated with $f(x \mid \theta)$ when $\theta_1$ is fixed.
Derive the marginal distribution \[\begin{align*} \tilde{f}\left(x \mid \theta_1\right)=\int f\left(x \mid \theta_1, \theta_2\right) \pi\left(\theta_2 \mid \theta_1\right) d \theta_2. \end{align*}\]
Compute the Jeffreys prior $\pi_J\left(\theta_1\right)$ associated with $\tilde{f}\left(x \mid \theta_1\right)$.
The reference prior for $\theta$ is $\pi_R(\theta) \propto \pi_J\left(\theta_2 \mid \theta_1\right)\pi_J\left(\theta_1\right)$.

Example

Let $X_1, X_2 \iid N(\mu, \sigma^2)$ where $\sigma$ is the parameter of interest and $\mu$ is the unknown nuisance parameter.

Fixing $\sigma$, the Jeffreys prior for $\mu$ is $\pi_J(\mu\mid\sigma) \propto 1$ (computed using the joint distribution of $X_1$ and $X_2$).
The marginal distribution is \[\begin{align*} \tilde{f}(x_1, x_2\mid\sigma) & = \int \frac{1}{2\pi\sigma^2}\exp\left(-\frac{(x_1-\mu)^2+(x_2-\mu)^2}{2\sigma^2}\right) d\mu\\ & = \frac{1}{\sqrt{2 \pi} 2 \sigma}e^{-\left(x_1-x_2\right)^2 / 4 \sigma^2}. \end{align*}\]
The Jeffreys prior for $\tilde{f}(x_1, x_2\mid\sigma)$ is $\pi_J(\sigma) \propto \sigma^{-1}$.
The reference prior for $(\mu, \sigma)$ is $\pi_R(\mu,\sigma) \propto \pi_J(\mu\mid\sigma)\pi_J(\sigma) \propto \sigma^{-1}$.

More about reference priors

In general, the reference prior depends the “importance” of the individual parameters.
Under some additional assumptions (e.g., posterior asymptotic normality), the reference prior can be easily computed.
For formal definitions and more examples of reference priors, see Bernardo (2005).

Matching prior

A peculiar, not to say paradoxical, approach to noninformative priors is to look for good frequentist properties, that is, properties that hold on the average (in $x$), rather than conditional on $x$.
Goal: posterior probabilities $\approx$ frequentist coverage.
For example, find a prior such that the HPD interval and the frequentist CI are asymptotically the same.

Posterior validation and robustness

When using an improper prior, you must check the propriety of the posterior.
You can perform some numerical analysis to see how the choice of prior affects your result. This is especially important for small sample sizes.
An additional level in the prior modeling should increase the robustness of the prior distribution.
For example, \[\begin{align*} \lambda & \sim \pi_2(\lambda), \\ \theta \mid \lambda & \sim \pi_1(\theta \mid \lambda), \\ X \mid \theta & \sim f(x \mid \theta) . \end{align*}\]

Some practical guides for choosing priors

With prior information:
- Find the MEP
- Find a (conjugate) parametric prior that approximates the prior information
Without prior information:
- Use conjugate priors (if applicable) or mixture of conjugate priors
- Use noninformative priors, e.g., Jeffreys prior, reference priors
Finally, perform some sensitivity analysis to assess priors’ influence on the result.