Lecture 02: Introduction to Bayesian Statistics

Chun-Hao Yang

Beta-Binomial Model

Beta-Binomial model

  • Let \(X_1, \ldots, X_n \mid p \iid \ber(p)\).
  • Consider the prior \(p \sim \text{Beta}(\alpha, \beta)\) where \(\alpha\) and \(\beta\) are known.
  • The posterior distribution of \(p\) given \(X_1, \ldots, X_n\) is \[ p \mid X_1, \ldots, X_n \sim \text{Beta}\left(\alpha + \sum_{i=1}^n X_i, \beta + n - \sum_{i=1}^n X_i \right). \]

Derivation

  • Bayes Theorem: \[\begin{align*} \pi(p \mid x) & = \frac{f(x \mid p) \pi(p)}{\int f(x \mid p) \pi(p) dp} = \frac{\text{likelihood} \times \text{prior}}{\text{marginal}}\\ & \propto f(x \mid p) \pi(p) = \text{likelihood} \times \text{prior} \end{align*}\]

  • The marginal (and other normalizing constants) can be ignored.

  • The likelihood is \[ f(x_1, \ldots, x_n \mid p) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i}(1-p)^{n - \sum_{i=1}^n x_i}. \]

  • The prior is \[ \pi(p) = \frac{1}{B(\alpha, \beta)} p^{\alpha-1} (1-p)^{\beta-1} \propto p^{\alpha-1}(1-p)^{\beta - 1}. \]

Derivation

  • Hence the posterior is \[\begin{align*} \pi(p \mid X_1, \ldots, X_n) & \propto p^{\sum_{i=1}^n x_i}(1-p)^{n - \sum_{i=1}^n x_i} \times p^{\alpha-1}(1-p)^{\beta - 1}\\ & = p^{\alpha + \sum_{i=1}^n x_i - 1}(1-p)^{\beta + n - \sum_{i=1}^n x_i - 1}. \end{align*}\]

  • Recognizing that this is the kernel of a Beta distribution, the posterior is \[ p \mid X_1, \ldots, X_n \sim \text{Beta}\left(\alpha + \sum_{i=1}^n X_i, \beta + n - \sum_{i=1}^n X_i \right). \]

  • It’s called Beta-Binomial model since the posterior only depends on \(\sum X_i\) and the distribution of \(\sum X_i\) is \(\bin(n,p)\).

Kernel of a pdf/pmf

  • The kernel is the form of the pdf or pmf in which any factors that are not functions of any of the variables in the domain are omitted.

  • Examples:

    • Normal: \(\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\)
    • Gamma: \(x^{\alpha-1}\exp(-\beta x)\)
    • Beta: \(x^{\alpha-1}(1-x)^{\beta-1}\)
    • Poisson: \(\frac{\lambda^x}{x!}\)
  • We can use only the kernels to simplify the computation.

Posterior Distribution

set.seed(2023)
n <- 15
p <- 0.3
X <- rbinom(n, 1, p)
s <- sum(X)
p_mle <- s/n # MLE
print(X)
 [1] 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
library(ggplot2)
ggplot(data = data.frame(p = c(0, 1)), aes(p)) + 
    lims(x = c(0, 1), y = c(0, 5)) + 
    labs(x = "p", y = "Density") +
    geom_function(fun = dunif, aes(col = "blue")) + 
    geom_function(fun = dbeta, aes(col = "red"), 
                  args = list(shape1 = s + 1, shape2 = n - s + 1)) + 
    geom_vline(xintercept = p_mle, linetype = "dashed", col = "darkgrey") + 
    geom_vline(xintercept = p, linetype = "dashed", col = "darkgreen") + 
    scale_colour_manual(name = "Distribution", 
                        values = c("blue", "red"),
                        labels = c("Prior", "Posterior"))

Posterior Distribution

Posterior Inference

  • The posterior distribution of \(p\) contains much more information than the MLE.
  • We can use the posterior for:
    • Estimation: \(\hat{p} = \E(p \mid X_1, \ldots, X_n)\) (posterior mean)
    • Prediction: \(\P(X_{n+1} = 1 \mid X_1, \ldots, X_n)\)
    • Interval estimation: Find \((L, U)\) such that \(\P(L \leq p \leq U \mid X_1, \ldots, X_n) = 0.95\)
  • In the Beta-Binomial model, an estimate for \(p\) is \[ \hat{p} = \E(p \mid X_1, \ldots, X_n) = \frac{\alpha + \sum_{i=1}^n X_i}{\alpha + \beta + n} \] whereas the MLE is \(\hat{p}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^nX_i\).

Maximum-a-posteriori Estimate

  • The posterior mean is not the only estimate we can obtain from the posterior.
  • Another commonly used estimator is the maximum-a-posterior (MAP) estimate \[ \hat{p}_{\text{MAP}} = \argmax_{0 < p < 1} \pi(p \mid X_1, \ldots, X_n). \]
  • Since the mode of \(\text{Beta}(\alpha, \beta)\) is \(\frac{\alpha-1}{\alpha+\beta-2}\) when \(\alpha, \beta > 1\), \[ \hat{p}_{\text{MAP}} = \frac{\alpha -1 + \sum_{i=1}^n X_i}{\alpha + \beta + n - 2}. \]
  • If \(\alpha = \beta = 1\) and \(\sum X_i > 1\), then \(\hat{p}_{\text{MAP}} = \hat{p}_{\text{MLE}} = \bar{X}\).

Recall: Laplace’s Rule of Succession

Given binary iid random variables \(X_1, \ldots, X_n\) with \(\sum_{i=1}^n X_i = s\), then \[\P\left(X_{n+1}=1 \mid X_1+\cdots+X_n=s\right)=\frac{s+1}{n+2}.\]

Derivation:

  • Let \(X_1, \ldots, X_n \iid \ber(p)\) and \(p \sim \text{Beta}(1,1)\) (the uniform prior).
  • The posterior is \(p \mid X_1 + \cdots + X_n = s \sim \text{Beta}(s + 1, n-s + 1)\).
  • Assuming \(X_{n+1} \sim \ber(p)\) and \(X_i\)’s are iid conditioned on \(p\), \[\begin{align*} \P\left(X_{n+1}=1 \mid \sum X_i=s\right) & = \int \P(X_{n+1}=1 \mid p)\pi\left(p \mid \sum X_i = s\right) dp\\ & = \int p \pi\left(p \mid \sum X_i = s\right) dp = \frac{s+1}{n+2}. \end{align*}\]

Interval Estimation: Credible Interval/Region

  • For frequentists, it’s called confidence interval/region.
  • Let \([L(X), U(X)]\) be an interval for \(\theta\) based on sample \(X\).
  • \(100\times(1-\alpha)\%\) Bayesian Coverage: \(\P(L(x) \leq \theta \leq U(x) \mid {\color{magenta}X = x}) = 1-\alpha\).
    • describes your information about the location of the true value of \(\theta\) after you have observed \(X = x\)
  • \(100\times(1-\alpha)\%\) Frequentist Coverage: \(\P(L(X) \leq \theta \leq U(X) \mid {\color{magenta}\theta}) = 1-\alpha\)
    • describes the probability that the interval will cover the true value before the data are observed
  • There are many ways to construct a credible interval:
    • Quantile-based method
    • Highest posterior density (HPD) region

Quantile-based method

To find a \(100\times(1-\alpha)\%\) credible interval for \(\theta\):

  • Find numbers \(\theta_{\alpha / 2} < \theta_{1-\alpha / 2}\) such that

    1. \(\P\left(\theta<\theta_{\alpha/2} \mid X = x\right)=\alpha / 2\);
    2. \(\P\left(\theta>\theta_{1-\alpha / 2} \mid X = x\right)=\alpha / 2\).
  • The numbers \(\theta_{\alpha / 2}, \theta_{1-\alpha / 2}\) are the \(\alpha / 2\) and \(1-\alpha / 2\) posterior quantiles of \(\theta\), and so \[\begin{align*} \P\left(\theta \in\left[\theta_{\alpha / 2}, \theta_{1-\alpha / 2}\right] \mid X = x\right) & =1-\P\left(\theta \notin\left[\theta_{\alpha / 2}, \theta_{1-\alpha / 2}\right] \mid X = x\right) \\ & =1-\left[\P\left(\theta<\theta_{\alpha / 2} \mid X=x\right)\right.\\ & \qquad \left.+\P\left(\theta>\theta_{1-\alpha / 2} \mid X = x\right)\right] \\ & =1-\alpha. \end{align*}\]

Binomial Example

Suppose we observed \(X=2\) from a \(\bin(10, p)\). Assume the uniform prior \(p\).

alpha <- 1; beta <- 1 # uniform prior
n <- 10; X <- 2 # data

qbeta(c(0.025, 0.975), alpha + X, beta + n - X)
[1] 0.06021773 0.51775585

Highest posterior density (HPD) region

Definition 1 A \(100 \times(1-\alpha) \%\) HPD region consists of a subset of the parameter space, \(s(x) \subset \Theta\) such that

  1. \(\P(\theta \in s(x) \mid X = x)=1-\alpha\);
  2. If \(\theta_a \in s(x)\), and \(\theta_b \notin s(x)\), then \(\pi\left(\theta_a \mid X=x\right)>\pi\left(\theta_b \mid X=x\right)\).
  • An HPD region might not be an interval if the posterior density is multimodal (having multiple peaks).

Binomial Example

Let’s compute the HPD for the previous example: \(\theta \mid X = 2 \sim \text{Beta}(3, 9)\).

library(HDInterval)
hdi(qbeta, 0.95, shape1 = 3, shape2 = 9)
     lower      upper 
0.04055517 0.48372366 
attr(,"credMass")
[1] 0.95

The HPD is narrower than the quantile-based interval.

Fig. 3.6 in Hoff’s book

Wrap-up for the Beta-Binomial model

  • The Beta-Binomial model: \[\begin{align*} X \mid p & \sim \bin(n,p)\\ p & \sim \text{Beta}(\alpha,\beta)\\ p \mid X & \sim \text{Beta}(\alpha + X, \beta + n - X) \end{align*}\]
  • Note that the posterior is in the same family of the prior: both of them are in the Beta family.
  • In this case, the Beta prior is called a conjugate prior for the Binomial model.

Catalog for conjugate priors

Sampling model Parameter Prior Posterior
\(X \sim \bin(n, p)\) \(p\) \(\text{Beta}(\alpha_0, \beta_0)\) \(\text{Beta}(\alpha_0 + X, \beta_0+n-X)\)
\(X \sim N(\mu, \sigma^2)\) \(\mu\) \(N(\mu_0, \sigma_0^2)\) \(N\left(\frac{1}{\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}}\left(\frac{\mu_0}{\sigma_0^2}+\frac{ X}{\sigma^2}\right),\left(\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}\right)^{-1}\right)\)
\(X \sim \text{Poisson}(\lambda)\) \(\lambda\) \(\text{Gamma}(\alpha_0, \beta_0)\) \(\text{Gamma}(\alpha_0 + X, \beta_0+1)\)
\(X \sim \text{Gamma}(\alpha, \beta)\) \(\beta\) \(\text{Gamma}(\alpha_0, \beta_0)\) \(\text{Gamma}(\alpha_0 + \alpha, \beta_0+X)\)
\(X \sim \text{NB}(r, p)\) \(p\) \(\text{Beta}(\alpha_0, \beta_0)\) \(\text{Beta}(\alpha_0 + r, \beta_0+X)\)
  • More can be found on Wiki
  • Exercise: Derive the posterior for the Normal-Normal model.

Real Data Example

  • 2022年,基隆市男女嬰出生數分別為856及731,性別比例為1.1711(生下男嬰機率為0.5394)。
  • 同年,全台灣男女嬰出生數分別為71,208及66,205,性別比例為1.076(生下男嬰機率為0.5183)
  • 根據統計,全球人類自然出生性別比1.052(生下男嬰機率為0.5122)。
  • 基隆市自然出生性別比是否高於台灣平均?

Real Data Example

library(tidyverse)
library(knitr)

m <- 856; f <- 731
prior_mean <- c(0.5, 0.5122, 0.5122, 0.5122, 0.5122, 0.5122)
alpha_plus_beta <- c(2, 2, 10, 100, 1000, 10000)
alpha <- prior_mean * alpha_plus_beta
beta <- alpha_plus_beta - alpha
data.frame(alpha, beta, prior_mean) |> 
    mutate(alpha_plus_beta = alpha + beta,
           post_mean = (alpha+m)/(alpha+beta+m+f),
           ratio = post_mean/(1-post_mean),
           post_int = paste0("[", round(qbeta(0.025, alpha + m, beta + f), 3),
                             ", ", round(qbeta(0.975, alpha + m, beta + f), 3), "]")) |>
    select(c(prior_mean, alpha_plus_beta, post_mean, ratio, post_int)) |>
    kable(format = "markdown", digits = 4,
          col.names = c("Prior mean $\\frac{\\alpha}{\\alpha+\\beta}$",
                        "$\\alpha + \\beta$",
                        "Post. mean", "Gender ratio",
                        "Post. 95% Interval"))

Real Data Example

Prior mean \(\frac{\alpha}{\alpha+\beta}\) \(\alpha + \beta\) Post. mean Gender ratio Post. 95% Interval
0.5000 2 0.5393 1.1708 [0.515, 0.564]
0.5122 2 0.5393 1.1708 [0.515, 0.564]
0.5122 10 0.5392 1.1702 [0.515, 0.564]
0.5122 100 0.5378 1.1634 [0.514, 0.562]
0.5122 1000 0.5289 1.1226 [0.51, 0.548]
0.5122 10000 0.5159 1.0658 [0.507, 0.525]

Some follow-up questions

  • Is the sample size (856+731=1587) large enough?
  • Which prior? Depend on how strong your prior belief is.
    • No prior knowledge \(\Rightarrow\) Uniform prior
    • Reliable prior information \(\Rightarrow\) prior with larger \(\alpha+\beta\)
  • With prior information, we can obtain better estimates when the sample size is small:
    • For example, think of a really small village in which only two boys and one girl were born last year.
    • If we don’t use any prior information, the gender ratio is 2.
  • What if our prior information cannot be described by a Beta distribution?
    • Ex: our prior information indicates a multimodal distribution
    • The posterior will be complicated and we need some other methods for posterior inference.

Small Area Estimation (SAE)

  • The previous example is an example of an SAE problem.

Small area estimation is any of several statistical techniques involving the estimation of parameters for small sub-populations, generally used when the sub-population of interest is included in a larger survey1.

  • There is a hierarchical structure in SAE problems:

\[ \text{基隆} \subset \text{台灣} \subset \text{東亞} \subset \text{亞洲} \subset{世界} \]

  • One of the common frequentist models for SAE problems is the random effect model.

Small Area Estimation

  • Bayesians typically use a hierarchical model to handle an SAE problem.

  • In this example, (conceptually) \[\begin{align*} X \mid p & \sim \bin\left(n, p\right)\\ p \mid p_{\text{TW}} & \sim \pi_1\\ p_{\text{TW}} \mid p_{\text{World}} & \sim \pi_2 \end{align*}\]

  • Notations:

    • \(R = \frac{p}{1-p}\) is the Keelung’s gender ratio (parameter);
    • \(R_{\text{TW}} = \frac{p_{\text{TW}}}{1-p_{\text{TW}}}\) is the Taiwan’s gender ratio (parameter);
    • \(p_{\text{World}} = 0.5122\)
  • This model is more complicated than the previous one, but descibes the data more accurately.

Why Bayesian

What is good and bad about Bayesian?

  • Good:
    • consistent and coherent: everything (sample or parameter) is a random variable
    • everything is conditional: we only care about conditional independence rather than marginal independence
    • stopping rule does not matter
  • Bad:
    • the choice of prior is subjective
    • computationally challenging

Exchangeability

Definition 2 Let \(p\left(x_1, \ldots, x_n\right)\) be the joint density of \(X_1\), \(\ldots, X_n\). If \(p\left(x_1, \ldots, x_n\right)=p\left(x_{\pi(1)}, \ldots, x_{\pi(n)}\right)\) for all permutations \(\pi\) of \(\{1, \ldots, n\}\), then \(X_1, \ldots, X_n\) are exchangeable.

  • Roughly speaking, \(X_1, \ldots, X_n\) are exchangeable if the subscript labels convey no information about the outcomes.
  • Apparently, independence implies exchangeability, but the converse is false.
  • What is the relationship between conditional independence and exchangeability?

Conditional independence and exchangeability

Proposition 1 If \(\theta \sim p(\theta)\) and \(X_1, \ldots, X_n\) are conditionally i.i.d. given \(\theta\), then marginally (unconditionally on \(\theta\) ) \(, X_1, \ldots, X_n\) are exchangeable.

Proof. Suppose \(X_1, \ldots, X_n\) are conditionally iid given some unknown parameter \(\theta\). Then for any permutation \(\pi\) of \(\{1, \ldots, n\}\) and any set of values \(\left(x_1, \ldots, x_n\right) \in\) \(\mc{X}^n\) \[\begin{align*} p\left(x_1, \ldots, x_n\right) & =\int p\left(x_1, \ldots, x_n \mid \theta\right) p(\theta) d \theta & & \text { (definition of marginal probability) } \\ & =\int\left\{\prod_{i=1}^n p\left(x_i \mid \theta\right)\right\} p(\theta) d \theta & & \text { ($X$'s are conditionally i.i.d.) } \\ & =\int\left\{\prod_{i=1}^n p\left(x_{\pi(i)} \mid \theta\right)\right\} p(\theta) d \theta & & \text { (product does not depend on order) } \\ & =p\left(x_{\pi(1)}, \ldots x_{\pi(n)}\right) & & \text { (definition of marginal probability) } . \end{align*}\]

de Finetti’s Theorem

  • We have seen that
\[\begin{align*} \left.\begin{array}{l} X_1, \ldots, X_n \mid \theta \text { i.i.d } \\ \theta \sim p(\theta) \end{array}\right\} \Rightarrow X_1, \ldots, X_n \text { are exchangeable. } \end{align*}\]
  • What about an arrow in the other direction?

Theorem 1 (de Finetti) Let \(X_i \in \mc{X}\) for all \(i \in\{1,2, \ldots\}\). Suppose that, for any \(n\), \(X_1, \ldots, X_n\) are exchangeable. Then our model can be written as \[\begin{align*} p\left(x_1, \ldots, x_n\right)=\int\left\{\prod_{i=1}^n p\left(x_i \mid \theta\right)\right\} p(\theta) d \theta \end{align*}\] for some parameter \(\theta\), some prior distribution on \(\theta\), and some sampling model \(p(x \mid \theta)\). The prior and sampling model depend on \(p\left(x_1, \ldots, x_n\right)\).

de Finetti’s Theorem

  • The conclusion is
\[\begin{align*} \left.\begin{array}{l} X_1, \ldots, X_n \mid \theta \text { are i.i.d. } \\ \theta \sim p(\theta) \end{array}\right\} \Leftrightarrow X_1, \ldots, X_n \text { are exchangeable for all } n \text {. } \end{align*}\]
  • This justifies the use of prior distributions when samples are exchangeable.

  • When is the condition “\(X_1, \ldots, X_n\) are exchangeable for all \(n\)” reasonable?

    • \(X_1, \ldots, X_n\) are outcomes of a repeatable experiment;
    • \(X_1, \ldots, X_n\) are sampled from a finite population with replacement;
    • \(X_1, \ldots, X_n\) are sampled from an infinite population without replacement.

Stopping Rule

Let \(\theta\) be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses \[\begin{align*} H_0: \theta=1 / 2, \quad H_1: \theta>1 / 2 \end{align*}\] at a significance level of \(\alpha=0.05\). Suppose we observe the following sequence of flips: \[ \text{heads, heads, heads, heads, heads, tails (5 heads, 1 tails)} \]

  • To perform a frequentist hypothesis test, we must define a random variable to describe the data.
  • The proper way to do this depends on exactly which of the following two experiments was actually performed:

Stopping Rule

Suppose the experiment is “Flip six times and record the results.”

  • \(X\) counts the number of heads, and \(X \sim \bin(6, \theta)\).
  • The observed data was \(x=5\), and the \(p\)-value of our hypothesis test is \[\begin{align*} p\text{-value} & =\P_{\theta=1 / 2}(X \geq 5) \\ & =\P_{\theta=1 / 2}(X=5)+\P_{\theta=1 / 2}(X=6) \\ & =\frac{6}{64}+\frac{1}{64}=\frac{7}{64}=0.109375>0.05 . \end{align*}\] So we fail to reject \(H_0\) at \(\alpha=0.05\).

Stopping Rule

Suppose now the experiment is “Flip until we get tails.”

  • \(X\) counts the number of the flip on which the first tails occurs, and \(X \sim \text{Geometric}(1-\theta)\).
  • The observed data was \(x=6\), and the p-value of our hypothesis test is \[\begin{align*} p \text{-value} & = \P_{\theta=1 / 2}(X \geq 6) \\ & =1-\P_{\theta=1 / 2}(X<6) \\ & =1-\sum_{x=1}^5 \P_{\theta=1 / 2}(X=x) \\ & =1-\left(\frac{1}{2}+\frac{1}{4}+\frac{1}{8}+\frac{1}{16}+\frac{1}{32}\right)=\frac{1}{32}=0.03125<0.05 \end{align*}\] So we reject \(H_0\) at \(\alpha=0.05\).

Stopping Rule

  • The result our hypothesis test depends on whether we would have stopped flipping if we had gotten a tails sooner.
  • The tests are dependent on what we call the stopping rule.
  • The likelihood for the actual value of \(x\) that was observed is the same for both experiments (up to a constant): \[\begin{align*} p(x \mid \theta) \propto \theta^5(1-\theta) . \end{align*}\]
  • A Bayesian approach would take the data into account only through this likelihood.
  • Homework: Show that under a Beta prior, the posteriors under the two stopping rules are the same.

More importantly …