Lecture 01: Course Introduction and Review

Chun-Hao Yang

Syllabus

Course Description

  • Statistical analysis is an inversion process: retrieving the causes (models/parameters) from the effects (observations/sample/data)
  • Bayes’ Theorem: If \(A\) and \(E\) are events such that \(\P(E) \neq 0\), \[\P(A \mid E)=\frac{\P(E \mid A) \P(A)}{\P(E \mid A) \P(A)+\P\left(E \mid A^c\right) \P\left(A^c\right)}.\]
  • Let \(A\) be causes and \(E\) be effects. Bayes’ Theorem gives the relationship between \(\P(A\mid E)\) (statistical inference) and \(\P(E\mid A)\) (phenomenon).
  • This course will focus on statistical analysis under the Bayesian paradigm.

Course Description

  • Adapting Bayes’ Theorem to a parametric problem,

\[\pi(\theta|x) = \frac{p(x|\theta)\pi(\theta)}{\int p(x|\theta^{\prime})\pi(\theta^{\prime})d\theta^{\prime}}.\]

\[\begin{align*} \theta & \longleftrightarrow \text{parameter}\\ x & \longleftrightarrow \text{data}\\ \pi(\theta) & \longleftrightarrow \textcolor{magenta}{\text{prior distribution}}\\ p(x|\theta) & \longleftrightarrow \text{likelihood/model}\\ \pi(\theta|x) & \longleftrightarrow \textcolor{magenta}{\text{posterior distribution}}\\ \end{align*}\]

  • Key components: priors and posteriors

Course Description

A general procedure for Bayesian statistical analysis:

  1. Choose a class \(\mc{M}\) of models for the sample \(x\);
  2. Choose a prior distribution over \(\mc{M}\);
  3. Use Bayes Theorem to compute the posterior distribution of models given the sample;
  4. Pick a “good” model based on the posterior distribution.

Topics

  1. Principles for the choice of priors
  2. Inference based on the posterior
  3. Bayesian analysis for different problems:
    • estimation
    • hypothesis testing
    • regression
    • prediction
    • nonparametric model
    • model comparison

Course Materials

Course Schedule

Week Date Topics Reading
1 9/5 Introduction & History of Bayes Theorem
2 9/12 One-parameter Models; Conjugate Priors Hoff Ch. 1-3, BC Ch. 1
3 9/19 Prior Information and Prior Distribution BC Ch. 3
4 9/26 Decision Theory and Bayesian Estimation BC Ch. 2, 4
5 10/3 Connections to non-Bayesian Analysis; Hierarchical Models BDA Ch. 4, 5
6 10/10 No class (National Holiday)
7 10/17 Testing and Model Comparison BC Ch. 5, 7, BDA Ch. 6, 7
8 10/24 Project Proposal
9 10/31 Metropolis-Hastings algorithms; Gibbs sampler BDA Ch. 10-11
10 11/7 Hamiltonian Monte Carlo; Variational Inference BDA Ch. 12-13
11 11/14 Bayesian regression BDA Ch. 14
12 11/21 Generalized Linear Models; Latent Variable Model BDA Ch. 16, 18
13 11/28 Empirical Bayes BC Ch. 10
14 12/5 Bayesian Nonparametrics BDA Ch. 21, 23
15 12/12 Final Project Presentation
16 12/19 Final Project Presentation

Grading

  • Homework (3 - 4 homeworks): 30%
  • Oral Presentation (project proposal and final presentation): 40%
  • Written Report: 30%

Contact information

  • Office: Room 602, Cosmology Hall
  • E-mail:
  • Office Hours: Tue. 3-5pm or by appointment

Final Project

  • A group of 2-3 students
  • Choose a dataset and find a problem which you can answer with your dataset.
  • Figure out how to perform legitimate statistical analysis based on (but not limited to) what we learn in this course.
  • An oral presentation of your results (20 mins per group)
  • A written report (8 pages) including:
    • description of the dataset
    • statistical problem
    • model and inference
    • results and conclusion
    • reference
  • More details will be announced after midterm exam.

Prerequisites

  • Basic calculus and linear algebra
  • Mathematical statistics (undergrad level is fine)
  • Familiarity with some data analytic programming language (R/Python/MATLAB)
  • Experiences with data analysis is not required but will be helpful

Example: 8-school dataset

A study was performed for the ETS to analyze the effects of special coaching programs on SAT scores. Eight schools with their own coaching programs were involved in this study.

School Effect Std
A 28.39 14.9
B 7.94 10.2
C -2.75 16.3
D 6.82 11.0
E -0.64 9.4
F 0.63 11.4
G 18.01 10.4
H 12.16 17.6

Example: 8-school dataset

What can we ask about this dataset?

  • Is there any difference between the coaching programs?
  • Which program is considered effective?
  • Which program is the most effective?
  • How much can we expect to improve on the SAT scores after taking the coaching programs?

Step 1 & 2: Determine the likelihood and prior

  • Consider \[\begin{align*} Y_{i} \mid \theta_i, \sigma_i^2 & \ind N(\theta_i, \sigma_i^2), i = 1,\ldots, 8\\ \theta_i \mid \mu, \tau & \ind N(\mu, \tau^2) \end{align*}\] where \(Y_i\)’s are the observed average treatment effect and \(\sigma_i^2\)’s are the variance of the treatment effects.
  • We are interested in estimating the \(\theta_i\)’s, which are the effects of the coaching programs.

Step 4: Inference based on the posterior

data {
  int<lower=0> J;         
  real y[J];              
  real<lower=0> sigma[J];
}
parameters {
  real mu;                
  real<lower=0> tau;      
  real theta[J];
}
model {
  theta ~ normal(mu, tau); // prior
  y ~ normal(theta, sigma); // likelihood
}

Step 4: Inference based on the posterior

library(rstan)
schools_dat <- list(J = 8,
    y = c(28, 8, -3, 7, -1, 1, 18, 12),
    sigma = c(15, 10, 16, 11, 9, 11, 10, 18)) 
fit <- rstan::sampling(school_model, data = schools_dat,
    chains = 4, iter = 11000, warmup = 1000, thin = 10)

Step 4: Inference based on the posterior

Stan: Statistical Computing Platform

  • Provide automated Bayesian computation
  • Interface with R/Python/MATLAB/…

A brief history of Bayesian statistics

Who is the father of statistics?

Some important statisticians:

  • Carl Friedrich Gauss (1777 - 1855): Gaussian distribution, …
  • Ronald A. Fisher (1890 - 1962): likelihood based inference, …
  • Karl Pearson (1857 - 1936): Pearson’s correlation coefficient, …
  • William Sealy Gosset (1876 - 1937): \(t\)-test and \(t\) distribution, …
  • Jerzy Neyman (1894 - 1981): Neyman-Pearson Lemma, …
  • Pierre-Simon Laplace (1749 - 1827): Central Limit Theorem, …
  • Thomas Bayes (1702 - 1761): Bayes Theorem, …

Thomas Bayes

  • British Reverend Thomas Bayes (1702 - 1761)
  • Work on inverse probability problem
  • Discover a relationship between causes and observations (1746 - 1749), which is later called Bayes Theorem
  • Write “An Essay towards solving a Problem in the Doctrine of Chances”
  • His philosopher friend, Richard Price, edited and published his result.

Bayes’ Thought Experiment

Pierre-Simon Laplace

  • A French mathematician (1749 - 1827)
  • Discover the same rule in 1774, independent of Bayes
  • Using conditional probability, write down the rule as \[\begin{align*} \P(A \mid E)=\frac{\P(E \mid A) \P(A)}{\P(E \mid A) \P(A)+\P\left(E \mid A^c\right) \P\left(A^c\right)}. \end{align*}\]
  • Propose to use equi-probability as the prior, i.e., the uniform prior

Laplace’s Rule of Succession

  • If we repeat an experiment that we know can result in a success or failure, \(n\) times independently, and get \(s\) successes, and \(n-s\) failures, then what is the probability that the next repetition will succeed?
  • Laplace’s Answer: \[\P\left(X_{n+1}=1 \mid X_1+\cdots+X_n=s\right)=\frac{s+1}{n+2}\]
  • This answer is obtained from the Beta-Binomial model.

Pierre-Simon Laplace

  • Laplace was not satisfied with the uniform prior, neither did other mathematicians.
  • However, he still applied Bayes Theorem to solve many practical problems:
    • Estimating the French population
    • Birth and census study
    • Credibility of witnesses in the court
  • In 1810, he announced the Central Limit Theorem.
  • At the age of 62, he turned to frequentist-based approach. Why?

Decline of Bayesian Statistics

  • The subjective component, prior distribution, is heavily criticized by mathematicians and theorists.
  • For large dataset, Bayesian and frequentist produced almost the same result.
  • Dataset became more reliable, and frequentist approaches are easier to implement.
  • Bayesian approaches are computationally challenging, even for simple models.
  • Researchers tend to design their own experiments to answer industrial/scientific questions.

Sir Ronald A. Fisher

  • British Statistician/Mathematician (1890 - 1962)
  • Tea and milk experiment
  • Father of modern statistics and experimental design
  • Problem –> Experiment –> Data –> Analysis
  • Develop a collection of statistical methods, e.g., MLE, ANOVA, Fisher information, sufficient statistics, etc.

Revival of Bayesian Statistics

  • Clean, reliable data -> Frequentist
  • Through the use of prior information, Bayesian approaches are actually more powerful and flexible when handling complex datasets.
  • Advances in computing technologies
  • Readings:
    • Lindley, D. V. (1975). The future of statistics: A Bayesian 21st century. Advances in Applied Probability, 7, 106-115.
    • Efron, B. (1986). Why isn’t everyone a Bayesian? The American Statistician, 40(1), 1-5.
    • Efron, B. (1998). R. A. Fisher in the 21st century. Statistical Science, 95-114.

What is probability?

  • Frequentist: Probability is the limit of relative frequency as the experiment repeats infinitely.
  • Bayesian: Probability reflects one’s belief.
  • However, it doesn’t matter how you interpret probability because

The Theory That Would Not Die

  • For more interesting stories about Bayesian statistics, check

Review

Probability

  • The value \(\P(E)\) is the probability of the occurrence of event \(E\).
  • Let \(\Omega\) be the sample space, i.e., the space containing all possible outcomes.
  • A probability \(\P\) on \(\Omega\) satisfies:
    • \(\P(\Omega) = 1\)
    • \(\P(E) \geq 0\) for any \(E \subset \Omega\)
    • (countable additivity) \(\P(\cup_{i=1}^{\infty} E_i) = \sum_{i=1}^\infty \P(E_i)\) for pairwise disjoint \(E_i\)’s

Conditional Probability

  • The conditional probability of \(A\) given \(B\) is

\[\P(A \mid B) \coloneqq \frac{\P(A \cap B)}{\P(B)}, \quad \text{if}\;\;\P(B) > 0.\]

  • One-line proof of Bayes Theorem: If \(\P(A)\) and \(\P(B)\) are both non-zero,

\[\P(A \mid B) = \frac{\P(A \cap B)}{\P(B)} = \frac{\P(B \mid A)\P(A)}{\P(B)}.\]

Random Variables

  • A random variable is
    • a measurable function
    • an outcome from a random experiment
  • The information about a random variable is given by it probability density function (pdf) or probability mass function (pmf).
  • pdf: \(f(x) \geq 0\) and \(\int f(x) dx = 1\)
  • pmf: \(f(x) = \P(X = x)\) and \(\sum f(x) = 1\)
  • Expectation: \(\E(X) = \int xf(x)dx\) or \(\E(X) = \sum x\P(X = x)\)
  • Variance: \(\var(X) = \E[(X - \mu)^2] = \E(X^2) - \mu^2\) where \(\mu = \E(X)\).
  • Covariance: \(\cov(X, Y) = \E[(X - \mu_X)(Y - \mu_Y)] = \E(XY) - \mu_X\mu_Y\)

Independence

  • Independence of events: \(\P(A \cap B) = \P(A)\P(B)\)
  • Independent random variables:
    • \(\Leftrightarrow\) \(f_{X,Y}(x, y) = f_X(x)f_Y(y)\)
    • \(\Rightarrow\) \(\E(XY) = \E(X) \E(Y)\) (the reverse is not true)
    • \(\Leftrightarrow\) \(f_{X|Y}(x|y) = f_X(x)\)
  • Conditional independence
    • \(X \perp Y \mid Z\): given \(Z\), \(X\) and \(Y\) are independent
    • Equivalently, \(X \mid Z\) and \(Y \mid Z\) are independent.
  • Remark: (mutual) independence DOES NOT imply conditional independence
  • Example: \(X, Z \iid \text{Ber}(1/2)\) and \(Y = I(X \neq Z)\). Check that \(X\) and \(Y\) are independent, but they are not conditionally independent given \(Z\).

Law of Total Probability/Expectation/Variance

  • Law of Total Probability: For \(B_i \cap B_j = \emptyset\) and \(\cup_{i=1}^k B_i = \Omega\),

\[\P(A) = \sum_{i=1}^k \P(A \mid B_i)\P(B_i).\]

  • Law of Total Expectation (tower property/double expectation):

\[\E(X) = \E_Y(\E_{X|Y}(X|Y)).\]

  • Law of Total Variance:

\[\var(X) = \var(\E(X|Y)) + \E(\var(X|Y)).\]

Important univariate distributions

  • Continuous univariate distribution:
    • Normal distribution (on \(\R\))
    • Gamma distribution (on \(\R_+\))
    • Beta distribution (on \([0, 1]\))
  • Discrete univariate distribution:
    • Binomial distribution (on \(\{0, 1, \ldots, n\}\))
    • Poisson distribution (on \(\{0, 1, 2, \ldots\}\))
    • Negative Binomial distribution (on \(\{0, 1, 2, \ldots\}\))

Normal Distribution

  • \(N(\mu, \sigma^2)\), \(\mu \in \R\), \(\sigma > 0\)
  • Density function:

\[f(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \R\]

  • \(\E(X) = \mu\) and \(\var(X) = \sigma^2\)

Gamma Distribution

  • \(\text{Gamma}(\alpha, \beta)\), \(\alpha, \beta > 0\)
  • Density function:

\[f(x;\alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}\exp(-\beta x), \quad x > 0\]

  • \(\E(X) = \frac{\alpha}{\beta}\) and \(\var(X) = \frac{\alpha}{\beta^2}\).
  • Special Case: Exponential distribution (\(\alpha = 1\)) and \(\chi^2\) distribution (\(\alpha = k/2\) and \(\beta = \frac{1}{2}\), where \(k\) is the degree of freedom)

Beta Distribution

  • \(\text{Beta}(\alpha, \beta)\), \(\alpha, \beta > 0\)
  • Density function:

\[f(x;\alpha, \beta) = \frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta-1}, \quad 0 \leq x \leq 1\]

  • \(B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}\) is the Beta function
  • \(\E(X) = \frac{\alpha}{\alpha+\beta}\)
  • Special case: Uniform distribution (\(\alpha = \beta = 1\))

Binomial Distribution

  • \(\text{Bin}(n, p)\), \(n \in \mathbb{N}\), \(0 < p < 1\)
  • Number of positive outcomes out of \(n\) binary trials
  • Mass function:

\[f(x; n, p) = \choose{n}{x}p^x(1-p)^{n-x}, \quad x = 0, 1,\ldots, n\]

  • \(\E(X) = np\) and \(\var(X) = np(1-p)\)
  • Special case: Bernoulli distribution (\(n = 1\))

Poisson Distribution

  • \(\text{Poi}(\lambda)\), \(\lambda > 0\)
  • Mass function:

\[f(x;\lambda) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0, 1, 2, \ldots\]

  • \(\E(X) = \var(X) = \lambda\).

Negative Binomial Distribution

  • \(\text{NB}(r, p)\), \(r = 1,2,\ldots\), \(0 < p < 1\)
  • Number of failures before the \(r\)th success
  • Mass function:

\[f(x;r,p) = \choose{x+r-1}{x}p^r(1-p)^x, \quad x = 0, 1, \ldots\]

  • \(\E(X) = \frac{r(1-p)}{p}\) and \(\var(X) = \frac{r(1-p)}{p^2}\).
  • Special case: Geometric distribution (\(r=1\), number of failures before the first success)

Maximum Likelihood Estimation

  • Suppose \(X_1, \ldots, X_n \iid f(x|\theta)\).
  • The likelihood function is \[L(\theta) = \prod_{i=1}^n f(x_i|\theta).\]
  • The MLE of \(\theta\) is \(\hat{\theta} = \argmax_{\theta}L(\theta)\).
  • Under some regularity conditions on \(f(x|\theta)\), the MLE is efficient (has the smallest variance) and its distribution is approximately normal (when \(n\) is large enough).

Law of Large Numbers & Central Limit Theorem

Suppose \(X_1, \ldots, X_n\) are iid (independent and identically distributed) from some distribution \(F\) with \(\E(X) = \mu\) and \(\var(X) = \sigma^2 < \infty\). Let \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\). Then

  • Law of Large Numbers (LLN): \(\bar{X}_n\) will be very closed to \(\mu\) for large \(n\) \[ \bar{X}_n \cas \mu\]
  • Central Limit Theorem (CLT): the distribution of \(\bar{X}_n\) will be approximately normal for large \(n\) \[\sqrt{n}(\bar{X}_n - \mu) \cd N(0, \sigma^2)\]

Monte Carlo Approximation/Estimation

  • Suppose we have function \(f: [a, b] \to \R\) and we want to compute \(I = \int_a^b f(x)dx\).
  • Write \[I = (b-a)\int_a^b f(x)\frac{1}{b-a}dx = (b-a)\E(f(X)), \quad X \sim \text{Unif}(a,b).\]
  • Monte Carlo approximation:
    • Generate \(X_1, \ldots, X_n \iid \text{Unif}(a,b)\).
    • Compute \(\hat{I}_n = \frac{b-a}{n}\sum_{i=1}^n f(X_i)\).
    • By LLN and CLT, \[\hat{I}_n \cas I \quad \text{and} \quad \sqrt{n}(\hat{I}_n - I) \cd N(0, (b-a)^2\sigma^2)\] if \(\sigma^2 = \var(f(X)) < \infty\).