Lecture 06: Testing and Model Selection

Chun-Hao Yang

Project Proposal

  • A 10-15 min presentation describing your final project, including
    • the dataset
    • the scientific questions
    • what statistical methods do you think can help you answer the questions
  • Upload your slides to NTU cool by 10/23.
  • This presentation weights 20% of your final grade.
  • You can work in groups of 2-3 people.
  • You can choose a topic related to your own research.

Introduction

  • In the last lecture, we saw that hierarchical structures allow us to construct models with high flexibility.
  • In practice, there are more than one way to build a hierarchical models, especially when you use latent variables.
  • Is a model with more hierarchy always “better” than one with less hierarchy?
  • What does it mean by saying “a model is better than the other?”
  • You can evaluate models by
    • how well it fits the data,
    • how good it is at prediction,
    • how we can interpret this model,
    • how robust it is to the presence of outliers, etc.

Hypothesis Testing

Hypothesis testing

  • The goal of a hypothesis test is to decide whether you can reject the null hypothesis \(H_0\) and accept the alternative hypothesis \(H_1\) or not.
  • Usually, the hypotheses are of the form \[ H_0: \theta \in \Theta_0\quad \text{vs.} \quad H_1: \theta \in \Theta_1 = \Theta_0^c. \]
  • The philosophy is that we first assume \(H_0\) to be true, and see how strong the data is against \(H_0\). If it is strong enough, we reject \(H_0\) and accept \(H_1\).

Likelihood Ratio Test

  • One of the frequentist approaches to hypothesis testing is the likelihood ratio test (LRT).
  • Suppose \(X_1, \ldots, X_n \iid f(x\mid \theta)\) and \(L(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta)\) is the likelihood function.
  • The LRT statistic is \[ \lambda = \frac{\max_{\theta \in \Theta_0} L(\theta \mid x)}{\max_{\theta \in \Theta_0 \cup \Theta_1} L(\theta \mid x)}. \]
  • Apparently, \(0 \leq \lambda \leq 1\) and small values of \(\lambda\) provide evidence against \(H_0\).
  • When should we reject \(H_0\)?

Two types of error

  • False rejection (Type I error) and False acceptance (Type II error)

  • The decision space is \(\mc{D} = \{\text{accept}, \text{reject}\} = \{0, 1\}\).

  • A typical loss is the 0-1 loss \[ L(\theta, d)= \begin{cases}1 & \text { if } d = \mathbb{I}_{\Theta_0}(\theta) \\ 0 & \text { otherwise }\end{cases}. \]

  • The frequentist risk is \[\begin{align*} R(\theta, \delta) & = \E L(\theta, \delta(X)) = \P(\delta(X) = 1 \mid \theta \in \Theta_0) + \P(\delta(X) = 0 \mid \theta \notin \Theta_0)\\ & = \text{Type I error} + \text{Type II error} \end{align*}\]

  • Under this loss, the Bayesian solution \[ \varphi^\pi(x)= \begin{cases}1 & \text { if } \P^\pi\left(\theta \in \Theta_1 \mid x\right) > \P^\pi\left(\theta \in \Theta_0 \mid x\right) \\ 0 & \text { otherwise }\end{cases}. \]

Bayesian test

  • A Bayesian test with prior \(\pi\) rejects \(H_0: \theta \in \Theta_0\) if \[ \P^\pi\left(\theta \in \Theta_1 \mid x\right) > \P^\pi\left(\theta \in \Theta_0 \mid x\right). \]
  • What are the Type I and Type II errors?
  • What is the power of the test?
  • What if we have a point null hypothesis \(H_0: \theta = \theta_0\)? For continuous parameter, a Bayesian test will always reject \(H_0\).
  • Is this rejection due to the prior or the observation?

Bayes Factor

  • The Bayes factor is the ratio of posterior odds to the prior odds \[ B_{01}^\pi(x)=\left.\frac{P\left(\theta \in \Theta_0 \mid x\right)}{P\left(\theta \in \Theta_1 \mid x\right)} \middle/ \frac{\pi\left(\theta \in \Theta_0\right)}{\pi\left(\theta \in \Theta_1\right)}\right.. \]
  • It is an “objective” answer, since it partly eliminates the influence of the prior modeling and emphasizes the role of the observations.
  • If the Bayes factor is 1, the observation does not provide information in favor of either hypothesis.
  • If the Bayes factor is greater than 1, the observation favors \(H_0\) and hence we accept \(H_0\) when the Bayes factor is large.
  • The Bayes factor offers a continuum of evidence, rather than a dichotomous decision.

Differences between Bayes factor and \(p\)-value

  • \(p\)-value is the probability of observing a test statistic at least as extreme as the one observed, given that \(H_0\) is true.
  • So small \(p\)-values provide evidence against \(H_0\), but large \(p\)-values do not provide evidence in favor of \(H_0\) (or against \(H_1\)).
  • In contrast, large Bayes factors provide evidence in favor of \(H_0\), and small Bayes factors provide evidence in favor of \(H_1\) (or against \(H_0\)).
  • The \(p\)-value bases rejection of \(H_0\) on unlikely events that did not occur.

Type I and Type II errors for Bayesian test

  • Although it is possible to define Type I and Type II errors for Bayesian tests, the concept is not compatible with Bayesian paradigm.
  • In Bayesian inference, we tend not to give a simple, dichotomous answer to the question of whether a hypothesis is true or false.
  • We compute the posterior probability of the two hypotheses and choose the hypothesis when you think the evidence (e.g., Bayes factor) is strong enough.

Improper prior for testing

  • For Bayesian estimation, improper priors (e.g., noninformative priors) sometimes give us good estimates, although we are required to check the propriety of the posterior.
  • For testing, it is not recommended to use improper prior for a couple of reasons.
    1. Improper priors make the Bayes factor undefined.
    2. The testing setting is not coherent with an absolute lack of information since we need to decide our hypotheses.
  • There are still some solutions proposed to overcome the difficulties related to the use of improper priors.
  • Most of them are based on the idea to use part of the data to transform the priors into proper distributions.
  • For example, partial Bayes factor uses a subset of samples (training sample) to construct a proper prior, and use the rest to compute the Bayes factor.

Bayesian one sample t-test

  • Suppose \(X_1, \ldots, X_n \iid N(\mu, \sigma^2)\) and \(\sigma^2\) is known.
  • The null hypothesis is \(H_0: \mu = \mu_0\) and the alternative hypothesis is \(H_1: \mu \neq \mu_0\).
  • We can use the normal prior or Cauchy piror centered at \(\mu_0\) (since it has heavy tails) for \(\mu\).
  • For the variance \(\sigma^2\), it is okay to use the Jeffreys prior.
  • Then we can use sampling techniques (e.g., MCMC) to sample from the posterior and compute the Bayes factor.

Example

This is a dataset which shows the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.

library(BayesFactor)
library(ggplot2)
library(tidyverse)

data(sleep)
ggplot(sleep, aes(x = group, y = extra)) + geom_boxplot()

Example

We want to see the difference in efficacy of the drugs.

ttestBF(x = sleep$extra[sleep$group==1], 
        y = sleep$extra[sleep$group==2], 
        paired=TRUE)
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 17.25888 ±0%

Against denominator:
  Null, mu = 0 
---
Bayes factor type: BFoneSample, JZS

Priors:

  • Jeffreys prior for the variance
  • Cauchy prior for the mean (effect size)

You can change the priors and see how the results change. This might require you to implement the sampling process yourself.

Example

To check the quality of the posterior samples, you can ask the function to return posterior rather than the Bayes factor.

samples <- ttestBF(x = sleep$extra[sleep$group==1],
                   y = sleep$extra[sleep$group==2], 
                   paired=TRUE, 
                   posterior = TRUE, iterations = 1000)
plot(samples[, "mu"])

Model Selection

Model selection

  • Hypothesis testing is a special case of model selection.
  • A general model selection problem is to choose between a collection of models \(M_1, \ldots, M_k\).
  • For example, we can choose whether we want to use a linear regression model or a quadratic regression model.
  • The choice can be based on the \(R^2\) or the adjusted \(R^2\), i.e., choose the one with highest adjusted \(R^2\).
  • For general models (e.g., non-linear regression), we can use
    • information criterion (e.g., AIC, BIC)
    • Bayes factor

Information criterion

  • The general form of an information criterion is \[ -\text{Model fit} + \text{Penalty}. \]
  • We choose the model with the smallest value of the information criterion.
  • The ‘Model fit’ is typically chosen as the log-likelihood of the model.
  • The ‘Penalty’ is typically chosen as a function of the number of parameters in the model.
  • For Bayesian information criterion (BIC), \(\text{Penalty} = p\log n\).
  • For Akaike information criterion (AIC), \(\text{Penalty} = 2p\).
  • However, a small value of the information criterion only indicates a good model fit of the dataset.

Cross-validation

  • Cross-validation is a way to evaluate the predictive performance of a model.
  • The idea is to split the data into two parts: a training set and a validation set.
  • The model is fit to the training set, and then the predictive performance is evaluated on the validation set.
  • Commonly used approaches are \(k\)-fold cross-validation and leave-one-out cross-validation.

Bayes Factor

  • Although BIC has “Bayesian” in its name, it is not technically a Bayesian method. It just has a Bayesian interpretation.
  • The Bayes factor can be defined in the general model selection setting: \[ \text{BF}_{ij} = \left.\frac{\P(M_i \mid x)}{\P(M_j \mid x)}\middle/\frac{\P(M_i)}{\P(M_j)}\right. \] where \(\P(M_i)\) is the prior belief of the model \(M_i\).
  • If \(\text{BF}_{ij}\) is large enough, than we prefer model \(M_i\) over model \(M_j\).
  • If we don’t have any preference between models, \(\P(M_i) = 1/k\) and comparison of Bayes factor is equivalent to comparison of posterior probability of models.
  • If you prefer simpler models, you can assign higher prior preference to simpler models and see if the posterior outweighs your preference.

Bayes Factor

  • Usually not only do we have priors for difference models, we also have priors for the parameters in each model.
  • For example, if model \(M_i\) is indexed by the parameter \(\theta_i\) and \(\theta_i\) has prior \(\pi_i(\theta_i)\), then \[ \P(M_i \mid x) = \frac{\P(M_i) \P(x \mid M_i)}{\sum_{j=1}^k \P(M_j) \P(x \mid M_j)} \] where \[ \P(x \mid M_i) = \int \P(x \mid \theta_i, M_i) \pi_i(\theta_i) \, d\theta_i. \]
  • If we assume \(\P(M_i) = 1/k\), then the Bayes factor is \[ \text{BF}_{ij} = \frac{\P(x \mid M_i)}{\P(x \mid M_j)} = \frac{\int \P(x \mid \theta_i, M_i) \pi_i(\theta_i) \, d\theta_i}{\int \P(x \mid \theta_j, M_j) \pi_j(\theta_j) \, d\theta_j}. \]

Model averaging

  • Model averaging is a way to combine the results from multiple models.
  • The idea is to average the posterior distributions of the parameters over all models.
  • For prediction, we can use the posterior predictive distribution \[ \P(x_{\text{new}} \mid x) = \sum_{i=1}^k \P(x_{\text{new}} \mid x, M_i) \P(M_i \mid x). \]
  • A commonly used trick in machine learning, called ensemble, is to use a weighted average of the predictions from different models: \(x_{\text{new}} = \sum_{i=1}^k w_i x_{\text{new},i}\) where \(x_{\text{new},i}\) is the prediction from model \(M_i\) and \(w_i\) is the weight for model \(M_i\).
  • Usually the weights are chosen to be \(1/k\) or through cross-validation.

Model diagnostics

  • Model diagnostics is a way to check whether the model is appropriate for the data.
  • If we have a collection of candidate models, we can choose the best model based on the techniques we discussed earlier.
  • However, the selected model may not be a good model.
  • We have to check whether the model is good not just by comparing with other models.

What do we need to check?

  • A ‘model’ encompasses the sampling distribution, the prior distribution, any hierarchical structure, and issues such as which explanatory variables have been included in a regression.
  • Sensitivity analysis: how much do posterior inferences change when other reasonable probability models are used in place of the present model?
  • External validation: using the model to make predictions about future data, and then collecting those data and comparing to their predictions.
  • Robustness: how do the model performs when every assumption we made is wrong?
  • There is no general recipe for model checking. It depends on the model and the purpose of the model.
  • However, we can do simulation, which uses the worst data you can possibly imagine, to diagnose the model.