Lecture 04: Decision Theory and Bayesian Estimation

Chun-Hao Yang

Introduction

We have seen that the determination of prior can be very complicated and different prior leads to different inference procedures.
In this lecture, we will see how we can evaluate and compare different inference procedures.
More importantly, besides offering coherent inference procedures, the Bayesian paradigm also enjoys some frequentist optimality properties.
The statistical inferential framework we focus in this lecture is called Decision Theory.

Decision Theory

Evaluating estimators

Decision theory can be applied to any kind of statistical problems, e.g., testing, estimation, etc.
We will temporarily focus on the statistical estimation setting.
Let \(\Theta\) be the parameter space and \(\mc{D}\) be the decision space, which contain all possible decisions.
In standard estimation setting, \(\mc{D} = \Theta\).

Definition 1 A loss function is any function \(L\) from \(\Theta \times \mc{D}\) to \([0,+\infty)\).

Loss function

This loss function is supposed to evaluate the penalty (or error) \(L(\theta, d)\) associated with the decision \(d\) when the parameter takes the value \(\theta\).
A utility function is just the opposite of a loss function, e.g. \(U(\theta, d) = -L(\theta, d)\).
Intuitively, a good decision \(d\) minimizes the loss (maximizes the utility) at every \(\theta\).
Is this possible?

Bayesian Decision Theory

Bayesian statistical inference should start with the rigorous determination of three factors:
1. the distribution family for the observations, \(f(x\mid\theta)\);
2. the prior distribution for the parameters, \(\pi(\theta)\);
3. the loss associated with the decisions, \(L(\theta, \delta)\);
Bad news: The determination of loss is as complicated as the determination of prior.
Actually, Lindley (1985) states that loss and prior are difficult to separate and should be analyzed simultaneously.

Commonly used loss functions

Squared error loss: \(L(\theta, d) = (\theta - d)^2\)
Weighted loss: \(L(\theta, d) = w(\theta)(\theta-d)^2\)
- using this loss with prior \(\pi(\theta)\) is the same as using squared error loss with prior \(\pi(\theta)w(\theta)\)
Absolute error: \(L(\theta, d) = |\theta - d|\)
The 0-1 loss: \(L(\theta, d) = I(\theta \neq d)\)
Intrinsic loss:
- Kullback-Leibler divergence \(L_{\text{KL}}(\theta, \delta) = \E_\theta\left[\log \left(\frac{f(x \mid \theta)}{f(x \mid \delta)}\right)\right]\)
- Hellinger loss \(L_{\text{H}}(\theta, \delta)=\frac{1}{2} \E_\theta\left[\left(\sqrt{\frac{f(x \mid \delta)}{f(x \mid \theta)}}-1\right)^2\right] = \frac{1}{2}\int \left(\sqrt{f(x\mid\theta)}-\sqrt{f(x\mid\delta)}\right)^2 dx\)

Example

For \(X \sim N(\theta, 1)\), we have

\[\begin{align*} L_{\mathrm{KL}}(\theta, \delta) & =\frac{1}{2} \E_\theta\left[-(x-\theta)^2+(x-\delta)^2\right]=\frac{1}{2}(\delta-\theta)^2 \\ L_{\mathrm{H}}(\theta, \delta) & =1-\exp \left\{-(\delta-\theta)^2 / 8\right\}. \end{align*}\]

For \(X \sim N(0, \sigma^2)\), we have

\[\begin{align*} L_{\mathrm{KL}}(\sigma, \delta) & =\log \frac{\delta}{\sigma}+\frac{\sigma^2}{2 \delta^2}-\frac{1}{2} \\ L_{\mathrm{H}}(\sigma, \delta) & =1-\sqrt{\frac{2 \sigma \delta}{\sigma^2+\delta^2}}. \end{align*}\]

Statistical model

From a decision-theoretic point of view, the statistical model now involves three spaces:
- \(\mc{X}\), observation space (or sample space),
- \(\Theta\), parameter space, and
- \(\mc{D}\), decision space (or action space).
We then need to determine a loss function \(L(\theta, d)\).
The goal is to find a decision rule \(\delta: \mc{X} \to \mc{D}\), such that the loss \(L(\theta, \delta(x))\) is minimized.
Except for trivial settings, it is generally impossible to uniformly minimize (in \(d\)) the loss function \(L(\theta, d)\) when \(\theta\) is unknown.

Risk

To derive an effective comparison criterion from the loss function, we have

Frequentist Risk: average over unknown \(x\) (samples) \[\begin{align*} R(\theta, \delta) =\E_\theta[L(\theta, \delta(X))] =\int_{\mc{X}} L(\theta, \delta(x)) f(x \mid \theta) d x \end{align*}\]
Posterior Risk: average over unknown \(\theta\) (parameters) \[\begin{align*} \rho(\pi, d \mid x) =\E^\pi[L(\theta, d) \mid x] =\int_{\Theta} L(\theta, d) \pi(\theta \mid x) d \theta \end{align*}\]
Integrated Risk: average over both \(\theta\) and \(x\) \[\begin{align*} r(\pi, \delta) =\mathbb{E}^\pi[R(\theta, \delta)] =\int_{\Theta} \int_{\mathcal{X}} \mathrm{L}(\theta, \delta(x)) f(x \mid \theta) d x \pi(\theta) d \theta \end{align*}\]

Frequentist Risk

Given \(\delta\), \(R(\theta, \delta)\) is a function of \(\theta\).
The value \(R(\theta, \delta)\) is the long-run performance of \(\delta\) when the true parameter is \(\theta\).
The goal is to select a decision rule that has the best long-run performance uniformly in \(\theta\), which is generally impossible.
Even if we find a decision rule with good long-run performance, we are not able to evaluate the performance for the given observation \(x\).
Finally, using this risk implicitly assumes that the same problem will be met again and again.

Posterior Risk

Given \(x\), \(\rho(\pi, d \mid x)\) is the average error resulting from decision \(d\).
The posterior risk is a function of \(x\), which is known.
By Fubini’s Theorem, \[ r(\pi, \delta) = \int \rho(\pi, \delta(x) \mid x) m(x)dx \] where \(m(x) = \int f(x\mid\theta)\pi(\theta)d\theta\) is the marginal distribution.
Unlike the frequentist risk which associates a function to \(\delta\), the integrated risk associates a real number to \(\delta\).

Bayes Estimator

Given a prior distribution \(\pi\) and a loss function \(L\)
A Bayes estimator is any estimator \(\delta^\pi\) which minimizes \(r(\pi, \delta)\).
For every \(x \in \mathcal{X}\), it is given by \(\delta^\pi(x)\), argument of \(\min _d \rho(\pi, d \mid x)\).
The value \(r(\pi)=r\left(\pi, \delta^\pi\right)\) is then called the Bayes risk.
For example, posterior mean is the Bayes estimator under the squared error loss \(L(\theta, d) = (\theta - d)^2\) and posterior median is the Bayes estimator under \(L(\theta, d) = |\theta-d|\).
When \(\pi\) is improper, \(r(\pi, \delta)\) might not be finite. In this case, we \(\delta^{\pi}(x)\) as the minimum of the minimizer of the posterior risk and call it a generalized Bayes estimator.

The 0-1 loss and MAP

This loss is mainly used in the classical approach to hypothesis testing.
For discrete parameters, the Bayes estimator under the 0-1 loss is the MAP.
For example, consider \(\Theta = \{0, 1\}\) and the posterior is \(\pi(\theta = 1 \mid x) = p(x) = 1 - \pi(\theta = 0 \mid x)\).
The posterior risk is \[ \rho(\pi, d \mid x) = \E_{\theta\mid x}L(\theta, d) = \begin{cases} 1 - p(x), & d = 1\\ p(x), & d = 0 \end{cases}. \]
The Bayes estimator is \[ \delta^{\pi}(x) = \argmin_d \rho(\pi, d \mid x) = \begin{cases} 1, & \pi(\theta = 1 \mid x) > 1/2\\ 0, & \pi(\theta = 0 \mid x) > 1/2 \end{cases}, \] which is the MAP.

The 0-1 loss and MAP

For continuous parameters, it is slightly more complicated.
For any \(d \in \Theta\), the posterior risk is \[ \rho(\pi, d \mid x) = \E_{\theta\mid x}I(\theta \neq d) = \P(\theta \neq d \mid x) = 1, \] which is independent of \(d\).
Hence every \(d \in \Theta\) is a minimizer of the posterior risk.
However if we replace the 0-1 loss by a sequence of losses, \(L_{\varepsilon}(\theta, d) = I(\|\theta-d\| > \varepsilon)\), the MAP estimate is then the limit of the Bayes estimates associated with \(L_{\varepsilon}\), when \(\varepsilon \to 0\).

A quick wrap-up

To perform a decision-theoretic evaluation for estimators (or any other statistical inference), we need to first choose a loss function \(L\).
With \(L\), we can obtain different risks by averaging over different spaces.
As a Bayesian, only the posterior risk \(\rho(\pi, d\mid x)\) is important.
The integrated risk \(r(\pi, \delta)\) actually connects the posterior risk and the frequentist risk.
It also explains why Bayes estimators play an important role in frequentist optimality criteria.

Two optimalities

There are two fundamental notions of frequentist Decision Theory: minimaxity and admissibility.
Under the frequentist paradigm, there is no single optimal estimator.
We will see that Bayes estimators are often optimal for the frequentist concepts of optimality.

Minimaxity

It is an insurance against the worst case because it aims at minimizing the expected loss in the least favorable case.
The minimax risk associated with a loss function \(L\) is the value \[ \bar{R}=\inf _{\delta \in \mc{D}} \sup _\theta R(\theta, \delta)=\inf _{\delta \in \mc{D}} \sup_\theta \E_\theta[L(\theta, \delta(X))]. \]
A minimax estimator is any estimator \(\delta_0\) such that \[ \sup _\theta R\left(\theta, \delta_0\right)=\bar{R} . \]
In other words, a minimax estimator has the best worst-case performance.
The notion of minimaxity provides a good illustration of the conservative aspects of the frequentist paradigm.

Bayes esitmator and minimaxity

In fact, from a Bayesian point of view, it is often equivalent to take a prior concentrated on these worst cases.
In some cases, a minimax estimator is also a Bayes estimator, that is, there exists a prior such that the Bayes estimator is minimax.
Such prior is called a least favorable prior, which has the largest Bayes risk among all other priors.
If you need a conservative Bayes estimate, you can choose the least favorable prior (if it exists).
Even if a minimax estimator is not Bayes, it is often the limit of Bayes estimators, e.g., the MLE for \(N(\theta, 1)\).
A useful approach to the construction of minimax estimators is called a limiting Bayes approach.¹

Admissibility

An estimator \(\delta_0\) is inadmissible if there exists an estimator \(\delta_1\) which dominates \(\delta_0\), that is, such that, for every \(\theta\), \[\begin{align*} R\left(\theta, \delta_0\right) \geq R\left(\theta, \delta_1\right) \end{align*}\] and, for at least one value \(\theta_0\) of the parameter, \[\begin{align*} R\left(\theta_0, \delta_0\right)>R\left(\theta_0, \delta_1\right) . \end{align*}\]
Otherwise, \(\delta_0\) is said to be admissible.

Example

Student	國文	英文	數學	自然	社會
A	80	90	95	85	85
B	70	80	85	75	75
C	85	90	95	90	60
D	90	80	50	60	80
E	80	80	80	80	80

Among these 5 students:

A and E are minimax, since their have the highest worst score, 80.
B and E are inadmissible, since A dominates B and E; A, C, and D are admissible.
The Bayes estimator under the prior (1,1,1,1,1) is A; the Bayes estimator under the prior (1,0,0,0,0) is D.

Stein’s Paradox

Let \(X \sim N_p(\theta, I)\) and \(p>2\). Then \(X\) is the MLE for \(\theta\) and is inadmissible.
The MLE is dominated by the James-Stein estimator \[ \hat{\theta}^{\text{JS}} = \left(1 - \frac{p-2}{\|X\|^2}\right)X. \]
The James-Stein estimator is still inadmissible and is dominated by the positive-part James-Stein estimator¹ \[ \hat{\theta}^{\text{JS+}} = \left(1 - \frac{p-2}{\|X\|^2}\right)_+X, \] where \((x)_+ = \max(x, 0)\).
The positive-part James-Stein estimator is still inadmissible since it is not differentiable.

Optimality of Bayes estimators

What is more important is whether the Bayes risk is finite rather than the propriety of the prior.
Finite Bayes risk: (regular) Bayes estimator;
Infinite Bayes risk: generalized Bayes estimator
Unique (regular) Bayes \(\Rightarrow\) admissible
- When \(L(\theta, d)\) is strictly convex in \(d\), the Bayes estimator is unique.
- For example, with a proper prior, the posterior mean is admissible.
- With an improper prior, the posterior mean is admissible if it has finite Bayes risk.

Optimality of Bayes estimators

Unique (regular) Bayes with constant (frequentist) risk \(\Rightarrow\) admissible and minimax
- If \(X \mid p \sim \bin(n, p)\) and \(p \sim \text{Beta}(\sqrt{n}/2, \sqrt{n}/2)\), the posterior mean of \(p\) is admissible and minimax (under the squared error loss).
- This prior \(\text{Beta}(\sqrt{n}/2, \sqrt{n}/2)\) is called a least favorable prior.
- A least favorable prior, if it exists, gives a conservative Bayes estimate.

Conclusion

Bayes estimators are admissible. It can also be minimax with a particular prior.

Bayesian Estimation

Bayesian estimation

When the prior distribution \(\pi(\theta)\) is available, the posterior distribution \(\pi(\theta \mid x)\) can be formally derived from the observation \(x\) with distribution \(f(x \mid \theta)\).
This updated distribution integrates simultaneously prior information about \(\theta\) and information brought by the observation \(x\).
Even though \(\theta\) is not necessarily a random variable, the distribution \(\pi(\theta\mid x)\) can be used as a regular probability distribution to describe the properties of \(\theta\).
Summarizing indices for \(\pi(\theta\mid x)\) such as the posterior mean, the posterior mode, the posterior variance, and the posterior median, can be used.

MAP estimator

There is no way of selecting a best estimator, without using a loss criterion.
Nonetheless, a possible estimator of \(\theta\) based on \(\pi(\theta \mid x)\) is the maximum a posteriori (MAP) estimator, defined as the posterior mode: \[ \hat{\theta}_{\text{MAP}} = \argmax_\theta \pi(\theta \mid x) = \argmax_{\theta} \ell(\theta\mid x) \pi(\theta) \] where \(\ell(\theta\mid x)\) is the likelihood function.
Note that the MAP estimator also bypasses the computation of the marginal \(m(x) = \int \ell(\theta\mid x)\pi(\theta)d\theta\).
Recall that the MAP estimator is a Bayes estimator (in the decision-theoretic sense) under the 0-1 loss when \(\theta\) is discrete and is the limit of Bayes estimators when \(\theta\) is continuous.

Penalized maximum likelihood estimator

For frequentists, a common approach to preventing overfitting is through penalization or regularization.
A typical penalized MLE is of the form \[ \hat{\theta} = \argmax \log \ell(\theta \mid x) - \text{Penalty}(\theta). \]
The MAP estimator can be expressed as a penalized MLE: \[\begin{align*} \hat{\theta}_{\text{MAP}} & = \argmax_{\theta} \ell(\theta\mid x) \pi(\theta) = \argmax_{\theta} \log \ell(\theta\mid x) + \log \pi(\theta). \end{align*}\]
That is, the penalty of \(\theta\) is given through the prior: \(\text{Penalty}(\theta) = - \log \pi(\theta)\).
Specifying a prior is equivalent to specifying a penalty term for frequentists.

Penalized (regularized) regression

Linear regression: \(Y = X \beta + \varepsilon\), where \(X\) is a \(k\times p\) covariate matrix, \(\beta \in \R^p\) is the unknown parameter, and \(\varepsilon \sim N_k(0, \sigma^2I_k)\).
The (unregularized) ordinary least-square estimator¹ of \(\beta\) is \[ \hat{\beta}_{\text{OLS}} = \argmin_{\beta} \|Y - X\beta\|^2 = (X^TX)^{-1}X^TY. \]
Some commonly used regularization:
- Ridge regression: \(\text{Penalty}(\beta) = -\lambda\|\beta\|^2 \Leftrightarrow \beta_i \iid N(0, \lambda^{-1})\).
- Lasso²: \(\text{Penalty}(\beta) = -\lambda\|\beta\|_1 = \lambda\sum_{i=1}^p|\beta_i| \Leftrightarrow \beta_i \iid \text{Laplace}(0, \lambda^{-1})\)
Whenever you’re applying regularization techniques, you are essentially doing Bayesian analysis.

Example

However, the MAP might not always be useful.
Consider \(x \sim \mathcal{C}(\theta, 1)\), i.e., \[\begin{align*} f(x \mid \theta)=\frac{1}{\pi}\left[1+(x-\theta)^2\right]^{-1}, \end{align*}\] and \(\pi(\theta)=\frac{1}{2} e^{-|\theta|}\).
The MAP estimator of \(\theta\) is then \(\delta^*(x)=0\), as the maximum of \(\exp (-|\theta|)\left[1+(x-\theta)^2\right]^{-1}\) is attained for \(\theta=0\), whatever the value of \(x\).
This behavior may be explained by the flatness of the likelihood function, which is not informative enough, compared with the sharp prior distribution.
However, from a practical point of view, this estimator is useless.

Posterior mean vs MAP

Which is better?
If we adopt the decision theory framework and confine ourselves to the squared error loss \(L(\theta, d) = (\theta-d)^2\), then of course the posterior mean is better.
Computation: Suppose we have the posterior \(\pi(\theta \mid x)\).
- MAP: \(\hat{\theta} = \argmax_{\theta} \pi(\theta \mid x)\)
- Posterior mean: \(\hat{\theta} = \int \theta \pi(\theta \mid x)d\theta\)
- Which is computationally easier?

Capture-recapture model

It’s a method commonly used in ecology to estimate an animal population’s size where it is impractical to count every individual.
It is based on taking at least two successive samples from the population of interest.

Sample 1/2	Captured	Missed
Captured	\(n_{11}\)	\(n_{12}\)
Missed	\(n_{21}\)	\(n_{22}\)

After the two capture experiments, the population is divided as in the table above, with \(n_{11} +n_{12} +n_{21} +n_{22} = N\) (the fourth sample size \(n_{22}\) being unknown).

Uniform model

For the simplest model, called uniform, each individual has the same probability \(p\) of being captured in both experiments.
Therefore, \(p_{11}=p^2, p_{12}=p_{21}=p(1-p)\), and \(p_{22}=(1-p)^2\).
The likelihood can be written \[ L\left(N, p \mid n_{11}, n_{12}, n_{21}\right)=\choose{N}{n_{11}\;n_{12}\;n_{21}} p^{\tilde{n}}(1-p)^{2N-\tilde{n}} \] where \(\tilde{n}=2 n_{11}+n_{12}+n_{21}\) is the total number of captured individuals.
There are two unknown parameters \(N\) and \(p\), which we assume a priori independent.
The prior is \(\pi(N, p) = \pi(N)\pi(p)\) and we assume a Beta prior for \(p\).
Unfortunately, the marginal posterior of \(N\) is complicated no matter we use \(\pi(N) = 1\), \(\pi(N) = 1/N\), or Poisson.

Darroch model

The Darroch model¹ is a hypergeometric model, in which the two sample sizes \(n_1 = n_{11} + n_{12}\) and \(n_2 = n_{11} + n_{21}\) are fixed.
In this case, the only remaining random variable is \(n_{11}\), with distribution \(\text{HyperGeo}(N, n_1, n_2)\).
Recall: The hypergeometric distribution \(\text{HyperGeo}(N, K, n)\) has pmf \[ \P(X = x) = \frac{\choose{K}{x}\choose{N-K}{n-x}}{\choose{N}{n}}, \] which models the number of successes in \(n\) draws (without replacement) from a population of size \(N\) with \(K\) successes.

Darroch model

The MLE of \(N\) is \[\begin{align*} \hat{N}=\left[\frac{n_1}{\left(n_{11} / n_2\right)}\right], \end{align*}\] where \([a]\) is the integer part of \(a\).
It identifies the proportion in the population \(\left(n_1 / N\right)\) and the proportion in the sample \(\left(n_{11} / n_2\right)\).
A major drawback is that it cannot be used when \(n_{11} = 0\).
A Bayesian analysis does not suffer from this defect because it reaches a conclusion even when \(n_{11} = 0\).
Given a prior distribution \(\pi\) on \(N\), it is formally easy to derive the posterior \(\pi(N = n\mid n_{11})\) and draw an inference on \(N\).

Example

Suppose that a prior ecological study suggests that the population size varies between 36 and 50.
Hence here we use a uniform distribution on \(\{36,\ldots,50\}\).
Considering \(n_1 = n_2 = 5\), we have \[ \pi\left(N=n \mid n_{11}\right)=\frac{\choose{n_1}{n_{11}}\choose{n-n_1}{n_2-n_{11}} / \choose{n}{n_2} \pi(N=n)}{\sum_{k=36}^{50}\choose{n_1}{n_{11}}\choose{k-n_1}{n_2-n_{11}} / \choose{k}{n_2} \pi(N=k)}. \]

library(tidyverse)
library(kableExtra)
N_vec <- 36:50; n_1 <- 5; n_2 <- 5; n_11 <- 0:5

out <-sapply(n_11, function(x) dhyper(x, n_1, N_vec-n_1, n_2)) %>%
    apply(., 2, function(x) x/sum(x)) %>%
    data.frame() 
out %>% cbind(N = N_vec, .) %>%
    kbl(format="markdown", digits = 4, col.names = c("N", n_11))

Example

N	0	1	2	3	4	5
36	0.0580	0.0725	0.0886	0.1061	0.1246	0.1438
37	0.0594	0.0716	0.0846	0.0979	0.1112	0.1244
38	0.0608	0.0708	0.0808	0.0905	0.0996	0.1080
39	0.0621	0.0700	0.0772	0.0838	0.0895	0.0942
40	0.0634	0.0691	0.0739	0.0778	0.0806	0.0824
41	0.0647	0.0683	0.0708	0.0723	0.0728	0.0723
42	0.0659	0.0674	0.0679	0.0673	0.0659	0.0637
43	0.0670	0.0666	0.0651	0.0628	0.0598	0.0563
44	0.0682	0.0658	0.0625	0.0587	0.0544	0.0499
45	0.0692	0.0650	0.0601	0.0549	0.0496	0.0444
46	0.0703	0.0642	0.0578	0.0515	0.0453	0.0396
47	0.0713	0.0634	0.0556	0.0483	0.0415	0.0353
48	0.0723	0.0626	0.0536	0.0454	0.0380	0.0317
49	0.0732	0.0618	0.0516	0.0427	0.0350	0.0284
50	0.0741	0.0611	0.0498	0.0402	0.0322	0.0256

Estimates of the population size

Posterior mean:

\(n_{11}\)	0	1	2	3	4	5
\(\mathbb{E}(N\mid n_{11})\)	43.32	42.77	42.23	41.72	41.23	40.78

MAP:

\(n_{11}\)	0	1	2	3	4	5
\(\text{argmax }\pi(N\mid n_{11})\)	50	36	36	36	36	36

Estimates of the population size

If, instead of the squared error, we use the loss \[\begin{align*} L(N, \delta)= \begin{cases}10(\delta-N) & \text { if } \delta>N, \\ N-\delta & \text { otherwise }\end{cases} \end{align*}\] in order to avoid an overestimation of the population size, the Bayes estimator is the \((1 / 11)\)-quantile of \(\pi\left(N \mid n_{11}\right)\).

\(n_{11}\)	0	1	2	3	4	5
\(\delta(n_{11})\)	37	37	37	36	36	36

Summary

The decision theory framework allows us to compare different estimators, under a given loss function.
The determination of loss is as difficult as the determination of prior.
From a decision-theoretic point of view, Bayes estimators are always admissible, while MLE might not (e.g., Stein’s paradox).
The MAP estimator is essentially a penalized MLE. The penalization term is determined by the prior.