\[ \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \renewcommand{\P}{\mathbb{P}} \newcommand{\var}{{\rm Var}} % Variance \newcommand{\mse}{{\rm MSE}} % MSE \newcommand{\bias}{{\rm Bias}} % MSE \newcommand{\cov}{{\rm Cov}} % Covariance \newcommand{\iid}{\stackrel{\rm iid}{\sim}} \newcommand{\ind}{\stackrel{\rm ind}{\sim}} \renewcommand{\choose}[2]{\binom{#1}{#2}} % Choose \newcommand{\chooses}[2]{{}_{#1}C_{#2}} % Small choose \newcommand{\cd}{\stackrel{d}{\rightarrow}} \newcommand{\cas}{\stackrel{a.s.}{\rightarrow}} \newcommand{\cp}{\stackrel{p}{\rightarrow}} \newcommand{\bin}{{\rm Bin}} \newcommand{\ber}{{\rm Ber}} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \]
Definition 1 A loss function is any function \(L\) from \(\Theta \times \mc{D}\) to \([0,+\infty)\).
Bayesian statistical inference should start with the rigorous determination of three factors:
Bad news: The determination of loss is as complicated as the determination of prior.
Actually, Lindley (1985) states that loss and prior are difficult to separate and should be analyzed simultaneously.
To derive an effective comparison criterion from the loss function, we have
Frequentist Risk: average over unknown \(x\) (samples) \[\begin{align*} R(\theta, \delta) =\E_\theta[L(\theta, \delta(X))] =\int_{\mc{X}} L(\theta, \delta(x)) f(x \mid \theta) d x \end{align*}\]
Posterior Risk: average over unknown \(\theta\) (parameters) \[\begin{align*} \rho(\pi, d \mid x) =\E^\pi[L(\theta, d) \mid x] =\int_{\Theta} L(\theta, d) \pi(\theta \mid x) d \theta \end{align*}\]
Integrated Risk: average over both \(\theta\) and \(x\) \[\begin{align*} r(\pi, \delta) =\mathbb{E}^\pi[R(\theta, \delta)] =\int_{\Theta} \int_{\mathcal{X}} \mathrm{L}(\theta, \delta(x)) f(x \mid \theta) d x \pi(\theta) d \theta \end{align*}\]
Student | 國文 | 英文 | 數學 | 自然 | 社會 |
---|---|---|---|---|---|
A | 80 | 90 | 95 | 85 | 85 |
B | 70 | 80 | 85 | 75 | 75 |
C | 85 | 90 | 95 | 90 | 60 |
D | 90 | 80 | 50 | 60 | 80 |
E | 80 | 80 | 80 | 80 | 80 |
Among these 5 students:
A and E are minimax, since their have the highest worst score, 80.
B and E are inadmissible, since A dominates B and E; A, C, and D are admissible.
The Bayes estimator under the prior (1,1,1,1,1) is A; the Bayes estimator under the prior (1,0,0,0,0) is D.
Conclusion
Bayes estimators are admissible. It can also be minimax with a particular prior.
There is no way of selecting a best estimator, without using a loss criterion.
Nonetheless, a possible estimator of \(\theta\) based on \(\pi(\theta \mid x)\) is the maximum a posteriori (MAP) estimator, defined as the posterior mode: \[ \hat{\theta}_{\text{MAP}} = \argmax_\theta \pi(\theta \mid x) = \argmax_{\theta} \ell(\theta\mid x) \pi(\theta) \] where \(\ell(\theta\mid x)\) is the likelihood function.
Note that the MAP estimator also bypasses the computation of the marginal \(m(x) = \int \ell(\theta\mid x)\pi(\theta)d\theta\).
Recall that the MAP estimator is a Bayes estimator (in the decision-theoretic sense) under the 0-1 loss when \(\theta\) is discrete and is the limit of Bayes estimators when \(\theta\) is continuous.
Sample 1/2 | Captured | Missed |
Captured | \(n_{11}\) | \(n_{12}\) |
Missed | \(n_{21}\) | \(n_{22}\) |
N | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
36 | 0.0580 | 0.0725 | 0.0886 | 0.1061 | 0.1246 | 0.1438 |
37 | 0.0594 | 0.0716 | 0.0846 | 0.0979 | 0.1112 | 0.1244 |
38 | 0.0608 | 0.0708 | 0.0808 | 0.0905 | 0.0996 | 0.1080 |
39 | 0.0621 | 0.0700 | 0.0772 | 0.0838 | 0.0895 | 0.0942 |
40 | 0.0634 | 0.0691 | 0.0739 | 0.0778 | 0.0806 | 0.0824 |
41 | 0.0647 | 0.0683 | 0.0708 | 0.0723 | 0.0728 | 0.0723 |
42 | 0.0659 | 0.0674 | 0.0679 | 0.0673 | 0.0659 | 0.0637 |
43 | 0.0670 | 0.0666 | 0.0651 | 0.0628 | 0.0598 | 0.0563 |
44 | 0.0682 | 0.0658 | 0.0625 | 0.0587 | 0.0544 | 0.0499 |
45 | 0.0692 | 0.0650 | 0.0601 | 0.0549 | 0.0496 | 0.0444 |
46 | 0.0703 | 0.0642 | 0.0578 | 0.0515 | 0.0453 | 0.0396 |
47 | 0.0713 | 0.0634 | 0.0556 | 0.0483 | 0.0415 | 0.0353 |
48 | 0.0723 | 0.0626 | 0.0536 | 0.0454 | 0.0380 | 0.0317 |
49 | 0.0732 | 0.0618 | 0.0516 | 0.0427 | 0.0350 | 0.0284 |
50 | 0.0741 | 0.0611 | 0.0498 | 0.0402 | 0.0322 | 0.0256 |
\(n_{11}\) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
\(\mathbb{E}(N\mid n_{11})\) | 43.32 | 42.77 | 42.23 | 41.72 | 41.23 | 40.78 |
\(n_{11}\) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
\(\text{argmax }\pi(N\mid n_{11})\) | 50 | 36 | 36 | 36 | 36 | 36 |
If, instead of the squared error, we use the loss \[\begin{align*} L(N, \delta)= \begin{cases}10(\delta-N) & \text { if } \delta>N, \\ N-\delta & \text { otherwise }\end{cases} \end{align*}\] in order to avoid an overestimation of the population size, the Bayes estimator is the \((1 / 11)\)-quantile of \(\pi\left(N \mid n_{11}\right)\).
\(n_{11}\) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
\(\delta(n_{11})\) | 37 | 37 | 37 | 36 | 36 | 36 |