An example survey

Chun-Hao Yang

An example survey

  • There are 6 surveys and each survey has two questions:
    • prefer A or B, or neither?
    • prefer A’ or B, or neither?
  • Notations:
    • \(\hat{p}_1\): proportion of people who prefer A over B.
    • \(\hat{p}_2\): proportion of people who prefer B over A.
    • \(\hat{q}_1\): proportion of people who prefer A’ over B.
    • \(\hat{q}_2\): proportion of people who prefer B over A’.
    • \(n\) is the sample size in each survey

Survey Results

study p_1 p_2 q_1 q_2 n
1 0.4830 0.3920 0.4610 0.4160 2046
2 0.4100 0.3500 0.4200 0.3600 1149
3 0.4660 0.3310 0.4650 0.3490 1112
4 0.4601 0.3222 0.4082 0.3586 1112
5 0.4400 0.3200 0.3970 0.3300 1082
6 0.3880 0.2930 0.3820 0.3060 1484
  • Two possible statistics:
    • difference in proportion: \(T_1 = \hat{p}_1 - \hat{q}_1\).
    • difference in difference: \(T_2 = (\hat{p}_1 - \hat{p}_2) - (\hat{q}_1 - \hat{q}_2)\)

Assumptions

  • The surveys are independent.
  • The responses to the two questions in each survey are also independent.
  • The surveys are conducted based on simple random samples.

Parametric Bootstrap

B <- 10^4
T1_boot <- matrix(0, ncol = B, nrow = 6)
T2_boot <- matrix(0, ncol = B, nrow = 6)

set.seed(2023)

for (i in 1:B) {
  for (j in 1:6) {
    prop_boot_p <- rmultinom(1, n[j], 
        prob = c(p_1[j], p_2[j], 1-p_1[j]-p_2[j]))/n[j]
    prop_boot_q <- rmultinom(1, n[j], 
        prob = c(q_1[j], q_2[j], 1-q_1[j]-q_2[j]))/n[j]
    T1_boot[j, i] <- prop_boot_p[1] - prop_boot_q[1]
    T2_boot[j, i] <- (prop_boot_p[1] - prop_boot_p[2]) - 
        (prop_boot_q[1] - prop_boot_q[2])
  }
}

Difference in Proportions

survey_dp_boot <- tibble(study = 1:6, 
                      DiP = 100 * rowMeans(T1_boot),
                      se = 100 * apply(T1_boot, 1, sd),
                      CI.lower = 100 * apply(T1_boot, 1, quantile, 
                                       probs = 0.025),
                      CI.upper = 100 * apply(T1_boot, 1, quantile, 
                                       probs = 0.975))
survey_dp_boot |>  kable(digits = 2)
study DiP se CI.lower CI.upper
1 2.18 1.54 -0.83 5.23
2 -0.97 2.05 -4.96 3.05
3 0.11 2.12 -4.05 4.23
4 5.14 2.09 0.99 9.26
5 4.30 2.14 0.09 8.50
6 0.63 1.79 -2.90 4.11

Difference in Differences

survey_did_boot <- tibble(study = 1:6, 
                      DiD = 100 * rowMeans(T2_boot),
                      se = 100 * apply(T2_boot, 1, sd),
                      CI.lower = 100 * apply(T2_boot, 1, quantile, 
                                       probs = 0.025),
                      CI.upper = 100 * apply(T2_boot, 1, quantile, 
                                       probs = 0.975))
survey_did_boot |>  kable(digits = 2)
study DiD se CI.lower CI.upper
1 4.57 2.88 -1.03 10.22
2 0.05 3.63 -6.96 7.14
3 1.89 3.78 -5.58 9.35
4 8.74 3.69 1.44 15.92
5 5.29 3.70 -1.85 12.66
6 1.95 3.03 -4.04 7.88

Bayesian Analysis

Joint distribution of the two questions

  • Apparently, the responses to the two question are correlated.

  • However we only the marginal distributions for each question.

  • We need to model the joint distribution directly.

  • Let Q1 be the first question (A vs B) and Q2 be the second question (A’ vs B).

  • Suppose the joint distribution is

    Q1\Q2 A’ B no response
    A \(\rho_1\) \(\rho_{12}\) \(\gamma\)
    B \(\rho_{21}\) \(\rho_2\) \(\gamma\)
    no response \(\gamma\) \(\gamma\) \(\rho_3\)
  • The parameter is \(\theta = (\rho_1, \rho_2, \rho_3, \rho_{12}, \rho_{21}, \gamma)\), where \(\rho_1 + \rho_2 + \rho_3 + \rho_{12} + \rho_{21} + 4\gamma = 1\).

  • Denote the counts by \(n_{ij}\), where \(i,j \in \{1,2,3\} = \{A, B, \text{no response}\}\).

Interpretation of the parameters

  • \(\rho_1\): the proportion of people that always prefer A type.
  • \(\rho_2\): the proportion of people that always prefer B type.
  • \(\rho_3\): the proportion of people that always give no response.
  • \(\rho_{12}\): the proportion of people that prefer \(A > B > A'\).
  • \(\rho_{21}\): the proportion of people that prefer \(A' > B > A\).
  • \(\gamma\): the proportion of people that answer only one of the two questions.

Missing Data

  • We have observed \(Y = [n_{1\cdot}, n_{2\cdot}, n_{3\cdot},n_{\cdot 1}, n_{\cdot 2}]^T\), where \(n_{i\cdot} = \sum_j n_{ij}\) and \(n_{\cdot j} = \sum_i n_{ij}\).
  • Let \(U = [n_{11}, n_{12}, n_{13}, n_{21}, n_{22}, n_{23}, n_{31}, n_{32}, n_{33}]^T\) be the complete data.
  • If we also observe \(Z = [n_{11}, n_{12}, n_{21}, n_{32}]^T\), then \(U\) can be recovered from \(Y\) and \(Z\).
  • \(Z\) is called the missing data.
  • Let \(M\) be a matrix such that \[ U = M\left[\begin{array}{c} Y \\ Z \end{array}\right]. \]
  • The distribution of \(U\) multinomial with parameter \((\rho_1, \rho_{12}, \gamma, \rho_{21}, \rho_2, \gamma, \gamma, \gamma, \rho_3)\).

Intuition

  • If we observe \(U\), we can find the posterior \(\pi(\theta \mid U)\).
  • Observing \(U\) is equivalent to observing \(Y\) and \(Z\) and hence the posterior can be written as \(\pi(\theta \mid Y, Z)\).
  • However, \(Z\) is missing, so we can not condition on \(Z\).
  • The most we can do is \(\pi(\theta, Z \mid Y)\).
  • By Bayes Theorem, \[ \pi(\theta, Z \mid Y) = \frac{p(Y, Z \mid \theta)\pi(\theta)}{p(Y)}. \]
  • The distribution \(p(Y, Z \mid \theta)\) is available since \([Y^T, Z^T]^T = M^{-1}U\) and \(U\) is multinomial.
  • We can also consider the marginal posterior \(\pi(\theta \mid Y) = \sum_{Z} \pi(\theta, Z \mid Y)\).

Data (Survey 1)

Q1\Q2 A’ B No response Total
A \(\textcolor{red}{437}\) \(\textcolor{red}{409}\) \(\textcolor{red}{120}\) 988
B \(\textcolor{red}{358}\) \(\textcolor{red}{327}\) \(\textcolor{red}{99}\) 802
No response \(\textcolor{red}{127}\) \(\textcolor{red}{96}\) \(\textcolor{red}{27}\) 256
Total 943 851 \(\textcolor{blue}{252}\) \(\textcolor{blue}{2046}\)
  • The numbers in black are observed.
  • The numbers in red are my random guesses.
  • The numbers in blue are redundant as they can be computed from the others.

Sampling Algorithm

  • We need to sample from \(\pi(\theta, Z \mid Y)\).
  • The sampling is achieved by Gibbs sampler:
    • sample \(Z\) from \(\pi(Z \mid \theta, Y)\)
    • sample \(\theta\) from \(\pi(\theta \mid Z, Y)\).
  • In our case, \(\pi(Z \mid \theta, Y)\) is independent of \(\theta\) and the full conditionals \(\pi(z_i \mid z_{-i}, Y)\) are hypergeometric distributions.
  • You fill find that the derivation is the same as Fisher’s exact test.
  • The distribution \(\pi(\theta \mid Z, Y)\) is exactly \(\pi(\theta \mid U)\), i.e., the posterior given the complete data.

Prior and posterior

  • We need to specify a prior for \(\theta = (\rho_1, \rho_2, \rho_3, \rho_{12}, \rho_{21}, 4\gamma)\).
  • For convenience, we choose \((\rho_1, \rho_2, \rho_3, \rho_{12}, \rho_{21}, 4\gamma) \sim \text{Dir}(\alpha_1, \ldots, \alpha_6)\).
  • The posterior given the complete data is \[\begin{multline*} (\rho_1, \rho_2, \rho_3, \rho_{12}, \rho_{21}, 4\gamma) \mid U \sim \text{Dir}(\alpha_1 + n_{11}, \alpha_2 + n_{22}, \alpha_3 + n_{33}, \alpha_4 + n_{12},\\ \alpha_5 + n_{21}, \alpha_6 + n_{13} + n_{23} + n_{31} + n_{32}). \end{multline*}\]

Analysis for Survey 1

source("../dataset/survey/survey_Missing.R")

Y <- c(988, 802, 256, 943, 851)
theta_sample <- single_survey_sampler(3000, Y)
Q_2.A' Q_2.B Q_2.No response total
Q_1.A 22.23 20.07 5.43 47.73
Q_1.B 18.09 16.30 5.43 39.82
Q_1.No response 5.43 5.43 1.59 12.45
total 45.75 41.80 12.45 100.00

Analysis for Survey 2

Y <- c(471, 402, 276, 483, 414)
theta_sample <- single_survey_sampler(3000, Y)
Q_2.A' Q_2.B Q_2.No response total
Q_1.A 17.19 14.81 8.83 40.83
Q_1.B 14.75 12.58 8.83 36.17
Q_1.No response 8.83 8.83 5.33 23.00
total 40.77 36.23 23.00 100.00

Meta analysis

  • We can also perform meta analysis by combining the 6 surveys.
  • Let \(U_i\) be the complete data for survey \(i\).
  • A simple hierarchical model is \[\begin{align*} U_i \mid \theta_i &\ind \text{Multinomial}(\theta_i),\\ \theta_i \mid \alpha, \nu &\iid \text{Dir}(\alpha\nu),\\ \nu &\sim \text{Dir}(1, 1, 1, 1, 1, 1),\\ \alpha &\sim \text{Exp}(0.001) \end{align*}\]
  • The complete data \(U_i\) is obtained by generating the missing data \(Z_i\) from \(\pi(Z_i \mid Y_i)\) and then combining with the observed data \(Y_i\).
  • I wrote a function multi_survey_sampler to perform the meta analysis.

What is the question?

  • Suppose you have obtained posterior samples of \(\theta_i = (\rho_{1,i}, \rho_{2,i}, \rho_{3,i}, \rho_{12,i}, \rho_{21,i}, \gamma_i)\) for \(i = 1, \ldots, 6\).
  • What do you want to know?
  • Probably, you want to compare \(\rho_{12,i}\) and \(\rho_{21,i}\) for each \(i = 1, \ldots, 6\).
  • You have several choices:
    • use \(\eta_{1, i} = \rho_{12,i} - \rho_{21,i} \in [-1, 1]\);
    • use \(\eta_{2, i} = \log\frac{\rho_{12,i}}{\rho_{21,i}} \in (-\infty, \infty)\);
  • Since we are using the joint distribution, it is okay to subtract probabilities (although it is not recommended).
  • You cannot arbitrarily subtract conditional probabilities or marginal probabilities.
  • For example, it is meaningless to compute \[\begin{align*} & \P(X = 1) - \P(Y = 0) \quad \text{or}\\ & \P(X = 1 \mid Y = 1) - \P(X = 1 \mid Y = 0). \end{align*}\]

Analysis

source("../dataset/survey/survey_Missing.R")
survey_data <- read.csv("../dataset/survey/survey_data.csv",header = TRUE)
attach(survey_data)
N1 <- cbind(round(n*p_1), round(n*p_2))
N1 <- cbind(N1, n - N1[,1] - N1[,2])
N2 <- cbind(round(n*q_1), round(n*q_2))
N2 <- cbind(N2, n - N2[,1] - N2[,2])
detach(survey_data)
Y <- cbind(N1, N2[,1:2])
Y |> `colnames<-`(c("n_1.", "n_2.", "n_3.", "n_.1", "n_.2")) |> kable()
n_1. n_2. n_3. n_.1 n_.2
988 802 256 943 851
471 402 276 483 414
518 368 226 517 388
512 358 242 454 399
476 346 260 430 357
576 435 473 567 454

Analysis

model <- stan_model("../dataset/survey/multi_survey.stan")
samples <- multi_survey_sampler(Y, model)
eta1 <- samples$theta[,4,] - samples$theta[,5,]
eta2 <- log(samples$theta[,4,]) - log(samples$theta[,5,])
eta1_overall <- samples$nu[,4] - samples$nu[,5]
eta2_overall <- log(samples$nu[,4]) - log(samples$nu[,5])

Results

  • \(\eta_1 = \rho_{12} - \rho_{21} =\) (prefer A > B > A’) - (prefer A’ > B > A)
  • \(\eta_2 = \log\frac{\rho_{12}}{\rho_{21}} =\) log(prefer A > B > A’) - log(prefer A’ > B > A)
  • The red dashed line is the overall estimate.