Generalized Linear Models

Chun-Hao Yang

Outline

  • Classical Linear Model
    • Ordinary Least Squares (OLS) Estimation
    • Maximum Likelihood (ML) Estimation
    • Penalty and Regularization
  • Generalized Linear Models
    • Logistic Regression
    • Multinomial Regression
  • Non-linear Models
    • Generalized Additive Models (GAM)
    • Projection Pursuit Regression (PPR)

Classical Linear Model

  • Given \(p\) covariates \(x_1, \ldots, x_p\) and a response variable \(y\), the classical linear model assumes that the relationship between the \(x_i\)’s and \(y\) is linear: \[ y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p. \]

  • Denote \(\boldsymbol{x} = (1, x_{1}, \ldots, x_{p})^T\) and \(\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)^T\), the model can be written as \[ y = \boldsymbol{x}^T\boldsymbol{\beta}. \]

  • Suppose now we have \(n\) samples \((\boldsymbol{x}_1, y_1), \ldots (\boldsymbol{x}_n, y_n)\) and we believe that the linear model above is a reasonable approximation of the relationship between the \(\boldsymbol{x}_i\)’s and \(y_i\).

  • The goal is to estimate the model parameter \(\boldsymbol{\beta}\).

Ordinary Least Squares (OLS) Estimation

  • The most common method to estimate \(\boldsymbol{\beta}\) is the ordinary least squares (OLS) estimation, that is, we find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals: \[ \hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2. \]
  • Denoting \[ \boldsymbol{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \quad \boldsymbol{X} = \begin{bmatrix} \boldsymbol{x}_1^T \\ \vdots \\ \boldsymbol{x}_n^T \end{bmatrix}, \] the minimization problem can be written as \[ \hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2. \]

Ordinary Least Squares (OLS) Estimation

  • Let \(L(\boldsymbol{\beta}) = \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2\). This is often called the loss function.
  • Taking the gradient of \(L(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) and setting it to zero, we have \[ \nabla_{\boldsymbol{\beta}} L(\boldsymbol{\beta}) = -2\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) = 0. \]
  • The OLS estimation has a closed-form solution: \[ \hat{\boldsymbol{\beta}}^{\text{OLS}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]
  • To ensure the existence of the inverse, we need to assume that \(\boldsymbol{X}^T\boldsymbol{X}\) is invertible, that is, the columns of \(\boldsymbol{X}\) are linearly independent.
  • To verify that \(\hat{\boldsymbol{\beta}}^{\text{OLS}}\) is indeed the minimizer, we need to show that the Hessian of \(L(\boldsymbol{\beta})\) is positive definite: \[ \nabla^2_{\boldsymbol{\beta}} L(\boldsymbol{\beta}) = 2\boldsymbol{X}^T\boldsymbol{X} \succ 0. \]

Maximum Likelihood (ML) Estimation

  • Another way to estimate \(\boldsymbol{\beta}\) is the maximum likelihood (ML) estimation.
  • Suppose that the response variable \(y\) is normally distributed with mean \(\mu(\boldsymbol{x}) = \boldsymbol{x}^T\boldsymbol{\beta}\) and variance \(\sigma^2\): \[ y \mid \boldsymbol{x} \sim \mathcal{N}(\boldsymbol{x}^T\boldsymbol{\beta}, \sigma^2). \]
  • Assuming the samples are i.i.d, the likelihood function is \[ L(\boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}\right). \]
  • The log-likelihood function is \[ \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2 = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2. \]
  • The ML estimation is \(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2 = \argmax_{\boldsymbol{\beta}, \sigma^2} \ell(\boldsymbol{\beta}, \sigma^2)\).

Maximum Likelihood (ML) Estimation

  • Taking the gradient of \(\ell(\boldsymbol{\beta}, \sigma^2)\) with respect to \(\boldsymbol{\beta}\) and \(\sigma^2\) and setting them to zero, we have \[ \nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = \frac{1}{\sigma^2} \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) = 0, \] and \[ \frac{\partial}{\partial \sigma^2} \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 = 0. \]
  • The ML estimation has a closed-form solution: \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}, \quad \hat{\sigma}^2 = \frac{1}{n}\|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}}^{\text{MLE}}\|_2^2. \]
  • The MLE of \(\boldsymbol{\beta}\) is the same as the OLS estimation.

OLS v.s. ML Estimation

  • Compared to the OLS estimation, the ML estimation requires an additional assumption on the distribution of \(y\).
  • In ths case of linear regression, the normality assumption is the most common one.
  • An equivalent way to express the linear regression under the normality assumption is \[ y = \boldsymbol{x}^T\boldsymbol{\beta} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2). \]
  • One of the advantages of the ML estimation is that it provides a way to estimate the variance of the estimated parameter \(\hat{\boldsymbol{\beta}}\): \[\begin{align*} \var(\hat{\boldsymbol{\beta}}^{\text{MLE}}) & = \var((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}) \\ & = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\var(\boldsymbol{y})\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \\ & = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T(\sigma^2 I)\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \\ & = \sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1}. \end{align*}\]

Useful Properties of MLE

  • Under the normality assumption, the MLE has the following properties:
    • Unbiasedness: \(\E(\hat{\boldsymbol{\beta}}^{\text{MLE}}) = \boldsymbol{\beta}\).
    • Normality: \(\hat{\boldsymbol{\beta}}^{\text{MLE}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1})\).
    • Prediction Intervals: for a given \(\boldsymbol{x}^{\star}\), the predicted value is \(y^{\star} = \boldsymbol{x}^{\star T}\hat{\boldsymbol{\beta}}^{\text{MLE}}\) and the prediction interval is \[ y^{\star} \pm t_{n-p-1, 1-\alpha/2} \hat{\sigma} \sqrt{1 + \boldsymbol{x}^{\star T}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{x}^{\star}}. \]
    • We can also derive the confidence intervals and hypothesis tests for \(c^T\boldsymbol{\beta}\) for any \(c\).

What if the samples are not i.i.d?

  • If the samples are not i.i.d, we can model the joint distribution of the samples: \[ \boldsymbol{y} \mid \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2 W^{-1}) \] where \(W\) is an \(n\times n\) covariance matrix describing the dependence between the samples.
  • Consider the transformation \(\widetilde{\boldsymbol{y}} = W^{1/2}\boldsymbol{y}\) and \(\widetilde{\boldsymbol{X}} = W^{1/2}\boldsymbol{X}\). The model becomes \[ \widetilde{\boldsymbol{y}} \mid \widetilde{\boldsymbol{X}} \sim \mathcal{N}(\widetilde{\boldsymbol{X}}\boldsymbol{\beta}, \sigma^2 I). \]
  • Therefore the MLE for \(\boldsymbol{\beta}\) is \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = (\widetilde{\boldsymbol{X}}^T\widetilde{\boldsymbol{X}})^{-1}\widetilde{\boldsymbol{X}}^T\widetilde{\boldsymbol{y}} = (\boldsymbol{X}^TW\boldsymbol{X})^{-1}\boldsymbol{X}^TW\boldsymbol{y}, \] which is called the weighted least squares estimation.

Penalized Likelihood Estimation

  • However, in practice, the MLE might not be the best choice.
  • For example, when \(X\) contains columns that are close to collinear or if the number of covariates \(p\) is large, computing \((X^TX)^{-1}\) will become numerically unstable.
  • One of the most common ways to address this issue is to add penalization or regularization.
  • The idea is to add a penalty term to the negative log-likelihood function, i.e., \[ -\ell(\boldsymbol{\beta}, \sigma^2) + \lambda \cdot \text{pen}(\boldsymbol{\beta}). \]
  • That is, we are looking for the \(\boldsymbol{\beta}\) that minimizes the negaive log-likelihood and the penalty.
  • The extra term \(\lambda\) is a hyperparameter that controls the trade-off between the likelihood and the penalty.

Ridge Regression

  • One of the most common penalization methods is the Ridge regression.
  • The penalty term is the \(L_2\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \|\boldsymbol{\beta}\|_2^2 = \sum_{j=0}^p \beta_j^2. \]
  • The Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2. \]
  • Let \(L_{\lambda}(\boldsymbol{\beta}) = \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2\) and set the gradient to zero \[ \nabla_{\boldsymbol{\beta}} L_{\lambda}(\boldsymbol{\beta}) = -2\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) + 2\lambda \boldsymbol{\beta} = 0. \]
  • Hence the Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = (\boldsymbol{X}^T\boldsymbol{X} + \lambda I)^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]

Ridge Regression

  • Typically, penalization of the intercept is not desired in Ridge regression so that \(\beta_0\) should be excluded from the penalty term.
  • A simple way to achieve this is to center all covariates and the responses so that \(\bar{y} = 0\) and \(\bar{\boldsymbol{x}} = 0\) which automatically results in \(\hat{\beta}_0 = 0\).
  • A second approach is to use the following penalty term: \[ \text{pen}(\boldsymbol{\beta}) = \sum_{j=1}^p \beta_j^2 = \boldsymbol{\beta}^TK\boldsymbol{\beta}, \] where \(K = \text{diag}(0, 1, \ldots, 1)\).
  • The Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = (\boldsymbol{X}^T\boldsymbol{X} + \lambda K)^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]

Ridge v.s. OLS

  • The Ridge estimator is biased and the OLS is unbiased.
  • However, one can show that the Ridge estimator has a smaller variance than the OLS estimator.
  • When choosing an appropriate hyperparameter \(\lambda\), the Ridge estimator can have a smaller mean squared error (MSE) than the OLS estimator.
  • The Ridge estimator is shrinkage estimator that shrinks the coefficients towards zero (large value of \(\lambda\) yields a stronger shrinkage).
  • The Ridge estimator is particularly useful when the covariates are collinear or when the number of covariates is large.
  • The choice of \(\lambda\) is often done using cross-validation.

Least Absolute Shrinkage and Selection Operator (LASSO)

  • Another common penalization method is the least absolute shrinkage and selection operator (LASSO).
  • The penalty term is the \(L_1\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \sum_{j=1}^p |\beta_j|. \]
  • The LASSO estimator is \[ \hat{\boldsymbol{\beta}}^{\text{LASSO}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^p |\beta_j|. \]

LASSO Estimation

  • Note that the objective function of the LASSO estimator is not differentiable, due to the absolute value term.
  • No closed-form solution for the LASSO estimator is available. However, it can be solved using the Least Angle Regression1 or the coordinate descent2 algorithm.
  • One of the key properties of the LASSO estimator is that it produces sparse solutions, i.e., some of the estimated coefficients are exactly zero.
  • This is particularly useful for variable selection, i.e., to identify the important covariates.

Ridge v.s. LASSO (\(L_2\) penalty v.s. \(L_1\) penalty)

Elastic-Net

  • The Elastic-Net1 is a combination of the Ridge and the LASSO.
  • The penalty term is a combination of the \(L_1\) and \(L_2\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2. \]
  • The Elastic-Net estimator is \[ \hat{\boldsymbol{\beta}}^{\text{EN}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2. \]
  • Through the choice of \(\lambda_1\) and \(\lambda_2\), the Elastic-Net can be used to achieve the benefits of both the ridge and the LASSO.

Example - Sparse features

  • We generate a synthetic dataset with 50 samples and 10 features.
  • Only 5 out of the 10 features are informative.
  • We fit the linear regression, Ridge, LASSO, and Elastic-Net models to the data.
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.datasets import make_regression

X, y, true_coef = make_regression(n_samples = 50, n_features = 10, 
                                  n_informative = 5, noise = 5,
                                  coef = True, random_state = 42)

lm = LinearRegression().fit(X, y)
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=1.0).fit(X, y)
enet = ElasticNet(alpha=1.0, l1_ratio=0.5).fit(X, y)

Example - Sparse features

True Coef Linear Ridge LASSO Elastic-Net
57.078 56.306 55.626 55.459 40.091
0.000 0.173 0.471 0.000 1.685
0.000 -0.185 0.073 -0.000 1.556
35.610 33.877 33.447 33.189 25.617
0.000 0.702 1.655 0.000 8.187
60.577 60.568 59.158 59.634 38.185
0.000 1.586 1.838 0.650 4.251
64.592 64.964 63.242 63.704 37.886
0.000 -0.440 -0.428 -0.000 -1.682
98.652 99.563 96.871 98.682 61.042

Generalized Linear Models

Beyond Normality

  • When the response variable is not real-valued, the classical linear model is not appropriate.
  • For example:
    • Binary responses: \(y \in \{0, 1\}\).
    • Count data: \(y \in \{0, 1, 2, \ldots\}\).
    • Multinomial responses: \(y \in \{1, 2, \ldots, K\}\).
  • In these cases, neither the OLS estimation nor the normality assumption is appropriate.
  • Generalized Linear Model (GLM) is a generalization of the classical linear model that allows for non-normal responses.
  • The key is to find a reasonable distribution to model \(y\).

Binary Responces: Logistic regression

  • When \(y\) is binary, we can use the Bernoulli distribution \[ Y \mid \boldsymbol{x} \sim \text{Ber}(p(\boldsymbol{x})), \]
  • That is, \(\P(Y = 1 \mid \boldsymbol{x}) = p(\boldsymbol{x})\) and \(\P(Y = 0 \mid \boldsymbol{x}) = 1 - p(\boldsymbol{x})\) and the expectation is \(\E(Y \mid \boldsymbol{x}) = p(\boldsymbol{x})\).
  • The logistic regression model assumes \[ p(\boldsymbol{x}) = \frac{1}{1 + \exp(-\boldsymbol{x}^T\boldsymbol{\beta})}. \]
  • Equivalently, we can write \[ \log\left(\frac{p(\boldsymbol{x})}{1 - p(\boldsymbol{x})}\right) = \boldsymbol{x}^T\boldsymbol{\beta}. \]
  • That is, the log-odds of the event \(\{Y = 1\}\) is linear in \(\boldsymbol{x}\).

ML Estimation for Logistic Regression

  • Given \(n\) samples \((\boldsymbol{x}_1, y_1), \ldots (\boldsymbol{x}_n, y_n)\), the likelihood function is \[ L(\boldsymbol{\beta}) = \prod_{i=1}^n p(\boldsymbol{x}_i)^{y_i} (1 - p(\boldsymbol{x}_i))^{1 - y_i}. \]
  • The log-likelihood function is \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \log(p(\boldsymbol{x}_i)) + (1 - y_i)\log(1 - p(\boldsymbol{x}_i)). \]
  • Hence the MLE of \(\boldsymbol{\beta}\) is \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = \argmax_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}). \]
  • The negative log-likelihood is also called the cross-entropy loss, i.e., maximizing the likelihood is equivalent to minimizing the cross-entropy loss.

Cross-Entropy Loss

  • In Information Theory, the cross-entropy is defined as \[ H(p, q) = -\sum_{x} p(x) \log(q(x)) = -\E_p[\log(q(X))], \] where \(p(x)\) and \(q(x)\) are two discrete probability distributions.
  • Large value of cross-entropy indicates that the two distributions are different.
  • In the case of logistic regression, we want to measure the discrepancy between the data \([y_i, 1-y_i]\) and the model \([p(\boldsymbol{x}_i), 1 - p(\boldsymbol{x}_i)]\).
  • Hence the cross-entropy is \[ -\sum_{i=1}^n y_i \log(p(\boldsymbol{x}_i)) + (1 - y_i)\log(1 - p(\boldsymbol{x}_i)). \]

Exponential family

  • Recall that an exponential family is a family of distributions \[ f(x \mid \theta) = h(x)\exp(\theta^T T(x) - \psi(\theta)) \] where \(\theta \in \R^k\) and \(T(x) = [T_1(x), \ldots, T_k(x)]^T\).
  • The parameter \(\theta\) is called the natural parameter or the canonical parameter and \(T(x)\) is the sufficient statistic.
  • Two useful properties (from Bartlett’s identities):
    • \(\E(T(X)) = \nabla_{\theta}\psi(\theta)\)
    • \(\var(T(X)) = \text{Hess}(\psi(\theta)) = \nabla^2_{\theta} \psi(\theta)\).
  • That is, the relationship between the parameter \(\theta\) and the expectation \(\E(T(X))\) determined by \(\nabla \psi\).

Examples

  • Normal disribution: \[ f(x \mid \mu, \sigma^2) = \exp\left(-\frac{1}{2\sigma^2} x^2 + \frac{\mu}{\sigma^2}x - \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right), \quad x \in \R \]
    • \(\theta = \left(-\frac{1}{2\sigma^2}, \frac{\mu}{\sigma^2}\right)\), \(T(x) = (-x^2, x)\), \(\psi(\theta) = -\frac{\theta_2^2}{4\theta_1} - \frac{1}{2}\log\left(-\frac{\theta_1}{\pi}\right)= \frac{\mu^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2)\)
  • Bernoulli distribution: \[ f(x \mid p) = p^x(1-p)^{1-x} = \exp\left(x\log\frac{p}{1-p} + \log(1-p)\right), \quad x \in \{0, 1\} \]
    • \(\theta = \log\frac{p}{1-p}\), \(T(x) = x\), \(\psi(\theta) = -\log(1-p) = \log(1 + e^{\theta})\)
  • Poisson distribution: \[ f(x \mid \lambda) = \frac{\lambda^x e^{\lambda}}{x!}= \frac{1}{x!}\exp(x\log\lambda - \lambda), \quad x = 0, 1, 2, \ldots \]
    • \(\theta = \log\lambda\), \(T(x) = x\), \(\psi(\theta) = \exp(\theta) = \lambda\)

Generalized Linear Model (GLM)

  • Let \(Y\) be univariate, \(\boldsymbol{x} \in \R^p\), and \(\boldsymbol{\beta} \in \R^p\).
  • A GLM is assuming \(Y \mid \boldsymbol{x} \sim F_{\theta}\), where \(\theta = \boldsymbol{x}^T\boldsymbol{\beta}\) and \(F_\theta\) has the density function \[ f(y \mid \theta) = h(y)\exp(\theta\cdot y - \psi(\theta)). \]
  • Therefore \[\begin{align*} \E(Y \mid \boldsymbol{x}) & = \frac{d}{d\theta}\psi(\theta) = \psi^{\prime}(\boldsymbol{x}^T\boldsymbol{\beta}). \end{align*}\]
  • Equivalently, \[ g(\E(Y \mid \boldsymbol{x})) = \boldsymbol{x}^T\boldsymbol{\beta} \] where \(g\) is the inverse of \(\psi^{\prime}\).
  • The function \(g\) is called the link function.

Logistic Regression

  • For Bernoulli distributions, we have \[ \theta = \log\frac{p}{1-p}, \quad \psi(\theta) = -\log(1-p) = \log(1 + e^{\theta}). \]
  • Thus, \(\psi^{\prime}(\theta) = \frac{e^{\theta}}{1 + e^{\theta}}\) and \(g(p) = (\psi^{\prime})^{-1}(p) = \log\frac{p}{1-p}\). \(\psi^{\prime}\) is called the logistic function and \(g\) is called the logit function.
  • Putting altogether, we have \[ g(\E(Y \mid \boldsymbol{x})) = \log\left(\frac{p(\boldsymbol{x})}{1 - p(\boldsymbol{x})}\right) = \boldsymbol{x}^T\boldsymbol{\beta} \] or equivalently \[ \P(Y = 1 \mid \boldsymbol{x}) = \psi^{\prime}(\boldsymbol{x}^T\boldsymbol{\beta}) = \frac{\exp(\boldsymbol{x}^T\boldsymbol{\beta})}{1 + \exp(\boldsymbol{x}^T\boldsymbol{\beta})}. \]

Remarks

  • The link function \(g = (\psi^{\prime})^{-1}\) is sometimes called the canonical link function, since it is derived from the canonical representation of an exponential family.
  • All we need for a link function is that it is invertible and matches the range of \(\E(Y \mid \boldsymbol{x})\) and \(\boldsymbol{x}^T\boldsymbol{\beta}\).
  • For example, in the Bernoulli linear model, we could have used the probit link function \[ g(u) = \Phi^{-1}(u): [0, 1] \to \R \] where \(\Phi\) is the CDF of the standard normal distribution.
  • This is called the probit regression.

Multinomial Regression

  • Multinomial regression is a generalization of Logistic regression to categorical variables with more than two categories.
  • Suppose \(Y\) is a categorical variable with \(K\) categories, \(Y \in \{1, 2, \ldots, K\}\).
  • A more useful representation is to use the one-hot encoding: \[ Y = [0, 0, \ldots, 1, \ldots, 0]^T \] where the \(k\)-th element is 1 and the rest are 0.
  • The multinomial regression model assumes \[ Y \mid \boldsymbol{x} \sim \text{Multi}(1, [p_1(\boldsymbol{x}), p_2(\boldsymbol{x}), \ldots, p_K(\boldsymbol{x})]^T) \] where \(p_k(\boldsymbol{x}) = \P(Y = 1_k \mid \boldsymbol{x})\) and \(1_k\) is the one-hot encoding of the \(k\)-th category.

Multinomial Distribution

  • The probability mass function of the \(\text{Multi}(m,p)\) is \[ f(x \mid p) = \frac{m!}{x_1!\cdots x_K!}\prod_{k=1}^K p_k^{x_k} = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^K x_k\log p_k\right) \] where \(x = [x_1, x_2, \ldots, x_K]^T\), \(\sum_{k=1}^K x_k = m\), and \(\sum_{k=1}^K p_k = 1\).
  • Note that \(p_K = 1 - p_1 - \ldots - p_{K-1}\) and therefore \[\begin{align*} f(x \mid p) & = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^{K-1} x_k\log p_k + \left(m - \sum_{k=1}^{K-1} x_k\right)\log(1 - p_1 - \ldots - p_{K-1})\right)\\ & = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^{K-1} x_k\log\frac{p_k}{p_K} + m\log(1 - p_1 - \ldots - p_{K-1})\right). \end{align*}\]

Softmax Function

  • The canonical parameter is \(\theta = [\log\frac{p_1}{p_K}, \ldots, \log\frac{p_{K-1}}{p_K}]^T\) and therefore \(p_i = p_K\exp(\theta_i)\).
  • Using the relationship \(p_K = 1 - \sum_{k=1}^{K-1} p_k\), we have \[ p_K = 1 - p_K\sum_{k=1}^{K-1} \exp(\theta_k) \quad \Rightarrow \quad p_K = \frac{1}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}. \]
  • Hence (assume \(m = 1\) for simplicity) \[\begin{align*} \psi(\theta) & = - \log(1 - p_1 - \ldots - p_{K-1}) = - \log(1 - p_Ke^{\theta_1} - \ldots - p_Ke^{\theta_{K-1}})\\ & = - \log\left(1 - \frac{\sum_{k=1}^{K-1} \exp(\theta_k)}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}\right) = \log\left(1 + \sum_{k=1}^{K-1} \exp(\theta_k)\right). \end{align*}\]
  • Taking the derivative, we have the softmax function: \[ \nabla_{\theta}\psi(\theta) = \left[\frac{\exp(\theta_1)}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}, \ldots, \frac{\exp(\theta_{K-1})}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}\right]. \]

Multinomial Regression

  • The multinomial regression model is given by \[\begin{align*} \theta_i & = \boldsymbol{x}^T\boldsymbol{\beta}_i, \\ p_i(\boldsymbol{x}) & = \frac{\exp(\theta_i)}{1 + \sum_{k=1}^{K-1} \exp(\theta_i)}, \quad i = 1, 2, \ldots, K-1, \end{align*}\] where \(\boldsymbol{\beta}_i \in \R^p\).
  • In fact, a more common representation is \[\begin{align*} \tilde{\theta}_i & = \boldsymbol{x}^T\tilde{\boldsymbol{\beta}}_i, \\ p_i(\boldsymbol{x}) & = \frac{\exp(\tilde{\theta}_i)}{\sum_{k=1}^{K} \exp(\tilde{\theta}_i)}, \quad i = 1, 2, \ldots, K. \end{align*}\]
  • The equivalence is due to the transformation \(\theta_i = \tilde{\theta}_i - \tilde{\theta}_K\) and \(\boldsymbol{\beta}_i = \tilde{\boldsymbol{\beta}}_i - \tilde{\boldsymbol{\beta}}_K\). We can also write \[ [p_1(\boldsymbol{x}), \ldots, p_K(\boldsymbol{x})] = \texttt{softmax}(\boldsymbol{x}^T\boldsymbol{\beta}_1, \ldots, \boldsymbol{x}^T\boldsymbol{\beta}_K). \]

Quick Summary

  • A GLM is \[ g(\E(Y \mid X = x)) = x^T\beta \Leftrightarrow \E(Y \mid X = x) = g^{-1}(x^T\beta). \]
  • The link function \(g\) connects the conditional expectation and the linear predictor and is chosen based on the distribution of \(Y\).
  • Examples:
    • Logistic regression: \(g(p) = \log\left(\frac{p}{1-p}\right)\), \(g^{-1}(x) = \frac{1}{1+e^{-x}}\).
    • Linear regression: \(g(\mu) = \mu\).
    • Multinomial regression: softmax function.
    • There are other choices and the above are called the canonical link functions.

Beyond Linearity

  • Up to now, we have assumed that a linear relationship between the features and the (transformed) conditional expectation: \[ g(\E(Y \mid \boldsymbol{x})) = \boldsymbol{x}^T\boldsymbol{\beta} \]
  • However, this is a strong assumption and may not be appropriate in many cases.
  • To remove this assumption, we can consider \[ g(\E(Y \mid \boldsymbol{x})) = f(\boldsymbol{x}) \] where \(f: \R^p \to \R\) is an unknown function.
  • The problem is now to estimate the function \(f\).
  • Depending on the restrictions on \(f\), we can use different methods to estimate \(f\).

Generalized Additive Models (GAM)

  • An additive model assumes that the unknown function \(f\) is a sum of univariate functions: \[ f(\boldsymbol{x}) = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p). \]
  • Therefore, the model is \[ g(\E(Y \mid \boldsymbol{x})) = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p). \]
  • The functions \(f_j: \R \to \R\) are unknown and need to be estimated.
  • Nonparametric methods can be used to estimate the functions \(f_j\), for example, kernel smoothing, splines, etc.
  • The GAM is a generalization of the linear model that allows for non-linear relationships between the features and the conditional expectation.

Projection Pursuit Regression (PPR)

  • The projection pursuit regression (PPR)1 model assumes: \[ f(\boldsymbol{x}) = \sum_{m=1}^M f_m(\boldsymbol{x}^T\boldsymbol{\omega}_m), \] where \(\boldsymbol{\omega}_m \in \R^p\) are unknown unit vectors and \(f_m: \R \to \R\) are unknown functions.
  • The scalar variable \(V_m = \boldsymbol{x}^T\boldsymbol{\omega}_m\) is the projection of \(\boldsymbol{x}\) onto the unit vector \(\boldsymbol{\omega}_m\), and we seek \(\boldsymbol{\omega}_m\) so that the model fits well, hence the name “projection pursuit.”
  • If \(M\) is taken arbitrarily large, for appropriate choice of \(f_m\) the PPR model can approximate any continuous function in \(\R^p\) arbitrarily well, i.e., the PPR is a universal approximator.
  • However, this model is not widely used due to the difficulty in estimating the functions \(f_m\).

MLE for Logistic Regression

The log-likelihood function

In order to find the MLE, we need to simplify the log-likelihood function: \[\begin{align*} \ell(\boldsymbol{\beta}) & = \log \left(\prod_{i=1}^n p(\boldsymbol{x}_i)^{y_i}(1-p(\boldsymbol{x}_i))^{1-y_i}\right) \\ & = \sum_{i=1}^n y_i \log p(\boldsymbol{x}_i) + (1-y_i)\log(1-p(\boldsymbol{x}_i))\\ & = \sum_{i=1}^n \left[y_i \log \left(\frac{p(\boldsymbol{x}_i)}{1-p(\boldsymbol{x}_i)}\right) + \log(1-p(\boldsymbol{x}_i))\right]\\ & = \sum_{i=1}^n \left[y_i\boldsymbol{x}_i^T\boldsymbol{\beta} - \log(1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}))\right]. \end{align*}\]

Gradient and Hessian

Now we compute the gradient and the Hessian of the log-likelihood function:

\[\begin{align*} \nabla \ell(\boldsymbol{\beta}) & = \sum_{i=1}^n \left[y_i \boldsymbol{x}_i - \frac{\exp(\boldsymbol{x}_i^T\boldsymbol{\beta})}{1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta})}\boldsymbol{x}_i\right] = \sum_{i=1}^n (y_i - p(\boldsymbol{x}_i))\boldsymbol{x}_i\\ & = X^T(\boldsymbol{y}-\mathbf{p})\\ \nabla^2 \ell(\boldsymbol{\beta}) & = -\sum_{i=1}^n p(\boldsymbol{x}_i)(1-p(\boldsymbol{x}_i))\boldsymbol{x}_i\boldsymbol{x}_i^T = -X^TWX \end{align*}\] where \[ \mathbf{p} = [p(\boldsymbol{x}_1), \ldots, p(\boldsymbol{x}_n)]^T, \quad W= \diag(\mathbf{p})\diag(1-\mathbf{p}), \quad X = \begin{bmatrix} \boldsymbol{x}_1^T \\ \vdots \\ \boldsymbol{x}_n^T \end{bmatrix}. \]

Iteratively Re-Weighted Least Squares (IRWLS)

There is no analytic solution for the MLE of the logistic regression. However, the Newton-Raphson method can be used to find the MLE. The Newton-Raphson method is an iterative method that updates the parameter \(\boldsymbol{\beta}\) as follows:

\[\begin{align*} \boldsymbol{\beta}^{(t+1)} & = \boldsymbol{\beta}^{(t)} - \left[\nabla^2 \ell(\boldsymbol{\beta}^{(t)})\right]^{-1} \nabla \ell\left(\boldsymbol{\beta}^{(t)}\right) \\ & = \boldsymbol{\beta}^{(t)}+\left(X^T W^{(t)}X\right)^{-1} X^T\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right) \\ & = \left(X^T W^{(t)} X\right)^{-1} X^T W^{(t)}\left[X \boldsymbol{\beta}^{(t)}+\left(W^{(t)}\right)^{-1}\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right)\right] \\ & = \left(X^T W^{(t)} X\right)^{-1} X^T W^{(t)} \mathbf{z}^{(t)} \end{align*}\] where \[\begin{align*} \mathbf{z}^{(t)} & =X \boldsymbol{\beta}^{(t)}+\left(W^{(t)}\right)^{-1}\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right), \quad \mathbf{p}^{(t)} = [p^{(t)}(\boldsymbol{x}_1), \ldots, p^{(t)}(\boldsymbol{x}_n)]^T\\ p^{(t)}(\boldsymbol{x}_i) & = \frac{\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}^{(t)})}{1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}^{(t)})}, \quad W^{(t)} = \diag(\mathbf{p}^{(t)})\diag(1-\mathbf{p}^{(t)}). \end{align*}\]