Generalized Linear Models

Chun-Hao Yang

Outline

Classical Linear Model
- Ordinary Least Squares (OLS) Estimation
- Maximum Likelihood (ML) Estimation
- Penalty and Regularization
Generalized Linear Models
- Logistic Regression
- Multinomial Regression
Non-linear Models
- Generalized Additive Models (GAM)
- Projection Pursuit Regression (PPR)

Classical Linear Model

Given \(p\) covariates \(x_1, \ldots, x_p\) and a response variable \(y\), the classical linear model assumes that the relationship between the \(x_i\)’s and \(y\) is linear: \[ y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p. \]
Denote \(\boldsymbol{x} = (1, x_{1}, \ldots, x_{p})^T\) and \(\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)^T\), the model can be written as \[ y = \boldsymbol{x}^T\boldsymbol{\beta}. \]
Suppose now we have \(n\) samples \((\boldsymbol{x}_1, y_1), \ldots (\boldsymbol{x}_n, y_n)\) and we believe that the linear model above is a reasonable approximation of the relationship between the \(\boldsymbol{x}_i\)’s and \(y_i\).
The goal is to estimate the model parameter \(\boldsymbol{\beta}\).

Ordinary Least Squares (OLS) Estimation

The most common method to estimate \(\boldsymbol{\beta}\) is the ordinary least squares (OLS) estimation, that is, we find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals: \[ \hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2. \]
Denoting \[ \boldsymbol{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \quad \boldsymbol{X} = \begin{bmatrix} \boldsymbol{x}_1^T \\ \vdots \\ \boldsymbol{x}_n^T \end{bmatrix}, \] the minimization problem can be written as \[ \hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2. \]

Ordinary Least Squares (OLS) Estimation

Let \(L(\boldsymbol{\beta}) = \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2\). This is often called the loss function.
Taking the gradient of \(L(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) and setting it to zero, we have \[ \nabla_{\boldsymbol{\beta}} L(\boldsymbol{\beta}) = -2\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) = 0. \]
The OLS estimation has a closed-form solution: \[ \hat{\boldsymbol{\beta}}^{\text{OLS}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]
To ensure the existence of the inverse, we need to assume that \(\boldsymbol{X}^T\boldsymbol{X}\) is invertible, that is, the columns of \(\boldsymbol{X}\) are linearly independent.
To verify that \(\hat{\boldsymbol{\beta}}^{\text{OLS}}\) is indeed the minimizer, we need to show that the Hessian of \(L(\boldsymbol{\beta})\) is positive definite: \[ \nabla^2_{\boldsymbol{\beta}} L(\boldsymbol{\beta}) = 2\boldsymbol{X}^T\boldsymbol{X} \succ 0. \]

Maximum Likelihood (ML) Estimation

Another way to estimate \(\boldsymbol{\beta}\) is the maximum likelihood (ML) estimation.
Suppose that the response variable \(y\) is normally distributed with mean \(\mu(\boldsymbol{x}) = \boldsymbol{x}^T\boldsymbol{\beta}\) and variance \(\sigma^2\): \[ y \mid \boldsymbol{x} \sim \mathcal{N}(\boldsymbol{x}^T\boldsymbol{\beta}, \sigma^2). \]
Assuming the samples are i.i.d, the likelihood function is \[ L(\boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}\right). \]
The log-likelihood function is \[ \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})^2 = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2. \]
The ML estimation is \(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2 = \argmax_{\boldsymbol{\beta}, \sigma^2} \ell(\boldsymbol{\beta}, \sigma^2)\).

Maximum Likelihood (ML) Estimation

Taking the gradient of \(\ell(\boldsymbol{\beta}, \sigma^2)\) with respect to \(\boldsymbol{\beta}\) and \(\sigma^2\) and setting them to zero, we have \[ \nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = \frac{1}{\sigma^2} \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) = 0, \] and \[ \frac{\partial}{\partial \sigma^2} \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 = 0. \]
The ML estimation has a closed-form solution: \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}, \quad \hat{\sigma}^2 = \frac{1}{n}\|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}}^{\text{MLE}}\|_2^2. \]
The MLE of \(\boldsymbol{\beta}\) is the same as the OLS estimation.

OLS v.s. ML Estimation

Compared to the OLS estimation, the ML estimation requires an additional assumption on the distribution of \(y\).
In ths case of linear regression, the normality assumption is the most common one.
An equivalent way to express the linear regression under the normality assumption is \[ y = \boldsymbol{x}^T\boldsymbol{\beta} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2). \]
One of the advantages of the ML estimation is that it provides a way to estimate the variance of the estimated parameter \(\hat{\boldsymbol{\beta}}\): \[\begin{align*} \var(\hat{\boldsymbol{\beta}}^{\text{MLE}}) & = \var((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}) \\ & = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\var(\boldsymbol{y})\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \\ & = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T(\sigma^2 I)\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \\ & = \sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1}. \end{align*}\]

Useful Properties of MLE

Under the normality assumption, the MLE has the following properties:
- Unbiasedness: \(\E(\hat{\boldsymbol{\beta}}^{\text{MLE}}) = \boldsymbol{\beta}\).
- Normality: \(\hat{\boldsymbol{\beta}}^{\text{MLE}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1})\).
- Prediction Intervals: for a given \(\boldsymbol{x}^{\star}\), the predicted value is \(y^{\star} = \boldsymbol{x}^{\star T}\hat{\boldsymbol{\beta}}^{\text{MLE}}\) and the prediction interval is \[ y^{\star} \pm t_{n-p-1, 1-\alpha/2} \hat{\sigma} \sqrt{1 + \boldsymbol{x}^{\star T}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{x}^{\star}}. \]
- We can also derive the confidence intervals and hypothesis tests for \(c^T\boldsymbol{\beta}\) for any \(c\).

What if the samples are not i.i.d?

If the samples are not i.i.d, we can model the joint distribution of the samples: \[ \boldsymbol{y} \mid \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2 W^{-1}) \] where \(W\) is an \(n\times n\) covariance matrix describing the dependence between the samples.
Consider the transformation \(\widetilde{\boldsymbol{y}} = W^{1/2}\boldsymbol{y}\) and \(\widetilde{\boldsymbol{X}} = W^{1/2}\boldsymbol{X}\). The model becomes \[ \widetilde{\boldsymbol{y}} \mid \widetilde{\boldsymbol{X}} \sim \mathcal{N}(\widetilde{\boldsymbol{X}}\boldsymbol{\beta}, \sigma^2 I). \]
Therefore the MLE for \(\boldsymbol{\beta}\) is \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = (\widetilde{\boldsymbol{X}}^T\widetilde{\boldsymbol{X}})^{-1}\widetilde{\boldsymbol{X}}^T\widetilde{\boldsymbol{y}} = (\boldsymbol{X}^TW\boldsymbol{X})^{-1}\boldsymbol{X}^TW\boldsymbol{y}, \] which is called the weighted least squares estimation.

Penalized Likelihood Estimation

However, in practice, the MLE might not be the best choice.
For example, when \(X\) contains columns that are close to collinear or if the number of covariates \(p\) is large, computing \((X^TX)^{-1}\) will become numerically unstable.
One of the most common ways to address this issue is to add penalization or regularization.
The idea is to add a penalty term to the negative log-likelihood function, i.e., \[ -\ell(\boldsymbol{\beta}, \sigma^2) + \lambda \cdot \text{pen}(\boldsymbol{\beta}). \]
That is, we are looking for the \(\boldsymbol{\beta}\) that minimizes the negaive log-likelihood and the penalty.
The extra term \(\lambda\) is a hyperparameter that controls the trade-off between the likelihood and the penalty.

Ridge Regression

One of the most common penalization methods is the Ridge regression.
The penalty term is the \(L_2\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \|\boldsymbol{\beta}\|_2^2 = \sum_{j=0}^p \beta_j^2. \]
The Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2. \]
Let \(L_{\lambda}(\boldsymbol{\beta}) = \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2\) and set the gradient to zero \[ \nabla_{\boldsymbol{\beta}} L_{\lambda}(\boldsymbol{\beta}) = -2\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) + 2\lambda \boldsymbol{\beta} = 0. \]
Hence the Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = (\boldsymbol{X}^T\boldsymbol{X} + \lambda I)^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]

Ridge Regression

Typically, penalization of the intercept is not desired in Ridge regression so that \(\beta_0\) should be excluded from the penalty term.
A simple way to achieve this is to center all covariates and the responses so that \(\bar{y} = 0\) and \(\bar{\boldsymbol{x}} = 0\) which automatically results in \(\hat{\beta}_0 = 0\).
A second approach is to use the following penalty term: \[ \text{pen}(\boldsymbol{\beta}) = \sum_{j=1}^p \beta_j^2 = \boldsymbol{\beta}^TK\boldsymbol{\beta}, \] where \(K = \text{diag}(0, 1, \ldots, 1)\).
The Ridge estimator is \[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = (\boldsymbol{X}^T\boldsymbol{X} + \lambda K)^{-1}\boldsymbol{X}^T\boldsymbol{y}. \]

Ridge v.s. OLS

The Ridge estimator is biased and the OLS is unbiased.
However, one can show that the Ridge estimator has a smaller variance than the OLS estimator.
When choosing an appropriate hyperparameter \(\lambda\), the Ridge estimator can have a smaller mean squared error (MSE) than the OLS estimator.
The Ridge estimator is shrinkage estimator that shrinks the coefficients towards zero (large value of \(\lambda\) yields a stronger shrinkage).
The Ridge estimator is particularly useful when the covariates are collinear or when the number of covariates is large.
The choice of \(\lambda\) is often done using cross-validation.

Least Absolute Shrinkage and Selection Operator (LASSO)

Another common penalization method is the least absolute shrinkage and selection operator (LASSO).
The penalty term is the \(L_1\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \sum_{j=1}^p |\beta_j|. \]
The LASSO estimator is \[ \hat{\boldsymbol{\beta}}^{\text{LASSO}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^p |\beta_j|. \]

LASSO Estimation

Note that the objective function of the LASSO estimator is not differentiable, due to the absolute value term.
No closed-form solution for the LASSO estimator is available. However, it can be solved using the Least Angle Regression¹ or the coordinate descent² algorithm.
One of the key properties of the LASSO estimator is that it produces sparse solutions, i.e., some of the estimated coefficients are exactly zero.
This is particularly useful for variable selection, i.e., to identify the important covariates.

Ridge v.s. LASSO (\(L_2\) penalty v.s. \(L_1\) penalty)

Elastic-Net

The Elastic-Net¹ is a combination of the Ridge and the LASSO.
The penalty term is a combination of the \(L_1\) and \(L_2\) norm of \(\boldsymbol{\beta}\): \[ \text{pen}(\boldsymbol{\beta}) = \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2. \]
The Elastic-Net estimator is \[ \hat{\boldsymbol{\beta}}^{\text{EN}} = \argmin_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2. \]
Through the choice of \(\lambda_1\) and \(\lambda_2\), the Elastic-Net can be used to achieve the benefits of both the ridge and the LASSO.

Example - Sparse features

We generate a synthetic dataset with 50 samples and 10 features.
Only 5 out of the 10 features are informative.
We fit the linear regression, Ridge, LASSO, and Elastic-Net models to the data.

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.datasets import make_regression

X, y, true_coef = make_regression(n_samples = 50, n_features = 10, 
                                  n_informative = 5, noise = 5,
                                  coef = True, random_state = 42)

lm = LinearRegression().fit(X, y)
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=1.0).fit(X, y)
enet = ElasticNet(alpha=1.0, l1_ratio=0.5).fit(X, y)

Example - Sparse features

True Coef	Linear	Ridge	LASSO	Elastic-Net
57.078	56.306	55.626	55.459	40.091
0.000	0.173	0.471	0.000	1.685
0.000	-0.185	0.073	-0.000	1.556
35.610	33.877	33.447	33.189	25.617
0.000	0.702	1.655	0.000	8.187
60.577	60.568	59.158	59.634	38.185
0.000	1.586	1.838	0.650	4.251
64.592	64.964	63.242	63.704	37.886
0.000	-0.440	-0.428	-0.000	-1.682
98.652	99.563	96.871	98.682	61.042

Generalized Linear Models

Beyond Normality

When the response variable is not real-valued, the classical linear model is not appropriate.
For example:
- Binary responses: \(y \in \{0, 1\}\).
- Count data: \(y \in \{0, 1, 2, \ldots\}\).
- Multinomial responses: \(y \in \{1, 2, \ldots, K\}\).
In these cases, neither the OLS estimation nor the normality assumption is appropriate.
Generalized Linear Model (GLM) is a generalization of the classical linear model that allows for non-normal responses.
The key is to find a reasonable distribution to model \(y\).

Binary Responces: Logistic regression

When \(y\) is binary, we can use the Bernoulli distribution \[ Y \mid \boldsymbol{x} \sim \text{Ber}(p(\boldsymbol{x})), \]
That is, \(\P(Y = 1 \mid \boldsymbol{x}) = p(\boldsymbol{x})\) and \(\P(Y = 0 \mid \boldsymbol{x}) = 1 - p(\boldsymbol{x})\) and the expectation is \(\E(Y \mid \boldsymbol{x}) = p(\boldsymbol{x})\).
The logistic regression model assumes \[ p(\boldsymbol{x}) = \frac{1}{1 + \exp(-\boldsymbol{x}^T\boldsymbol{\beta})}. \]
Equivalently, we can write \[ \log\left(\frac{p(\boldsymbol{x})}{1 - p(\boldsymbol{x})}\right) = \boldsymbol{x}^T\boldsymbol{\beta}. \]
That is, the log-odds of the event \(\{Y = 1\}\) is linear in \(\boldsymbol{x}\).

ML Estimation for Logistic Regression

Given \(n\) samples \((\boldsymbol{x}_1, y_1), \ldots (\boldsymbol{x}_n, y_n)\), the likelihood function is \[ L(\boldsymbol{\beta}) = \prod_{i=1}^n p(\boldsymbol{x}_i)^{y_i} (1 - p(\boldsymbol{x}_i))^{1 - y_i}. \]
The log-likelihood function is \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \log(p(\boldsymbol{x}_i)) + (1 - y_i)\log(1 - p(\boldsymbol{x}_i)). \]
Hence the MLE of \(\boldsymbol{\beta}\) is \[ \hat{\boldsymbol{\beta}}^{\text{MLE}} = \argmax_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}). \]
The negative log-likelihood is also called the cross-entropy loss, i.e., maximizing the likelihood is equivalent to minimizing the cross-entropy loss.

Cross-Entropy Loss

In Information Theory, the cross-entropy is defined as \[ H(p, q) = -\sum_{x} p(x) \log(q(x)) = -\E_p[\log(q(X))], \] where \(p(x)\) and \(q(x)\) are two discrete probability distributions.
Large value of cross-entropy indicates that the two distributions are different.
In the case of logistic regression, we want to measure the discrepancy between the data \([y_i, 1-y_i]\) and the model \([p(\boldsymbol{x}_i), 1 - p(\boldsymbol{x}_i)]\).
Hence the cross-entropy is \[ -\sum_{i=1}^n y_i \log(p(\boldsymbol{x}_i)) + (1 - y_i)\log(1 - p(\boldsymbol{x}_i)). \]

Exponential family

Recall that an exponential family is a family of distributions \[ f(x \mid \theta) = h(x)\exp(\theta^T T(x) - \psi(\theta)) \] where \(\theta \in \R^k\) and \(T(x) = [T_1(x), \ldots, T_k(x)]^T\).
The parameter \(\theta\) is called the natural parameter or the canonical parameter and \(T(x)\) is the sufficient statistic.
Two useful properties (from Bartlett’s identities):
- \(\E(T(X)) = \nabla_{\theta}\psi(\theta)\)
- \(\var(T(X)) = \text{Hess}(\psi(\theta)) = \nabla^2_{\theta} \psi(\theta)\).
That is, the relationship between the parameter \(\theta\) and the expectation \(\E(T(X))\) determined by \(\nabla \psi\).

Examples

Normal disribution: \[ f(x \mid \mu, \sigma^2) = \exp\left(-\frac{1}{2\sigma^2} x^2 + \frac{\mu}{\sigma^2}x - \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right), \quad x \in \R \]
- \(\theta = \left(-\frac{1}{2\sigma^2}, \frac{\mu}{\sigma^2}\right)\), \(T(x) = (-x^2, x)\), \(\psi(\theta) = -\frac{\theta_2^2}{4\theta_1} - \frac{1}{2}\log\left(-\frac{\theta_1}{\pi}\right)= \frac{\mu^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2)\)
Bernoulli distribution: \[ f(x \mid p) = p^x(1-p)^{1-x} = \exp\left(x\log\frac{p}{1-p} + \log(1-p)\right), \quad x \in \{0, 1\} \]
- \(\theta = \log\frac{p}{1-p}\), \(T(x) = x\), \(\psi(\theta) = -\log(1-p) = \log(1 + e^{\theta})\)
Poisson distribution: \[ f(x \mid \lambda) = \frac{\lambda^x e^{\lambda}}{x!}= \frac{1}{x!}\exp(x\log\lambda - \lambda), \quad x = 0, 1, 2, \ldots \]
- \(\theta = \log\lambda\), \(T(x) = x\), \(\psi(\theta) = \exp(\theta) = \lambda\)

Generalized Linear Model (GLM)

Let \(Y\) be univariate, \(\boldsymbol{x} \in \R^p\), and \(\boldsymbol{\beta} \in \R^p\).
A GLM is assuming \(Y \mid \boldsymbol{x} \sim F_{\theta}\), where \(\theta = \boldsymbol{x}^T\boldsymbol{\beta}\) and \(F_\theta\) has the density function \[ f(y \mid \theta) = h(y)\exp(\theta\cdot y - \psi(\theta)). \]
Therefore \[\begin{align*} \E(Y \mid \boldsymbol{x}) & = \frac{d}{d\theta}\psi(\theta) = \psi^{\prime}(\boldsymbol{x}^T\boldsymbol{\beta}). \end{align*}\]
Equivalently, \[ g(\E(Y \mid \boldsymbol{x})) = \boldsymbol{x}^T\boldsymbol{\beta} \] where \(g\) is the inverse of \(\psi^{\prime}\).
The function \(g\) is called the link function.

Logistic Regression

For Bernoulli distributions, we have \[ \theta = \log\frac{p}{1-p}, \quad \psi(\theta) = -\log(1-p) = \log(1 + e^{\theta}). \]
Thus, \(\psi^{\prime}(\theta) = \frac{e^{\theta}}{1 + e^{\theta}}\) and \(g(p) = (\psi^{\prime})^{-1}(p) = \log\frac{p}{1-p}\). \(\psi^{\prime}\) is called the logistic function and \(g\) is called the logit function.
Putting altogether, we have \[ g(\E(Y \mid \boldsymbol{x})) = \log\left(\frac{p(\boldsymbol{x})}{1 - p(\boldsymbol{x})}\right) = \boldsymbol{x}^T\boldsymbol{\beta} \] or equivalently \[ \P(Y = 1 \mid \boldsymbol{x}) = \psi^{\prime}(\boldsymbol{x}^T\boldsymbol{\beta}) = \frac{\exp(\boldsymbol{x}^T\boldsymbol{\beta})}{1 + \exp(\boldsymbol{x}^T\boldsymbol{\beta})}. \]

Remarks

The link function \(g = (\psi^{\prime})^{-1}\) is sometimes called the canonical link function, since it is derived from the canonical representation of an exponential family.
All we need for a link function is that it is invertible and matches the range of \(\E(Y \mid \boldsymbol{x})\) and \(\boldsymbol{x}^T\boldsymbol{\beta}\).
For example, in the Bernoulli linear model, we could have used the probit link function \[ g(u) = \Phi^{-1}(u): [0, 1] \to \R \] where \(\Phi\) is the CDF of the standard normal distribution.
This is called the probit regression.

Multinomial Regression

Multinomial regression is a generalization of Logistic regression to categorical variables with more than two categories.
Suppose \(Y\) is a categorical variable with \(K\) categories, \(Y \in \{1, 2, \ldots, K\}\).
A more useful representation is to use the one-hot encoding: \[ Y = [0, 0, \ldots, 1, \ldots, 0]^T \] where the \(k\)-th element is 1 and the rest are 0.
The multinomial regression model assumes \[ Y \mid \boldsymbol{x} \sim \text{Multi}(1, [p_1(\boldsymbol{x}), p_2(\boldsymbol{x}), \ldots, p_K(\boldsymbol{x})]^T) \] where \(p_k(\boldsymbol{x}) = \P(Y = 1_k \mid \boldsymbol{x})\) and \(1_k\) is the one-hot encoding of the \(k\)-th category.

Multinomial Distribution

The probability mass function of the \(\text{Multi}(m,p)\) is \[ f(x \mid p) = \frac{m!}{x_1!\cdots x_K!}\prod_{k=1}^K p_k^{x_k} = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^K x_k\log p_k\right) \] where \(x = [x_1, x_2, \ldots, x_K]^T\), \(\sum_{k=1}^K x_k = m\), and \(\sum_{k=1}^K p_k = 1\).
Note that \(p_K = 1 - p_1 - \ldots - p_{K-1}\) and therefore \[\begin{align*} f(x \mid p) & = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^{K-1} x_k\log p_k + \left(m - \sum_{k=1}^{K-1} x_k\right)\log(1 - p_1 - \ldots - p_{K-1})\right)\\ & = \frac{m!}{x_1!\cdots x_K!}\exp\left(\sum_{k=1}^{K-1} x_k\log\frac{p_k}{p_K} + m\log(1 - p_1 - \ldots - p_{K-1})\right). \end{align*}\]

Softmax Function

The canonical parameter is \(\theta = [\log\frac{p_1}{p_K}, \ldots, \log\frac{p_{K-1}}{p_K}]^T\) and therefore \(p_i = p_K\exp(\theta_i)\).
Using the relationship \(p_K = 1 - \sum_{k=1}^{K-1} p_k\), we have \[ p_K = 1 - p_K\sum_{k=1}^{K-1} \exp(\theta_k) \quad \Rightarrow \quad p_K = \frac{1}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}. \]
Hence (assume \(m = 1\) for simplicity) \[\begin{align*} \psi(\theta) & = - \log(1 - p_1 - \ldots - p_{K-1}) = - \log(1 - p_Ke^{\theta_1} - \ldots - p_Ke^{\theta_{K-1}})\\ & = - \log\left(1 - \frac{\sum_{k=1}^{K-1} \exp(\theta_k)}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}\right) = \log\left(1 + \sum_{k=1}^{K-1} \exp(\theta_k)\right). \end{align*}\]
Taking the derivative, we have the softmax function: \[ \nabla_{\theta}\psi(\theta) = \left[\frac{\exp(\theta_1)}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}, \ldots, \frac{\exp(\theta_{K-1})}{1 + \sum_{k=1}^{K-1} \exp(\theta_k)}\right]. \]

Multinomial Regression

The multinomial regression model is given by \[\begin{align*} \theta_i & = \boldsymbol{x}^T\boldsymbol{\beta}_i, \\ p_i(\boldsymbol{x}) & = \frac{\exp(\theta_i)}{1 + \sum_{k=1}^{K-1} \exp(\theta_i)}, \quad i = 1, 2, \ldots, K-1, \end{align*}\] where \(\boldsymbol{\beta}_i \in \R^p\).
In fact, a more common representation is \[\begin{align*} \tilde{\theta}_i & = \boldsymbol{x}^T\tilde{\boldsymbol{\beta}}_i, \\ p_i(\boldsymbol{x}) & = \frac{\exp(\tilde{\theta}_i)}{\sum_{k=1}^{K} \exp(\tilde{\theta}_i)}, \quad i = 1, 2, \ldots, K. \end{align*}\]
The equivalence is due to the transformation \(\theta_i = \tilde{\theta}_i - \tilde{\theta}_K\) and \(\boldsymbol{\beta}_i = \tilde{\boldsymbol{\beta}}_i - \tilde{\boldsymbol{\beta}}_K\). We can also write \[ [p_1(\boldsymbol{x}), \ldots, p_K(\boldsymbol{x})] = \texttt{softmax}(\boldsymbol{x}^T\boldsymbol{\beta}_1, \ldots, \boldsymbol{x}^T\boldsymbol{\beta}_K). \]

Quick Summary

A GLM is \[ g(\E(Y \mid X = x)) = x^T\beta \Leftrightarrow \E(Y \mid X = x) = g^{-1}(x^T\beta). \]
The link function \(g\) connects the conditional expectation and the linear predictor and is chosen based on the distribution of \(Y\).
Examples:
- Logistic regression: \(g(p) = \log\left(\frac{p}{1-p}\right)\), \(g^{-1}(x) = \frac{1}{1+e^{-x}}\).
- Linear regression: \(g(\mu) = \mu\).
- Multinomial regression: softmax function.
- There are other choices and the above are called the canonical link functions.

Beyond Linearity

Up to now, we have assumed that a linear relationship between the features and the (transformed) conditional expectation: \[ g(\E(Y \mid \boldsymbol{x})) = \boldsymbol{x}^T\boldsymbol{\beta} \]
However, this is a strong assumption and may not be appropriate in many cases.
To remove this assumption, we can consider \[ g(\E(Y \mid \boldsymbol{x})) = f(\boldsymbol{x}) \] where \(f: \R^p \to \R\) is an unknown function.
The problem is now to estimate the function \(f\).
Depending on the restrictions on \(f\), we can use different methods to estimate \(f\).

Generalized Additive Models (GAM)

An additive model assumes that the unknown function \(f\) is a sum of univariate functions: \[ f(\boldsymbol{x}) = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p). \]
Therefore, the model is \[ g(\E(Y \mid \boldsymbol{x})) = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p). \]
The functions \(f_j: \R \to \R\) are unknown and need to be estimated.
Nonparametric methods can be used to estimate the functions \(f_j\), for example, kernel smoothing, splines, etc.
The GAM is a generalization of the linear model that allows for non-linear relationships between the features and the conditional expectation.

Projection Pursuit Regression (PPR)

The projection pursuit regression (PPR)¹ model assumes: \[ f(\boldsymbol{x}) = \sum_{m=1}^M f_m(\boldsymbol{x}^T\boldsymbol{\omega}_m), \] where \(\boldsymbol{\omega}_m \in \R^p\) are unknown unit vectors and \(f_m: \R \to \R\) are unknown functions.
The scalar variable \(V_m = \boldsymbol{x}^T\boldsymbol{\omega}_m\) is the projection of \(\boldsymbol{x}\) onto the unit vector \(\boldsymbol{\omega}_m\), and we seek \(\boldsymbol{\omega}_m\) so that the model fits well, hence the name “projection pursuit.”
If \(M\) is taken arbitrarily large, for appropriate choice of \(f_m\) the PPR model can approximate any continuous function in \(\R^p\) arbitrarily well, i.e., the PPR is a universal approximator.
However, this model is not widely used due to the difficulty in estimating the functions \(f_m\).

MLE for Logistic Regression

The log-likelihood function

In order to find the MLE, we need to simplify the log-likelihood function: \[\begin{align*} \ell(\boldsymbol{\beta}) & = \log \left(\prod_{i=1}^n p(\boldsymbol{x}_i)^{y_i}(1-p(\boldsymbol{x}_i))^{1-y_i}\right) \\ & = \sum_{i=1}^n y_i \log p(\boldsymbol{x}_i) + (1-y_i)\log(1-p(\boldsymbol{x}_i))\\ & = \sum_{i=1}^n \left[y_i \log \left(\frac{p(\boldsymbol{x}_i)}{1-p(\boldsymbol{x}_i)}\right) + \log(1-p(\boldsymbol{x}_i))\right]\\ & = \sum_{i=1}^n \left[y_i\boldsymbol{x}_i^T\boldsymbol{\beta} - \log(1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}))\right]. \end{align*}\]

Gradient and Hessian

Now we compute the gradient and the Hessian of the log-likelihood function:

\[\begin{align*} \nabla \ell(\boldsymbol{\beta}) & = \sum_{i=1}^n \left[y_i \boldsymbol{x}_i - \frac{\exp(\boldsymbol{x}_i^T\boldsymbol{\beta})}{1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta})}\boldsymbol{x}_i\right] = \sum_{i=1}^n (y_i - p(\boldsymbol{x}_i))\boldsymbol{x}_i\\ & = X^T(\boldsymbol{y}-\mathbf{p})\\ \nabla^2 \ell(\boldsymbol{\beta}) & = -\sum_{i=1}^n p(\boldsymbol{x}_i)(1-p(\boldsymbol{x}_i))\boldsymbol{x}_i\boldsymbol{x}_i^T = -X^TWX \end{align*}\] where \[ \mathbf{p} = [p(\boldsymbol{x}_1), \ldots, p(\boldsymbol{x}_n)]^T, \quad W= \diag(\mathbf{p})\diag(1-\mathbf{p}), \quad X = \begin{bmatrix} \boldsymbol{x}_1^T \\ \vdots \\ \boldsymbol{x}_n^T \end{bmatrix}. \]

Iteratively Re-Weighted Least Squares (IRWLS)

There is no analytic solution for the MLE of the logistic regression. However, the Newton-Raphson method can be used to find the MLE. The Newton-Raphson method is an iterative method that updates the parameter \(\boldsymbol{\beta}\) as follows:

\[\begin{align*} \boldsymbol{\beta}^{(t+1)} & = \boldsymbol{\beta}^{(t)} - \left[\nabla^2 \ell(\boldsymbol{\beta}^{(t)})\right]^{-1} \nabla \ell\left(\boldsymbol{\beta}^{(t)}\right) \\ & = \boldsymbol{\beta}^{(t)}+\left(X^T W^{(t)}X\right)^{-1} X^T\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right) \\ & = \left(X^T W^{(t)} X\right)^{-1} X^T W^{(t)}\left[X \boldsymbol{\beta}^{(t)}+\left(W^{(t)}\right)^{-1}\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right)\right] \\ & = \left(X^T W^{(t)} X\right)^{-1} X^T W^{(t)} \mathbf{z}^{(t)} \end{align*}\] where \[\begin{align*} \mathbf{z}^{(t)} & =X \boldsymbol{\beta}^{(t)}+\left(W^{(t)}\right)^{-1}\left(\boldsymbol{y}-\mathbf{p}^{(t)}\right), \quad \mathbf{p}^{(t)} = [p^{(t)}(\boldsymbol{x}_1), \ldots, p^{(t)}(\boldsymbol{x}_n)]^T\\ p^{(t)}(\boldsymbol{x}_i) & = \frac{\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}^{(t)})}{1+\exp(\boldsymbol{x}_i^T\boldsymbol{\beta}^{(t)})}, \quad W^{(t)} = \diag(\mathbf{p}^{(t)})\diag(1-\mathbf{p}^{(t)}). \end{align*}\]