import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(100)
n = 100
X = rng.uniform(low=0, high=3, size=n)
y = np.sin(2 * np.pi * X) + X + rng.normal(loc=0, scale=0.5, size=n)
plt.scatter(X, y, c="black", s=20)
plt.show()
\[ \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \renewcommand{\P}{\mathbb{P}} \newcommand{\var}{{\rm Var}} % Variance \newcommand{\mse}{{\rm MSE}} % MSE \newcommand{\bias}{{\rm Bias}} % MSE \newcommand{\cov}{{\rm Cov}} % Covariance \newcommand{\iid}{\stackrel{\rm iid}{\sim}} \newcommand{\ind}{\stackrel{\rm ind}{\sim}} \renewcommand{\choose}[2]{\binom{#1}{#2}} % Choose \newcommand{\chooses}[2]{{}_{#1}C_{#2}} % Small choose \newcommand{\cd}{\stackrel{d}{\rightarrow}} \newcommand{\cas}{\stackrel{a.s.}{\rightarrow}} \newcommand{\cp}{\stackrel{p}{\rightarrow}} \newcommand{\bin}{{\rm Bin}} \newcommand{\ber}{{\rm Ber}} \newcommand{\diag}{{\rm diag}} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \]
A general procedure for supervised learning:
There are three common ways to choose the feature map \(\phi\):
Generate a dataset from the model \(y = \sin(2\pi x) + x + \epsilon\), where \(\epsilon \sim N(0, 0.5^2)\).
Compare four different models:
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
# Ridge regression
mod1 = KernelRidge(alpha = 1, kernel='linear').fit(X.reshape(-1, 1), y)
# Kernel ridge regression with RBF kernel and different gamma (bandwidth)
mod2 = KernelRidge(alpha = 1, kernel='rbf', gamma=1).fit(X.reshape(-1, 1), y)
mod3 = KernelRidge(alpha = 1, kernel='rbf', gamma=100).fit(X.reshape(-1, 1), y)
# Use cross-validation to select the hyperparameters
mod4 = GridSearchCV(
KernelRidge(kernel="rbf", gamma=0.1),
param_grid={"alpha": [1, 0.1, 0.01, 0.001], "gamma": np.logspace(-2, 2, 5)},
).fit(X.reshape(-1, 1), y)
The relationship between the input \(\boldsymbol{x}\) and the hidden unit \(\boldsymbol{h}\) is \[ \boldsymbol{h} = \sigma\left(\boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}\right), \] where \[ \boldsymbol{W} = \begin{bmatrix} \boldsymbol{w}_1^T \\ \boldsymbol{w}_2^T \\ \vdots \\ \boldsymbol{w}_m^T \end{bmatrix} \in \R^{m \times p}, \quad \boldsymbol{b} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} \in \R^m. \]
The function \(\sigma: \R \to \R\) is called an activation function and is applied element-wise to the vector \(\boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}\).
Tha map \(\boldsymbol{x} \mapsto \sigma\left(\boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}\right)\) can be viewed as a feature map parametrized by \(\boldsymbol{W}\) and \(\boldsymbol{b}\).
That is, we replace the linear predictor \(\beta_0 + \boldsymbol{x}^T \boldsymbol{\beta}\) with \(\beta_0 + \boldsymbol{\beta}^T \boldsymbol{h} = \beta_0 + \boldsymbol{\beta}^T \sigma\left(\boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}\right)\).
Finally, we can link the predictor \(\beta_0 + \boldsymbol{\beta}^T \sigma\left(\boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}\right)\) to the output \(y\) (or more specifically \(\E(Y \mid \boldsymbol{x})\)) using a link function, e.g., identity, logit, or softmax.
Write \[\begin{align*} f_{\theta}(\boldsymbol{x}) & = \boldsymbol{W}^{(2)}\sigma\left(\boldsymbol{W}^{(1)}\boldsymbol{x} + \boldsymbol{b}^{(1)}\right) + \boldsymbol{b}^{(2)} = \boldsymbol{W}^{(2)}\boldsymbol{h}^{(1)} + \boldsymbol{b}^{(2)}\\ \boldsymbol{h}^{(1)} & = \sigma\left(\boldsymbol{W}^{(1)}\boldsymbol{x} + \boldsymbol{b}^{(1)}\right), \end{align*}\] and we have \[\begin{align*} \nabla_{\boldsymbol{b}^{(2)}} f_{\theta}(\boldsymbol{x}) & = 1\\ \nabla_{\boldsymbol{W}^{(2)}} f_{\theta}(\boldsymbol{x}) & = \boldsymbol{h}^{(1)} = \sigma\left(\boldsymbol{W}^{(1)}\boldsymbol{x} + \boldsymbol{b}^{(1)}\right) \in \R^m\\ \nabla_{\boldsymbol{b}^{(1)}} f_{\theta}(\boldsymbol{x}) & = \left[\nabla_{\boldsymbol{b}^{(1)}} \boldsymbol{h}^{(1)}\right] \left(\boldsymbol{W}^{(2)}\right)^T \in \R^m\\ \nabla_{\boldsymbol{W}^{(1)}} f_{\theta}(\boldsymbol{x}) & = \boldsymbol{W}^{(2)}\nabla_{\boldsymbol{W}^{(1)}} \boldsymbol{h}^{(1)} = \sum_{k=1}^m W_k^{(2)} \nabla_{\boldsymbol{W}^{(1)}} h_k^{(1)}\in \R^{m \times p}\\ \nabla_{\boldsymbol{b}^{(1)}} \boldsymbol{h}^{(1)} & = \text{diag}\left(\sigma^{\prime}(\boldsymbol{W}^{(1)}\boldsymbol{x} + \boldsymbol{b}^{(1)})\right) \in \R^{m \times m}\\ \nabla_{\boldsymbol{W}^{(1)}} h_k^{(1)} & \in \R^{m \times p}. \end{align*}\]
A graphical representation of computation:
Consider the same dataset as the previous example. We want to build a regression model using neural networks.
Setup an MLP with one hidden layer with 10 units and the logistic activation function.
import keras
from keras import layers
model = keras.Sequential(
[
layers.Dense(units = 10, activation = "sigmoid",
input_shape = (1,), name = "hidden_layer"),
layers.Dense(units = 1, activation = None,
name = "output_layer")
],
name="Shallow_MLP"
)
model.summary()
Model: "Shallow_MLP"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ hidden_layer (Dense) │ (None, 10) │ 20 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ output_layer (Dense) │ (None, 1) │ 11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 31 (124.00 B)
Trainable params: 31 (124.00 B)
Non-trainable params: 0 (0.00 B)
Use the Adam optimizer with a learning rate of 0.3 and train the model for 1000 epochs (iterations).
Show that for any \(\boldsymbol{Z} \in \R^{n \times d}\), \(\boldsymbol{y} \in \R^n\), and \(\alpha > 0\), we have \[ (\boldsymbol{Z}^T \boldsymbol{Z} + \alpha I_d)^{-1} \boldsymbol{Z}^T \boldsymbol{y} = \boldsymbol{Z}^T (\boldsymbol{Z} \boldsymbol{Z}^T + \alpha I_n)^{-1} \boldsymbol{y}. \]
Proof:
Consider the univariate RBF kernel \(k(x, y) = \exp\left(-(x - y)^2\right)\). We have \[\begin{align*} k(x, y) & = \exp\left(-(x - y)^2\right) = \exp\left(-x^2 + 2xy - y^2\right) = \exp(-x^2)\exp(2xy)\exp(-y^2)\\ & = \exp(-x^2)\left(\sum_{m=0}^{\infty} \frac{2^mx^my^m}{m!}\right)\exp(-y^2)\\ & = \exp(-x^2) \left(1, \sqrt{\frac{2^1}{1!}}x, \sqrt{\frac{2^2}{2!}}x^2, \sqrt{\frac{2^3}{3!}}x^3, \ldots \right)^T\\ & \qquad \left(1, \sqrt{\frac{2^1}{1!}}y, \sqrt{\frac{2^2}{2!}}y^2, \sqrt{\frac{2^3}{3!}}y^3, \ldots \right)\exp(-y^2) \end{align*}\]
Hence the feature map corresponding to the RBF kernel is \[ \phi(x) = \exp(-x^2)\left(1, \sqrt{\frac{2^1}{1!}}x, \sqrt{\frac{2^2}{2!}}x^2, \sqrt{\frac{2^3}{3!}}x^3, \ldots \right)^T. \]