% Small choose
$$
The main advantage of generative models is that they can very flexible.
However, the flexibility comes at a cost: it is computationally expensive to make inference.
In contrast, descriptive models are usually simpler and computationally efficient, but require strong assumptions.
Most generative models face at least one of the following issues:
Given an orthogonal matrix \(\boldsymbol{W} \in \mathbb{R}^{p \times q}\), we can obtain
There are many variants of PCA/PPCA:
The actual optimization objective of the variational autoencoder, like in other variational methods, is the evidence lower bound (ELBO), which is derived as follows:
\[ \begin{aligned} \log p_{\boldsymbol{\theta}}(\boldsymbol{x}) & =\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] \\ & =\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})}\right]\right] \\ & =\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}{p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})}\right]\right] \\ & =\underbrace{\mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\right]\right]}_{\substack{=\mathcal{L}_{\boldsymbol{\theta}, \phi}(\boldsymbol{x})\; (\text{ELBO})}} +\underbrace{\mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \left[\frac{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}{p_\theta(\boldsymbol{z} \mid \boldsymbol{x})}\right]\right]}_{=D_{\text{KL}}\left(q_{\boldsymbol{\phi}} \| p_{\boldsymbol{\theta}}\right)} \end{aligned} \]
Due to the non-negativity of the KL divergence, the ELBO is a lower bound on the log-likelihood of the data \[ \begin{aligned} \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\boldsymbol{x}) & =\log p_{\boldsymbol{\theta}}(\boldsymbol{x})-D_{\text{KL}}\left(q_{\boldsymbol{\phi}} \| p_{\boldsymbol{\theta}}\right) \\ & \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x}) \end{aligned} \]
Hence maximization of the ELBO \(\mathcal{L}_{\boldsymbol{\theta}, \phi}(\boldsymbol{x})\) w.r.t. the parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\), will concurrently optimize the two things:
Generated samples from a VAE trained on MNIST dataset: