Recent advances on VI¶

For a survey of recent advances in Variational Inference, I highly recommend Zhang et al 2018. Some of the topics presented in this survey are discussed here

Recap from previous lectures¶

We are interested in a posterior

\[ p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})} \]

which may be intractable. If that is the case we do approximate inference either through sampling (MCMC) or optimization (VI).

In the latter we select a (simple) approximate posterior \(q_\nu(\theta)\) and we optimize the parameters \(\nu\) by maximizing the evidence lower bound (ELBO)

\[\begin{split} \begin{split} \log p(\mathcal{D}) \geq \mathcal{L}(\nu) &= - \int q_\nu(\theta) \log \frac{q_\nu(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta \\ &= \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta) \frac{p(\theta)}{q_\nu(\theta)}\right ] \\ &= \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right]- D_{KL}[q_\nu(\theta) || p(\theta)] \end{split} \end{split}\]

\[ \hat \nu = \text{arg}\max_\nu \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right]- D_{KL}[q_\nu(\theta) || p(\theta)] \]

which makes \(q_\nu(\theta)\) close to \(p(\theta|\mathcal{D})\)

Note

There is a trade-off between how flexible/expressive the posterior is and how simple is to approximate this expression

In what follows we review different ways to improve VI

Tigher bounds for the KL divergence¶

Importance weighting¶

This is an idea based on importance sampling. Tigher bounds for the ELBO can be obtained by sampling several \(z\) for a given \(x\). This was explored for autoencoders in Importance Weighted Autoencoders

Let’s say we sample independently \(K\) times from the posterior, this yields progressively tighter lower bounds for the evidence:

\[ \mathcal{L}_k = \mathbb{E}_{z_K, \ldots, z_2, z_1 \sim q_\phi(z|x)} \log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \]

where \(w_k = \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}\) are called the importance weights. Note that for \(K=1\) we recover the VAE bound.

This tighter bound has been shown to be equivalent to using the regular bound with a more complex posterior. Recent discussion can be find in Debiasing Evidence Approximations: On Importance-weighted Autoencoders and Jackknife Variational Inference and Tighter Variational Bounds are Not Necessarily Better.

Other divergence measures¶

\(\alpha\) divergence¶

The KL divergence is computationally-convenient but there are other options to measure how far two distributions are. For example the family of \(\alpha\) divergences (Renyi’s formulation) is defined as

\[ D_\alpha(p||q) = \frac{1}{\alpha -1} \log p(x)^\alpha q(x)^{1-\alpha} \,dx \]

where \(\alpha\) represents a trade-of between the mass-covering and zero-forcing effects. The KL corresponds to the special case \(\alpha \to 1\)

Note

The \(\alpha\) divergence has been explored for VI recently and is implemented in numpyro

f divergence¶

The \(\alpha\) divergence is a particular case of the f-divergence

\[ D_f(p||q) = q(x) f \left ( \frac{p(x)}{q(x)} \right) \,dx \]

where \(f\) is a convex function with \(f(0) = 1\). The KL is recovered for \(f(z) = z \log(z)\)

In general \(f\) should defined such that the result in the bound does not depend on the marginal likelihood. Wang, Liu and Liu, 2018 proposed tail-adaptive f-divergence

Stein variational gradient descent (SVGD)¶

Another totally different approach is based on the Stein operator

\[ \mathcal{A}_p \phi(x) = \phi(x) \nabla_x \log p(x) + \nabla_x \phi(x) \]

where \(p(x)\) is a distribution and \(\phi(x) = [\phi_1(x), \phi_2(x), \ldots, \phi_d(x)]\) a smooth vector function

Under this following, known as the Stein identity, holds

\[ \mathbb{E}_{x\sim p(x)} \left [ \mathcal{A}_p \phi(x) \right] = 0, \]

Now, for another distribution \(q(x)\) with the same support as \(p\), we can write

\[ \mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_p \phi(x) \right] - \mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_q \phi(x) \right]= \mathbb{E}_{x\sim q(x)} \left [ \phi(x) ( \nabla_x \log p(x) - \nabla_x \log q(x)) \right] \]

from which the Stein discrepancy between two distributions is defined

\[ \sqrt{S(q, p)} = \max_{\phi\in \mathcal{F}} \mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_p \phi(x) \right] \]

Which to actually work requires \(\mathcal{F}\) to be broad enough

This is were kernels can be used. By taking an infinite amount of basis function \(\phi(x)\) on the stein discrepancy it can be shown that the optimization is solved by

\[ \textbf{S}(q, p) = \mathbb{E}_{x, x' \sim q(x)} \left [ \mathcal{A}_p^x \mathcal{A}_p^{x'} \kappa(x, x')\right] \]

where \(\kappa\) is a kernel function, e.g. RBF or rational quadratic. From this one can use stochastic gradient descent.

Operator VI¶

Ranganath et al. 2016 proposed to replace the KL divergence as objective for VI with the Langevin-stein operator

Bayesian Learning and Neural Networks

Recent advances on VI

Contents

Recent advances on VI¶

Recap from previous lectures¶

More flexible approximate posteriors for VI¶

Normalizing flows¶

Adding more structure¶