More on Bayesian Neural Networks

Bayes by backprop

In 2015 Blundel et al proposed Bayes by Backprop which consists on replacing the ELBO

\[ \mathcal{L}(\nu) = \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right] - D_{KL}[q_\nu(\theta) || p(\theta)] \]

with monte-carlo estimates

\[ \mathcal{L}(\nu) \approx \sum_{i=1}^N \sum_{k=1}^K \log p(x_i|\theta_k) - \log q_\nu(\theta_k) + \log p(\theta_k) \]

where \(N\) is the number of data samples in the minibatch and \(K\) is the number of times we sample from the parameters \(\theta\). This formulation is more general, because it does not depend on closed-form solutions of the KL

Thanks to this flexibility more complex priors can be used. In the original Bayes-by-backprop paper the following is considered

\[ p(\theta) = \pi_1 \mathcal{N}(0, \sigma_1^2) + \pi_2 \mathcal{N}(0, \sigma_2^2) \]

with \(\sigma_1 \ll \sigma_2\). The term with smaller variance allows for automatic “shut-down” (pruning) of weights, i.e. sparsification

See also

BLiTZ is a PyTorch-based library that implements Bayes by Backprop to train BNNs

The local reparametrization trick

In BNN we sample from every weight as

\[ w_{ji}\sim \mathcal{N}(\mu_{ji}, \sigma_{ji}^2) \]

using the reparameterization trick to reduce variance

\[ w_{ji} = \mu_{ji} +\epsilon_{ji} \cdot\sigma_{ji}, \quad \epsilon_{ji} \sim \mathcal{N}(0, I) \]

The idea behind the [local reparameterization (Kingma, Sallimans and Welling, 2015)](](http://papers.nips.cc/paper/5666-variational-dropout-and-the-local-reparameterization-trick) is that instead of sampling from every weight we sample from the pre-activations

\[ Z = WX + B \]

then

\[ z_i = \nu_i + \eta_i \cdot \epsilon_{i} \]

where \(\epsilon\) is still a standard normal and \(\nu_i = \sum_j x_j \mu_{ji}\) and \(\eta_i = \sqrt{\sum_j x_j^2 \sigma_{ji}^2}\)

This reduces the amounts of samples we take by orders of magnitude and further reduces the variance of the estimator

FLIPOUT

Decorrelation of the gradients within a minibatch speeding up bayesian neural networks with gaussian perturbations

Dropout as a Bayesian approximation

This is an alternative take on BNNs based on representing uncertainty based on dropout technique (Gal and Gharahmani, 2015)

Dropout turns-off neurons following a certain distribution. The authors argue that this is like having an ensemble of neural networks and hence uncertainties can be computed. This is done by applying dropout not only during training but also when predicting (test set) to estimate uncertainty

This short letter critiques this application of dropout, and shows that uncertainty with this approach (fixed dropout probability) does not decrease as new data points arrive. A solution to this?

Deep ensembles

Another alternative take on BNN based on ensembles of deterministic neural networks trained using MAP (Laksminarayanan, Pritzel and Blundell, 2016).

Predicting with an ensemble of deterministic neural networks would return a sample of predictions, which can then be used as a sort of posterior distribution. The key is how to introduce randomness so that there is diversity in the ensemble

One way to do this is by using bagging (bootstrap resampling), i.e. training the deterministic NNs with subsamples of the training data (drawn with replacement). But this has been shown to be worse than using the full dataset for all the individual classifiers (Nixon, Laksminarayanan and Tran, 2020)

In the original paper the randomization comes only from

  • The initial values of parameters of the neural networks (default pytorch initialization)

  • The shuffling of training data points

One key aspect of this work is that to smooth the predictive distributions, adversarial examples are used. They also highlight the use of the variance of the predictions in the case of regression. The full algorithm goes as follows

../../_images/deep-ensembles.png

The paper compares ensembles with MC-dropout (which can also be interpreted as an ensemble method), showing that it is much better at detecting out-of-distribution samples. (Gustafsson et al 2020) obtains a similar result when comparing ensembles and MC-dropout for computer vision architectures. A more through comparison (including SVI and other alternatives) is given in Ovadia et al. 2019

What are Bayesian Neural Networks Posteriors Really Like

In this work by (Izmailov et al. 2021) deep neural networks are trained using Hamiltonian Monte Carlo (HMC). HMC (and MCMC methods in general) guarantees asymptotically exact samples from the true posterior.

The authors recognize that training deep nets with MCMC is computationaly expensive to implement in practive, with respect to SVI. The focus of the paper is on evaluating how good are the approximate posteriors and deterministic approximations used on SVI. They show that

  • BNN can perform better than regular training and deep ensembles

  • A single HMC chain provides a comparable posterior to running several shorter chains

  • Posterior tempering (temperature scaling) is actually not needed

  • High variance Gaussian priors led to strong performance and results are robust to the scale. Performance using Gaussian, MoG and logistic priors is not too different. A vague prior in parameter space is not necessarily a vague prior in function space. This result is very conflicting with (Fortuin et al. 2021)!

  • BNN have good performance on out-of-distribution samples but perform poorly under domain shift (ensembles are better in this case)

  • The predictive distribution of the compared methods differs from that of HMC. Ensembles seem to be closer to HMC than mean-field VI (MFVI). But in terms of entropies they HMC is more overconfident than MFVI

Assorted list of interesting discussions