Statistical Inference¶

Inference: Drawing conclusions from facts/evidence through reasoning and scientific premises

In statistical inference

Our facts/evidence comes from the data (observations)
Our scientific premises and assumptions are encapsulated in the model

The main tasks in statistical inference can be hierarchically sorted as

Level 1: Fit a model to the data
Level 2: Validate and compare between models
Level 3: Answer questions with our model: Hypothesis testing

We will review the first two levels in what follows

Level 1: Fitting the parameters of a model¶

We have a model \(\mathcal{M}_j\) with parameters \(\theta\)

We want to estimate \(\theta~\) that best fit the data \(\mathcal{D}\)

We start by writing Bayes Theorem

\[ p(\theta|\mathcal{D}, \mathcal{M}_j) = \frac{p(\mathcal{D}| \theta, \mathcal{M}_j) p(\theta|\mathcal{M}_j)}{p(\mathcal{D}|\mathcal{M}_j)} \]

Let’s name these terms for the future

\[ \text{posterior} = \frac{\text{likelihood} \cdot \text{prior}}{\text{evidence}} \]

In the bayesian approach we want to find the posterior of \(\theta~\), but first let’s review the frequentist solution

Maximum likelihood Estimation¶

Consider the following simplifications

The prior on \(\theta~\) is uniform (uninformative)
We only care for most likely value of \(\theta~\) (point estimate)

Applying these to Bayes theorem we get

\[\begin{split} \begin{align} \hat \theta &= \text{arg} \max_\theta p(\theta|\mathcal{D}, \mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta p(\mathcal{D}| \theta, \mathcal{M}_j) \nonumber \end{align} \end{split}\]

which is known as the Maximum likelihood estimator (MLE) of \(\theta~\)

Important

Likelihood is not the same as probability

If \(\theta~\) is fixed \(p(\mathcal{D}| \theta, \mathcal{M}_j)\) defines a probability over \(\mathcal{D}\)
If \(\mathcal{D}\) is fixed \(p(\mathcal{D}| \theta, \mathcal{M}_j)\) defines the likelihood of \(\theta\)

See also

I suggest reading D. Mackays’ book Section 2.3 on the difference between probability and likelihood (inverse probability)

Example: MLE for a coin

https://c.tenor.com/bd3puNXKLwUAAAAd/coin-toss.gif

Let the following be observations from a “coin toss” process

\[ \mathcal{D} = [x_1, x_2, \ldots, x_N] \]

where \(x_i \in \{0: \text{tail}, 1: \text{head}\}\)

Assumption 1: Observations are independent and identically distributed (iid)

\[ p(\mathcal{D}|\theta, \mathcal{M}_j) = \prod_{i=1}^N p(x_i|\theta, \mathcal{M}_j) \]

Assumption 2: Bernoulli distribution model with parameter \(\theta \in [0, 1]\) for the observations

\[ p(x_i|\theta, \mathcal{M}_j) = \theta^{x_i} (1- \theta)^{1-x_i} \]

This is a distribution for binary outcomes \(x\in \{0, 1\}\)

Explore how the distribution change with \(\theta\):

hmap = hv.HoloMap(kdims='Parameter')
x = [0, 1]
for theta in np.linspace(0, 1, num=8):
    dist = scipy.stats.bernoulli(p=theta)
    hmap[theta] = hv.Bars((x, [dist.pmf(k) for k in x]), 'x', 'PMF', label='Bernoulli distribution,')

hmap.opts(hv.opts.Bars(width=450, height=300, tools=['hover']))

What is the MLE of \(\theta~\)?

Note

The arg maximum of \(p(x)\) is the same as \(\log p(x)\)

Noting this we can write

\[\begin{split} \begin{align} \hat \theta &= \text{arg} \max_\theta p(\theta|\mathcal{D}, \mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta \log p(\theta|\mathcal{D}, \mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta \log p(\mathcal{D}| \theta, \mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta \sum_{i=1}^N \log p(x_i| \theta, \mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta \sum_{i=1}^N x_i \log (\theta) + (1 -x_i) \log(1-\theta) \nonumber \end{align} \end{split}\]

From here we can take the derivate, set it to zero, and get the MLE of \(\theta\)

\[ \hat \theta = \frac{1}{N} \sum_{i=1}^N x_i \]

See also

For more detailes, properties and examples on MLE see Barber’s chapter 8 and this course Statistics (in spanish)

Priors and Maximum a Posteriori¶

Let’s lift the assumption that the prior is uniform

We are still looking for a point estimate of \(\theta~\)
We keep the iid assumption and we consider the “log trick”

From Bayes theorem we can write

\[\begin{split} \begin{align} \hat \theta &= \text{arg} \max_\theta \log p(\theta|\mathcal{D}, \mathcal{M}_j) p(\theta|\mathcal{M}_j) \nonumber \\ &= \text{arg} \max_\theta \sum_{i=1}^N \log p(x_i| \theta, \mathcal{M}_j) + \log p(\theta|\mathcal{M}_j) \nonumber \end{align} \end{split}\]

which is called the Maximum a posteriori (MAP) estimate of \(\theta~ \)

Note

The MAP estimate corresponds to the mode of \(p(\theta|\mathcal{D}, \mathcal{M}_j)\) (assuming it is unimodal)

Now, in addition to the model (likelihood), we have to set the prior \(p(\theta)\). Note that this can be a sensible choice

Example: MAP for the coin

We will use a Beta distribution prior for the Bernoulli parameter \(\theta ~\)

\[ p(\theta|\mathcal{M}_j) = \text{Beta}(\theta| \alpha, \beta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)} \]

where \(B(x,y) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}\) and \(\Gamma(x)\) is the Gamma function

This is a distribution for \(x \in [0, 1]\), e.g probabilities. For \(\alpha=\beta=1\) we get the Uniform distribution in \([0, 1]\).

Explore how the beta distribution change with \(\alpha\) and \(\beta\)

x = np.linspace(0, 1, num=100)
hmap = hv.HoloMap(kdims=['alpha', 'beta'])
for a in [0.5, 1, 2, 3, 10]:
    for b in [0.5, 1, 2, 3, 10]:
        dist = scipy.stats.beta(a, b)
        hmap[(a,b)] = hv.Curve((x, dist.pdf(x)), 'x', 'PDF', label='Beta distribution,')

hmap.opts(hv.opts.Curve(width=450, height=300, ylim=(0, 5), tools=['hover']))

Let’s replace the chosen distributions in the MAP omitting the terms that do not depend on \(\theta~\)

\[ \hat \theta= \text{arg} \max_\theta \sum_{i=1}^N x_i \log (\theta) + (1 -x_i) \log(1-\theta) +(\alpha -1) \log(\theta) + ( \beta -1) \log(1-\theta) \]

if we set the derivative to zero we obtain

\[ \hat \theta = \frac{1}{\alpha + \beta - 2 + N} \left(\alpha -1 + \sum_{i=1}^N x_i\right) \]

Note

This solution reduces to the MLE for \(\alpha=\beta=1\) (uniform)

Important

If we know something about the coin before observing the data, we add it through \(\alpha\) and \(\beta\)

Bayesian inference¶

With MLE we get point estimates

How good are these estimates? Can we trust them? What is their uncertainty?

In the frequentist approach we answer these questions through confidence intervals and/or bootstrap

In a “full” bayesian approach we select likelihood/prior and aim for the posterior of \(\theta~\),

\[ p(\theta|\mathcal{D}, \mathcal{M}_j) = \frac{p(\mathcal{D}| \theta, \mathcal{M}_j) p(\theta|\mathcal{M}_j)}{p(\mathcal{D}|\mathcal{M}_j)} \]

If we have the posterior we know everything about \(\theta~\). But, how do we get the posterior?

Simple case: Analytical posterior

In some “very special cases” the posterior is analytically tractable. Let’s go back to our example

Example: Posterior for the coin

The likelihood of the coin (Bernoulli) is

\[\begin{split} \begin{align} p(\mathcal{D}|\theta, \mathcal{M}_j) &= \prod_{i=1}^N p(x_i|\theta, \mathcal{M}_j) \nonumber \\ &= \prod_{i=1}^N \theta^{x_i} (1-\theta)^{1-x_i} \nonumber \\ &= \theta^{\sum_i x_i}(1-\theta)^{N-\sum_i x_i} \nonumber \end{align} \end{split}\]

The prior is Beta

\[ p(\theta|\mathcal{M}_j) = \text{Beta}(\theta| \alpha , \beta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)} \]

Then the posterior is

\[ p(\theta|\mathcal{D}, \mathcal{M}_j) = \frac{1}{Z} \theta^{\alpha +\sum_i x_i - 1}(1-\theta)^{\beta +N-\sum_i x_i-1}, \]

where \(Z\) is a normalizing constant

We can recognize that the posterior is also Beta:

\[ p(\theta|\mathcal{D}, \mathcal{M}_j) = \text{Beta}(\theta| \hat \alpha , \hat \beta), \]

with parameters \(\hat \alpha= \alpha +\sum_i x_i\) and \(\hat \beta= \beta +N-\sum_i x_i\)

Note

In this case we say that Beta is conjugate to the Bernoulli distribution: It produces a Beta posterior

See also

The following article shows a table of conjugate pairs

Interactive posterior: Move the sliders to change \(\alpha\), \(\beta\) and the number of coins observed

How many coins do we need to observe so that the posterior gets close to \(p=.7\)?
What is the influence of \(\alpha\) and \(\beta\) when \(N\) is low? And when \(N\) is high?

coins = scipy.stats.bernoulli.rvs(p=0.7, size=1000, random_state=1234) # True parameter is 0.7
x = np.linspace(0, 1, num=100)
hmap1 = hv.HoloMap(kdims=['N'])
hmap2 = hv.HoloMap(kdims=['N', 'alpha', 'beta'])
for N in [1, 2, 5, 10, 20, 50, 100, 200, 500]:
    heads = np.sum(coins[:N]) 
    hmap1[N] = hv.Bars((['head','tail'], [heads, N-heads]), 'x', 'Histogram', label='Coins')
    for a in [0.5, 2, 10]:
        for b in [0.5, 2, 10]:
            prior = scipy.stats.beta(a, b)
            posterior = scipy.stats.beta(a + heads, b + N - heads)
            hprior = hv.Curve((x, prior.pdf(x)), 'x', 'PDF', group='Distribution', label='prior')
            hposterior = hv.Curve((x, posterior.pdf(x)), 'x', 'PDF', group='Distribution', label='posterior')
            hmap2[(N, a, b)] = hprior*hposterior

#hv.output(widget_location='bottom')
(hmap1+hmap2).opts(hv.opts.Bars(width=150, height=300),
                   hv.opts.Curve(width=300, height=300, ylim=(0, 5), tools=['hover']))

Note that in the Bayesian approach online problems are just updates to the posterior

The following shows how the posterior is updated one coin at a time

x = np.linspace(0, 1, num=100)
hmap1 = hv.HoloMap(kdims=['Iteration'])
hmap2 = hv.HoloMap(kdims=['Iteration'])

a = b = 1 # prior belief
for k, coin in enumerate(coins[:100]):
    hmap1[k] = hv.Bars((['head','tail'], [a-1, b-1]), 'x', 'Histogram', label='Coins')
    hmap2[k] = hv.Curve((x, scipy.stats.beta(a, b).pdf(x)), 'x', 'PDF', label='Posterior')
    # posterior updates
    a += coin
    b += 1 - coin
    
(hmap1+hmap2).opts(hv.opts.Bars(width=150, height=300),
                   hv.opts.Curve(width=400, height=300, ylim=(0, 5)))
#hv.output((hmap1+hmap2), holomap='gif', fps=5)

What if I can’t get an analytical posterior?¶

In many cases the denominator in the posterior

\[ p(\mathcal{D} | \mathcal{M_j}) = \int p(\mathcal{D} | \theta, \mathcal{M_j}) p(\theta|\mathcal{M_j}) d \theta \]

is intractable

The options are to

Approximation based: Variational Inference (VI)
Sampling based: Markov Chain Monte Carlo (MCMC)

We will review how to implement both using in future lessons

Level 2 Inference: Comparing models¶

Using Bayes theorem we can express the posterior probability of a given model as

\[ p(\mathcal{M_j} | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M_j}) p(\mathcal{M_j})}{p(\mathcal{D})} \]

where \(p(\mathcal{D} | \mathcal{M_j})\) is called the evidence or the marginal likelihood for \(\mathcal{M_j}\)

The evidence was ignored in level 1 inference (normalizing constant) but in this level is key!

If we want to compare two models we can compute the ratio between posteriors

\[ \frac{p(\mathcal{M_j} | \mathcal{D})}{p(\mathcal{M_k} | \mathcal{D})} = \frac{p(\mathcal{D} | \mathcal{M_j}) p(\mathcal{M_j})}{p(\mathcal{D} | \mathcal{M_k}) p(\mathcal{M_k})} \]

For example we may ignore the priors if we consider the models to be equally probable or we can use the priors to favor simpler models and implement Occam’s Razor (principle of parsinomy)

If there are many models and we care only for the most probable we can write

\[ \mathcal{M}^* = \text{arg} \max_j p(\mathcal{M_j} | \mathcal{D}) \]

Using Bayes Theorem and assuming that all models are equally probable (uniform prior)

\[\begin{split} \begin{align} \mathcal{M}^* &= \text{arg} \max_j p(\mathcal{D} | \mathcal{M_j}) \nonumber \\ &= \text{arg} \max_j \int p(\mathcal{D} | \theta, \mathcal{M_j}) p(\theta|\mathcal{M_j}) d \theta \end{align} \end{split}\]

See also

For more on model comparison see Chapter 28 of D. Mackay’s book

Bayesian Learning and Neural Networks

Statistical Inference

Contents

Statistical Inference¶

Level 1: Fitting the parameters of a model¶

Maximum likelihood Estimation¶

Priors and Maximum a Posteriori¶

Bayesian inference¶

What if I can’t get an analytical posterior?¶

Level 2 Inference: Comparing models¶