Statistical Inference
Contents
import holoviews as hv
hv.extension('bokeh')
import numpy as np
import scipy.stats
Statistical Inference¶
- Inference
Drawing conclusions from facts/evidence through reasoning and scientific premises
In statistical inference
Our facts/evidence comes from the data (observations)
Our scientific premises and assumptions are encapsulated in the model
The main tasks in statistical inference can be hierarchically sorted as
Level 1: Fit a model to the data
Level 2: Validate and compare between models
Level 3: Answer questions with our model: Hypothesis testing
We will review the first two levels in what follows
Level 1: Fitting the parameters of a model¶
We have a model \(\mathcal{M}_j\) with parameters \(\theta\)
We want to estimate \(\theta~\) that best fit the data \(\mathcal{D}\)
We start by writing Bayes Theorem
Let’s name these terms for the future
In the bayesian approach we want to find the posterior of \(\theta~\), but first let’s review the frequentist solution
Maximum likelihood Estimation¶
Consider the following simplifications
The prior on \(\theta~\) is uniform (uninformative)
We only care for most likely value of \(\theta~\) (point estimate)
Applying these to Bayes theorem we get
which is known as the Maximum likelihood estimator (MLE) of \(\theta~\)
Important
Likelihood is not the same as probability
If \(\theta~\) is fixed \(p(\mathcal{D}| \theta, \mathcal{M}_j)\) defines a probability over \(\mathcal{D}\)
If \(\mathcal{D}\) is fixed \(p(\mathcal{D}| \theta, \mathcal{M}_j)\) defines the likelihood of \(\theta\)
See also
I suggest reading D. Mackays’ book Section 2.3 on the difference between probability and likelihood (inverse probability)
Example: MLE for a coin
Let the following be observations from a “coin toss” process
where \(x_i \in \{0: \text{tail}, 1: \text{head}\}\)
Assumption 1: Observations are independent and identically distributed (iid)
Assumption 2: Bernoulli distribution model with parameter \(\theta \in [0, 1]\) for the observations
This is a distribution for binary outcomes \(x\in \{0, 1\}\)
Explore how the distribution change with \(\theta\):
hmap = hv.HoloMap(kdims='Parameter')
x = [0, 1]
for theta in np.linspace(0, 1, num=8):
dist = scipy.stats.bernoulli(p=theta)
hmap[theta] = hv.Bars((x, [dist.pmf(k) for k in x]), 'x', 'PMF', label='Bernoulli distribution,')
hmap.opts(hv.opts.Bars(width=450, height=300, tools=['hover']))
What is the MLE of \(\theta~\)?
Note
The arg maximum of \(p(x)\) is the same as \(\log p(x)\)
Noting this we can write
From here we can take the derivate, set it to zero, and get the MLE of \(\theta\)
See also
For more detailes, properties and examples on MLE see Barber’s chapter 8 and this course Statistics (in spanish)
Priors and Maximum a Posteriori¶
Let’s lift the assumption that the prior is uniform
We are still looking for a point estimate of \(\theta~\)
We keep the iid assumption and we consider the “log trick”
From Bayes theorem we can write
which is called the Maximum a posteriori (MAP) estimate of \(\theta~ \)
Note
The MAP estimate corresponds to the mode of \(p(\theta|\mathcal{D}, \mathcal{M}_j)\) (assuming it is unimodal)
Now, in addition to the model (likelihood), we have to set the prior \(p(\theta)\). Note that this can be a sensible choice
Example: MAP for the coin
We will use a Beta distribution prior for the Bernoulli parameter \(\theta ~\)
where \(B(x,y) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}\) and \(\Gamma(x)\) is the Gamma function
This is a distribution for \(x \in [0, 1]\), e.g probabilities. For \(\alpha=\beta=1\) we get the Uniform distribution in \([0, 1]\).
Explore how the beta distribution change with \(\alpha\) and \(\beta\)
x = np.linspace(0, 1, num=100)
hmap = hv.HoloMap(kdims=['alpha', 'beta'])
for a in [0.5, 1, 2, 3, 10]:
for b in [0.5, 1, 2, 3, 10]:
dist = scipy.stats.beta(a, b)
hmap[(a,b)] = hv.Curve((x, dist.pdf(x)), 'x', 'PDF', label='Beta distribution,')
hmap.opts(hv.opts.Curve(width=450, height=300, ylim=(0, 5), tools=['hover']))
Let’s replace the chosen distributions in the MAP omitting the terms that do not depend on \(\theta~\)
if we set the derivative to zero we obtain
Note
This solution reduces to the MLE for \(\alpha=\beta=1\) (uniform)
Important
If we know something about the coin before observing the data, we add it through \(\alpha\) and \(\beta\)
Bayesian inference¶
With MLE we get point estimates
How good are these estimates? Can we trust them? What is their uncertainty?
In the frequentist approach we answer these questions through confidence intervals and/or bootstrap
In a “full” bayesian approach we select likelihood/prior and aim for the posterior of \(\theta~\),
If we have the posterior we know everything about \(\theta~\). But, how do we get the posterior?
Simple case: Analytical posterior
In some “very special cases” the posterior is analytically tractable. Let’s go back to our example
Example: Posterior for the coin
The likelihood of the coin (Bernoulli) is
The prior is Beta
Then the posterior is
where \(Z\) is a normalizing constant
We can recognize that the posterior is also Beta:
with parameters \(\hat \alpha= \alpha +\sum_i x_i\) and \(\hat \beta= \beta +N-\sum_i x_i\)
Note
In this case we say that Beta is conjugate to the Bernoulli distribution: It produces a Beta posterior
See also
The following article shows a table of conjugate pairs
Interactive posterior: Move the sliders to change \(\alpha\), \(\beta\) and the number of coins observed
How many coins do we need to observe so that the posterior gets close to \(p=.7\)?
What is the influence of \(\alpha\) and \(\beta\) when \(N\) is low? And when \(N\) is high?
coins = scipy.stats.bernoulli.rvs(p=0.7, size=1000, random_state=1234) # True parameter is 0.7
x = np.linspace(0, 1, num=100)
hmap1 = hv.HoloMap(kdims=['N'])
hmap2 = hv.HoloMap(kdims=['N', 'alpha', 'beta'])
for N in [1, 2, 5, 10, 20, 50, 100, 200, 500]:
heads = np.sum(coins[:N])
hmap1[N] = hv.Bars((['head','tail'], [heads, N-heads]), 'x', 'Histogram', label='Coins')
for a in [0.5, 2, 10]:
for b in [0.5, 2, 10]:
prior = scipy.stats.beta(a, b)
posterior = scipy.stats.beta(a + heads, b + N - heads)
hprior = hv.Curve((x, prior.pdf(x)), 'x', 'PDF', group='Distribution', label='prior')
hposterior = hv.Curve((x, posterior.pdf(x)), 'x', 'PDF', group='Distribution', label='posterior')
hmap2[(N, a, b)] = hprior*hposterior
#hv.output(widget_location='bottom')
(hmap1+hmap2).opts(hv.opts.Bars(width=150, height=300),
hv.opts.Curve(width=300, height=300, ylim=(0, 5), tools=['hover']))
Note that in the Bayesian approach online problems are just updates to the posterior
The following shows how the posterior is updated one coin at a time
x = np.linspace(0, 1, num=100)
hmap1 = hv.HoloMap(kdims=['Iteration'])
hmap2 = hv.HoloMap(kdims=['Iteration'])
a = b = 1 # prior belief
for k, coin in enumerate(coins[:100]):
hmap1[k] = hv.Bars((['head','tail'], [a-1, b-1]), 'x', 'Histogram', label='Coins')
hmap2[k] = hv.Curve((x, scipy.stats.beta(a, b).pdf(x)), 'x', 'PDF', label='Posterior')
# posterior updates
a += coin
b += 1 - coin
(hmap1+hmap2).opts(hv.opts.Bars(width=150, height=300),
hv.opts.Curve(width=400, height=300, ylim=(0, 5)))
#hv.output((hmap1+hmap2), holomap='gif', fps=5)
What if I can’t get an analytical posterior?¶
In many cases the denominator in the posterior
is intractable
The options are to
Approximation based: Variational Inference (VI)
Sampling based: Markov Chain Monte Carlo (MCMC)
We will review how to implement both using in future lessons
Level 2 Inference: Comparing models¶
Using Bayes theorem we can express the posterior probability of a given model as
where \(p(\mathcal{D} | \mathcal{M_j})\) is called the evidence or the marginal likelihood for \(\mathcal{M_j}\)
The evidence was ignored in level 1 inference (normalizing constant) but in this level is key!
If we want to compare two models we can compute the ratio between posteriors
For example we may ignore the priors if we consider the models to be equally probable or we can use the priors to favor simpler models and implement Occam’s Razor (principle of parsinomy)
If there are many models and we care only for the most probable we can write
Using Bayes Theorem and assuming that all models are equally probable (uniform prior)
See also
For more on model comparison see Chapter 28 of D. Mackay’s book