Probability Theory
Contents
import holoviews as hv
hv.extension('bokeh')
import numpy as np
import scipy.stats
Probability Theory¶
Disclaimer: This chapter is not comprehensive. It only works as a summary of the fundamental concepts from probability theory and statistical inference needed in this course
Basic concepts¶
- Random Variable (RV)
A variable that we assign to the output of a random phenomenon
Examples of RV:
result of throwing a coin/dice
tomorrow’s weather
amount of minutes I will spend in the traffic jam going to work
We don’t know the value of an RV until we sample from it. Sampling is equivalent to observing the RV.
Notation: RV are denoted by capital letters while observations are denoted with lowercase letters, and
means that \(x\) was sampled from \(X\)
Important
We describe a RV by its domain and probability density or mass function
Example: Fair six-faced dice
Domain (possible outputs): \([1, 2, 3, 4, 5, 6]\)
Probability mass function: \(\left[\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}\right]\)
The probability of drawing a \(1\) is \(P(X=1) = P(1) = \frac{1}{6}\)
The probability of drawing a number greater or equal than \(5\) is \(P(X\geq 5) = \frac{1}{3}\)
The probability of drawing and odd number is \(P(\text{odd}) = \frac{1}{2}\)
Joint, Marginal and Conditional probabilities
If we have two or more random variables we can define their joint pmf: \(P(X,Y)\)
From the joint we sum (integrate) to obtain the marginal distribution of \(X\) or \(Y\)
Law of total probability (sum rule):
If we sum (marginalize) on X we obtain the marginal for \(Y\)
where \(P(Y=y|X=x)\) is the conditional probability of \(y\) given \(x\)
(iif \(P(X=x) \neq 0\))
Example: The following example shows the joint PMF of a two dimensional discrete RV
If we sum in either axis we obtain the marginals
x = np.arange(-4, 5, 1); y = np.arange(-4, 5, 1)
X, Y = np.meshgrid(x, y); XY = np.zeros_like(X)
XY[2:-2, -3] = 1; XY[2:-2, 2] = 1; XY[4, 2:-2] = 1
XY = XY/np.sum(XY)
data = hv.Dataset((x, y, XY), kdims=['x','y'], vdims='xy')
joint = data.to(hv.Image).opts(cmap='Blues', width=300, height=300)
margx = joint.reduce(x=np.sum).opts(interpolation='steps-mid').to(hv.Bars)
margy = joint.reduce(y=np.sum).opts(interpolation='steps-mid').to(hv.Bars)
(joint << margx.opts(width=150) << margy.opts(height=150))
If we take a horizontal slice of the joint we obtain the conditional \(p(x|y)\)
data.to(hv.Bars, 'x').opts(title='p(x|y)', width=350, height=200)
And a vertical slice would be the conditional \(p(y|x)\)
data.to(hv.Bars, 'y').opts(title='p(y|x)', width=350, height=200)
Chain rule of probabilities (product rule):
For example if we have four variables:
Bayes Theorem:
Combining the product and sum rule for two random variables we can write
We call \(P(y|x)\) the posterior distribution of \(y\):
What we know of \(y\) after we observe \(x\)
We call \(P(y)\) the prior distribution of \(y\)
What we know of \(y\) before observing \(x\)
Independence:
If two RVs are independent then
Knowing that \(x\) happened does not help me to know if \(y\) happened
Conditional independence:
If two RVs are conditionally independent given a third one then
The meaning of probability¶
Meaning 1: We observe the outcome of a random experiment (event) several times and we count
We flip a coin 5 times and get [x, x, o, x, o]
The probability of x is 3/5
The probability of o is 2/5
We have estimated the probability from the frequency of x and o
This is called the Frequentist interpretation of probability
Meaning 2: Probability is the degree of belief of an event
Probabilities describe assumptions and also describe inference given those assumptions
This is called the Bayesian interpretation of probability
Enough philosophy, What is the difference for us?
Intepretation of uncertainty
Incorporation of prior information
Model evaluation
Handling of nuisance parameters
Specifically on inference
Frequentist: Parameters of frequentist models are point estimates
Bayesian: Parameters of bayesian models can be uncertain and have distributions too, this distributions are called priors
See also
More on Frequentism vs Bayesianism by Jake Vanderplas