Probability Theory¶

Disclaimer: This chapter is not comprehensive. It only works as a summary of the fundamental concepts from probability theory and statistical inference needed in this course

Basic concepts¶

Random Variable (RV): A variable that we assign to the output of a random phenomenon

Examples of RV:

result of throwing a coin/dice
tomorrow’s weather
amount of minutes I will spend in the traffic jam going to work

We don’t know the value of an RV until we sample from it. Sampling is equivalent to observing the RV.

Notation: RV are denoted by capital letters while observations are denoted with lowercase letters, and

\[ x \sim X, \]

means that \(x\) was sampled from \(X\)

Important

We describe a RV by its domain and probability density or mass function

Example: Fair six-faced dice

Domain (possible outputs): \([1, 2, 3, 4, 5, 6]\)

Probability mass function: \(\left[\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}\right]\)

The probability of drawing a \(1\) is \(P(X=1) = P(1) = \frac{1}{6}\)
The probability of drawing a number greater or equal than \(5\) is \(P(X\geq 5) = \frac{1}{3}\)
The probability of drawing and odd number is \(P(\text{odd}) = \frac{1}{2}\)

Joint, Marginal and Conditional probabilities

If we have two or more random variables we can define their joint pmf: \(P(X,Y)\)

From the joint we sum (integrate) to obtain the marginal distribution of \(X\) or \(Y\)

Law of total probability (sum rule):

If we sum (marginalize) on X we obtain the marginal for \(Y\)

\[\begin{split} \begin{align} P(Y=y) &= \sum_{x \in \mathcal{X}} P(X=x, Y=y) \nonumber \\ &= \sum_{x \in \mathcal{X}} P(Y=y|X=x) P(X=x), \end{align} \end{split}\]

where \(P(Y=y|X=x)\) is the conditional probability of \(y\) given \(x\)

\[ P(Y=y|X=x) = \frac{P(X=x, Y=y)}{P(X=x)} \]

(iif \(P(X=x) \neq 0\))

Example: The following example shows the joint PMF of a two dimensional discrete RV

If we sum in either axis we obtain the marginals

x = np.arange(-4, 5, 1); y = np.arange(-4, 5, 1)
X, Y = np.meshgrid(x, y); XY = np.zeros_like(X)
XY[2:-2, -3] = 1; XY[2:-2, 2] = 1; XY[4, 2:-2] = 1
XY = XY/np.sum(XY)

data = hv.Dataset((x, y, XY), kdims=['x','y'], vdims='xy')
joint = data.to(hv.Image).opts(cmap='Blues', width=300, height=300)
margx = joint.reduce(x=np.sum).opts(interpolation='steps-mid').to(hv.Bars)
margy = joint.reduce(y=np.sum).opts(interpolation='steps-mid').to(hv.Bars)
(joint << margx.opts(width=150) << margy.opts(height=150))

If we take a horizontal slice of the joint we obtain the conditional \(p(x|y)\)

data.to(hv.Bars, 'x').opts(title='p(x|y)', width=350, height=200) 

And a vertical slice would be the conditional \(p(y|x)\)

data.to(hv.Bars, 'y').opts(title='p(y|x)', width=350, height=200) 

Chain rule of probabilities (product rule):

For example if we have four variables:

\[\begin{split} \begin{align} P(x_1, x_2, x_3, x_4) &= P(x_4|x_3, x_2, x_1) P(x_3, x_2, x_1) \nonumber \\ &= P(x_4|x_3, x_2, x_1) P(x_3| x_2, x_1) P(x_2, x_1) \nonumber \\ &= P(x_4|x_3, x_2, x_1) P(x_3| x_2, x_1) P(x_2 |x_1) P(x_1) \nonumber \\ \end{align} \end{split}\]

Bayes Theorem:

Combining the product and sum rule for two random variables we can write

\[ P(y | x) = \frac{P(x|y) P(y)}{P(x)} = \frac{P(x|y) P(y)}{\sum_{y\in\mathcal{Y}} P(x|y) P(y)} \]

We call \(P(y|x)\) the posterior distribution of \(y\):

What we know of \(y\) after we observe \(x\)

We call \(P(y)\) the prior distribution of \(y\)

What we know of \(y\) before observing \(x\)

Independence:

If two RVs are independent then

\[\begin{split} \begin{align} P(x, y) &= P(y|x) P(x)\nonumber \\ &= P(y) P(x)\nonumber \end{align} \end{split}\]

Knowing that \(x\) happened does not help me to know if \(y\) happened

Conditional independence:

If two RVs are conditionally independent given a third one then

\[ P(x, y|z) = P(x|z)P(y|z) \]

The meaning of probability¶

Meaning 1: We observe the outcome of a random experiment (event) several times and we count

We flip a coin 5 times and get [x, x, o, x, o]

The probability of x is 3/5
The probability of o is 2/5

We have estimated the probability from the frequency of x and o

This is called the Frequentist interpretation of probability

Meaning 2: Probability is the degree of belief of an event

Probabilities describe assumptions and also describe inference given those assumptions

This is called the Bayesian interpretation of probability

Enough philosophy, What is the difference for us?

Intepretation of uncertainty
Incorporation of prior information
Model evaluation
Handling of nuisance parameters

Specifically on inference

Frequentist: Parameters of frequentist models are point estimates
Bayesian: Parameters of bayesian models can be uncertain and have distributions too, this distributions are called priors

Bayesian Learning and Neural Networks

Probability Theory

Contents

Probability Theory¶

Basic concepts¶

The meaning of probability¶