Information Theoretic Quantities

The following presents key concepts of Information Theory that will be used later to train generative models

How can we measure information?

Information Theory is the mathematical study of the quantification and transmission of information proposed by Claude Shannon on this seminal work: A Mathematical Theory of Communication, 1948

Shannon considered the output of a noisy source as a random variable \(X\)

  • The RV takes \(M\) possible values \(\mathcal{A} = \{x_1, x_2, x_3, \ldots, x_M\}\)

  • Each value \(x_i\) have an associated probability \(P(X=x_i) = p_i\)

Consider the following question: What is the amount of information carried by \(x_i\)?

Shannon defined the amount of information as

\[ I(x_i) = \log_2 \frac{1}{p_i}, \]

which is measured in bits

Note

One bit is the amount of information needed to choose between two equiprobable states

Example: A meteorological station in 1920 that sends tomorrow’s weather prediction from Niebla to Valdivia via telegraph

../../_images/telegraph.png

Tomorrows weather is a random variable

  • The dictionary of messages: (1) Rainy, (2) Cloudy, (3) Partially cloudy, (4) Sunny

  • Assume thta their probabilities are: \(p_1=1/2\), \(p_2=1/4\), \(p_3=1/8\), \(p_4=1/8\)

What is the minimum number of yes/no questions (equiprobable) needed to guess tomorrow’s weather?

  • Is it going to rain?

  • No: Is it going to be cloudy?

  • No: Is it going to be sunny?

What is then the amount amount of information of each message?

  • Rainy: \(\log_2 \frac{1}{p_1} = \log_2 2 = 1\) bits

  • Cloudy: \(2\) bits

  • Partially cloudy and Sunny: \(3\) bits

Important

The larger the probability the smallest information it carries

Shannon’s entropy

After defining the amount of information for a state Shannon’s defined the average information of the source \(X\) as

\[\begin{split} \begin{align} H(X) &= \mathbb{E}_{x\sim X}\left [\log_2 \frac{1}{P(x)} \right] \nonumber \\ &= - \sum_{x\in \mathcal{A}} P(x) \log_2 P(X) \nonumber \\ &= - \sum_{i=1}^M p_i \log_2 p_i ~ \text{[bits]} \nonumber \end{align} \end{split}\]

and called it the entropy of the source

Note

Entropy is the “average information of the source”

Properties:

  • Entropy is nonnegative: \(H(X)>0\)

  • Entropy is equal to zero when \(p_j = 1 \wedge p_i = 0, i \neq j\)

  • Entropy is maximum when \(X\) is uniformly distributed \(p_i = \frac{1}{M}\), \(H(X) = \log_2(M)\)

Note

The more random the source is the larger its entropy

For continuous variables the Differential entropy is defined as

\[ H(p) = - \int p(x) \log p(x) \,dx ~ \text{[nats]} \]

where \(p(x)\) is the probability density function (pdf) of \(X\)

Relative Entropy: Kullback-Leibler (KL) divergence

Consider a continuous random variable \(X\) and two distributions \(q(x)\) and \(p(x)\) defined on its probability space

The relative entropy between these distributions is

\[\begin{split} \begin{align} D_{\text{KL}} \left [ p(x) || q(x) \right] &= \mathbb{E}_{x \sim p(x)} \left [ \log \frac{p(x)}{q(x)} \right ] \nonumber \\ &= \mathbb{E}_{x \sim p(x)} \left [ \log p(x) \right ] - \mathbb{E}_{x \sim p(x)} \left [ \log q(x) \right ], \nonumber \\ &= \int p(x) \log p(x) \,dx - \int p(x) \log q(x) \,dx \nonumber \end{align} \end{split}\]

which is also known as the Kullback- Leibler divergence

  • The left hand side term is the negative entropy of \(p(x)\)

  • The right hand side term is called the cross-entropy of \(q(x)\) relative to \(p(x)\)

Intepretations of KL

  • Coding: Expected number of “extra bits” needed to code \(p(x)\) using a code optimal for \(q(x)\)

  • Bayesian modeling: Amount of information lost when \(q(x)\) is used as a model for \(p(x)\)

Property: Non-negativity

The KL divergence is non-negative

\[ D_{\text{KL}} \left [ p(x) || q(x) \right] \geq 0 \]

with the equality holding for \(p(x) \equiv q(x)\)

This is given by the Gibbs inequality

\[ - \int p(x) \log p(x) \,dx \leq - \int p(x) \log q(x) \,dx \]

Note

The entropy of \(p(x)\) is equal or less than the cross-entropy of \(q(x)\) relative to \(p(x)\)

Property: Asymmetry

The KL divergence is asymmetric

\[ D_{\text{KL}} \left [ p(x) || q(x) \right] \neq D_{\text{KL}} \left [ q(x) || p(x) \right] \]
  • The KL is not a proper distance (no triangle inequility either)

  • Forward and Reverse KL have different meanings (we will explore them soon)

Property: Relation with mutual information

The KL is related to the mutual information between random variables as

\[ \text{MI}(X, Y) = D_{\text{KL}} \left [ p(x, y) || p(x)p(y) \right] \]

See also

See D. Mackays’ book chapter 1 for more details on Information Theory