Saturday, November 17, 2018

Entropy of a Conditional Distribution

Consider a set of $N$ objects, and let $g$ be the number of configurations indistinguishable under distribution $\rho$.  That is, the mulitplicity. Define $ \sigma$, the entropy, so that $g= e^{N\sigma}$. If $\rho$ is a multinomial distribution, entropy is asymptotic to

$\sigma = -\sum_i { p_i \ln{p_i}}$

If $\rho$ is a conditional distribution, $\rho(x|y)$, entropy looks like

$-\sum_i { p(x_i|y_j) \ln{p(x_i|y_j)}}$, which of course is a function of $y_j$.

The expectation value for this over all $y$ is
\begin{align}
\left< \sigma \right> &= -\left <\sum_i p(x_i|y_j) \ln{p(x_i|y_j)}\right > \\
 &= -\sum_j p(y_j) \sum_i  p(x_i|y_j) \ln{p(x_i|y_j)} \\
 &= \sum_j  \sum_i  p(y_j) p(x_i|y_j) [-\ln p(x_i, y_j) + \ln p(y_j)] \\
 &= \sum_j  \sum_i  (-p(x_i, y_j) \ln p(x_i, y_j) + p(x_i, y_j) \ln p(y_j)) \\
 &=  \sum_j  \sum_i -p(x_i, y_j) \ln p(x_i, y_j) + \sum_j p( y_j) \ln p(y_j) \\
&= \sigma(x,y) - \sigma(y)
\end{align}

In words: the conditional (on $y$) entropy is the joint entropy less the $y$-marginal entropy, or the conditional entropy plus the marginal entropy is the joint entropy.

$$
\sigma(x,y) =  \sigma(x|y) + \sigma(y)
$$

In a multivariate distribution, the sum of all the conditional entropies fails to sum to the joint entropy by a quantity known as the mutual information $I$

$$
\sigma(x,y) =  \sigma(y|x) + \sigma(x|y) + I(x,y)
$$

Entropy can be understood as the optimum average message length in the following sense. Consider a finite set of words, each with length $l_i$. Each word appears with probability $p_i$. The average word length, then, is $\sum_i p_i l_i$. Subject to constraint $\sum_i {2^{-l_i}} -1 =0$, the minimum length is achieved when $l_i = -\ln p_i$.

We can ask what the difference is between this optimal average length, and an average achieved with some other choice of $l_i$. Pick some non-optimal set of lengths $l_i = -\ln q_i$.  The difference, known as the Kullback-Leibler divergence, between the average of these lengths and the optimum average is

$$ \langle l \rangle - \sigma(\rho) = \sum_i p_i l_i - (- \sum_i p_i \ln p_i ) = \sum_i p_i \ln \frac{p_i}{q_i} $$

Mutual information can be viewed in terms of the KL divergence of the product distribution from the joint distribution like so

\begin{align}
D_{KL}(\rho(x,y)&||\rho(x)\rho(y)) = \sum_i \sum_j p(x_i, y_j) \ln \frac{p(x_i, y_j)}{p(x_i) \rho(y_j)} \\
&=  \sum_i \sum_j p(x_i, y_j) \ln p(x_i, y_j) -  \sum_i  p(x_i)  \ln p(x_i) - \sum_j p(y_j)  \ln p(y_j) \\
&= -\sigma(x,y) + \sigma(x) + \sigma(y) \\
&= -\frac12 (\sigma(x|y) + \sigma(y|x) + \sigma(x) + \sigma(y) ) + \sigma(x) +\sigma(y) \\
&= -\frac12 (\sigma(x|y) + \sigma(y|x) - \sigma(x) - \sigma(y) )  \\
&=  -\frac12 (-2I) = I
\end{align}