Consider a set of N objects, and let g be the number of configurations indistinguishable under distribution ρ. That is, the mulitplicity. Define σ, the entropy, so that g=eNσ. If ρ is a multinomial distribution, entropy is asymptotic to
σ=−∑ipilnpi
If ρ is a conditional distribution, ρ(x|y), entropy looks like
−∑ip(xi|yj)lnp(xi|yj), which of course is a function of yj.
The expectation value for this over all y is
⟨σ⟩=−⟨∑ip(xi|yj)lnp(xi|yj)⟩=−∑jp(yj)∑ip(xi|yj)lnp(xi|yj)=∑j∑ip(yj)p(xi|yj)[−lnp(xi,yj)+lnp(yj)]=∑j∑i(−p(xi,yj)lnp(xi,yj)+p(xi,yj)lnp(yj))=∑j∑i−p(xi,yj)lnp(xi,yj)+∑jp(yj)lnp(yj)=σ(x,y)−σ(y)
In words: the conditional (on y) entropy is the joint entropy less the y-marginal entropy, or the conditional entropy plus the marginal entropy is the joint entropy.
σ(x,y)=σ(x|y)+σ(y)
In a multivariate distribution, the sum of all the conditional entropies fails to sum to the joint entropy by a quantity known as the mutual information I
σ(x,y)=σ(y|x)+σ(x|y)+I(x,y)
Entropy can be understood as the optimum average message length in the following sense. Consider a finite set of words, each with length li. Each word appears with probability pi. The average word length, then, is ∑ipili. Subject to constraint ∑i2−li−1=0, the minimum length is achieved when li=−lnpi.
We can ask what the difference is between this optimal average length, and an average achieved with some other choice of li. Pick some non-optimal set of lengths li=−lnqi. The difference, known as the Kullback-Leibler divergence, between the average of these lengths and the optimum average is
⟨l⟩−σ(ρ)=∑ipili−(−∑ipilnpi)=∑ipilnpiqi
Mutual information can be viewed in terms of the KL divergence of the product distribution from the joint distribution like so
DKL(ρ(x,y)||ρ(x)ρ(y))=∑i∑jp(xi,yj)lnp(xi,yj)p(xi)ρ(yj)=∑i∑jp(xi,yj)lnp(xi,yj)−∑ip(xi)lnp(xi)−∑jp(yj)lnp(yj)=−σ(x,y)+σ(x)+σ(y)=−12(σ(x|y)+σ(y|x)+σ(x)+σ(y))+σ(x)+σ(y)=−12(σ(x|y)+σ(y|x)−σ(x)−σ(y))=−12(−2I)=I