Processing math: 100%

Saturday, November 17, 2018

Entropy of a Conditional Distribution

Consider a set of N objects, and let g be the number of configurations indistinguishable under distribution ρ.  That is, the mulitplicity. Define σ, the entropy, so that g=eNσ. If ρ is a multinomial distribution, entropy is asymptotic to

σ=ipilnpi

If ρ is a conditional distribution, ρ(x|y), entropy looks like

ip(xi|yj)lnp(xi|yj), which of course is a function of yj.

The expectation value for this over all y is
σ=ip(xi|yj)lnp(xi|yj)=jp(yj)ip(xi|yj)lnp(xi|yj)=jip(yj)p(xi|yj)[lnp(xi,yj)+lnp(yj)]=ji(p(xi,yj)lnp(xi,yj)+p(xi,yj)lnp(yj))=jip(xi,yj)lnp(xi,yj)+jp(yj)lnp(yj)=σ(x,y)σ(y)

In words: the conditional (on y) entropy is the joint entropy less the y-marginal entropy, or the conditional entropy plus the marginal entropy is the joint entropy.

σ(x,y)=σ(x|y)+σ(y)

In a multivariate distribution, the sum of all the conditional entropies fails to sum to the joint entropy by a quantity known as the mutual information I

σ(x,y)=σ(y|x)+σ(x|y)+I(x,y)

Entropy can be understood as the optimum average message length in the following sense. Consider a finite set of words, each with length li. Each word appears with probability pi. The average word length, then, is ipili. Subject to constraint i2li1=0, the minimum length is achieved when li=lnpi.

We can ask what the difference is between this optimal average length, and an average achieved with some other choice of li. Pick some non-optimal set of lengths li=lnqi.  The difference, known as the Kullback-Leibler divergence, between the average of these lengths and the optimum average is

lσ(ρ)=ipili(ipilnpi)=ipilnpiqi

Mutual information can be viewed in terms of the KL divergence of the product distribution from the joint distribution like so

DKL(ρ(x,y)||ρ(x)ρ(y))=ijp(xi,yj)lnp(xi,yj)p(xi)ρ(yj)=ijp(xi,yj)lnp(xi,yj)ip(xi)lnp(xi)jp(yj)lnp(yj)=σ(x,y)+σ(x)+σ(y)=12(σ(x|y)+σ(y|x)+σ(x)+σ(y))+σ(x)+σ(y)=12(σ(x|y)+σ(y|x)σ(x)σ(y))=12(2I)=I