Consider a set of $N$ objects, and let $g$ be the number of configurations indistinguishable under distribution $\rho$. That is, the mulitplicity. Define $ \sigma$, the entropy, so that $g= e^{N\sigma}$. If $\rho$ is a multinomial distribution, entropy is asymptotic to
$\sigma = -\sum_i { p_i \ln{p_i}}$
If $\rho$ is a conditional distribution, $\rho(x|y)$, entropy looks like
$-\sum_i { p(x_i|y_j) \ln{p(x_i|y_j)}}$, which of course is a function of $y_j$.
The expectation value for this over all $y$ is
\begin{align}
\left< \sigma \right> &= -\left <\sum_i p(x_i|y_j) \ln{p(x_i|y_j)}\right > \\
&= -\sum_j p(y_j) \sum_i p(x_i|y_j) \ln{p(x_i|y_j)} \\
&= \sum_j \sum_i p(y_j) p(x_i|y_j) [-\ln p(x_i, y_j) + \ln p(y_j)] \\
&= \sum_j \sum_i (-p(x_i, y_j) \ln p(x_i, y_j) + p(x_i, y_j) \ln p(y_j)) \\
&= \sum_j \sum_i -p(x_i, y_j) \ln p(x_i, y_j) + \sum_j p( y_j) \ln p(y_j) \\
&= \sigma(x,y) - \sigma(y)
\end{align}
In words: the conditional (on $y$) entropy is the joint entropy less the $y$-marginal entropy, or the conditional entropy plus the marginal entropy is the joint entropy.
$$
\sigma(x,y) = \sigma(x|y) + \sigma(y)
$$
In a multivariate distribution, the sum of all the conditional entropies fails to sum to the joint entropy by a quantity known as the mutual information $I$
$$
\sigma(x,y) = \sigma(y|x) + \sigma(x|y) + I(x,y)
$$
Entropy can be understood as the optimum average message length in the following sense. Consider a finite set of words, each with length $l_i$. Each word appears with probability $p_i$. The average word length, then, is $\sum_i p_i l_i$. Subject to constraint $\sum_i {2^{-l_i}} -1 =0$, the minimum length is achieved when $l_i = -\ln p_i$.
We can ask what the difference is between this optimal average length, and an average achieved with some other choice of $l_i$. Pick some non-optimal set of lengths $l_i = -\ln q_i$. The difference, known as the Kullback-Leibler divergence, between the average of these lengths and the optimum average is
$$ \langle l \rangle - \sigma(\rho) = \sum_i p_i l_i - (- \sum_i p_i \ln p_i ) = \sum_i p_i \ln \frac{p_i}{q_i} $$
Mutual information can be viewed in terms of the KL divergence of the product distribution from the joint distribution like so
\begin{align}
D_{KL}(\rho(x,y)&||\rho(x)\rho(y)) = \sum_i \sum_j p(x_i, y_j) \ln \frac{p(x_i, y_j)}{p(x_i) \rho(y_j)} \\
&= \sum_i \sum_j p(x_i, y_j) \ln p(x_i, y_j) - \sum_i p(x_i) \ln p(x_i) - \sum_j p(y_j) \ln p(y_j) \\
&= -\sigma(x,y) + \sigma(x) + \sigma(y) \\
&= -\frac12 (\sigma(x|y) + \sigma(y|x) + \sigma(x) + \sigma(y) ) + \sigma(x) +\sigma(y) \\
&= -\frac12 (\sigma(x|y) + \sigma(y|x) - \sigma(x) - \sigma(y) ) \\
&= -\frac12 (-2I) = I
\end{align}