Thursday, August 25, 2016

Bayes Thinking

Preamble

Consider a set $\{x_i\}$, and map $f(x)$ of that set: $y_i = f(x_i)$. Further, consider a set of measurements $\{d_i\}$ of that mapping. The measurements possibly differ from the mapped values $\{y_i\}$ like so:  $ d_i - \epsilon_i = y_i $. We don't know much about $\{\epsilon_i \}$, just that they are randomly selected from a gaussian distribution with center $0$ and width $\sigma^2$. So for a given $f(x_i)$, we see $d_i$ with probability $Normal(\epsilon_i)$.

Perhaps this is labeled data from a supervised learning scenario, or a regression problem. We wish to know which function, from the space of all possibilities, gives the best mapping. Bayes Theorem helps us understand how to think about a problem like this.

Bayes Theorem

\[ P(h \vert d) = \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \]

Bayes theorem relates the probability of hypothesis $h$ given data $d$ to the probability of data $d$ given hypothesis $h$. The hypotheses are our candidate functions introduced above. These terms are sometimes described as $posterior = \left( {likelihood \over evidence} \right) {prior}$.  This is the math that tells you how to update your understanding from prior to posterior, given information likelihood, and  background evidence. Take a moment to appreciate what a bold statement that is. This tells you, essentially, how to think. Or at least how to learn.

So back to our problem: which is the best function? Clearly we want to choose the $h$ such that the probability for the winning $h$ is the highest in the sense of $ P(h \vert d)$. This is called maximum a-posteriori (MAP). We might write this as

\[ argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]

The normalizing factor $P(d)$ does not distinguish choice of hypothesis, and the same can be said of the prior if the prior is uniform.  With both of these assumptions in place, maximum a-posteriori is the same as maximum likelihood (ML):

\[ argmax \; P(h \vert d)  = argmax \; P(d \vert h) \]

Bayes Learning

So let's find

\begin{align*}
 h_{ML} &= argmax \; P(d \vert h)  \\
 &=   argmax \prod_i P(d_i \vert h) \\
 &=   argmax \prod_i Normal(\epsilon_i) \\
 &=   argmax \prod_i C \, \exp \left(-\frac{\epsilon_i^2}{2 \sigma^2} \right)
\end{align*}

We can take the log of the argmax expression, as the extremum is maintained.
\[ \bbox[5px,border:2px solid red] {
 h_{ML} =   argmin \sum_i \epsilon_i^2
} \tag{minimum squared error} \]

Whoever said that we should minimize squared error knew what he was talking about. Bayes agrees.

Let's back up a little and think about this another way. Consider the MAP hypothesis

\[ h_{MAP} = argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]

Again, $P(d)$ does not distinguish hypotheses, so we can drop that term

\[ h_{MAP} = argmax \;  P(d \vert h)  P(h)   \]

The maximum is maintained through the log function

\[ h_{MAP} = argmax [  \log P(d \vert h) + \log P(h) ]  \]

\[ h_{MAP} = argmin [  -\log P(d \vert h) -\log P(h) ]  \tag{Occam's razor}\]

With this relation in view, consider the first term. We decided earlier that was related to our measurement or calculation error (or both). The other term, $-\log P(h)$ can be considered a description (code) length, if we're thinking like C. Shannon. So we can paraphrase:

\[
\bbox[5px,border:2px solid red]
{
 h_{MAP} = \text{minimum error} + \text{minimum description length}
}
\]

Or, the best hypothsis is the simplest one that gets the job done.  Sounds like Occam's advice.