Processing math: 75%

Thursday, August 25, 2016

Bayes Thinking

Preamble

Consider a set {xi}, and map f(x) of that set: yi=f(xi). Further, consider a set of measurements {di} of that mapping. The measurements possibly differ from the mapped values {yi} like so:  diϵi=yi. We don't know much about {ϵi}, just that they are randomly selected from a gaussian distribution with center 0 and width σ2. So for a given f(xi), we see di with probability Normal(ϵi).

Perhaps this is labeled data from a supervised learning scenario, or a regression problem. We wish to know which function, from the space of all possibilities, gives the best mapping. Bayes Theorem helps us understand how to think about a problem like this.

Bayes Theorem

P(h|d)=(P(d|h)P(d))P(h)

Bayes theorem relates the probability of hypothesis h given data d to the probability of data d given hypothesis h. The hypotheses are our candidate functions introduced above. These terms are sometimes described as posterior=(likelihoodevidence)prior.  This is the math that tells you how to update your understanding from prior to posterior, given information likelihood, and  background evidence. Take a moment to appreciate what a bold statement that is. This tells you, essentially, how to think. Or at least how to learn.

So back to our problem: which is the best function? Clearly we want to choose the h such that the probability for the winning h is the highest in the sense of P(h|d). This is called maximum a-posteriori (MAP). We might write this as

argmax{(P(d|h)P(d))P(h)}

The normalizing factor P(d) does not distinguish choice of hypothesis, and the same can be said of the prior if the prior is uniform.  With both of these assumptions in place, maximum a-posteriori is the same as maximum likelihood (ML):

argmaxP(h|d)=argmaxP(d|h)

Bayes Learning

So let's find

hML=argmaxP(d|h)=argmaxiP(di|h)=argmaxiNormal(ϵi)=argmaxiCexp(ϵ2i2σ2)

We can take the log of the argmax expression, as the extremum is maintained.
\bbox[5px,border:2px solid red] {  h_{ML} =   argmin \sum_i \epsilon_i^2 } \tag{minimum squared error}

Whoever said that we should minimize squared error knew what he was talking about. Bayes agrees.

Let's back up a little and think about this another way. Consider the MAP hypothesis

h_{MAP} = argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\}

Again, P(d) does not distinguish hypotheses, so we can drop that term

h_{MAP} = argmax \;  P(d \vert h)  P(h)  

The maximum is maintained through the log function

h_{MAP} = argmax [  \log P(d \vert h) + \log P(h) ]  

h_{MAP} = argmin [  -\log P(d \vert h) -\log P(h) ]  \tag{Occam's razor}

With this relation in view, consider the first term. We decided earlier that was related to our measurement or calculation error (or both). The other term, -\log P(h) can be considered a description (code) length, if we're thinking like C. Shannon. So we can paraphrase:

\bbox[5px,border:2px solid red] {  h_{MAP} = \text{minimum error} + \text{minimum description length} }

Or, the best hypothsis is the simplest one that gets the job done.  Sounds like Occam's advice.