Self Similar: machine learning

Preamble

Consider a set $\{x_i\}$, and map $f(x)$ of that set: $y_i = f(x_i)$. Further, consider a set of measurements $\{d_i\}$ of that mapping. The measurements possibly differ from the mapped values $\{y_i\}$ like so: $ d_i - \epsilon_i = y_i $. We don't know much about $\{\epsilon_i \}$, just that they are randomly selected from a gaussian distribution with center $0$ and width $\sigma^2$. So for a given $f(x_i)$, we see $d_i$ with probability $Normal(\epsilon_i)$.

Perhaps this is labeled data from a supervised learning scenario, or a regression problem. We wish to know which function, from the space of all possibilities, gives the best mapping. Bayes Theorem helps us understand how to think about a problem like this.

Bayes Theorem

\[ P(h \vert d) = \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \]

Bayes theorem relates the probability of hypothesis $h$ given data $d$ to the probability of data $d$ given hypothesis $h$. The hypotheses are our candidate functions introduced above. These terms are sometimes described as $posterior = \left( {likelihood \over evidence} \right) {prior}$. This is the math that tells you how to update your understanding from prior to posterior, given information likelihood, and background evidence. Take a moment to appreciate what a bold statement that is. This tells you, essentially, how to think. Or at least how to learn.

So back to our problem: which is the best function? Clearly we want to choose the $h$ such that the probability for the winning $h$ is the highest in the sense of $ P(h \vert d)$. This is called maximum a-posteriori (MAP). We might write this as

\[ argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]

The normalizing factor $P(d)$ does not distinguish choice of hypothesis, and the same can be said of the prior if the prior is uniform. With both of these assumptions in place, maximum a-posteriori is the same as maximum likelihood (ML):

\[ argmax \; P(h \vert d) = argmax \; P(d \vert h) \]

Bayes Learning

So let's find

\begin{align*}
h_{ML} &= argmax \; P(d \vert h) \\
&=   argmax \prod_i P(d_i \vert h) \\
&=   argmax \prod_i Normal(\epsilon_i) \\
&=   argmax \prod_i C \, \exp \left(-\frac{\epsilon_i^2}{2 \sigma^2} \right)
\end{align*}

We can take the log of the argmax expression, as the extremum is maintained.
\[ \bbox[5px,border:2px solid red] {
h_{ML} =   argmin \sum_i \epsilon_i^2
} \tag{minimum squared error} \]

Whoever said that we should minimize squared error knew what he was talking about. Bayes agrees.

Let's back up a little and think about this another way. Consider the MAP hypothesis

\[ h_{MAP} = argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]

Again, $P(d)$ does not distinguish hypotheses, so we can drop that term

\[ h_{MAP} = argmax \; P(d \vert h) P(h) \]

The maximum is maintained through the log function

\[ h_{MAP} = argmax [ \log P(d \vert h) + \log P(h) ] \]

\[ h_{MAP} = argmin [ -\log P(d \vert h) -\log P(h) ] \tag{Occam's razor}\]

With this relation in view, consider the first term. We decided earlier that was related to our measurement or calculation error (or both). The other term, $-\log P(h)$ can be considered a description (code) length, if we're thinking like C. Shannon. So we can paraphrase:

\[
\bbox[5px,border:2px solid red]
{
h_{MAP} = \text{minimum error} + \text{minimum description length}
}
\]

Or, the best hypothsis is the simplest one that gets the job done. Sounds like Occam's advice.

Consider $\langle (y-\theta)^2\rangle $, the expectation of the squared error between estimator $y$, a random variable, and target $\theta$ (determined, not random).

After a small manipulation $ \langle (y-\langle y \rangle +\langle y\rangle-\theta)^2\rangle $, we can write

\[\langle(y-\langle y\rangle)^2+(\langle y\rangle-\theta)^2+2(y-\langle y\rangle)(\langle y\rangle-\theta)\rangle\]

\begin{equation*}
\langle (y-\langle y\rangle)^2\rangle+\langle (\langle y\rangle -\theta)^2\rangle+2 \langle y-\langle y\rangle \rangle \langle \langle y\rangle-\theta\rangle
\end{equation*}

But since $\langle y- \langle y\rangle \rangle$ is $0$, we can write

\begin{equation} \label{bv1}
\langle (y-\langle y\rangle)^2\rangle+ (\langle y\rangle-\theta)^2
\end{equation}

That is, the variance plus the bias squared. An immediate consequence is that if your estimator is unbiased, the mean squared error of your estimator is the same as your estimator's variance. That's a nice result.

Let's try to generalize a little. What if $\theta$ is a random variable? Say, $\theta = \hat{f} + \epsilon$, where $\epsilon$ is a gaussian random variable, center zero and width $\sigma^2$.

Beginning with \eqref{bv1}, we have

\[ \langle (y-\langle y\rangle)^2\rangle+\langle (\langle y\rangle -(\hat{f} + \epsilon))^2\rangle\]

When expanding the second term, all cross terms with $\epsilon$ vanish (center is zero), so we have

\[ \langle (y-\langle y\rangle)^2\rangle+(\langle y\rangle - \hat{f})^2 + \langle\epsilon^2\rangle \]

Or the variance plus the squared bias plus an unavoidable squared error term.

Thursday, August 25, 2016

Bayes Thinking

Preamble

Bayes Theorem

Bayes Learning

Friday, April 16, 2010

Bias and Variance

About Me