Preamble
Consider a set $\{x_i\}$, and map $f(x)$ of that set: $y_i = f(x_i)$. Further, consider a set of measurements $\{d_i\}$ of that mapping. The measurements possibly differ from the mapped values $\{y_i\}$ like so: $ d_i - \epsilon_i = y_i $. We don't know much about $\{\epsilon_i \}$, just that they are randomly selected from a gaussian distribution with center $0$ and width $\sigma^2$. So for a given $f(x_i)$, we see $d_i$ with probability $Normal(\epsilon_i)$.Perhaps this is labeled data from a supervised learning scenario, or a regression problem. We wish to know which function, from the space of all possibilities, gives the best mapping. Bayes Theorem helps us understand how to think about a problem like this.
Bayes Theorem
\[ P(h \vert d) = \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \]Bayes theorem relates the probability of hypothesis $h$ given data $d$ to the probability of data $d$ given hypothesis $h$. The hypotheses are our candidate functions introduced above. These terms are sometimes described as $posterior = \left( {likelihood \over evidence} \right) {prior}$. This is the math that tells you how to update your understanding from prior to posterior, given information likelihood, and background evidence. Take a moment to appreciate what a bold statement that is. This tells you, essentially, how to think. Or at least how to learn.
So back to our problem: which is the best function? Clearly we want to choose the $h$ such that the probability for the winning $h$ is the highest in the sense of $ P(h \vert d)$. This is called maximum a-posteriori (MAP). We might write this as
\[ argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]
The normalizing factor $P(d)$ does not distinguish choice of hypothesis, and the same can be said of the prior if the prior is uniform. With both of these assumptions in place, maximum a-posteriori is the same as maximum likelihood (ML):
\[ argmax \; P(h \vert d) = argmax \; P(d \vert h) \]
Bayes Learning
So let's find\begin{align*}
h_{ML} &= argmax \; P(d \vert h) \\
&= argmax \prod_i P(d_i \vert h) \\
&= argmax \prod_i Normal(\epsilon_i) \\
&= argmax \prod_i C \, \exp \left(-\frac{\epsilon_i^2}{2 \sigma^2} \right)
\end{align*}
We can take the log of the argmax expression, as the extremum is maintained.
\[ \bbox[5px,border:2px solid red] {
h_{ML} = argmin \sum_i \epsilon_i^2
} \tag{minimum squared error} \]
Whoever said that we should minimize squared error knew what he was talking about. Bayes agrees.
Let's back up a little and think about this another way. Consider the MAP hypothesis
\[ h_{MAP} = argmax \left\{ \left( \frac{P(d \vert h)}{P(d)} \right) P(h) \right\} \]
Again, $P(d)$ does not distinguish hypotheses, so we can drop that term
\[ h_{MAP} = argmax \; P(d \vert h) P(h) \]
The maximum is maintained through the log function
\[ h_{MAP} = argmax [ \log P(d \vert h) + \log P(h) ] \]
\[ h_{MAP} = argmin [ -\log P(d \vert h) -\log P(h) ] \tag{Occam's razor}\]
With this relation in view, consider the first term. We decided earlier that was related to our measurement or calculation error (or both). The other term, $-\log P(h)$ can be considered a description (code) length, if we're thinking like C. Shannon. So we can paraphrase:
\[
\bbox[5px,border:2px solid red]
{
h_{MAP} = \text{minimum error} + \text{minimum description length}
}
\]
Or, the best hypothsis is the simplest one that gets the job done. Sounds like Occam's advice.