Short Guide Through Appendix A.1. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error.
More Detailed Guide Through Appendix A.1.
Definitions.
Let
be the set of
all possible input/output pairs (pairs of vectors).
Let
be the set of functions that can be implemented by the network.
For every net function
we have
.
Elements of
are parameterized with
a parameter vector
from the set of possible parameters
.
is a function which maps a parameter vector
onto a net function
(
is surjective.)
Let
be the set of target functions
, where
.
Let
be the set of hypothesis functions
, where
.
For simplicity, take all sets to be finite, and let all functions map
each
to some
.
Values of functions with argument
are
denoted by
. We have
.
Let
be the data,
where
.
is divided into a training set
and a test set
.
For the moment,
we are not interested in how
was obtained.
We use squared error
, where
is the Euclidean norm.
.
.
holds.
Learning.
We use a variant of the Gibbs formalism (see
Opper & Haussler, 1991,
or Levin et al., 1990).
Consider a stochastic learning algorithm (random weight
initialization, random learning rate).
The learning algorithm attempts to reduce training set error by
randomly selecting a hypothesis with low
,
according to some conditional
distribution
over
.
is chosen in advance,
but in contrast to
traditional Gibbs (which deals with unconditional distributions on
),
we may take a look at the training set before selecting
.
For instance, one training
set may suggest linear functions as being more probable
than others, another one splines, etc.
The unconventional Gibbs variant is appropriate because
FMS uses only
(the set of first components of
's elements,
see section 3) to compute the flatness of
.
The trade-off between the desire for low
and
the a priori belief in a hypothesis
according to
is governed by a
positive constant
(interpretable as the inverse temperature from
statistical mechanics,
or the amount of stochasticity in the training algorithm).
We obtain
, the learning algorithm
applied to data
:
![]() |
(6) |
![]() |
(7) |
For theoretical purposes, assume we know
and may
use it for learning.
To learn, we use the same distribution
as above
(prior belief in some hypotheses
is based exclusively
on the training set).
There is a reason why
we do not use
instead:
does not allow
for making a distinction between a better prior belief
in hypotheses
and a better approximation of the test set data.
However, we are interested
in how
performs on the test set data
.
We obtain
![]() |
(8) |
![]() |
(9) |
Expected extended
generalization error.
We define the expected extended
generalization error
on the unseen test exemplars
:
![]() |
(10) |
Overfitting and underfitting error.
Let us separate the generalization error into an overfitting error
and an
underfitting error
(in analogy to
Wang et al., 1994;
and Guyon et al., 1992).
We will see that
overfitting and underfitting error correspond to the two
different error terms in our algorithm:
decreasing one term is equivalent to decreasing
,
decreasing the other is equivalent to decreasing
.
Using the Kullback-Leibler distance (Kullback, 1959),
we measure the information
conveyed by
, but not by
(see figure 4).
We may view this as information about
:
since there are more
which are compatible with
than there are
which are compatible with
,
's influence on
is stronger than its influence on
.
To get the non-stochastic bias (see definition of
),
we divide this information by
and obtain the overfitting error:
![]() |
(11) | ||
![]() |
|
figure=o.ps,angle=0,width=0.9
|
|
figure=u.ps,angle=0,width=0.9
|
Analogously, we measure the information
conveyed by
, but not by
(see figure 5).
This information is about
.
To get the non-stochastic bias (see definition of
),
we divide this information by
and obtain the underfitting error:
![]() |
(12) | ||
![]() |
Peaks in
which do not
match peaks of
produced
by
lead to overfitting error.
Peaks of
produced by
which do
not match peaks of
lead to underfitting error.
Overfitting and underfitting error tell us something about the
shape of
with respect to
, i.e., to
what degree is the prior belief in
compatible with
.
Why are they called ``overfitting'' and ``underfitting'' error?
Positive contributions to the overfitting error are obtained
where peaks of
do not match
(or are higher than) peaks of
:
there some
will have large probability
after training on
but will have lower probability after
training on all data
.
This is either because
has been approximated too closely,
or because of sharp peaks in
--
the learning algorithm specializes either on
or
on
(``overfitting'').
The specialization on
will become even worse if
is
corrupted by noise -- the case of noisy
will be treated later.
Positive contributions to the underfitting error are obtained
where peaks of
do not match (or are higher than)
peaks of
: there some
will have large probability
after training on all data
, but will have lower probability
after training on
. This is either due to a poor
approximation (note that
is almost fully
determined by
), or
to insufficient information about
conveyed by
(``underfitting'').
Either the algorithm did not
learn ``enough'' of
, or
does not tell
us anything about
.
In the latter case, there is nothing we can do --
we have to focus on the case where we did not learn
enough about
.
Analysis of overfitting and underfitting error.
holds.
Note: for zero temperature limit
we obtain
and
.
.
, i.e., there is no underfitting error.
For
(full stochasticity) we get
and
(recall that
is not the conventional but the
extended expected generalization error).
Since
,
holds.
In what follows, averages after learning on
are denoted by
,
and averages after learning on
are denoted by
.
Since
, we have
.
Analogously, we have
.
Thus,
, and
.2With large
, after learning on
,
measures the difference between average test set error
and a minimal test set error.
With large
,
after learning on
,
measures the difference between average test set error and a maximal
test set error.
So assume we do have a large
(large enough to exceed the minimum
of
).
We have to assume that
indeed conveys information about the test set:
preferring hypotheses
with small
by using a larger
leads to smaller test set error
(without this assumption no error decreasing algorithm would
make sense).
can be decreased by enforcing less stochasticity
(by further increasing
),
but this will increase
.
Likewise,
decreasing
(enforcing more stochasticity)
will decrease
but increase
.
Increasing
decreases the
maximal test set error after learning
more than it decreases
the average test set error, thus decreasing
, and vice versa.
Decreasing
increases the
minimal test set error after learning
more than it increases
the average test set error, thus decreasing
, and vice versa.
This is the above-mentioned
trade-off between stochasticity and
fitting the training set, governed by
.
Tolerable error level / Set of acceptable minima.
Let us implicitly define a tolerable error level
which, with confidence
,
is the upper bound of the
training set error after learning.
We would like to have an
algorithm decreasing (1) training set
error (this corresponds to decreasing underfitting error),
and (2) an additional error term, which
should be designed to ensure low overfitting error, given a
fixed small
.
The remainder of this section will lead to an answer for
the question:
how to design this additional error term?
Since low underfitting is obtained by selecting a
hypothesis from
, in what follows we will
focus on
only.
Using an appropriate choice of prior belief,
at the end of this section, we
will finally see
that the overfitting error can be reduced by
an error term expressing preference for flat nets.
Relative overfitting error.
Let us formally define the relative overfitting error
,
which is the relative contribution of some
to the mean
overfitting error of hypotheses set
:
| (14) |
![]() |
(15) |
For
,
we approximate
as follows. We
assume that
is large where
is large
(trade-off between low
and
).
Then
has large values (due to large
) where
(assuming
is small). We get
. The relative overfitting error can now be approximated by
Prior belief in
and
.
Assume
was obtained from a
target function
.
Let
be the prior on targets
and
the
probability of obtaining
with a given
. We have
The data is drawn from a target function with added noise (the noise-free case is treated below). We don't make any assumptions about the nature of the noise -- it does not have to be Gaussian (like, e.g., in MacKay's work, 1992b).
We want to select a
which makes
small, i.e., those
with small
should have high probabilities
.
We don't know
during learning.
is assumed to be drawn from a target
.
We compute the expectation of
, given
.
The probability of the test set
, given
, is
![]() |
(19) |
The expected relative overfitting error
is obtained by inserting
equation (20) into equation
(17):
Minimizing expected relative overfitting error.
We define a
such that
has its largest value near
small expected test set error
(see (17)
and (20)).
This definition leads to a low expectation of
(see equation (21)).
Define
Using equation (20) we get
determines the hypothesis
from
that leads to lowest expected test set error.
Consequently, we achieve the lowest expected relative overfitting
error.
helps us to define
:
To appreciate the importance of the prior
in the definition of
(see also equation (29)), in what follows,
we will focus on the noise-free case.
The special case of noise-free data.
Let
be equal to
(up to
an normalizing constant):
From (27) and (17), we obtain
the expected
:
For
we obtain in this noise free case
The lowest expected test set error
measured by
.
See equation (27).
Noisy data and noise-free data: conclusion.
For both the noise-free and the noisy case,
equation (18) shows that
given
and
, the expected test set error depends on
prior target probability
.
Choice of prior belief.
Now we select some
, our prior belief in target
.
We introduce a formalism similar to Wolpert's
(Wolpert, 1994a ).
is defined as the probability of obtaining
by
choosing a
randomly according to
.
Let us first have a look at Wolpert's formalism:
.
By restricting
to
, he obtains an injective function
,
which is
restricted to
.
is surjective (because
is surjective):
![]() |
(30) | ||
![]() |
|||
![]() |
However, we prefer to follow another path. Our algorithm (flat minimum
search) tends to prune a weight
if
is very flat
in
's direction. It prefers regions where
(where many weights lead to the same net function).
Unlike Wolpert's approach, ours distinguishes the probabilities of
targets
with
.
The advantage is:
we do not only search for
which are flat in one direction
but for
which are flat in many directions (this corresponds
to a higher probability of the corresponding targets).
Define
| (31) |
![]() |
(32) |
![]() |
(33) |
partitions
into equivalence
classes. To obtain
, we compute the
probability of
being in the equivalence
class
,
if randomly chosen according to
.
An equivalence class corresponds to a net function, i.e.,
maps all
of an equivalence class to the same
net function.
Relation to FMS algorithm.
FMS (from section 3)
works locally in weight space
.
Let
be the actual weight vector
found by FMS (with
).
Recall the definition of
(see (22) and (23)):
we want to find a hypothesis
which best approximates
those
with large
(the test data has high probability of being drawn from such
targets).
We will see that those
with
flat
locally have high probability
.
Furthermore we will see that a
close to
with flat
has flat
, too.
To approximate such targets
,
the only thing we can do is find a
close to many
with
and large
.
To justify this approximation
(see definition of
while recalling that
), we assume (1) that the noise
has mean
, and (2) that small noise is more likely than
large noise (e.g., Gaussian, Laplace, Cauchy distributions).
To restrict
to a local range in
,
we define regions of equal net functions
.
Note:
.
If
is
flat along long distances in many directions
,
then
has many elements.
Locally in weight space, at
with
, for
we define:
if the minimum
exists, then
, where
is a constant.
If this minimum does not exist, then
.
locally approximates
.
During search for
(corresponding to a hypothesis
),
to locally decrease
the expected test set error
(see equation (20)),
we want to enter areas where many
large
are near
in weight space.
We wish to decrease
the test set error, which is caused by drawing data
from highly probable targets
(those with large
).
We do not know, however, which
's
are mapped to target's
by
.
Therefore, we focus on
(
near
in weight space),
instead of
.
Assume
is small enough to allow for a Taylor
expansion, and that
is flat in direction
:
,
where
is the Hessian of
evaluated at
,
, and
(analogously for higher order derivatives). We see: in a small environment
of
, there is flatness in direction
, too.
Likewise, if
is not flat in any direction, this property also
holds within a small environment of
.
Only near
with flat
, there may exist
with large
.
Therefore, it is reasonable
to search for a
with
, where
is flat
within a large region. This means to search for
the
determined by
of equation (22).
Since
,
holds: we search for a
living within a large connected
region, where for all
within this region
,
where
is defined in section 2.
To conclude: we decrease the relative overfitting error
and the underfitting error by searching for a
flat minimum (see definition of flat minima in section
2).
Practical realization of the Gibbs variant.
(1) Select
and
, thus
implicitly choosing
.
(2) Compute the set
.
(3) Assume we know how data is obtained from target
,
i. e. we know
,
,
and the prior
. Then
we can compute
and
.
(4) Start with
and increase
until
equation (13) holds. Now we know the
from
the implicit choice above.
(5) Since we know all we need to compute
,
select some
according to this distribution.
Three comments on certain FMS limitations.
1. FMS only approximates the Gibbs
variant given by the definition of
(see (22) and (23)).
We only locally approximate
in weight space.
If
is locally flat around
then
there exist units or weights which can be given
with low precision (or can be removed).
If there are other weights
with
,
then one may assume that
there are also points in weight space near such
where weights can be given with low precision
(think of, e.g., symmetrical exchange of weights and units).
We assume the local approximation of
is good.
The most probable targets represented by flat
are approximated by a hypothesis
which is also
represented by a flat
(where
is near
in weight space). To allow for approximation of
by
, we have to assume that the hypothesis
set
is dense in the target set
.
If
is flat in many directions then there are many
that share this flatness and are well-approximated
by
. The only reasonable
thing FMS can do is to make
as flat as possible
in a large region around
,
to approximate the
with large prior probability
(recall that
flat regions are approximated by axis-aligned boxes, as
discussed in section 7, paragraph entitled
``Generalized boxes?'').
This approximation
is fine if
is smooth enough in ``unflat'' directions
(small changes in
should not result in quite different
net functions).
2.
Concerning point (3) above:
depends on
(how the training data is drawn from the
target, see (18)).
depends on
and
(how the test data is drawn from the target).
Since we do not know how the data is obtained,
the quality of the approximation of the Gibbs algorithm
may suffer
from noise which has not mean
, or from
large noise being more
probable than small noise.
Of course, if the choice of prior belief does not
match the true target distribution, the quality
of
's approximation
will suffer as well.
3.
Concerning point (5) above:
FMS outputs only a single
instead of
.
This issue is discussed in section 7 (paragraph
entitled ``multiple initializations?'').
To conclude:
Our FMS algorithm from section 3 only
approximates the Gibbs algorithm variant. Two
important assumptions are made:
The first is that an appropriate choice of prior belief
has been made.
The second is that the noise on the data is
not too ``weird'' (mean
, small noise more likely).
The two assumptions are necessary for any
algorithm based on an additional error term
besides the training error.
The approximations are:
is approximated locally
in weight space, and flat
are approximated by
flat
with
near
's.
Our Gibbs variant takes into account
that FMS uses only
for computing flatness.