Architecture.
For simplicity, we will focus on
a standard feedforward net (but in the experiments, we will use
recurrent nets as well). The net has
input units,
output units,
weights, and differentiable activation functions. It
maps input vectors
to output vectors
.
The weight from unit
to
is denoted by
.
The
-dimensional weight vector is denoted by
.
Training error.
Mean squared error
is used,
where
denotes the Euclidian norm, and
denotes the cardinality of a set.
To define regions in weight space with the property
that each weight vector from that region has ``similar small error'',
we introduce the tolerable error
, a positive constant.
``Small'' error is defined as being smaller than
.
implies ``underfitting''.
Boxes.
Each weight
satisfying
defines an ``acceptable minimum''.
We are interested in large regions of connected acceptable minima.
Such regions are called flat minima. They are associated with
low expected generalization error (see
[4]).
To simplify the algorithm for finding large connected
regions (see below), we do not consider
maximal connected regions but focus on so-called ``boxes'' within
regions: for each acceptable minimum
,
its box
in weight space
is a
-dimensional hypercuboid
with center
.
For simplicity, each edge of the box is taken to be parallel
to one weight axis.
Half the length of the box edge in direction of the axis
corresponding to weight
is denoted by
,
which is the maximal (positive) value such
that for all
, all positive
can be
added to or subtracted from
the corresponding component of
simultaneously
without violating
(
gives the
precision of
).
's box volume is defined by
.