In this section we propose a combination of principles of the bucket brigade algorithm with principles of neural networks. Competition can be introduced naturally into neural networks by a mechanism of lateral inhibition. What we still need is a mechanism analogous to the process of bidding and paying in classifier systems. This mechanism must establish recursive dependencies `through time'. We introduce a local method for shifting `weight substance' (initially provided by the environment) from weights that are allowed to transport activation information at a certain time to those weights that were `setting the stage' one time tick earlier.
The basic network structure is an arbitrary (possibly cyclic) directed graph, where the nodes are familiar processing units. Some units are used for input purposes, others serve as outputs and may be coupled with effectors that may change the environment, which in turn may change the current input. Thus we have external and internal feedback.
The set of non-input units is partitioned into predefined `competitive subsets'. All non-input units synchronously try to get activated by summing their weighted inputs at each time tick. All members of a predefined competitive subset laterally inhibit each other (by some `winner-take-all' mechanism) thus competing for being active. unlike with most other approaches to goal directed learning the basic building blocks of the network are not simple units but winner-take-all subsets, each of which should have at least two members.
All weights are randomly initialized with a positive real value, and
are modifiable.
Initially we will assume
that there is instant decay: A unit active at time
manages to
send its contributions to
connected units that try to get activated at
,
then the sender is switched off instantly.
All units active at time
take away a fraction of
the positive
weights of their outgoing connections (if there are any)
that lead to winners active
at time t+1,
and distribute this `weight-substance' proportionally to the respective
contributions among the incoming connections (if there are any)
coming from winners (or input
units) active at time t-1.
Since the weights determine the context-dependent
strength of a unit, winners `get paid' for setting the stage for
their successors.
Input units do not have any incoming connections that they could
strengthen, they get activated by the environment
thus representing holes through which the weight-substance
of the system is leaking. The environment's influence is
completed by sometimes rewarding (or punishing) the connections
to currently active units
in the case of useful output behavior. (An external critic decides
what kind of behavior is useful.) The sum of all positive
weights in the system remains constant, except for the weight-substance
that is leaking through the input units and the new substance
that is entering the system in the case of payoff. Thus we have
a dissipative system which is consuming weight-substance provided by the
environment.
More formally, at time
we denote the activation of the
th unit
by
, the weight on the directed connection
between units
and
by
, and the contribution of some connection by
.
The activation rule works as follows: Unit
gets
activated at time
if it is an input unit and receives a perception, or if it wins the
competition between the units in the competitive
subset it belongs to by having
the largest positive net input
. We assume
the simplest case:
equals
if unit
is active, and
otherwise. (For instance,
a conventional boolean unit with two possible
activation states may be implemented by a competitive subset
with two members.)
If non-input unit
is active then its weights change according to
where
determines how much of its weight some particular
connection has to pay
to those connections that were responsible for
setting the stage at the previous time step.
is the `external payoff' that the environment
gives to
at time
, and may be computed like this:
If
the external critic does not know at time
whether
useful behavior took place
then
. Else, if the critic notices a
useful action , and if
unit
was active at time
, then
with
being a proportionality factor. As it will be demonstrated
in the section describing the experiments, there is much room
for more or less supervised strategies to determine
:
Every unit might get instructed at every time step,
or just a few units at
certain isolated time steps, etc.
The weights of the system (as opposed to the activations in Hopfield-networks or feedback-BP) have reached a stable state when every connection at any time is giving back as much weight-substance as it is receiving during the next time step. This means that (parallel) chains of units and connections cooperating in time have evolved.
It is important to see the local character of this method. No book-keeping of past activations is required, not even the accumulative computation of, say, a weighted sum of past activations. Each weight and each unit in principle performs the same operation at each time tick. No such things as `epoch boundaries' are required during training.