Assume that the alphabet contains possible characters . The (local) representation of is a binary -dimensional vector with exactly one non-zero component (at the -th position). has input units and output units. is called the ``time-window'' size. We insert default characters at the beginning of each file. The representation of the default character, , is the -dimensional zero-vector. The -th character of file (starting from the first default character) is called .
For all and all possible ,
receives as an input
For instance, assume that a given ``context string'' of size is followed by a certain character in one third of all training exemplars involving this string. Then, given the context, the predictor's corresponding output unit will tend to predict a value of 0.3333.
In practical applications, the
will not always sum up to 1.
To obtain outputs satisfying the properties of
a proper probability distribution,
we normalize by defining