With the *offline* variant of the approach,
's training phase
is based on a set of training files.
Assume that the alphabet contains
possible characters
.
The (local) representation of is a binary -dimensional
vector with exactly one non-zero component (at the -th position).
has input units and output units.
is called the ``time-window size''.
We insert default characters at the beginning of each file.
The representation of the
default character, , is the -dimensional zero-vector.
The -th character of file (starting
from the first default character) is called .

For all and all possible ,
receives as an input

where is the concatenation operator for vectors. produces as an output , a -dimensional output vector. Using back-propagation [36][9][16][19], is trained to minimize

Let denote the -th component of the vector . Due to the local character representation, this error function is minimized if, for all and for all appropriate , is equal to the conditional probability

For normalization purposes, we define