Sepp Hochreiter's Fundamental Deep Learning Problem (1991)
Jürgen Schmidhuber, 2013
Two decades later everybody is talking about Deep Learning! A first milestone of Deep Learning research was the 1991 diploma thesis of Sepp Hochreiter , my very first student, who is now a professor in Linz. His work formally showed that deep neural networks are hard to train, because they suffer from the now famous problem of vanishing or exploding gradients: in typical deep or recurrent networks, back-propagated error signals [3a,3] either shrink rapidly, or grow out of bounds . In fact, they decay exponentially in the number of layers, or they explode. All our subsequent Deep Learning research of the 1990s and 2000s was motivated by this insight.
The thesis is in German, but don't worry, all basic results are documented in the universal language of mathematics . (Google Translate does a reasonable job on it.)
Ten years later, an additional survey came out in English .
We have found three ways of partially overcoming the Fundamental Deep Learning Problem:
I. My first Deep Learner of 1991 overcame it through unsupervised pre-training for a hierarchy of (recurrent) neural networks . This greatly facilitated subsequent supervised credit assignment through back-propagation.
II. LSTM-like networks (since 1997)  avoid the problem through special architecture unaffected by it.
III. Today, a million times faster GPU-based computers allow for propagating errors a few layers further down within reasonable time, even in traditional NN - that's basically what's winning many of the image competitions now, e.g., . (Although this does not really overcome the problem in a fundamental way.)
 Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, 1991.
PDF. (Alternative PDFs can be found here - scroll all the way down).
 Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Juergen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001.
 Paul J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP (1981), Springer, 1982.
[3a] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60.
J. Schmidhuber. My first Deep Learner of 1991 + Deep Learning timeline 1962-2013
 1997: Deep Learner based on LSTM RNN - first international
pattern recognition competitions won in 2009
 2011: Superhuman Visual Pattern Recognition through GPU-based deep feedforward NN