Although gradient methods based on predictability minimization can not always be expected to find factorial codes - due to local minima and the possibility that the problem of finding factorial codes may be NP-hard - they have a potential for removing kinds of redundancy that previous linear methods were not able to remove. This holds even if the conjecture in section 5 ultimately proves to be false.
In many realistic cases, however, approximations of non-redundant codes should be satisfactory. It remains to be seen whether predictability minimization can be useful to find nearly non-redundant representations of real-world inputs. In ongoing research it is intended to apply the methods described herein to problems of unsupervised image segmentation (in the case of multiple objects), as well as to unsupervised sequence segmentation.
There is a relationship of predictability minimization to more conventional `competitive' learning schemes: In a certain sense, units compete for representing certain `abstract' transformations of the environmental input. The competition is not based on a physical `neighbourhood' criterion but on mutual predictability. Unlike with most previous schemes based on `winner-take-all' networks, output representations formed by predictability minimization may have multiple `winners', as long as they stand for independent features extracted from the environment.
One might speculate about whether the brain uses a similar principle based on `representational neurons' trying to escape the predictions of `predictor neurons'. Since the principle allows for entirely local sequence learning algorithms (in space and time), it might be biologically more plausible than methods such as `back-propagation through time' etc.
Predictability minimization also might be useful in cases where different representational modules see different inputs. For instance, if a binary feature of one input `patch' is predictable from features extracted from neighbouring `patches', then representations formed by predictability minimization would tend to not use additional storage cells for representing the feature.
The paper at hand adopts a general viewpoint on predictability minimization by focussing on the general case of non-linear nets. In some cases, however, it might be desireable to restrict the computational power of the representational modules and/or the predictors by making them linear or semi-linear. For instance, a hierarchical system with successive stages of computationally limited modules may be useful for reflecting the hierarchical structure of certain environments.
Among the additional topics covered by the longer version of this report (Schmidhuber, 1991) are: General remarks on unsupervised learning and information-theoretic aspects, a `neural' approach to finding binary factorial codes without using predictors, implementations of predictability minimization using binary stochastic units, the relationship of predictability minimization to recent sequence chunking methods, and combinations of goal-directed learning and unsupervised predictability minimization.