Predictability minimization (PM). The simple approach in this paper is based on the recent principle of predictability minimization (PM) [Schmidhuber, 1992]. A feedforward network with output units (or code units) sees input patterns with redundant components. Its goal is to respond with informative but less redundant output patterns, ideally by creating a factorial (statistically nonredundant) code of the input ensemble [Barlow et al., 1989]. The central idea of PM is: for each code unit, there is a predictor network that tries to predict the code unit from the remaining code units. But each code unit tries to become as unpredictable as possible, by representing environmental properties that are independent from those represented by other code units. Predictors and code units co-evolve by fighting each other. See details in section 2.
Potential advantages of PM over other methods are: (1) Unlike certain inherently sequential methods (e.g. Rubner and Schulten, 1990), PM can be implemented in a parallel way. (2) Unlike e.g. with Barrow's model (1987), there may be many simultaneously active code units (multiple ``winners'' instead of single ``winners''), as long as they represent different aspects of the environment (distributed coding instead of local coding). (3) Unlike e.g. with Linsker's INFOMAX (1988), there is no need to compute the derivatives of determinants of covariance matrices. (4) Unlike, e.g., Deco's and Obradovic's system (1996), Földiák's system (1990), Rubner and Tavan's system (1989), and anti-Hebbian systems in general, PM does neither require time consuming settling phases (due to recurrent connections), nor analytic computation of weight vectors. (5) Unlike almost all other methods, PM has a potential to discover nonlinear redundancy in the input data, and to generate appropriate redundancy-free codes. (6) Unlike most other ``neural'' methods (see references above), existing variants of PM create binary codes as opposed to continuous codes. This (a) allows for easier post training analysis and (b) facilitates the creation of statistically independent code components as opposed to merely decorrelated code components. Note that statistical independence implies decorrelation. But decorrelation does not imply statistical independence. Why are statistically independent code components of interest? One of many important reasons is this: for efficiency reasons, most statistical classifiers (e.g. Bayesian pattern classifiers) assume statistical independence of their input variables (corresponding to the pattern components). If we had a method that takes an arbitrary pattern ensemble and generates an equivalent factorial code, the latter could be fed into an efficient conventional classifier, which in turn could achieve its theoretically optimal performance.
Purpose of paper. Despite its potential advantages, PM has been tested on artificial data only (Lindstädt, 1993; Schmidhuber, 1993, 1994). To start a more thorough experimental analysis, in this paper we study the question: what happens if we apply a computationally simple, entirely local, highly parallel, and even biologically plausible variant of PM to real world images? An intuitively reasonable first step towards representing images in a less redundant way (and one adopted by standard image processing techniques, but apparently also by early visual processing stages of biological systems) is to build compact representations based on information about boundaries (edges) between areas with nonvarying, redundant pixel activations. Since PM aims at generating codes with reduced redundancy, we may expect it to discover related ways of coding visual scenes, by creating feature detectors responsive to edges or similar informative features in the input scenes. Moreover, since edge detectors (as well as other, related useful feature detectors such as on-center-off-surround detectors) can be implemented with a single layer of neuronal units, we already may expect a single layer system to come up with such detectors. This paper reports a confirmation of this expectation, thus demonstrating that PM makes sense not only intuitively and in theory, but also in practical applications. The results encourage us to expect that the method also will be beneficial for large scale applications, by extracting more sophisticated, nonlinear, useful features in deeper layers. Due to our current hardware limitations, however, a test of this hypothesis is left for future research.
Outline. Section 2 briefly reviews the principles of PM in more detail. Section 3 applies the technique to real world images and presents results.