Consider a perceptual system being exposed to an unknown environment. The system has some kind of internal `state' to represent external events. We consider the general case where the state is an -dimensional distributed representation (a vector of real-valued or binary code symbols) created in response to the -th input vector .
An ambitious and potentially powerful objective of unsupervised learning is to represent the environment such that the various parts of the representation are statistically independent of each other. In other words, we would like to have methods for decomposing the environment into entities that belong together and do not have much to do with other entities1 (`learning to divide and conquer'). This notion is captured by the concept of `discovering factorial codes' (Barlow et al., 1989).
The aim of `factorial coding' is the following: Given the statistical properties of the inputs from the environment, find invertible internal representations such that the occurrence of the -th code symbol is independent of any of the others. Such representations are called factorial because they have a remarkable and unique property: The probability of the occurrence of a particular input is simply the product of the probabilities of the corresponding code symbols.
Among the advantages of factorial codes are as follows:
(1) `Optimal' input segmentation. An efficient method for discovering mutually independent features would have consequences for many segmentation tasks. For instance, consider the case where the inputs are given by retinal images of various objects with mutually independent positions. At any given time, the activations of nearby `pixels' caused by the same object are statistically correlated. Therefore a factorial code would not represent them by different parts of the internal storage. Instead, a factorial code could be created by finding input transformations corresponding to the abstract concept of `object position': Positions of different objects should be represented by different code symbols.
(2) Speeding up supervised learning. As Becker (1991) observes, if a representation with uncorrelated components is used as the input to a higher-level linear supervised learning network, then the Hessian of the supervised network's error function is diagonal, thus allowing efficient methods for speeding up learning (note that statistical independence is a stronger criterion than the mere absence of statistical correlation). Non-linear networks ought to profit as well.
(3) Occam's razor. Any method for finding factorial codes automatically implements Occam's razor which prefers simpler models over more complex ones, where simplicity is defined as the number of storage cells necessary to represent the environment in a factorial fashion. If there are more storage cells than necessary to implement a factorial code, then the independence criterion is met by letting all superfluous units emit constant values in response to all inputs. This implies storage efficiency, as well as a potential for better generalization capabilities.
(4) Novelty detection. As Barlow, Kaushal, and Mitchison (1989) point out , with factorial codes the detection of dependence between two symbols indicates hitherto undiscovered associations.
Barlow et al. do not present an efficient general method for finding factorial codes (but they propose a few sequential `non-neural' heuristic methods). Existing `neural' methods for decreasing output redundancy (e.g. Linsker (1988), Zemel and Hinton (1991), Oja (1989), Sanger (1989), Földiák (1990), Rubner and Schulten (1990), Silva and Almeida (1991)) are restricted to the linear case and do not aim at the ambitious general goal of statistical independence. In addition, some of these methods require Gaussian assumptions about the input and output signals, as well as the explicit calculation of the derivative of the determinant of the output covariance matrix (Shannon, 1948).
The main contribution of this paper is a simple but general `neural' architecture (plus the appropriate objective functions) for finding factorial codes.
I would not be surprised, however, if the general problem of finding factorial codes turned out to be NP-hard. In that case, gradient-based procedures as described herein could not be expected to always find factorial codes. The paper at hand focuses on the novel basic principle without trying to provide solutions for the old problem of local maxima. Also, the purpose of this report is not to compare the performance of algorithms based on the novel principle to the performance of existing sequential `non-neural' heuristic methods (Barlow et. al, 1989). The toy-experiments described below are merely for illustrative purposes.