Consider a perceptual system being exposed to an unknown environment. The system has some kind of internal `state' to represent external events. We consider the general case where the state is an -dimensional distributed representation (a vector of real-valued or binary code symbols) created in response to the -th input vector .

An ambitious and potentially powerful objective of unsupervised
learning is
to represent the environment such that
the various parts of the representation are *statistically independent*
of each other. In other words, we would like to have methods for
decomposing
the environment into entities that belong together
and do not
have much to do with other entities^{1} (`learning to divide and conquer').
This notion is captured by the concept of `discovering factorial
codes' (Barlow et al., 1989).

The aim of `factorial coding' is the following: Given
the statistical properties of the inputs from the environment,
find *invertible* internal representations
such that the occurrence of the -th code symbol is
independent of any of the others.
Such
representations are called factorial because they have
a remarkable and unique property:
The probability
of the occurrence of a particular input is simply the product
of the probabilities of the corresponding code symbols.

Among the advantages of factorial codes are as follows:

*(1) `Optimal' input segmentation.*
An efficient method for discovering mutually
independent features would have
consequences for many segmentation tasks.
For instance, consider the case where the inputs are given
by retinal images of various objects with mutually independent positions.
At any given time, the activations of nearby
`pixels' caused by the same object are statistically
correlated.
Therefore a factorial code would not represent
them by different parts of the internal storage.
Instead, a factorial code could be created by finding
input transformations corresponding to the abstract concept
of `object position': Positions of different objects
should be represented by different code symbols.

*(2) Speeding up supervised learning.*
As Becker (1991) observes, if a representation
with uncorrelated components
is used as the input to a higher-level linear supervised
learning network,
then the Hessian of the supervised network's
error function is diagonal, thus
allowing efficient methods for speeding up learning
(note that statistical independence is a stronger criterion
than the mere absence of statistical correlation).
Non-linear networks ought to profit as well.

*(3) Occam's razor.*
Any method for finding factorial codes automatically implements
Occam's razor which prefers simpler models over
more complex ones, where simplicity is defined as the number
of storage cells necessary to represent the environment in a factorial
fashion.
If there are more storage cells than necessary to implement
a factorial code, then the independence criterion
is met by letting all superfluous units emit constant values
in response to all inputs. This implies storage efficiency, as
well as a potential for better generalization capabilities.

*(4) Novelty detection.*
As Barlow, Kaushal, and
Mitchison (1989) point out , with factorial
codes the detection of
dependence between two symbols indicates
hitherto undiscovered associations.

Barlow et al. do not present an efficient general method for finding factorial codes (but they propose a few sequential `non-neural' heuristic methods). Existing `neural' methods for decreasing output redundancy (e.g. Linsker (1988), Zemel and Hinton (1991), Oja (1989), Sanger (1989), Földiák (1990), Rubner and Schulten (1990), Silva and Almeida (1991)) are restricted to the linear case and do not aim at the ambitious general goal of statistical independence. In addition, some of these methods require Gaussian assumptions about the input and output signals, as well as the explicit calculation of the derivative of the determinant of the output covariance matrix (Shannon, 1948).

The main contribution of this paper is a simple but general `neural' architecture (plus the appropriate objective functions) for finding factorial codes.

I would not be surprised, however, if the general problem of finding factorial codes turned out to be NP-hard. In that case, gradient-based procedures as described herein could not be expected to always find factorial codes. The paper at hand focuses on the novel basic principle without trying to provide solutions for the old problem of local maxima. Also, the purpose of this report is not to compare the performance of algorithms based on the novel principle to the performance of existing sequential `non-neural' heuristic methods (Barlow et. al, 1989). The toy-experiments described below are merely for illustrative purposes.

Back to Independent Component Analysis page.