One variation of the IMAX approach assumes that
and have *single binary* probabilistic
output units. In another variation,
and have single *real-valued*
output units. The latter case, however, requires
certain (not always realistic)
Gaussian assumptions about the input and output signals (see also
section 2.3 on Infomax).

In the case of *vector-valued* output
representations, Zemel and Hinton (1991) again make simplifying
Gaussian assumptions and maximize functions of the determinant
of the -covariance matrices (MAX) of the output activations
[Shannon, 1948] (see again section 2.3).
MAX can remove only *linear*
redundancy among the output units. (It should be mentioned, however,
that with Zemel's and Hinton's approach the outputs may be
non-linear functions of the inputs).

The nice thing about IMAX is that it expresses the goal of finding mutually predictable yet informative input transformations in a principled way (in terms of a single objective function). In contrast, our approach involves two separate objective functions that have to be combined using a relative weight factor. An interesting feature of our approach is that it conceptually separates two issues: (A) the desire for discriminating mappings from input to representation, and (B) the desire for mutually predictable representations. There are many different approaches (with mutual advantages and disadvantages) for satisfying (A). In the context of a given problem, the most appropriate alternative approach can be `plugged into' our basic architecture.

Another difference between IMAX and our approach is that our approach does not only enforce mutual predictability but also equality of and . This does not at all affect the generality of the approach. Note that one could introduce additional `predictor networks' - one for learning to predict from and another one for learning to predict from . Then one could design error functions enforcing mutual predictability (instead of using the essentially equivalent error function used in this paper). However, this would not increase the power of the approach but would only introduce unnecessary additional complexity. In fact, one advantage of our simple approach is that it makes it trivial to decide whether the outputs of both networks essentially represent the same thing.

The following section includes an experiment that compares IMAX to our approach.