Tomaso Poggio (CBCL, McGovern Institute, Massachusetts Institute of Technology, USA)
Abstract: I conjecture that the sample complexity of object recognition is mostly due to geometric image transformations and that the main goal of the ventral stream is to learn-and-discount image transformations. The theory provides a simple, biologically plausible one-layer module for learning generic affine transformations in R2 and become invariant to them for any new image. The module properties can be understood in terms of a slight extension of the Johnson-Lindenstrauss theorem. The theory then argues that although this approach is likely to have been discovered early by evolution and can deal with a few very important objects, it is too storage expensive to deal with a potentially infinite set of objects to be recognized. The ability to deal with an unlimited number of objects is obtained by a hierarchical , multi- layer architecture for which we prove local and global invariance of parts and wholes, respectively, to the affine group on R2. With the additional assumption of online, Hebbian-like learning mechanisms in various visual areas, the theory predicts the tuning of the neurons in V1, V2, V4 and IT in terms of the spectral properties of the covariance of the learned transformations for different receptive fields sizes. The theory predicts that class-specific transformations are learned and represented at the top of the ventral stream hierarchy; it also explains why a patch of mirror-symmetric tuned face neurons should be expected before pose invariance in the face network of visual cortex. If the theory were true, the ventral system would be a mirror of the symmetry properties of image transformations due to changes of viewpoint in the physical world.