VSS 2020 Poster


Despite the impressive achievements of supervised deep neural networks, brains must learn to represent the world without access to ground-truth training data. We propose that perception of distal properties arises instead from unsupervised learning objectives, such as temporal prediction, applied to proximal sensory data. To test this, we rendered 10,000 videos of objects moving with random rotational axis, speed, illumination, and reflectance. We trained a four-layer recurrent ‘PredNet’ network to predict the pixels of the next frame in each video. After training, object shape, material, position, and illumination could be decoded for new videos by taking linear combinations of unit activations. Representations were hierarchical, with scene properties better estimated from deep than shallow layers (e.g., material reflectance could be predicted with R$^2$=0.92 from layer 4, but only 0.54 from layer 1). Visualising single ‘neurons’ revealed selectivity for distal features: a ‘shadow unit’ in layer 4 responds exclusively to image locations containing the object’s shadow, while a ‘reflectance edge’ unit in layer 3 tracks image edges caused by reflectance changes. Material decoding was higher for moving than static objects, and increased over the first five frames, demonstrating that the model is sensitive to motion features disambiguating reflective from textured surfaces. To test whether these features are similar to those used by humans, we rendered test stimuli depicting reflective objects that were either static, moving, or moving with ‘reflections’ fixed to their surface. All conditions had near-identical static image properties, but motion cues in the latter conditions give rise to glossy vs matte percepts, respectively. Model-predicted gloss agreed with human judgements of the relative glossiness of all stimuli. Our results suggest unsupervised deep learning discovers motion cues to material similar to those represented in human vision, and provides a framework for understanding how brains learn rich scene representations without ground-truth world information.

Learning to see gloss by predicting videos
Banyan Breezeway, VSS 2020, St Pete’s Beach, Florida