Models of vision have come far in the past 10 years. Deep neural networks can recognise objects with near-human accuracy, and predict brain activity in high-level visual regions. However, most networks require supervised training using ground-truth labels for millions of images, whereas brains must somehow learn from sensory experience alone. We have been using unsupervised deep learning, combined with computer-rendered artificial environments, as a framework to understand how brains learn rich scene representations without ground-truth information about the world. An unsupervised generative neural network spontaneously clustered images according to scene properties like material and illumination, despite receiving no explicit information about them. Strikingly, the resulting representations also predicted specific patterns of ‘successes’ and ‘errors’ in human perception, like the tendency for bumpier surfaces to appear glossier than flatter ones with identical materials. A supervised network and diverse alternative models failed. We think that perceptual dimensions, like ‘glossiness,’ that seem to estimate properties of the physical world, can emerge spontaneously by learning to efficiently encode sensory data – indeed, unsupervised learning principles might underlie many perceptual dimensions in vision and beyond!