I will talk about two projects in which we use unsupervised deep learning, combined with large computer-rendered stimulus sets, as a framework to understand how brains learn rich scene representations without ground-truth world information. By learning to generate novel images, or learning to predict the next frame in video sequences, our models spontaneously learn to cluster images according to underlying scene properties such as illumination, shape, and material. We probe the models’ representation of material gloss in detail and find that they excellently predict human-perceived glossiness for novel surfaces and for network-generated images. Strikingly, the networks also correctly predict known failures of gloss perception on an image-by-image basis – for example, that bumpier surfaces tend to appear glossier than flatter ones, even when made of identical material. A supervised DNN and several other control models fail. Perceptual dimensions, like ‘glossiness,’ that appear to estimate properties of the physical world, can emerge spontaneously by learning to efficiently encode sensory data – indeed, unsupervised learning principles may account for a large number of perceptual dimensions in vision and beyond!