Learning to See

It is an exciting time for computer vision. With the success of new computational architectures for visual processing, such as deep neural networks (e.g., convNets) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. Computer vision is now present among many commercial products, such as digital cameras, web applications, security applications, etc. The performances achieved by convNets are remarkable and constitute the state of the art on many recognition tasks. But why it works so well? what is the nature of the internal representation learned by the network? I will show that the internal representation can be interpretable. In particular, object detectors emerge in a scene classification task. Then, I will show that an ambient audio signal can be used as a supervisory signal for learning visual representations. We do this by taking advantage of the fact that vision and hearing often tell us about similar structures in the world, such as when we see an object and simultaneously hear it make a sound. We train a convNet to predict ambient sound from video frames, and we show that, through this process, the model learns a visual representation that conveys significant information about objects and scenes.

RELATED CATEGORIES

Learning to See

Antonio Torralba

RELATED CATEGORIES

MORE VIDEOS FROM THE EVENT

MORE VIDEOS FROM THE SAME CATEGORIES