Learning How to Move and Where to Look from Unlabeled Video

The status quo in visual recognition is to learn from batches of unrelated Web photos labeled by human annotators. Yet cognitive science tells us that perception develops in the context of acting and moving in the world — and without intensive supervision. How can unlabeled video augment computational visual learning? I'll describe our recent work exploring how a system can learn effective representations by watching unlabeled video. Fist we consider how the ego-motion signals accompanying a video provide a valuable cue during learning, allowing the system to internalize the link between “how I move” and “what I see.” Next, I explore how the temporal coherence of video permits new forms of invariant feature learning, whether by capturing how object-centric regions evolve over time or by representing higher order consistency in visual changes. Incorporating these ideas into various recognition tasks, we demonstrate the power in learning from ongoing, unlabeled visual observations — even overtaking traditional heavily supervised approaches in some cases. Finally, I examine how simply having seen unlabeled human-taken videos, a system can learn to mimic human videographer tendencies, automatically creating normal field of view video out of unedited 360 degree panoramas.