Humans demonstrate a remarkable ability to generalize their knowledge and skills to new unseen scenarios. One of the primary reasons is that they are continually learning by acting in the environment and adapting to novel circumstances. This is in sharp contrast to our current machine learning algorithms which are incredibly narrow in only performing the tasks they are explicitly trained for. The reason for this is reliance on human labels which force the training to be done once ahead of time rather than continuously throughout the life of the agent. In this talk, I will present our initial efforts toward formulating artificial agents that could continually learn to perceive and perform complex sensorimotor tasks across a stream of new scenarios. I will focus on two key aspects: representation and action. Our embodied agent can 1) learn to represent its sensory input via self-supervision, and 2) map its learned representation to motor outputs via curiosity. This combination should equip the agent with the skills required to plan actions for solving complex tasks. I will discuss results on self-supervised representation learning for visual input, an agent learning to play video games simply based on its curiosity, a real robot learning to manipulate ropes, find its way in office environments driven by self-supervision and performing pick-n-place interaction to learn about objects.
Deepak Pathak is a Ph.D. candidate in Computer Science at UC Berkeley, advised by Prof. Trevor Darrell and Prof. Alexei A. Efros. His research spans deep learning, computer vision, and robotics. Deepak is a recipient of the NVIDIA Graduate Fellowship, the Snapchat Fellowship, and the Facebook Graduate Fellowship, and his research has been featured in popular press outlets, including MIT Technology Review, The Economist, Quanta Magazine, and The Wall Street Journal. Deepak received his Bachelors in Computer Science and Engineering from IIT Kanpur and was a recipient of the Gold Medal and the best undergraduate thesis award. He has also spent time at Facebook AI Research and Microsoft Research.
Understanding physical interactions in the real world is vital for robotic agents to complete challenging tasks. Through interactions with the environment, we would like the robot to learn about the scene, to improve its policy for different task goals, and to facilitate the task execution by utilizing objects from the environment. In this talk, we will explore three directions of designing visual representations for physical interactions: learning transferrable representations between different grasping tasks, learning task-oriented grasping for tool-based manipulation, and exploiting the visual memory over long time horizons in unseen environments.
Kuan Fang is a PhD student at Stanford Vision & Learning (SVL) Lab. His research interests span computer vision, machine learning and robotics. Before coming to Stanford, he received B.S. in Electrical Engineering from Tsinghua University. He has also worked as student researcher and intern at Microsoft Research Asia, Google [x] Robotics and Google Brain.
Existing work on multi-agent reinforcement learning usually considers settings where agents are jointly trained and/or their utilities are known. However, in many real world problems, we often have little knowledge about other agents’ mental states and can not directly alter their policies either. The ability of modeling the other agents, such as understanding their intentions and skills, is essential to the success of multi-agent reinforcement learning. In this talk, I will discuss my recent work on how to improve and incorporate agent modeling into multi-agent reinforcement learning in several novel settings. The first work proposes a new interactive agent modeling approach for improving imitation learning by learning to probe the target agent. In the second work, we formulates a multi-agent management problem where a manager is trained to achieve optimal coordination among self-interested worker agents which have their own minds (preferences, intentions, skills, etc.) by initiating contracts to assign suitable tasks and the right bonuses to them so that they will agree to work together.
Tianmin Shu is currently a Ph.D. candidate in the Department of Statistics at University of California, Los Angeles, where he studies social scene understanding and human-robot interactions. He has also interned at Facebook AI Research and Salesforce Research, working on multi-agent reinforcement learning. He is the recipient of the 2017 Cognitive Science Society Computational Modeling Prize in Perception/Action. His work on human-robot social interactions has also been featured in media.
Robots and autonomous systems have been playing a significant role in the modern economy. Custom-built robots have remarkably improved productivity, operational safety, and product quality. These robots are usually programmed for specific tasks in well-controlled environments, but unable to quickly adapt to diverse tasks in unstructured environments. To build robots that can perform a wide range of tasks in the real world, we need to endow them with the capacity of learning and perception. In this talk, I will present my work on building general-purpose robots that interact with the unstructured world through their senses, flexibly perform a wide range of tasks, and adaptively learn new skills. I will discuss the importance of closing the perception-action loop for general-purpose robots and demonstrate my approaches in this research direction, in particular, on learning primitive sensorimotor skills, building goal-directed behaviors, and programming new tasks from video demonstrations.
Yuke Zhu is a final year Ph.D. student in the Department of Computer Science at Stanford University, advised by Professor Fei-Fei Li and Professor Silvio Savarese.
Today's computer vision algorithms mostly obtain their understanding of the world through large, meticulously labeled datasets. In this talk, I'll discuss two recent methods that, instead, obtain their supervision by detecting anomalies in unlabeled multi-modal data. First, I'll present a self-supervised method for learning an audio-visual video representation. This method works by training a neural network to predict whether video frames and audio are temporally aligned (or synthetically misaligned). I'll then show how this representation can be used for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation.
Second, I'll present a method for detecting image manipulations (i.e. "photoshopped" images) that is trained only using a large dataset of real photographs. The algorithm uses the automatically recorded photo EXIF metadata as supervisory signal for training a model to determine whether an image is self-consistent -- that is, whether its content could have been produced by a single imaging pipeline. I'll then show how to apply this self-consistency model to the task of detecting and localizing image splices. The proposed method obtains state-of-the-art performance on several image forensics benchmarks, despite never seeing any manipulated images at training. That said, it is merely a step in the long quest for a truly general purpose visual forensics tool.
Andrew Owens is a postdoc at U.C. Berkeley. He received a Ph.D. in computer science from MIT in 2016, and a B.A. from Cornell University in 2010.