Understanding physical interactions in the real world is vital for robotic agents to complete challenging tasks. Through interactions with the environment, we would like the robot to learn about the scene, to improve its policy for different task goals, and to facilitate the task execution by utilizing objects from the environment. In this talk, we will explore three directions of designing visual representations for physical interactions: learning transferrable representations between different grasping tasks, learning task-oriented grasping for tool-based manipulation, and exploiting the visual memory over long time horizons in unseen environments.
Kuan Fang is a PhD student at Stanford Vision & Learning (SVL) Lab. His research interests span computer vision, machine learning and robotics. Before coming to Stanford, he received B.S. in Electrical Engineering from Tsinghua University. He has also worked as student researcher and intern at Microsoft Research Asia, Google [x] Robotics and Google Brain.
Existing work on multi-agent reinforcement learning usually considers settings where agents are jointly trained and/or their utilities are known. However, in many real world problems, we often have little knowledge about other agents’ mental states and can not directly alter their policies either. The ability of modeling the other agents, such as understanding their intentions and skills, is essential to the success of multi-agent reinforcement learning. In this talk, I will discuss my recent work on how to improve and incorporate agent modeling into multi-agent reinforcement learning in several novel settings. The first work proposes a new interactive agent modeling approach for improving imitation learning by learning to probe the target agent. In the second work, we formulates a multi-agent management problem where a manager is trained to achieve optimal coordination among self-interested worker agents which have their own minds (preferences, intentions, skills, etc.) by initiating contracts to assign suitable tasks and the right bonuses to them so that they will agree to work together.
Tianmin Shu is currently a Ph.D. candidate in the Department of Statistics at University of California, Los Angeles, where he studies social scene understanding and human-robot interactions. He has also interned at Facebook AI Research and Salesforce Research, working on multi-agent reinforcement learning. He is the recipient of the 2017 Cognitive Science Society Computational Modeling Prize in Perception/Action. His work on human-robot social interactions has also been featured in media.
Robots and autonomous systems have been playing a significant role in the modern economy. Custom-built robots have remarkably improved productivity, operational safety, and product quality. These robots are usually programmed for specific tasks in well-controlled environments, but unable to quickly adapt to diverse tasks in unstructured environments. To build robots that can perform a wide range of tasks in the real world, we need to endow them with the capacity of learning and perception. In this talk, I will present my work on building general-purpose robots that interact with the unstructured world through their senses, flexibly perform a wide range of tasks, and adaptively learn new skills. I will discuss the importance of closing the perception-action loop for general-purpose robots and demonstrate my approaches in this research direction, in particular, on learning primitive sensorimotor skills, building goal-directed behaviors, and programming new tasks from video demonstrations.
Yuke Zhu is a final year Ph.D. student in the Department of Computer Science at Stanford University, advised by Professor Fei-Fei Li and Professor Silvio Savarese.
Today's computer vision algorithms mostly obtain their understanding of the world through large, meticulously labeled datasets. In this talk, I'll discuss two recent methods that, instead, obtain their supervision by detecting anomalies in unlabeled multi-modal data. First, I'll present a self-supervised method for learning an audio-visual video representation. This method works by training a neural network to predict whether video frames and audio are temporally aligned (or synthetically misaligned). I'll then show how this representation can be used for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation.
Second, I'll present a method for detecting image manipulations (i.e. "photoshopped" images) that is trained only using a large dataset of real photographs. The algorithm uses the automatically recorded photo EXIF metadata as supervisory signal for training a model to determine whether an image is self-consistent -- that is, whether its content could have been produced by a single imaging pipeline. I'll then show how to apply this self-consistency model to the task of detecting and localizing image splices. The proposed method obtains state-of-the-art performance on several image forensics benchmarks, despite never seeing any manipulated images at training. That said, it is merely a step in the long quest for a truly general purpose visual forensics tool.
Andrew Owens is a postdoc at U.C. Berkeley. He received a Ph.D. in computer science from MIT in 2016, and a B.A. from Cornell University in 2010.