Open-World Object Manipulation using Pre-Trained Vision-Language Models
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?"" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation.
A grand milestone in robotics is to develop robots that can effectively perform various physical tasks for individuals in the real world.
The wide-ranging and diverse assortment of practical skills needed for such tasks presents a considerable challenge in creating an all-encompassing robot system.
Although current robotic systems have accomplished impressive feats, these systems are typically rigid and only function in a narrow range of behaviors, often
only those which they were specifically programmed or trained to perform.
Robotics capabilities can be measured based on two criteria: skills and objects. Skills refer to specific behaviors, such as "pick up X", "move X near Y",
or "take the lid off of X", while objects refer to entities upon which skills operate, such as "pick up the apple", "move the can near the soda dispenser",
"take the lid off the the coffee jar".
In this work, we focus on extending a limited set of skills to an unlimited set of new objects. We demonstrate the ability to execute manipulation skills on objects that were neither encountered during training nor explicitly programmed into the system in any way. In our work, we use the interface of natural language, where the robot receives raw text and then executes the skill as described by the text command. This raw text input can contain descriptions of any object, such as "grasp the pink stuffed elephant."
We have named our system MOO (Manipulation of Open-World Objects). In the following sections, we detail our methodology and data collection, and show that MOO achieves state of the art generalization for "unseen" object categories.
Like RT-1, our system is learned end-to-end via behavioral cloning from human demonstrations. We utilize OWL-ViT on our robot demonstration training data to generate genuine detections for conditioning the manipulation policy during the training loop.
Experiments and Results
We conduct extensive real-world evaluations of MOO peformance on 49 seen objects and 47 unseen objects, where we find that MOO is able to generalize to novel objects. In addition, we ablate MOO's model capacity and training distribution, and find that high-capactiy models with large amounts of diverse data are crucial to strong performance.
We explore using a generative VLM like PaLI to generate an image caption which is then provided to OWL-ViT to generate a mask usable by MOO. For example, PaLI can interpret human intent by identifying which object a person is pointing to, which MOO then successfully manipulates.
OWL-ViT need not be conditioned on a textual query; it can also be prompted with an image query, such as a stock image downloaded from the internet. Image-based querying is especially relevant when objects are hard to describe in words, such as when there are many visually similar objects in a scene. In these cases, it may be most effective to directly provide an image of the target of interest.
In cases where OWL-ViT or other VLMs fail to produce an accurate detection, we experiment with a GUI where humans can directly input the ground-truth mask provided to the policy. This is especially useful in cases with clutter or repeated objects, where textual or visual queries may be quite difficult even for state-of-the-art VLMs.
Open-world manipulation can be integrated with open-vocabulary object goal navigation. Coincidentally, there is an open-vocabulary object navigation algorithm called Clip on Wheels (CoW); we implement a variant of CoW and combine it with MOO, which we refer to as CoW-MOO. CoW handles open-vocabulary navigation to an object of interest, and MOO continues with manipulating the target object.
The authors would like to thank Alex Irpan, Brian Ichter, Clayton Tan, Grecia Salazar, Kanishka Rao, Nikhil Joshi, Noah Brown and the greater Robotics @ Google team for their feedback and contributions.
The website template was taken from Jon Barron.