Open-World Object Manipulation using Pre-Trained Vision-Language Models

  • Austin Stone*
  • Ted Xiao*
  • Yao Lu*
  • Keerthana Gopalakrishnan
  • Kuang-Huei Lee

  • Quan Vuong
  • Paul Wohlhart
  • Brianna Zitkovich
  • Fei Xia
  • Chelsea Finn
  • Karol Hausman

  • *Denotes equal contribution.


For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?"" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation.

We train a language-conditioned policy conditioned on object localizations from a frozen vision-language model (VLM). The policy is trained on demonstrations spanning a diverse set of 106 objects utilizing object-centric representations generated by a VLM, enabling the policy to generalize to novel objects and object localizations produced from modalities unseen during training.


A grand milestone in robotics is to develop robots that can effectively perform various physical tasks for individuals in the real world. The wide-ranging and diverse assortment of practical skills needed for such tasks presents a considerable challenge in creating an all-encompassing robot system. Although current robotic systems have accomplished impressive feats, these systems are typically rigid and only function in a narrow range of behaviors, often only those which they were specifically programmed or trained to perform. Robotics capabilities can be measured based on two criteria: skills and objects. Skills refer to specific behaviors, such as "pick up X", "move X near Y", or "take the lid off of X", while objects refer to entities upon which skills operate, such as "pick up the apple", "move the can near the soda dispenser", "take the lid off the the coffee jar".

In this work, we focus on extending a limited set of skills to an unlimited set of new objects. We demonstrate the ability to execute manipulation skills on objects that were neither encountered during training nor explicitly programmed into the system in any way. In our work, we use the interface of natural language, where the robot receives raw text and then executes the skill as described by the text command. This raw text input can contain descriptions of any object, such as "grasp the pink stuffed elephant."

We have named our system MOO (Manipulation of Open-World Objects). In the following sections, we detail our methodology and data collection, and show that MOO achieves state of the art generalization for "unseen" object categories.


Our system employs an open-vocabulary object detector, specifically OWL-ViT, to link natural language with objects in a visual image. Given a command for known skill but an unknown object, we break down the command into the skill and object components and provide the text describing the object to OWL-ViT in order to obtain a bounding box indicator of where the object(s) reside in the robot's RGB camera image. We enhance the general purpose transformer architecture of our previous work, RT-1 , by incorporating the object locations in the form of a segmentation mask and excluding the text embedding of the objects. RT-1 tends to be brittle when confronted with previously unseen objects because it has only seen data across a limited set of only 17 objects types during training, and it relied on natural language embeddings to determine which objects to manipulate. Given a previously unseen object, RT-1 will be confronted with a novel language embedding, making it difficult to generalize. Conversely in MOO, the the segmentation mask representation is identical for both previously seen and novel unseen objects, which simplifies generalization. A diagram of our architectural modifications to RT-1 is shown above.

Like RT-1, our system is learned end-to-end via behavioral cloning from human demonstrations. We utilize OWL-ViT on our robot demonstration training data to generate genuine detections for conditioning the manipulation policy during the training loop.


We found that we needed to greatly expand the training dataset of RT-1 in order to learn generalizable manipulation skills. The original RT-1 dataset included only 16 object types, and our early attempts struggled to manipulate objects with shapes different from those in the training set. Since most skills require grasping (picking) the object as a component step, we hypothesized that we could learn most of the information needed to generalize skills to unseen objects by only incorporating more grasping demonstrations into the RT-1 dataset. We added additional grasping demonstrations across 90 diverse object categories as shown in the figure above. We found that we could transfer knowledge from these grasping, or "pick" episodes to the existing RT-1 skills ("move X near Y", "place X upside down", etc).

Experiments and Results

Our focus is mainly on the system's ability to generalize and perform skills on objects that it has never encountered during training. To test this, we position our robot beside a tabletop surface that is covered with a variety of objects and then provide a natural language command to execute a specific skill on a designated object, as depicted above. The majority of the objects on the table are distractors and should not be involved in the episode, and the target object has not been seen previously. We repeat this process for each skill multiple times and record the percentage of successful episodes. We compare our results to those of RT-1 and VIMA for baseline evaluation.

The primary experimental outcomes are presented in the table above. The results reveal that MOO surpasses other methods in performance on both seen and unseen objects. Notably, MOO demonstrates an impressive success rate of approximately 75% for executing skills on objects that it has never encountered before. Through a series of ablations, we have established a strong correlation between increased performance and larger dataset sizes and model capacity. Additionally, we have demonstrated that MOO is more robust than competing methods in managing distracting environments. For further insights, we encourage you to consult our paper.


We conduct extensive real-world evaluations of MOO peformance on 13 seen objeects and 8 unseen objects, where we find that MOO is able to generalize to novel objects. In addition, we ablate MOO's model capacity and training distribution, and find that high-capactiy models with large amounts of diverse data are crucial to strong performance.

We explore using a generative VLM like PaLI to generate an image caption which is then provided to OWL-ViT to generate a mask usable by MOO. For example, PaLI can interpret human intent by identifying which object a person is pointing to, which MOO then successfully manipulates.

OWL-ViT need not be conditioned on a textual query; it can also be prompted with an image query, such as a stock image downloaded from the internet. Image-based qurrying is especially relevant when objects are hard to describe in words, such as when there are many visually similar objects in a scene. In these cases, it may be most effective to directly provide an image of the target of interest.

In cases where OWL-ViT or other VLMs fail to produce an accurate detection, we experiment with a GUI where humans can directly input the ground-truth mask provided to the policy. This is especially useful in cases with clutter or repeated objects, where textual or visual queries may be quite difficult even for state-of-the-art VLMs.

Open-world manipulation can be integrated with open-vocabulary object goal navigation. Coincidentally, there is an open-vocabulary object navigation algorithm called Clip on Wheels (CoW); we implement a variant of CoW and combine it with MOO, which we refer to as CoW-MOO. CoW handles open-vocabulary navigation to an object of interest, and MOO continues with manipulating the target object.



The authors would like to thank Alex Irpan, Brian Ichter, Clayton Tan, Grecia Salazar, Kanishka Rao, Nikhil Joshi, Noah Brown and the greater Robotics @ Google team for their feedback and contributions.

The website template was taken from Jon Barron.