CAREER: Learning Predictive Models for Visual Navigation and Object Interaction

Efficiently moving around and interacting with objects in novel environments requires building expectations about people (e.g., which side will an oncoming person pass by), places (e.g., where are car keys likely to be in a home), and things (e.g., which way will a door open). However, manually building such expectations into decision-making systems is challenging. At the same time, machine learning has been shown to be successful in extracting representative patterns from training datasets in many related application domains. While the use of machine learning to learn predictive models for decision-making seems promising; the design choices (data sources, forms of supervision, architectures for the predictive models, and interaction of the predictive models with decision-making) are deeply intertwined. As part of this project, investigators will identify the precise aspects in which machine learning benefits navigation and object interaction; and co-design datasets, models, and learning algorithms to build systems that realize these benefits. The project will improve the state-of-the-art of predictive reasoning for navigation and object interaction by designing approaches that can leverage large-scale diverse data sources for training. Models, datasets, and systems developed in this project will advance navigation and mobile manipulation capabilities. These will enable practical downstream applications (e.g., assistive robots, telepresence), and open up avenues for follow-up research (e.g., human-robot interaction). The project will contribute to the education of students and the broader community through curriculum development, engagement in research projects, and accessible dissemination of research.

The project will co-design data collection methods, learning techniques, and policy architectures to enable large-scale learning of predictive models for people, places, and things for problems involving navigation and mobile manipulation. Investigators will tackle the following three research tasks: (1) designing predictive models for people, places, and objects that are necessary for decision making; (2) identifying data sources and generating supervision to learn these predictive models at-scale; and (3) hierarchical and modular policy architectures that effectively use the learned predictive models. Investigators will re-use existing sense-plan-control components (motion planners, feedback controllers) where applicable (e.g., motion in free space), and introduce learning in modules that require speculation (i.e., high-level decision-making modules, e.g., identifying promising directions for exploration, predicting where will an oncoming human go next, what is a good position to open a drawer from). Investigators will evaluate the effectiveness of proposed methods by comparing the efficiency of systems with and without predictive reasoning.

Publications

Learning Getting-Up Policies for Real-World Humanoid Robots
Xialin He*, Runpei Dong*, Zixuan Chen, Saurabh Gupta
Robotics: Science and Systems (RSS), 2025
website / video

Abstract: Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world.

A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects
Arjun Gupta, Rishik Sathua, Saurabh Gupta
arXiv, 2025
website / video

Abstract: Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.

Demonstrating MOSART: Opening Articulated Structures in the Real World
Arjun Gupta, Michelle Zhang*, Rishik Sathua*, Saurabh Gupta
Robotics: Science and Systems (RSS), 2025
website / video

Abstract: What does it take to build mobile manipulation systems that can competently operate on previously unseen objects in previously unseen environments? This work answers this question using opening of articulated objects as a mobile manipulation testbed. Specifically, our focus is on the end-to-end performance on this task without any privileged information, i.e. the robot starts at a location with the novel target articulated object in view, and has to approach the object and successfully open it. We first develop a system for this task, and then conduct 100+ end-to-end system tests across 13 real world test sites. Our large-scale study reveals a number of surprising findings: a) modular systems outperform end-to-end learned systems for this task, even when the end-to-end learned systems are trained on 1000+ demonstrations, b) perception, and not precise end-effector control, is the primary bottleneck to task success, and c) state-of-the-art articulation parameter estimation models developed in isolation struggle when faced with robot-centric viewpoints. Overall, our findings highlight the limitations of developing components of the pipeline in isolation and underscore the need for system-level research, providing a pragmatic roadmap for building generalizable mobile manipulation systems.

GOAT: GO to Any Thing
Matthew Chang*, Theophile Gervet*, Mukul Khanna*, Sriram Yenamandra*, Dhruv Shah, So Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, Devendra Chaplot
Robotics: Science and Systems (RSS), 2024
website

Abstract: In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation.

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Aditya Prakash, Arjun Gupta, Saurabh Gupta
European Conference on Computer Vision (ECCV), 2024
website

Abstract: Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

Push Past Green: Learning to Look Behind Plant Foliage by Moving It
Xiaoyu Zhang, Saurabh Gupta
Conference on Robot Learning (CoRL), 2023
webpage / code+data

Abstract: Autonomous agriculture applications (e.g., inspection, phenotyping, plucking fruits) require manipulating the plant foliage to look behind the leaves and the branches. Partial visibility, extreme clutter, thin structures, and unknown geometry and dynamics for plants make such manipulation challenging. We tackle these challenges through data-driven methods. We use self-supervision to train SRPNet, a neural network that predicts what space is revealed on execution of a candidate action on a given plant. We use SRPNet with the cross-entropy method to predict actions that are effective at revealing space beneath plant foliage. Furthermore, as SRPNet does not just predict how much space is revealed but also where it is revealed, we can execute a sequence of actions that incrementally reveal more and more space beneath the plant foliage. We experiment with a synthetic (vines) and a real plant (Dracaena) on a physical test-bed across 5 settings including 2 settings that test generalization to novel plant configurations. Our experiments reveal the effectiveness of our overall method, PPG, over a competitive hand-crafted exploration method, and the effectiveness of SRPNet over a hand-crafted dynamics model and relevant ablations.

Predicting Motion Plans for Articulating Everyday Objects
Arjun Gupta, Max Shepherd, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023
webpage / dataset

Abstract: Mobile manipulation tasks such as opening a door, pulling open a drawer, or lifting a toilet lid require constrained motion of the end-effector under environmental and task constraints. This, coupled with partial information in novel environments, makes it challenging to employ classical motion planning approaches at test time. Our key insight is to cast it as a learning problem to leverage past experience of solving similar planning problems to directly predict motion plans for mobile manipulation tasks in novel situations at test time. To enable this, we develop a simulator, ArtObjSim, that simulates articulated objects placed in real scenes. We then introduce SeqIK\(\+\theta_0\), a fast and flexible representation for motion plans. Finally, we learn models that use SeqIK\(\+\theta_0\) to quickly predict motion plans for articulating novel objects at test time. Experimental evaluation shows improved speed and accuracy at generating motion plans than pure search-based methods and pure learning methods.

Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds
Shaowei Liu, Saurabh Gupta*, Shenlong Wang*
Computer Vision and Pattern Recognition (CVPR), 2023
website / code

Abstract: We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1 degreeof- freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset, and the Sapiens dataset with common daily objects, as well as real-world scans. Experiments show that our method outperforms two leading prior works on various metrics.

People

Contact

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS-2143873 (Project Title: CAREER: Learning Predictive Models for Visual Navigation and Object Interaction , PI: Saurabh Gupta). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.