← Back ICRA 2026

SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation

Aditya Agarwal, Gaurav Singh, Bipasha Sen, Tomas Lozano-Perez, Leslie Kaelbling

PDF

AI summary

Key figure (auto-extracted from paper)

Composing off-the-shelf vision foundation models enables accurate, open-world 3D scene reconstruction from a single RGB-D image, directly enabling robust robotic grasping in cluttered environments.

3D scene completion open-world perception robot manipulation RGB-D reconstruction dexterous grasping vision foundation models

Problem

Robots struggle to accurately perceive and manipulate objects in unstructured, cluttered environments because existing single-view 3D reconstruction methods are either closed-set, rely on synthetic data, or fail to individuate objects for reliable grasping.

Approach

SceneComplete chains pretrained open-set models—vision-language prompting, grounded segmentation, image inpainting, image-to-3D generation, and pose estimation—into a pipeline that converts a single RGB-D image into complete, segmented 3D object meshes.

Key results

Outperforms OctMAE and ZeroGrasp on 3D reconstruction metrics (Chamfer distance, MIoU) across the GraspNet-1B dataset
Generates accurate parallel-jaw and dexterous grasp proposals from reconstructed meshes
Demonstrates successful real-world pick-and-place manipulation with everyday objects on a physical robot
Provides a modular, scalable pipeline requiring minimal fine-tuning while leveraging rapidly advancing vision models

Why it matters

Provides a practical, category-agnostic perception foundation that allows manipulation robots to operate reliably in open-world, cluttered spaces without extensive retraining.

Abstract

Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demon- strate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal gener- ation, including for a dexterous hand. We release the code and additional results on our website.

Index terms

Perception for Grasping and Manipulation RGB-D Perception Manipulation Planning