← Back ICRA 2026

Improving Robotic Manipulation Robustness Via NICE Scene Surgery

Sajjad Pakdamansavoji, MOZHGAN POURKESHAVARZ, Adam Sigal, Zhiyuan Li, Rui Heng Yang, Amir Rasouli

PDF

AI summary

Key figure (auto-extracted from paper)

AI-driven scene surgery on existing robot demonstrations significantly boosts manipulation robustness and safety without requiring new data collection or simulators.

Robotic manipulation Visual distractors Data augmentation Diffusion inpainting Policy robustness Scene editing

Problem

Robotic policies trained on limited real-world demonstrations suffer performance and safety degradation when encountering visual distractors or scene variations unseen during training, while existing solutions rely on expensive simulators or custom model training.

Approach

NICE automatically edits distractor objects in real demonstration images through removal, restyling, or replacement using off-the-shelf vision-language models and diffusion inpainting, generating diverse, realistic training data while preserving task semantics.

Key results

Over 20% improvement in spatial affordance prediction accuracy for highly cluttered scenes
Average 11% increase in manipulation success rate across varying distractor quantities
6% reduction in target confusion and 7% decrease in collision rates
High background consistency and low FID scores validating photo-realistic scene editing

Why it matters

Provides a scalable, simulator-free data augmentation pipeline that directly improves the real-world robustness and safety of vision-language-action policies for robotic manipulation.

Abstract

Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object re- placement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision- language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affor- dance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robust- ness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.

Index terms

Deep Learning in Grasping and Manipulation Manipulation Planning Perception for Grasping and Manipulation