← Back ICRA 2026

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

Chanmi Lee, Minsung Yoon, Woojae Kim, Sebin Lee, Sung-eui Yoon

PDF

AI summary

Key figure (auto-extracted from paper)

A viewpoint-consistent 3D adversarial object successfully deceives wrist-mounted camera visuomotor policies across dynamic viewpoints, outperforming traditional 2D patches.

adversarial attacks visuomotor policies 3D adversarial objects differentiable rendering robotic security viewpoint robustness

Problem

Traditional 2D adversarial patches lose effectiveness under dynamic viewpoints and perspective distortions common in real-world robotic setups with moving or wrist-mounted cameras.

Approach

The authors optimize a texture on a 3D mesh using differentiable rendering and Expectation over Transformation, guided by a coarse-to-fine curriculum and saliency maps to maintain attack efficacy across varying distances and angles.

Key results

Achieves high attack success rates across diverse viewing angles and camera-object distances
Outperforms conventional 2D adversarial patches under dynamic viewpoint conditions
Demonstrates black-box transferability and real-world deployment viability
Coarse-to-fine optimization and saliency guidance significantly boost attack potency

Why it matters

It exposes critical security flaws in real-world robotic manipulation systems, prompting developers to prioritize 3D adversarial robustness for safe deployment.

Abstract

Neural network–based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appear- ance is relatively consistent; however, their efficacy often dimin- ishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proac- tively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable ren- dering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to in- duce textures effective across varying camera–object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black- box transferability and real-world applicability.

Index terms

Deep Learning for Visual Perception Deep Learning in Grasping and Manipulation Deep Learning Methods