← Back ICRA 2026

Actron3D: Learning Actionable Neural Functions from Videos for Transferable Robotic Manipulation

Anran Zhang, Hanzhi Chen, Yannick Burkhardt, Yao Zhong, Johannes Betz, Helen Oleynikova, Stefan Leutenegger

PDF

AI summary

Key figure (auto-extracted from paper)

Actron3D enables zero-shot robotic manipulation of novel objects by distilling multimodal cues from just 2–3 monocular videos into a continuous 3D neural representation.

Neural Affordance Robotic Manipulation Learning from Videos Zero-Shot Transfer 3D Representation Differentiable Optimization

Problem

Existing video-based robot learning methods lack explicit 3D spatial grounding, struggle with viewpoint and object discrepancies, and require extensive demonstrations or manual annotations to generalize.

Approach

The framework distills geometry, visual features, contact priors, and action flows from monocular videos into a continuous 3D Neural Affordance Function, then aligns this function to new objects via differentiable optimization to generate executable 6-DoF policies.

Key results

Continuous object-centric action representation encoding geometry, visual features, contact priors, and point flows
Differentiable affordance transfer mechanism aligning neural functions to novel scenes for 6-DoF trajectory generation
14.9 percentage point improvement in average success rate across 13 tasks over data-hungry baselines
Successful zero-shot manipulation using only 2–3 demonstration videos per task in simulation and real-world settings

Why it matters

Enables robots to efficiently learn and transfer complex manipulation skills from casual videos to unseen objects, drastically reducing reliance on expensive data collection and manual annotations.

Abstract

We present ACTRON3D, a framework that en- ables robots to acquire transferable 6-DoF manipulation skills from monocular, uncalibrated, RGB-only human demonstration videos. Our key idea is to represent manipulation knowledge within a video as a continuous neural function over object space. At the core of ACTRON3D lies the Neural Affordance Function, which distills geometry, visual features, contact priors, and action flows from diverse demonstration videos into a compact 3D neural representation. During deployment, we adopt a hier- archical pipeline that retrieves the matched affordance function and transfers encoded manipulation knowledge to novel objects through coarse-to-fine differentiable optimization. Leveraging the continuous nature of Neural Affordance Function, the framework performs spatial queries over multimodal features to align demonstrations with observations and generates precise 6-DoF manipulation policy. Experiments in both simulation and the real-world demonstrate that ACTRON3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in the average success rate across 13 tasks while requiring only 2–3 demonstration videos per task.

Index terms

Deep Learning in Grasping and Manipulation Perception for Grasping and Manipulation Learning from Demonstration