Actron3D: Learning Actionable Neural Functions from Videos for Transferable Robotic Manipulation
Anran Zhang, Hanzhi Chen, Yannick Burkhardt, Yao Zhong, Johannes Betz, Helen Oleynikova, Stefan Leutenegger
AI summary
Problem
Existing video-based robot learning methods lack explicit 3D spatial grounding, struggle with viewpoint and object discrepancies, and require extensive demonstrations or manual annotations to generalize.
Approach
The framework distills geometry, visual features, contact priors, and action flows from monocular videos into a continuous 3D Neural Affordance Function, then aligns this function to new objects via differentiable optimization to generate executable 6-DoF policies.
Key results
- Continuous object-centric action representation encoding geometry, visual features, contact priors, and point flows
- Differentiable affordance transfer mechanism aligning neural functions to novel scenes for 6-DoF trajectory generation
- 14.9 percentage point improvement in average success rate across 13 tasks over data-hungry baselines
- Successful zero-shot manipulation using only 2–3 demonstration videos per task in simulation and real-world settings
Why it matters
Enables robots to efficiently learn and transfer complex manipulation skills from casual videos to unseen objects, drastically reducing reliance on expensive data collection and manual annotations.
Abstract
We present ACTRON3D, a framework that en- ables robots to acquire transferable 6-DoF manipulation skills from monocular, uncalibrated, RGB-only human demonstration videos. Our key idea is to represent manipulation knowledge within a video as a continuous neural function over object space. At the core of ACTRON3D lies the Neural Affordance Function, which distills geometry, visual features, contact priors, and action flows from diverse demonstration videos into a compact 3D neural representation. During deployment, we adopt a hier- archical pipeline that retrieves the matched affordance function and transfers encoded manipulation knowledge to novel objects through coarse-to-fine differentiable optimization. Leveraging the continuous nature of Neural Affordance Function, the framework performs spatial queries over multimodal features to align demonstrations with observations and generates precise 6-DoF manipulation policy. Experiments in both simulation and the real-world demonstrate that ACTRON3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in the average success rate across 13 tasks while requiring only 2–3 demonstration videos per task.