VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang
AI summary
Problem
Multi-camera setups in 3D robotic manipulation introduce substantial redundancy and irrelevant information, increasing computational costs and training time while making it difficult to extract task-relevant features and mitigate occlusion.
Approach
VERM uses a foundation model to predict a task-adaptive virtual camera pose from multi-view observations, projects the 3D point cloud onto this single view, and combines it with a depth-aware module and dynamic coarse-to-fine refinement for precise 3D action planning.
Key results
- Achieves 1.89× faster training and 1.54× faster inference than state-of-the-art methods
- Surpasses prior SOTA on RLBench simulation and real-world 3D manipulation tasks
- Validates plug-and-play compatibility across multiple foundation models including GPT-4o, Qwen2.5, and Claude 3.5
- Introduces a dynamic coarse-to-fine refinement mechanism that selectively triggers high-precision views only during critical task phases
Why it matters
Provides a computationally efficient, task-adaptive visual perception pipeline that enables robots to perform precise 3D manipulation with reduced hardware dependency and faster learning, benefiting both academic researchers and industrial automation systems.
Abstract
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial re- dundancy and irrelevant information, which increases computa- tional costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we pro- pose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to- fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.