BFA: Best-Feature-Aware Fusion for Multi-View Fine-Grained Manipulation
Zihan Lan, Weixin Mao, Haosheng Li, Le Wang, Tiancai Wang, Haoqiang Fan, Osamu Yoshie
AI summary
Problem
Existing multi-view manipulation policies treat all camera inputs equally, ignoring that different views hold varying importance across manipulation stages. This uniform fusion introduces redundant visual noise, increases computation, and degrades precision in fine-grained tasks.
Approach
The authors introduce a plug-and-play Best-Feature-Aware (BFA) module that uses a lightweight score network to dynamically predict and reweight the importance of each camera view before fusing them for policy learning.
Key results
- 22–46% success rate improvement over ACT and RDT baselines across five real-world tasks
- Automated VLM-based annotation framework for generating view importance ground truth
- Reduced computational load through dynamic, stage-aware view prioritization
- Robust generalization on complex dexterous tasks like bag unzipping and box opening
Why it matters
Offers a lightweight, policy-agnostic visual fusion strategy that significantly enhances precision and efficiency for real-world fine-grained robotic manipulation.
Abstract
In real-world scenarios, multi-view cameras are typically employed for fine-grained manipulation tasks. Existing approaches (e.g., ACT [1]) tend to treat multi-view features equally and directly concatenate them for policy learning. How- ever, it will introduce redundant visual information and bring higher computational costs, leading to ineffective manipulation. Fine-grained manipulation tasks typically consist of multiple stages, where the best view may vary across different phases. This paper proposes a plug-and-play Best-Feature-Aware (BFA) fusion strategy for multi-view manipulation tasks, which is adaptable to various policies. Building upon the visual backbone of the policy network, we design a lightweight subnetwork to effectively predict the importance score of each view. Based on the predicted importance scores, the reweighted multi-view features are subsequently fused and fed into the end-to-end policy network for seamless integration. Notably, our method demon- strates outstanding performance in fine-grained manipulations. The experimental results show that our approach outperforms multiple baselines by 22-46% success rate on different tasks. Our work provides new insights and inspiration for tackling key challenges in fine-grained manipulations.