← Back ICRA 2026

BFA: Best-Feature-Aware Fusion for Multi-View Fine-Grained Manipulation

Zihan Lan, Weixin Mao, Haosheng Li, Le Wang, Tiancai Wang, Haoqiang Fan, Osamu Yoshie

PDF

AI summary

Key figure (auto-extracted from paper)

Dynamically weighting multi-view camera features by task stage boosts fine-grained robotic manipulation success rates by 22–46% over uniform fusion baselines.

Multi-view fusion Fine-grained manipulation Imitation learning View importance scoring Robotic control Vision-language models

Problem

Existing multi-view manipulation policies treat all camera inputs equally, ignoring that different views hold varying importance across manipulation stages. This uniform fusion introduces redundant visual noise, increases computation, and degrades precision in fine-grained tasks.

Approach

The authors introduce a plug-and-play Best-Feature-Aware (BFA) module that uses a lightweight score network to dynamically predict and reweight the importance of each camera view before fusing them for policy learning.

Key results

22–46% success rate improvement over ACT and RDT baselines across five real-world tasks
Automated VLM-based annotation framework for generating view importance ground truth
Reduced computational load through dynamic, stage-aware view prioritization
Robust generalization on complex dexterous tasks like bag unzipping and box opening

Why it matters

Offers a lightweight, policy-agnostic visual fusion strategy that significantly enhances precision and efficiency for real-world fine-grained robotic manipulation.

Abstract

In real-world scenarios, multi-view cameras are typically employed for fine-grained manipulation tasks. Existing approaches (e.g., ACT [1]) tend to treat multi-view features equally and directly concatenate them for policy learning. How- ever, it will introduce redundant visual information and bring higher computational costs, leading to ineffective manipulation. Fine-grained manipulation tasks typically consist of multiple stages, where the best view may vary across different phases. This paper proposes a plug-and-play Best-Feature-Aware (BFA) fusion strategy for multi-view manipulation tasks, which is adaptable to various policies. Building upon the visual backbone of the policy network, we design a lightweight subnetwork to effectively predict the importance score of each view. Based on the predicted importance scores, the reweighted multi-view features are subsequently fused and fed into the end-to-end policy network for seamless integration. Notably, our method demon- strates outstanding performance in fine-grained manipulations. The experimental results show that our approach outperforms multiple baselines by 22-46% success rate on different tasks. Our work provides new insights and inspiration for tackling key challenges in fine-grained manipulations.

Index terms

Imitation Learning Sensor Fusion Bimanual Manipulation