Multi-View Gating Unit with KL-Based Alignment Toward Real-World Robot Control
Kei Igarashi, Shingo Murata
AI summary
Problem
Integrating multi-view camera inputs for robot control is hindered by occlusions and irrelevant features that overwhelm naive concatenation, making precise manipulation difficult.
Approach
A Multi-View Gating Unit assigns context-dependent, per-dimension weights to latent representations from different cameras, combined with a KL-based alignment objective to enforce consistency between individual and fused features.
Key results
- Achieves 84% overall task success rate across five kitchen-like tasks
- Outperforms a modified Action Chunking with Transformers baseline
- Ablation studies confirm both per-dimension gating and KL alignment significantly boost performance
- Dynamically adapts feature weights to situational context to mitigate occlusions
Why it matters
Enables robust, context-aware perception for autonomous robots operating in unstructured, real-world environments.
Abstract
This paper proposes a framework for integrating latent representations from multi-view images, using adaptive weighting based on situational context to facilitate the genera- tion of robot actions. Specifically, we introduce the multi-view gating unit (MGU), which assigns context-dependent weights to each dimension of the latent representations extracted from different viewpoints. By summing the corresponding dimensions across all viewpoints, we construct a fused latent representation that serves as input to a policy model. To enhance the effec- tiveness of the MGU and improve the accuracy of action gen- eration, we incorporate a Kullback–Leibler (KL)-based align- ment objective that encourages consistency between individual viewpoint representations and the fused representation. We evaluate the proposed framework through imitation-learning experiments in a kitchen-like real-robot environment across five tasks. The experimental results show that the MGU dynamically adapts to different contexts, thereby enabling successful task execution. Additionally, we compare our approach with a modified Action Chunking with Transformers (ACT) baseline and conduct an ablation study to assess the contribution of each component. The results show that our method achieves a task success rate of 84%, outperforming all baseline methods and validating the effectiveness of both the individual components and their integration within the proposed framework.