← Back ICRA 2026

Multi-View Gating Unit with KL-Based Alignment Toward Real-World Robot Control

Kei Igarashi, Shingo Murata

PDF

AI summary

Key figure (auto-extracted from paper)

A context-aware gating mechanism dynamically fuses multi-view camera features to achieve an 84% success rate in real-world robot manipulation, outperforming transformer-based baselines.

Multi-view fusion Robot control Gating mechanism KL alignment Imitation learning Real-world robotics

Problem

Integrating multi-view camera inputs for robot control is hindered by occlusions and irrelevant features that overwhelm naive concatenation, making precise manipulation difficult.

Approach

A Multi-View Gating Unit assigns context-dependent, per-dimension weights to latent representations from different cameras, combined with a KL-based alignment objective to enforce consistency between individual and fused features.

Key results

Achieves 84% overall task success rate across five kitchen-like tasks
Outperforms a modified Action Chunking with Transformers baseline
Ablation studies confirm both per-dimension gating and KL alignment significantly boost performance
Dynamically adapts feature weights to situational context to mitigate occlusions

Why it matters

Enables robust, context-aware perception for autonomous robots operating in unstructured, real-world environments.

Abstract

This paper proposes a framework for integrating latent representations from multi-view images, using adaptive weighting based on situational context to facilitate the genera- tion of robot actions. Specifically, we introduce the multi-view gating unit (MGU), which assigns context-dependent weights to each dimension of the latent representations extracted from different viewpoints. By summing the corresponding dimensions across all viewpoints, we construct a fused latent representation that serves as input to a policy model. To enhance the effec- tiveness of the MGU and improve the accuracy of action gen- eration, we incorporate a Kullback–Leibler (KL)-based align- ment objective that encourages consistency between individual viewpoint representations and the fused representation. We evaluate the proposed framework through imitation-learning experiments in a kitchen-like real-robot environment across five tasks. The experimental results show that the MGU dynamically adapts to different contexts, thereby enabling successful task execution. Additionally, we compare our approach with a modified Action Chunking with Transformers (ACT) baseline and conduct an ablation study to assess the contribution of each component. The results show that our method achieves a task success rate of 84%, outperforming all baseline methods and validating the effectiveness of both the individual components and their integration within the proposed framework.

Index terms

Cognitive Control Architectures Machine Learning for Robot Control Representation Learning