← Back ICRA 2026

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Kai Li, Shiyu Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Distilling omni-view depth knowledge into a lightweight monocular policy via contrastive learning significantly boosts navigation performance and scene generalization on low-cost mobile robots.

Knowledge distillation Visuomotor policy Monocular navigation Contrastive learning Mobile robots Scene transferability

Problem

Vision-based policies on lightweight mobile robots face a trilemma of poor scene transferability with monocular cameras, limited onboard computation, and the high cost of depth sensors.

Approach

A knowledge distillation framework trains a lightweight single-view RGB student to mimic both expert actions and the latent embeddings of an omni-view depth teacher using contrastive learning.

Key results

Approximately 15% improvement in navigational success rate and 19% increase in collision-free travel distance
Elimination of depth sensors or multi-camera setups with ~20 ms onboard inference latency
Contrastive embedding alignment significantly enhances scene transferability and reduces action errors
Successful validation across simulated and real-world lightweight mobile robot deployments

Why it matters

Enables resource-constrained mobile robots to achieve robust, generalizable navigation without expensive depth sensors or heavy computational overhead.

Abstract

Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowl- edge from an information-rich, appearance-invariant omni-view depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert’s actions but also to align with the latent embeddings of the omni-view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation perfor- mance, and that the proposed distillation method enhances the performance of a single-view monocular policy, compared with policies solely imitating actions. Real-world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly1.

Index terms

RGB-D Perception Visual Learning Representation Learning