Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots
Kai Li, Shiyu Zhao
AI summary
Problem
Vision-based policies on lightweight mobile robots face a trilemma of poor scene transferability with monocular cameras, limited onboard computation, and the high cost of depth sensors.
Approach
A knowledge distillation framework trains a lightweight single-view RGB student to mimic both expert actions and the latent embeddings of an omni-view depth teacher using contrastive learning.
Key results
- Approximately 15% improvement in navigational success rate and 19% increase in collision-free travel distance
- Elimination of depth sensors or multi-camera setups with ~20 ms onboard inference latency
- Contrastive embedding alignment significantly enhances scene transferability and reduces action errors
- Successful validation across simulated and real-world lightweight mobile robot deployments
Why it matters
Enables resource-constrained mobile robots to achieve robust, generalizable navigation without expensive depth sensors or heavy computational overhead.
Abstract
Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowl- edge from an information-rich, appearance-invariant omni-view depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert’s actions but also to align with the latent embeddings of the omni-view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation perfor- mance, and that the proposed distillation method enhances the performance of a single-view monocular policy, compared with policies solely imitating actions. Real-world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly1.