Y-MAP-Net: Learning from Foundation Models for Real-Time, Multi-Task Scene Perception
Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis Argyros
AI summary
Problem
Large foundation models offer strong multi-task generalization but are too computationally heavy for real-time deployment on resource-constrained robotic platforms, while lightweight models lack broad perceptual capabilities.
Approach
The authors design a Y-shaped convolutional network trained via a multi-teacher, single-student paradigm, where task-specific foundation models supervise the learning process to distill their capabilities into a unified, efficient architecture.
Key results
- First real-time end-to-end network for simultaneous depth, normal, pose, segmentation, and captioning from monocular RGB
- Novel Y-shaped topology with fully shared weights enabling efficient multitask learning
- Iterative depth refinement using predicted surface normals to sharpen output fidelity
- Demonstrated computational efficiency on commodity hardware suitable for real-world robotic deployment
Why it matters
Provides a practical, unified perception backbone that enables real-time scene understanding and safe human-robot interaction on low-cost robotic platforms.
Abstract
We present Y-MAP-Net, a Y-shaped neural net- work architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net simultaneously predicts depth, surface normals, human pose, semantic segmentation, and gen- erates multi-label captions in a single forward pass. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the learning of the network, allowing it to distill their capabilities into a unified real-time inference architecture. Y-MAP-Net exhibits strong generalization, architectural simplicity, and computational ef- ficiency, making it well-suited for resource-constrained robotic platforms. By providing rich 3D, semantic, and contextual scene understanding from low-cost RGB cameras, Y-MAP-Net supports key robotic capabilities such as object manipulation and human–robot interaction. To encourage future research and reproducibility, we make our code publicly available [1].