← Back ICRA 2026

Y-MAP-Net: Learning from Foundation Models for Real-Time, Multi-Task Scene Perception

Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis Argyros

PDF

AI summary

Key figure (auto-extracted from paper)

Y-MAP-Net distills multiple foundation models into a compact, Y-shaped convolutional network that achieves real-time, simultaneous multi-task scene perception from a single RGB image.

Multi-task learning Real-time perception Foundation model distillation Robotic vision Convolutional networks RGB scene understanding

Problem

Large foundation models offer strong multi-task generalization but are too computationally heavy for real-time deployment on resource-constrained robotic platforms, while lightweight models lack broad perceptual capabilities.

Approach

The authors design a Y-shaped convolutional network trained via a multi-teacher, single-student paradigm, where task-specific foundation models supervise the learning process to distill their capabilities into a unified, efficient architecture.

Key results

First real-time end-to-end network for simultaneous depth, normal, pose, segmentation, and captioning from monocular RGB
Novel Y-shaped topology with fully shared weights enabling efficient multitask learning
Iterative depth refinement using predicted surface normals to sharpen output fidelity
Demonstrated computational efficiency on commodity hardware suitable for real-world robotic deployment

Why it matters

Provides a practical, unified perception backbone that enables real-time scene understanding and safe human-robot interaction on low-cost robotic platforms.

Abstract

We present Y-MAP-Net, a Y-shaped neural net- work architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net simultaneously predicts depth, surface normals, human pose, semantic segmentation, and gen- erates multi-label captions in a single forward pass. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the learning of the network, allowing it to distill their capabilities into a unified real-time inference architecture. Y-MAP-Net exhibits strong generalization, architectural simplicity, and computational ef- ficiency, making it well-suited for resource-constrained robotic platforms. By providing rich 3D, semantic, and contextual scene understanding from low-cost RGB cameras, Y-MAP-Net supports key robotic capabilities such as object manipulation and human–robot interaction. To encourage future research and reproducibility, we make our code publicly available [1].

Index terms

Deep Learning for Visual Perception Recognition Visual Learning