← Back ICRA 2026

Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

Dingcheng Huang, Xiaotong Zhang, Kamal Youcef-Toumi

PDF

AI summary

Key figure (auto-extracted from paper)

A relevance-driven scheduling framework selectively activates perception modules in real-time, cutting latency by over 27% while boosting pose estimation recall and maintaining high accuracy.

multimodal perception streaming perception relevance-driven scheduling human-robot collaboration computational efficiency real-time perception

Problem

Running all perception modules frame-by-frame in human-robot collaboration causes high latency and computational waste in streaming scenarios. Existing scheduling methods lack real-time, context-aware utility estimation for individual modules.

Approach

The framework estimates a reward for each perception module by balancing expected information gain against computational cost, using previous frame outputs to selectively activate only necessary modules in real-time.

Key results

Reduces computational latency by up to 27.52% compared to parallel pipelines
Improves MMPose activation recall by 72.73%
Achieves up to 98% keyframe accuracy
Validates scalable resource allocation for multimodal streaming perception

Why it matters

Enables robots to dynamically allocate computational resources in real-time, improving efficiency and responsiveness in human-robot collaboration without sacrificing perception quality.

Abstract

In modern human-robot collaboration (HRC) ap- plications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate as- sistance to human agents intelligently. While executing multi- ple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene under- standing, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, mod- ern perception pipelines still face challenges related to infor- mation redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene con- text. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces compu- tational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework’s capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

Index terms

Robotics in Under-Resourced Settings Computer Vision for Automation RGB-D Perception