Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception
Dingcheng Huang, Xiaotong Zhang, Kamal Youcef-Toumi
AI summary
Problem
Running all perception modules frame-by-frame in human-robot collaboration causes high latency and computational waste in streaming scenarios. Existing scheduling methods lack real-time, context-aware utility estimation for individual modules.
Approach
The framework estimates a reward for each perception module by balancing expected information gain against computational cost, using previous frame outputs to selectively activate only necessary modules in real-time.
Key results
- Reduces computational latency by up to 27.52% compared to parallel pipelines
- Improves MMPose activation recall by 72.73%
- Achieves up to 98% keyframe accuracy
- Validates scalable resource allocation for multimodal streaming perception
Why it matters
Enables robots to dynamically allocate computational resources in real-time, improving efficiency and responsiveness in human-robot collaboration without sacrificing perception quality.
Abstract
In modern human-robot collaboration (HRC) ap- plications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate as- sistance to human agents intelligently. While executing multi- ple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene under- standing, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, mod- ern perception pipelines still face challenges related to infor- mation redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene con- text. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces compu- tational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the frameworkâs capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.