← Back ICRA 2026

A Lightweight Agentic Multimodal Framework for Scene Understanding in Healthcare Robotics

Saurav Jha, Stefan K. Ehrlich

PDF

AI summary

Key figure (auto-extracted from paper)

A lightweight agentic framework combining a 3B vision-language model with structured scene graph generation achieves competitive multimodal reasoning accuracy while enabling interpretable, safety-critical decision support for healthcare robotics.

healthcare robotics multimodal reasoning scene graphs agentic frameworks vision-language models clinical AI

Problem

Current vision-language models lack the temporal reasoning, uncertainty handling, and structured outputs required for safe robotic planning in dynamic clinical environments, while often being too computationally heavy or opaque for high-stakes medical deployment.

Approach

The framework integrates the Qwen2.5-VL-3B model with a SmolAgent orchestration layer to enable chain-of-thought reasoning, speech-vision fusion, and dynamic generation of interpretable scene graphs for video-based clinical understanding.

Key results

70.5% accuracy on the Video-MME benchmark, outperforming similarly sized open-weight models
78.8% accuracy on a custom clinical dataset with strong temporal and action recognition
Generation of interpretable scene graphs bridging raw video perception and symbolic robotic planning
Competitive performance against larger proprietary models using only 3B parameters

Why it matters

Provides a resource-efficient, transparent reasoning pipeline essential for deploying safe and auditable multimodal AI in robot-assisted surgery and clinical monitoring.

Abstract

Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clin- ical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech–vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for inter- pretable and adaptive reasoning. Evaluations on the Video- MME benchmark and a custom clinical dataset show com- petitive accuracy and improved robustness compared to state- of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.

Index terms

AI-Based Methods Computer Vision for Medical Robotics Medical Robots and Systems