Research Analyzer
← Back ICRA 2026

Robust Unknown Object Detection and Tracking for Vision-Language-Action Models on Edge Devices

Subin Joo, DEOKGI JEUNG

PDF

AI summary

Key figure (auto-extracted from paper)
A modular stepwise Vision-Language-Action framework achieves robust, real-time detection and tracking of unknown objects on resource-constrained edge devices.
Vision-Language-Action Edge Computing Unknown Object Detection Real-time Tracking MobileSAM Modular Robotics

Problem

End-to-end VLA models demand excessive memory and operate as opaque black boxes, hindering debugging and real-time deployment on edge hardware for robotics.

Approach

The framework decomposes perception into a two-phase pipeline: depth-guided Gaussian sampling with MobileSAM for initializing unknown objects, followed by ORB feature matching and ROI-based localization for continuous 6-DOF tracking.

Key results

  • >99% detection success for unlearned objects
  • ~0.5–1.3s initialization for up to 10 objects
  • >30 FPS real-time 6-DOF pose tracking
  • Seamless integration with lightweight LLM for language-guided selection

Why it matters

Provides a practical, debuggable alternative for deploying foundation models on edge devices in safety-critical industrial and surgical robotics.

Abstract

This study proposes a Stepwise Vision-Language-Action (VLA) framework for the robust detection and tracking of unknown objects in edge device environments (NVIDIA Jetson AGX Orin). Conventional end-to-end VLA models face challenges such as massive memory requirements and a "black-box" nature that complicates debugging. To address these issues, we adopt a modular architecture, specifically integrating Depth-Guided Gaussian Sampling with MobileSAM in the vision module. This approach achieves over 99% detection success for unlearned objects. Furthermore, we demonstrate real-time 6-DOF pose tracking at over 30 FPS through ORB feature matching and ROI-based localization following the initialization phase.

Index terms

Humanoid Robot Systems Vision-Based Navigation Computer Vision for Manufacturing

Related papers