Robust Unknown Object Detection and Tracking for Vision-Language-Action Models on Edge Devices
Subin Joo, DEOKGI JEUNG
AI summary
Problem
End-to-end VLA models demand excessive memory and operate as opaque black boxes, hindering debugging and real-time deployment on edge hardware for robotics.
Approach
The framework decomposes perception into a two-phase pipeline: depth-guided Gaussian sampling with MobileSAM for initializing unknown objects, followed by ORB feature matching and ROI-based localization for continuous 6-DOF tracking.
Key results
- >99% detection success for unlearned objects
- ~0.5–1.3s initialization for up to 10 objects
- >30 FPS real-time 6-DOF pose tracking
- Seamless integration with lightweight LLM for language-guided selection
Why it matters
Provides a practical, debuggable alternative for deploying foundation models on edge devices in safety-critical industrial and surgical robotics.
Abstract
This study proposes a Stepwise Vision-Language-Action (VLA) framework for the robust detection and tracking of unknown objects in edge device environments (NVIDIA Jetson AGX Orin). Conventional end-to-end VLA models face challenges such as massive memory requirements and a "black-box" nature that complicates debugging. To address these issues, we adopt a modular architecture, specifically integrating Depth-Guided Gaussian Sampling with MobileSAM in the vision module. This approach achieves over 99% detection success for unlearned objects. Furthermore, we demonstrate real-time 6-DOF pose tracking at over 30 FPS through ORB feature matching and ROI-based localization following the initialization phase.