Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding
Haodi Liu, Xinhang Yang, Kunda Yan, Sen Cui, Zeyu Zhang, Changshui Zhang
AI summary
Problem
Current action recognition methods are limited to predefined labels, while vision-language models struggle with a trade-off between descriptive richness and factual accuracy, often hallucinating details crucial for safe human-robot interaction.
Approach
The authors propose Gold Points Sniper (GPS), a framework that trains lightweight VLMs to self-guide reasoning by extracting critical action details, validating them through selective self-questioning, and assessing factual consistency via semantic entailment classification.
Key results
- GPS-enhanced lightweight VLMs match GPT-4o performance on held-in benchmarks
- Selective self-questioning reduces hallucination and reasoning errors
- Semantic entailment scoring reliably quantifies factual consistency
- New CAP-based instruction-tuning dataset released for training and evaluation
Why it matters
Enables domestic robots to safely and accurately interpret complex human behaviors using efficient, open-source models rather than relying on costly proprietary APIs.
Abstract
Robots operating in everyday environments must understand fine-grained human actions, intentions, and con- textual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human- robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to iden- tify critical action-relevant details, Selective Socratic Ques- tioner validates and refines these details through selective self- questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment clas- sification. Extensive experiments on our curated instruction- tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial perfor- mance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/ GPS-Gold-Point-Sniper.