← Back ICRA 2026

Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

Haodi Liu, Xinhang Yang, Kunda Yan, Sen Cui, Zeyu Zhang, Changshui Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Lightweight vision-language models can match proprietary GPT-4o performance on fine-grained action understanding when equipped with a self-guided reasoning framework that extracts, validates, and fact-checks visual details.

vision-language models fine-grained action understanding self-guided reasoning hallucination mitigation domestic robotics semantic entailment

Problem

Current action recognition methods are limited to predefined labels, while vision-language models struggle with a trade-off between descriptive richness and factual accuracy, often hallucinating details crucial for safe human-robot interaction.

Approach

The authors propose Gold Points Sniper (GPS), a framework that trains lightweight VLMs to self-guide reasoning by extracting critical action details, validating them through selective self-questioning, and assessing factual consistency via semantic entailment classification.

Key results

GPS-enhanced lightweight VLMs match GPT-4o performance on held-in benchmarks
Selective self-questioning reduces hallucination and reasoning errors
Semantic entailment scoring reliably quantifies factual consistency
New CAP-based instruction-tuning dataset released for training and evaluation

Why it matters

Enables domestic robots to safely and accurately interpret complex human behaviors using efficient, open-source models rather than relying on costly proprietary APIs.

Abstract

Robots operating in everyday environments must understand fine-grained human actions, intentions, and con- textual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human- robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to iden- tify critical action-relevant details, Selective Socratic Ques- tioner validates and refines these details through selective self- questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment clas- sification. Extensive experiments on our curated instruction- tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial perfor- mance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/ GPS-Gold-Point-Sniper.

Index terms

Domestic Robotics