Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
Weicong Ni, Tianbao Jiang, Linlin Wang
AI summary
Problem
Vision-Language Models suffer from hallucinations and rigid reasoning strategies that fail to adapt to varying task complexities, hindering their safe deployment in robotic automation.
Approach
The authors introduce PStar, a training-free framework that quantifies question complexity using a Difficulty Feature Vector and dynamically retrieves optimal pseudocode reasoning paths via A* search and hybrid similarity matching.
Key results
- Reduces hallucination rates across multiple multimodal benchmarks
- Achieves state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar
- Outperforms GPT-4V and larger proprietary models without additional training
- Demonstrates high data efficiency using only 500 seed examples for path generation
Why it matters
Enables safer, more reliable deployment of vision-language models in real-world robotic systems by providing a lightweight, training-free mechanism to prevent catastrophic reasoning errors.
Abstract
Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deploy- ments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies—enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state- of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mecha- nism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.