TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, Shanghang Zhang
AI summary
Problem
Current vision-language models lack computational precision for robotics, relying on qualitative spatial reasoning instead of leveraging metric cues from depth sensors and camera calibration.
Approach
TIGeR enables VLMs to detect geometric reasoning needs, generate executable code, and invoke external tools for exact calculations, trained through supervised and reinforcement fine-tuning on a dedicated dataset.
Key results
- Introduces TIGeR framework for tool-integrated geometric computation
- Releases TIGeR-300K dataset with 300K tool-invocation sequences
- Achieves SOTA on geometric benchmarks via two-stage SFT and RFT training
- Delivers centimeter-level precision in real-world robotic manipulation
Why it matters
Bridges the gap between perceptual AI and physical robotics by providing VLMs with the exact metric computation capabilities required for precise robotic control.
Abstract
Vision-Language Models (VLMs) have shown re- markable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the com- putational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter- level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel frame- work that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate com- putational code, and invoke specialized libraries for exact calcu- lations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geo- metric reasoning benchmarks while demonstrating centimeter- level precision in real-world robotic manipulation tasks. See the project page at https://hany01rye.github.io/TIGeR/.