← Back ICRA 2026

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, Shanghang Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Transforming VLMs into precise geometric computers via code generation and external tool invocation achieves centimeter-level accuracy for real-world robotic manipulation.

Vision-Language Models Geometric Reasoning Tool-Integrated Reasoning Robotics Metric Computation Reinforcement Fine-Tuning

Problem

Current vision-language models lack computational precision for robotics, relying on qualitative spatial reasoning instead of leveraging metric cues from depth sensors and camera calibration.

Approach

TIGeR enables VLMs to detect geometric reasoning needs, generate executable code, and invoke external tools for exact calculations, trained through supervised and reinforcement fine-tuning on a dedicated dataset.

Key results

Introduces TIGeR framework for tool-integrated geometric computation
Releases TIGeR-300K dataset with 300K tool-invocation sequences
Achieves SOTA on geometric benchmarks via two-stage SFT and RFT training
Delivers centimeter-level precision in real-world robotic manipulation

Why it matters

Bridges the gap between perceptual AI and physical robotics by providing VLMs with the exact metric computation capabilities required for precise robotic control.

Abstract

Vision-Language Models (VLMs) have shown re- markable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the com- putational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter- level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel frame- work that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate com- putational code, and invoke specialized libraries for exact calcu- lations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geo- metric reasoning benchmarks while demonstrating centimeter- level precision in real-world robotic manipulation tasks. See the project page at https://hany01rye.github.io/TIGeR/.

Index terms

Visual Learning Deep Learning for Visual Perception