← Back ICRA 2026

LLM-Guided Semantic Stereo Adaptive Visual Servoing for Precise Peg-In-Hole

Xiyue Dong, Guangli Sun, Jinfei Hu, Tianyu Huang, Wei Chen, Yunhui Liu

PDF

AI summary

Key figure (auto-extracted from paper)

Combining LLM semantic reasoning with adaptive stereo visual servoing enables high-precision, calibration-free peg-in-hole assembly without prior models or training.

Visual servoing Large language models Stereo vision Peg-in-hole Adaptive control Calibration-free manipulation

Problem

Traditional visual servoing relies on precise camera calibration and manual feature engineering, while learning-based methods lack the sub-millimeter precision required for tight-tolerance, contact-rich assembly tasks.

Approach

An LLM semantically identifies and corresponds optimal feature points from uncalibrated stereo images using natural language task descriptions, driving a stereo adaptive visual servoing controller that estimates unknown camera parameters online.

Key results

>90% success rate across cylindrical, square, and hexagonal tasks
1.8–2.8 pixel steady-state error, comparable to calibrated methods
No prior models, calibration, or task-specific training required
Sub-3mm insertion tolerance achieved

Why it matters

Enables flexible, high-precision robotic assembly in unstructured environments by eliminating the need for laborious calibration and manual feature design.

Abstract

Precision assembly tasks like peg-in-hole remain challenging for robotic manipulation. While visual servoing offers a robust framework, it depends heavily on accurate calibration and manual feature engineering. Learning-based methods, in- cluding vision-language models (VLMs), provide strong semantic understanding but often lack the precision needed for high- tolerance, contact-rich insertions. This paper introduces a novel framework that combines the semantic reasoning of large lan- guage models (LLMs) with adaptive visual servoing to bridge this gap. Our approach uses an LLM as a semantic feature extractor and correspondence engine for stereo visual servoing. The LLM processes generic point features from uncalibrated stereo images along with a task description in natural language, leveraging its spatial understanding to identify and correspond optimal features across views. These features drive a stereo adaptive visual servoing controller that estimates unknown calibration parameters online, enabling precise, calibration-free positioning. Extensive evaluations on cylindrical, square, and hexagonal peg- in-hole tasks across three trials demonstrate average success rates above 90% with steady-state errors of 1.8–2.8 pixels, closely comparable to calibrated methods (1.2–2.5 pixels). This is achieved without requiring prior models, calibration, or task- specific training, thereby advancing flexible and precise robotic assembly.

Index terms

Robust/Adaptive Control Visual Servoing AI-Enabled Robotics