LLM-Guided Semantic Stereo Adaptive Visual Servoing for Precise Peg-In-Hole
Xiyue Dong, Guangli Sun, Jinfei Hu, Tianyu Huang, Wei Chen, Yunhui Liu
AI summary
Problem
Traditional visual servoing relies on precise camera calibration and manual feature engineering, while learning-based methods lack the sub-millimeter precision required for tight-tolerance, contact-rich assembly tasks.
Approach
An LLM semantically identifies and corresponds optimal feature points from uncalibrated stereo images using natural language task descriptions, driving a stereo adaptive visual servoing controller that estimates unknown camera parameters online.
Key results
- >90% success rate across cylindrical, square, and hexagonal tasks
- 1.8–2.8 pixel steady-state error, comparable to calibrated methods
- No prior models, calibration, or task-specific training required
- Sub-3mm insertion tolerance achieved
Why it matters
Enables flexible, high-precision robotic assembly in unstructured environments by eliminating the need for laborious calibration and manual feature design.
Abstract
Precision assembly tasks like peg-in-hole remain challenging for robotic manipulation. While visual servoing offers a robust framework, it depends heavily on accurate calibration and manual feature engineering. Learning-based methods, in- cluding vision-language models (VLMs), provide strong semantic understanding but often lack the precision needed for high- tolerance, contact-rich insertions. This paper introduces a novel framework that combines the semantic reasoning of large lan- guage models (LLMs) with adaptive visual servoing to bridge this gap. Our approach uses an LLM as a semantic feature extractor and correspondence engine for stereo visual servoing. The LLM processes generic point features from uncalibrated stereo images along with a task description in natural language, leveraging its spatial understanding to identify and correspond optimal features across views. These features drive a stereo adaptive visual servoing controller that estimates unknown calibration parameters online, enabling precise, calibration-free positioning. Extensive evaluations on cylindrical, square, and hexagonal peg- in-hole tasks across three trials demonstrate average success rates above 90% with steady-state errors of 1.8–2.8 pixels, closely comparable to calibrated methods (1.2–2.5 pixels). This is achieved without requiring prior models, calibration, or task- specific training, thereby advancing flexible and precise robotic assembly.