Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals Via Vision�Language Models
Chenrui Tie, Shengxiang Sun, Yudi Lin, Yanbo Wang, Zhongrui Li, Zhouhan Zhong, Jinxuan Zhu, Yiman Pang, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao
AI summary
Problem
Most robotic assembly methods treat physical connections as secondary or predefined, overlooking how to identify specific connector types, quantities, and placement constraints from instruction manuals. This gap limits real-world applicability where precise connector selection and execution are critical for structural integrity.
Approach
The framework uses a vision-language model to parse assembly manual diagrams and automatically generate a connection-enriched hierarchical graph encoding parts, sub-assemblies, and explicit connector relationships. Geometric constraint optimization then computes precise part poses from these connections to guide robotic execution.
Key results
- Curated dataset of 20+ complex assembly tasks with fine-grained 3D models and connector annotations
- Developed Isaac Lab simulation benchmark with four long-horizon tasks featuring diverse connector mechanics
- Achieved millimeter-level pose alignment accuracy by directly computing relative poses via connection constraints
- Validated end-to-end pipeline across furniture, toy, and manufacturing assembly scenarios
Why it matters
Enables robots to reliably execute complex real-world assembly tasks by explicitly modeling physical connector constraints directly from human instruction manuals.
Abstract
Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the foundational physi- cal constraints of assembly execution; while task planning sequences operations, the precise establishment of these con- straints ultimately determines assembly success. In this paper, we treat connections as explicit, primary entities in assembly representation, directly encoding connector types, specifications, and locations for every assembly step. Drawing inspiration from how humans learn assembly tasks through step-by-step instruc- tion manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub- assemblies, and edges explicitly model connection relation- ships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instan- tiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence. More detailed information can be found at https://nus-lins-lab.github.io/Manual2SkillPP/