← Back ICRA 2026

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals Via Vision�Language Models

Chenrui Tie, Shengxiang Sun, Yudi Lin, Yanbo Wang, Zhongrui Li, Zhouhan Zhong, Jinxuan Zhu, Yiman Pang, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao

PDF

AI summary

Key figure (auto-extracted from paper)

Manual2Skill++ automatically extracts connector details from instruction manuals using vision-language models, enabling precise, connection-aware robotic assembly through hierarchical graph representations.

Robotic assembly Vision-language models Connector-aware planning Hierarchical graph representation Instruction manuals Geometric pose alignment

Problem

Most robotic assembly methods treat physical connections as secondary or predefined, overlooking how to identify specific connector types, quantities, and placement constraints from instruction manuals. This gap limits real-world applicability where precise connector selection and execution are critical for structural integrity.

Approach

The framework uses a vision-language model to parse assembly manual diagrams and automatically generate a connection-enriched hierarchical graph encoding parts, sub-assemblies, and explicit connector relationships. Geometric constraint optimization then computes precise part poses from these connections to guide robotic execution.

Key results

Curated dataset of 20+ complex assembly tasks with fine-grained 3D models and connector annotations
Developed Isaac Lab simulation benchmark with four long-horizon tasks featuring diverse connector mechanics
Achieved millimeter-level pose alignment accuracy by directly computing relative poses via connection constraints
Validated end-to-end pipeline across furniture, toy, and manufacturing assembly scenarios

Why it matters

Enables robots to reliably execute complex real-world assembly tasks by explicitly modeling physical connector constraints directly from human instruction manuals.

Abstract

Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the foundational physi- cal constraints of assembly execution; while task planning sequences operations, the precise establishment of these con- straints ultimately determines assembly success. In this paper, we treat connections as explicit, primary entities in assembly representation, directly encoding connector types, specifications, and locations for every assembly step. Drawing inspiration from how humans learn assembly tasks through step-by-step instruc- tion manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub- assemblies, and edges explicitly model connection relation- ships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instan- tiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence. More detailed information can be found at https://nus-lins-lab.github.io/Manual2SkillPP/

Index terms

Manipulation Planning AI-Enabled Robotics Learning Categories and Concepts