← Back ICRA 2026

AssemMate: Graph-Based LLM for Robotic Assembly Assistance

QI ZHENG, Chaoran Zhang, Zijian Liang, Ente Lin, Shubo Cui, Qinghongbing Xie, Zhaobo Xu, Long Zeng

PDF

AI summary

Key figure (auto-extracted from paper)

AssemMate efficiently injects graph-structured assembly knowledge into LLMs, enabling faster, more accurate human-robot interaction and grasping than text-based approaches.

Graph-based LLM Robotic Assembly Knowledge Graph QA Vision-Enhanced Grasping Human-Robot Interaction Domain Knowledge Injection

Problem

Existing LLM-based robotic assembly assistants rely on natural language text for domain knowledge, which creates long contexts, redundancy, and slow reasoning that hinder real-time robotic control.

Approach

The method uses a self-supervised Graph Convolutional Network to encode assembly knowledge graphs into embeddings that align with an LLM, enabling efficient knowledge graph question answering and vision-enhanced grasp execution for cluttered scenes.

Key results

6.4% higher QA accuracy and 3× faster inference than text-based baselines
28× shorter context length with strong generalization on unseen graphs
71.2% optimal planning rate in simulation and 64.3% in real-world grasping
Accurate single-hop (82.1%) and multi-hop (66.7% nLCS) assembly planning

Why it matters

It enables real-time, precise human-robot collaboration in industrial assembly by replacing inefficient text prompts with compact, structured graph knowledge.

Abstract

Large Language Model (LLM)-based robotic as- sembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots’ requirements for real-time and precise reasoning. In order to bridge this gap, we present a novel graph-based LLM, denoted as AssemMate, which consists of two stages: graph- based question answering and vision-enhanced grasp execution. The first stage enables natural language question answering on a knowledge graph, supporting human-robot interaction and assembly task planning for specific products. The second stage then utilizes the planning generated before as a target, senses stacked scenes, and executes grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM’s representation, en- abling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate out- performs existing methods, achieving 6.4% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page https: //github.com/cristina304/AssemMate.git.

Index terms

Human-Centered Robotics AI-Enabled Robotics Assembly