AssemMate: Graph-Based LLM for Robotic Assembly Assistance
QI ZHENG, Chaoran Zhang, Zijian Liang, Ente Lin, Shubo Cui, Qinghongbing Xie, Zhaobo Xu, Long Zeng
AI summary
Problem
Existing LLM-based robotic assembly assistants rely on natural language text for domain knowledge, which creates long contexts, redundancy, and slow reasoning that hinder real-time robotic control.
Approach
The method uses a self-supervised Graph Convolutional Network to encode assembly knowledge graphs into embeddings that align with an LLM, enabling efficient knowledge graph question answering and vision-enhanced grasp execution for cluttered scenes.
Key results
- 6.4% higher QA accuracy and 3× faster inference than text-based baselines
- 28× shorter context length with strong generalization on unseen graphs
- 71.2% optimal planning rate in simulation and 64.3% in real-world grasping
- Accurate single-hop (82.1%) and multi-hop (66.7% nLCS) assembly planning
Why it matters
It enables real-time, precise human-robot collaboration in industrial assembly by replacing inefficient text prompts with compact, structured graph knowledge.
Abstract
Large Language Model (LLM)-based robotic as- sembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots’ requirements for real-time and precise reasoning. In order to bridge this gap, we present a novel graph-based LLM, denoted as AssemMate, which consists of two stages: graph- based question answering and vision-enhanced grasp execution. The first stage enables natural language question answering on a knowledge graph, supporting human-robot interaction and assembly task planning for specific products. The second stage then utilizes the planning generated before as a target, senses stacked scenes, and executes grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM’s representation, en- abling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate out- performs existing methods, achieving 6.4% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page https: //github.com/cristina304/AssemMate.git.