← Back ICRA 2026

Transformer Driven Visual Servoing for Fabric Texture Matching Using Dual-Arm Manipulator

Fuyuki Tokuda, Akira Seino, Akinari Kobayashi, KAI TANG, Kazuhiro Kosuge

PDF

AI summary

Key figure (auto-extracted from paper)

A transformer-based visual servoing network trained entirely on synthetic data enables zero-shot, high-precision alignment of unseen fabric textures in real-world robotic manipulation.

Visual servoing Transformer networks Fabric manipulation Dual-arm control Sim-to-real transfer Zero-shot learning

Problem

Existing visual servoing methods for deformable fabric manipulation struggle with generalization across unseen textures and lighting conditions while heavily relying on costly real-world training data.

Approach

The system uses a Transformer network trained on synthetic images to predict pose differences for texture alignment, while a dual-arm impedance controller simultaneously flattens the fabric to ensure consistent visual feedback.

Key results

Novel Transformer-based visual servoing network with Difference Extraction Attention Module (DEAM)
Zero-shot sim-to-real deployment trained exclusively on synthetic rendered data
Real-world alignment accuracy of 0.1 mm average position error across unseen textures
Robust performance under varying lighting conditions and diverse fabric patterns

Why it matters

This approach provides a scalable, data-efficient solution for automating precision fabric alignment in garment manufacturing, reducing reliance on costly real-world training data.

Abstract

In this paper, we propose a method to align and place a fabric piece on top of another using a dual-arm manipulator and a grayscale camera, so that their surface textures are accu- rately matched. We propose a novel control scheme that combines Transformer-driven visual servoing with dual-arm impedance con- trol.Thisapproachenablesthesystemtosimultaneouslycontrolthe pose of the fabric piece and place it onto the underlying one while applying tension to keep the fabric piece flat. Our transformer- based network incorporates pre-trained backbones and a newly in- troduced Difference Extraction Attention Module (DEAM), which significantly enhances pose difference prediction accuracy. Trained entirely on synthetic images generated using rendering software, the network enables zero-shot deployment in real-world scenar- ios without requiring prior training on specific fabric textures. Real-world experiments demonstrate that the proposed system accurately aligns fabric pieces with different textures.

Index terms

Industrial Robots Visual Servoing Dual Arm Manipulation