Research Analyzer
← Back ICRA 2026

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

Liman Wang, Hanyang Zhong, Tianyuan Wang, SHAN LUO, Jihong Zhu

PDF

AI summary

Key figure (auto-extracted from paper)
A multimodal LLM trained on synchronized visual, tactile, and pressure data significantly outperforms vision-language baselines in ranking fabric properties and selecting materials for specific tasks.
Multimodal learning Robotic fabric manipulation Large language models Visuotactile sensing Property ranking Knowledge distillation

Problem

Conventional robotic fabric classification fails to capture continuous physical properties like softness and elasticity, while prior multimodal approaches lack supervised property ranking and task-aware reasoning.

Approach

The system frames fabric selection as a property-specific pairwise comparison task and trains a multimodal LLM using supervised fine-tuning and explanation-guided knowledge distillation to enable interpretable, function-driven material ranking.

Key results

  • A property-specific pairwise comparison framework for functional fabric selection
  • Fabric-Llama-90B model trained via supervised fine-tuning and explanation-guided distillation
  • Public release of a 220-fabric dataset with co-registered RGB, visuotactile, and pressure data
  • Consistent outperformance of pretrained vision-language baselines in attribute ranking and selection reliability

Why it matters

Enables robots to make interpretable, function-driven material decisions for textile manufacturing, smart retail, and adaptive grasping applications.

Abstract

Choosing appropriate fabrics is critical for meeting functional and quality demands in robotic textile manufacturing, apparel production, and smart retail. We propose MLLM-Fabric, a robotic framework leveraging multimodal large language models (MLLMs) for fabric sorting and selection. Built on a multimodal robotic platform, the system is trained through supervised fine-tuning and explanation-guided distillation to rank fabric properties. We also release a dataset of 220 diverse fabrics, each with RGB images and synchronized visuotactile and pressure data. Experiments show that our Fabric-Llama-90B consistently outperforms pretrained vision-language baselines in both attribute ranking and selection reliability. Code and dataset are publicly available at https://github.com/limanwang/ MLLM-Fabric.

Index terms

Object Detection Segmentation and Categorization Semantic Scene Understanding Deep Learning in Grasping and Manipulation

Related papers