← Back ICRA 2026

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, Hao Dong

PDF

AI summary

Key figure (auto-extracted from paper)

GarmentPile++ enables robots to safely retrieve exactly one garment from a cluttered pile by combining vision-language reasoning with affordance-driven grasp planning and dynamic dual-arm cooperation.

cluttered garment retrieval vision-language reasoning affordance learning dual-arm cooperation robotic home assistance SAM2 segmentation

Problem

Existing garment manipulation methods largely assume single, isolated garments in structured environments, failing in real-world cluttered piles where entanglement, occlusion, and language-guided task constraints make reliable single-item retrieval difficult.

Approach

The pipeline uses SAM2 segmentation and mask fine-tuning to isolate garments, a VLM to select targets based on language instructions, and a PointNet++ affordance model to predict optimal grasp points, dynamically triggering dual-arm cooperation when needed.

Key results

VLM-guided segmentation with mask fine-tuning accurately isolates garments in heavy occlusion
Affordance model predicts safe, single-arm grasp points maximizing retrieval feasibility
Dynamic dual-arm cooperation mechanism handles large garments and multi-garment lifts
Successful sequential and specific retrieval across real-world and simulation environments

Why it matters

Provides a robust, language-guided foundation for home-assistant robots to handle everyday cluttered garment piles reliably.

Abstract

Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high- level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM’s comprehensive awareness of each garment’s state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.

Index terms

Manipulation Planning Grasping Representation Learning