← Back ICRA 2026

ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation

Wang Enyi, WEN FAN, Dandan Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Dynamically weighting vision, tactile, and spatial inputs during multi-agent manipulation boosts success rates by 12–25% over static fusion baselines.

multi-agent manipulation adaptive fusion diffusion policy tactile sensing collision avoidance decoupled training

Problem

Multi-agent robotic manipulation struggles with coordination, grasp stability, and collision avoidance in shared workspaces, while existing multi-modal policies rely on static fusion that wastes computation on irrelevant sensory data during different task phases.

Approach

ADM-DP decouples agent training while sharing end-effector positions, processes vision, tactile, and graph data through specialized encoders, and uses an attention mechanism to dynamically re-weight sensory inputs based on the current task context.

Key results

12–25% success rate gains across seven multi-agent tasks over state-of-the-art baselines
Adaptive Modality Attention Mechanism dynamically prioritizes vision, tactile, or spatial cues per task phase
Tactile-guided grasping strategy enables real-time corrective grasp refinement using FSR feedback
Decoupled training paradigm scales to multi-agent setups while maintaining low interdependence and spatial awareness

Why it matters

Provides a scalable, robust framework for cooperative robotics by eliminating static sensory fusion bottlenecks, benefiting researchers and engineers in multi-robot manipulation.

Abstract

Multi-agent robotic manipulation remains chal- lenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that in- tegrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile- guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph- based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art base- lines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios. https://Enyi-Bean.github.io/ ADM-DP/

Index terms

Imitation Learning Deep Learning in Grasping and Manipulation Multi-Robot Systems