ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation
Wang Enyi, WEN FAN, Dandan Zhang
AI summary
Problem
Multi-agent robotic manipulation struggles with coordination, grasp stability, and collision avoidance in shared workspaces, while existing multi-modal policies rely on static fusion that wastes computation on irrelevant sensory data during different task phases.
Approach
ADM-DP decouples agent training while sharing end-effector positions, processes vision, tactile, and graph data through specialized encoders, and uses an attention mechanism to dynamically re-weight sensory inputs based on the current task context.
Key results
- 12–25% success rate gains across seven multi-agent tasks over state-of-the-art baselines
- Adaptive Modality Attention Mechanism dynamically prioritizes vision, tactile, or spatial cues per task phase
- Tactile-guided grasping strategy enables real-time corrective grasp refinement using FSR feedback
- Decoupled training paradigm scales to multi-agent setups while maintaining low interdependence and spatial awareness
Why it matters
Provides a scalable, robust framework for cooperative robotics by eliminating static sensory fusion bottlenecks, benefiting researchers and engineers in multi-robot manipulation.
Abstract
Multi-agent robotic manipulation remains chal- lenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that in- tegrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile- guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph- based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art base- lines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios. https://Enyi-Bean.github.io/ ADM-DP/