← Back ICRA 2026

TMR-VLA: Vision-Language-Action Model for Magnetic Motion Control of Tri-Leg Silicone-Based Soft Robot

Ruijie Tang, Chi Kit Ng, Kaixuan Wu, Long Bai, Guankun Wang, Yiming Huang, Yupeng Wang, Hongliang Ren

PDF

AI summary

Key figure (auto-extracted from paper)

TMR-VLA enables autonomous, sensor-free control of a magnetic soft robot by directly mapping external vision and language commands to coil voltages, achieving a 74% average success rate.

Vision-Language-Action Magnetic Soft Robotics End-to-End Control Autonomous Navigation Medical Robotics Multimodal Learning

Problem

Miniature magnetic soft robots lack onboard power and sensors, forcing a decoupling of actuation and perception that makes autonomous control nearly impossible without manual expert guidance.

Approach

The authors introduce TMR-VLA, an end-to-end multimodal model that ingests sequential camera frames and natural language instructions to directly predict low-level voltage commands for external magnetic coils.

Key results

First end-to-end VLA framework for magnetic soft robotics
TrilegMR-Motion dataset with 15,793 image-voltage pairs
74% average success rate across five motion primitives
Outperforms strong open-source multimodal baselines in instruction understanding and execution

Why it matters

Provides a scalable, autonomous control paradigm for untethered medical robots, accelerating the development of minimally invasive in vivo diagnostics and therapies.

Abstract

In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 74% average success rate.

Index terms

Medical Robots and Systems AI-Enabled Robotics Modeling Control and Learning for Soft Robots