← Back ICRA 2026

Dexora: Open-Source VLA for High-DoF Bimanual Dexterity

Hang Zhao, Pengwei Wang, Shanghang Zhang,, Guocai Yao, Jianyu Chen, Hongyang Li, Hao Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Dexora enables high-DoF dual-arm, dual-hand dexterous manipulation via an open-source VLA that achieves superior success rates and cross-embodiment generalization.

Vision-Language-Action Dexterous Manipulation Dual-Arm Control Diffusion Transformer Teleoperation Cross-Embodiment Generalization

Problem

Existing Vision-Language-Action models are restricted to either low-DoF dual-arm grippers or single-arm dexterous hands, leaving a critical gap in coordinated, high-dimensional bimanual manipulation.

Approach

We introduce a hybrid teleoperation system combining exoskeleton arm tracking and markerless finger capture to build a large-scale simulated and real dataset, trained with a diffusion-transformer policy guided by a learned data-quality discriminator.

Key results

First open-source VLA for dual-arm, dual-hand high-DoF manipulation
≥90% success on basic tasks and 66.7% average dexterous success
Robust cross-embodiment generalization to grippers and low-DoF hands
Large-scale embodiment-matched dataset of 100K simulated and 10K real episodes

Why it matters

Provides a scalable, open-source pathway for training universal robot controllers capable of complex bimanual dexterity across diverse hardware platforms.

Abstract

Vision-Language-Action (VLA) models have re- cently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single- arm dexterous hand manipulation. While low-dimensional grip- per control can often be handled with simpler methods, high- dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual- arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real- world dataset of 10K teleoperated episodes (177.5 hours, 3.2M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline dis- criminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains ≥90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity. Demos, data, codes, and models can be found at https://dexoravla.github.io.

Index terms

Dexterous Manipulation Deep Learning in Grasping and Manipulation Dual Arm Manipulation