T(R, O) Grasp: Efficient Graph Diffusion of Robot-Object Spatial Transformation for Cross-Embodiment Dexterous Grasping
Xin Fei, Zhixuan Xu, Huaicong Fang, Tianrui Zhang, Lin Shao
AI summary
Problem
Current dexterous grasp synthesis methods face a trade-off between computational efficiency, cross-embodiment generalization, and robustness to initialization, often suffering from high memory overhead or brittle performance under partial observations.
Approach
The authors propose the T(R,O) Graph, a unified representation encoding spatial transformations between object patches and robotic hand links, and train a transformer-based graph diffusion model to efficiently generate grasps without relying on feasible initial poses.
Key results
- 94.83% average success rate across diverse dexterous hands
- 0.21s inference time and 41 grasps/second throughput on a single A100 GPU
- 68% reduction in GPU memory usage compared to prior interaction-centric methods
- Enables reliable closed-loop manipulation through initialization-independent diffusion sampling
Why it matters
It establishes a scalable, real-time capable foundation for cross-embodiment dexterous manipulation, bridging the gap between high-fidelity grasp generation and practical robotic deployment.
Abstract
Dexterous grasping remains a central challenge in robotics due to the complexity of its high-dimensional state and action space. We introduce T (R, O) Grasp, a diffusion-based framework that efficiently generates accurate and diverse grasps across multiple robotic hands. At its core is the T (R, O) Graph, a unified representation that models spatial transformations between robotic hands and objects while encoding their geometric properties. A graph diffusion model, coupled with an efficient inverse kinematics solver, supports both unconditioned and conditioned grasp synthesis. Extensive experiments on a diverse set of dexterous hands show that T (R, O) Grasp achieves average success rate of 94.83%, inference speed of 0.21s, and throughput of 41 grasps per second on an NVIDIA A100 40GB GPU, substantially outperforming existing baselines. In addition, our approach is robust and generalizable across embodiments while significantly reducing memory consumption. More importantly, the high inference speed enables closed-loop dexterous manipulation, underscoring the potential of T (R, O) Grasp to scale into a foundation model * denotes equal contribution † denotes corresponding author This research is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2024) for dexterous grasping. The code, appendix, and videos are available at https://nus-lins-lab.github.io/trograspweb/ .