RoTri-Diff: A Spatial Robot�Object Triadic Interaction-Guided Diffusion Model for Bimanual Manipulation
Zixuan Chen, Nga Teng Chan, Yiwen Hou, Chenrui Tie, Zixuan Liu, Haonan Chen, Junting Chen, Jieqi Shi, Yang Gao, Jing Huo, Lin Shao
AI summary
Problem
Existing bimanual imitation learning methods overlook the dynamic geometric relationship between the two arms and the object, leading to inter-arm collisions, unstable grasps, and degraded coordination in complex tasks.
Approach
RoTri-Diff uses a hierarchical diffusion model to encode the relative 6D poses between both arms and the object, combining this triadic interaction signal with robot keyposes and object motion to generate stable, coordinated action sequences.
Key results
- 10.2% success rate improvement on 11 RLBench2 tasks
- Stable real-world execution across 4 challenging bimanual tasks
- Hierarchical diffusion architecture integrating keyposes, object flow, and triadic constraints
- Robust performance across keypose, continuous, and hybrid action modes
Why it matters
Provides a reliable, spatially-aware framework for dual-arm coordination, advancing imitation learning for complex robotic manipulation tasks.
Abstract
Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the †Corresponding Author This work was completed during Zixuan and Nga Teng’s visiting at the National University of Singapore. 1Zixuan Chen and Jing Huo are with the School of Computer Science, Nanjing University, China. Emails: chenzx@nju.edu.cn, huojing@nju.edu.cn 2Jieqi Shi and Yang Gao are with the School of Intelligence Science and Technology, Nanjing University, China. Emails: isjieqi@nju.edu.cn, gaoy@nju.edu.cn 3Yiwen Hou, Chenrui Tie, Zixuan Liu, Haonan Chen, Junting Chen and Lin Shao are with the School of Com- puting, National University of Singapore, Singapore. Email: {yiwenhou,chenrui.tie,zixuanliu}@u.nus.edu, {chenhaonan,junting.chen,linshao}@u.nus.edu 4Nga Teng Chan is with the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, China. Email: ntchanab@connect.ust.hk This work is supported in part by New Generation Artificial Intelligence- National Science and Technology Major Project (2025ZD0122904), Na- tional Natural Science Foundation of China (62192783, 62276128, 62506153), Jiangsu Science and Technology Major Project (BG2025035), the Fundamental Research Funds for the Central Universities(KG202514) and the Collaborative Innovation Center of Novel Software Technology and Industrialization. dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter- arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the Robot–Object Triadic Interaction (RoTri) repre- sentation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the gen- eration of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the- art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks. Project website: https://rotri-diff.github.io/.