CaFe-TeleVision: A Coarse-To-Fine Teleoperation System with Immersive Situated Visualization for Enhanced Ergonomics
Zixin Tang, Yiming Chen, Quentin Rouxel, Dianxi Li, Shuang Wu, Fei Chen
AI summary
Problem
Current teleoperation systems suffer from workspace mismatches that cause physical strain, and static multi-view VR feedback that increases cognitive load, causes visual distraction, and creates occlusion blind spots.
Approach
The system pairs a coarse-to-fine retargeting module—using natural mode for efficiency and joystick-assisted mode for fine ergonomic adjustments—with an immersive VR perception module that streams high-fidelity stereo video and displays wrist-camera views on-demand, spatially anchored to the grippers.
Key results
- 28.89% higher task success rate over baselines
- 26.81% faster task completion time
- Statistically significant reduction in cognitive task load and higher user acceptance
- Seamless mode switching preserves physical ergonomics without operational pauses
Why it matters
It offers a practical, deployable teleoperation framework that directly resolves critical ergonomic bottlenecks, benefiting remote manipulation, robotic skill data collection, and telepresence applications.
Abstract
Teleoperation presents a promising paradigm for remote control and robot proprioceptive data collection. Despite recent progress, current teleoperation systems still suffer from limitations in efficiency and ergonomics, particularly in challenging scenarios. In this paper, we propose CaFe-TeleVision, a coarse-to-fine teleoperation system with immersive situated visualization for enhanced ergonomics. At its core, a coarse-to-fine control mechanism is proposed in the retargeting module to bridge workspace disparities, jointly optimizing efficiency and physical ergonomics. To stream im- mersive feedback with adequate visual cues for human vision systems, an on-demand situated visualization technique is integrated in the perception module, which reduces the cognitive load for multi-view processing. The system is built on a humanoid collaborative robot and validated with six challenging bimanual manipulation tasks. User study among 24 participants confirms that CaFe-TeleVision enhances ergonomics with statistical significance, indicating a lower task load Manuscript received 21 July 2025; revised 17 October 2025; accepted 14 November 2025. This paper was recommended for publication by Editor Ki-Uk Kyung upon evaluation of the Associate Editor and Reviewers’ comments. This work was supported in part by the Research Grants Council of the Hong Kong SAR under Grant 14211723, 14222722, 24209021 and C7100-22GF, in part by CUHK & HUAWEI Foundation Models and Interactive Intelligence Innovation Laboratory TH2520452 and in part by InnoHK of the Government of Hong Kong via the Hong Kong Centre for Logistics Robotics. (Corresponding authors: Fei Chen) Zixin Tang, Yiming Chen, Quentin Rouxel, Dianxi Li, and Fei Chen are with the Department of Mechanical and Automation Engineering, T- Stone Robotics Institute, The Chinese University of Hong Kong, Hong Kong SAR (email: zxtang@mae.cuhk.edu.hk, ymchen@mae.cuhk.edu.hk, quentinrouxel@cuhk.edu.hk, dxli@mae.cuhk.edu.hk, f.chen@ieee.org). Shuang Wu is with Huawei Hong Kong Research Center (email: wushuangust@gmail.com). Digital Object Identifier (DOI): see top of this page. and a higher user acceptance during teleoperation. Quantitative results also validate the superior performance of our system across six tasks, surpassing comparative methods by up to 28.89% in success rate and accelerating by 26.81% in completion time. Project webpage: https://clover-cuhk.github.io/cafe_television/