← Back ICRA 2026

Real-Time Millimeter-Accurate Underwater Pose Estimation Via Tightly-Coupled Fusion of Vision and Optical Tracking

Yuer Gao, Tongqing Xu, Yi Cai

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing high-frequency monocular vision with high-accuracy optical tracking enables real-time, millimeter-accurate underwater pose estimation at 62 FPS, significantly outperforming vision-only and standard fusion baselines.

Underwater robotics sensor fusion pose estimation optical tracking visual odometry real-time control

Problem

Underwater robotic applications require precise, high-frequency localization for agile control, but existing sensors face a fundamental speed-accuracy trade-off, with vision methods drifting over time and high-accuracy optical or acoustic systems lacking sufficient update rates.

Approach

A tightly-coupled Extended Kalman Filter fuses a high-frequency monocular vision pose estimator, augmented with a learned latent dynamics model to compensate for underwater disturbances, with periodic high-accuracy corrections from an external optical tracking system.

Key results

Achieves 5.65 mm position RMSE at 62 FPS in controlled underwater tests
Improves accuracy by 1.6× over EfficientPose+EKF baseline and 6.4× over vision-only estimation
Introduces a neural network-based latent dynamics variable to implicitly compensate for unmodeled hydrodynamic disturbances
Releases a synchronized underwater localization dataset with video, control inputs, and high-precision optical ground truth

Why it matters

Enables high-fidelity, real-time state estimation critical for validating control algorithms and enabling precise underwater manipulation in laboratory testbeds.

Abstract

Precise and high-frequency state estimation is re- quired for advanced underwater robotic applications such as physical interaction and agile control, yet no single sensor can simultaneously provide both high accuracy and high update rates. Vision-basedmethodsofferhigh-frequencyupdatesbutsufferfrom drift,whileopticaltrackingsystemsarehighlyaccuratebutmaynot provide sufficiently high update rates for real-time control loops. This letter presents a tightly-coupled sensor fusion framework that combines a high-frequency (62 FPS) monocular vision-based pose estimator with a high-accuracy (millimeter-level) optical tracking system. Our approach uses a visual estimator for high-frequency state propagation—with a latent variable motion model to com- pensate for underwater disturbances—while the optical tracker provides periodic corrections. In a controlled underwater testbed, this achieves a position RMSE of 5.65 mm at 62 FPS, improving accuracy1.6×comparedtothebestbaselinemethod(EfficientPose + EKF: 9.20 mm) and 6.4 × compared to vision-only estimation (36 mm). Our dataset and code are available upon request.

Index terms

Marine Robotics Sensor Fusion Localization