← Back ICRA 2026

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu, Cewu Lu

PDF

AI summary

Key figure (auto-extracted from paper)

Replacing monocular cameras with a single stereo camera enables real-time, metric-accurate, and shape-aware 3D human motion capture using sparse IMUs.

Stereo vision IMU fusion human motion capture metric accuracy shape-aware estimation real-time pose

Problem

Existing hybrid visual-inertial motion capture systems rely on monocular cameras, which suffer from depth ambiguity causing metric inaccuracies in global translation, and ignore individual body shape variations, leading to inconsistent local motions and foot-skating.

Approach

The Stereo-Inertial Poser fuses six sparse IMUs with a single stereo camera, using state space models and a shape-aware fusion module to directly estimate metric 3D keypoints, anthropometric body shape, and drift-compensated joint and root movements in real time.

Key results

Achieves over 200 FPS real-time inference without post-processing
Produces drift-free global translation and metric-accurate 3D trajectories
Reduces foot-skating effects via dynamic shape-aware fusion
Sets state-of-the-art performance across multiple benchmark datasets

Why it matters

Provides a low-cost, highly accurate, and real-time motion capture solution critical for robotics, teleoperation, and human-robot interaction.

Abstract

Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropo- metric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects. The code, data, and supplementary materials are available at https:// sites.google.com/view/stereo-inertial-poser.

Index terms

Gesture Posture and Facial Expressions Deep Learning for Visual Perception Human Detection and Tracking