← Back ICRA 2026

Robust Multimodal Dynamic Object Segmentation

Zhe Xin, Hanzhi Chang, Penghui Huang, Yinian Mao, Guoquan (Paul) Huang

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing 2D tracking, 3D geometry, and semantic cues with iterative SAM refinement achieves state-of-the-art dynamic object segmentation and clean static scene reconstruction.

Dynamic object segmentation multimodal fusion 3D reconstruction SAM refinement 3D Gaussian Splatting robotics

Problem

Existing methods rely on single modalities like optical flow or 3D reconstruction, causing inconsistent object boundaries, sensitivity to depth errors, and failure in multi-object scenarios.

Approach

The framework unifies 2D point tracks, 3D depth/pose data, and semantic features into trajectory-based inputs processed by a Transformer and clustering network, followed by a novel point-query SAM refinement to accurately separate multiple moving objects.

Key results

State-of-the-art motion mask accuracy across PointOdyssey, DAVIS2017, and Sintel benchmarks
Superior static scene reconstruction quality using staticness-aware 3D Gaussian Splatting
Novel point-query SAM post-processing that correctly segments multiple objects per frame
Robust performance against feature degradation through adaptive multimodal trajectory classification

Why it matters

Provides a reliable pipeline for autonomous robotics and AR/VR systems to filter dynamic elements and reconstruct clean 3D environments from video.

Abstract

Dynamic object segmentation plays a critical role in many visual applications such as static scene reconstruction from dynamic videos. However, existing optical flow-based methods fail to ensure consistent static/dynamic segmentation along object boundaries, while 3D reconstruction-based ap- proaches are highly sensitive to reconstruction errors. To ad- dress these limitations, we present a dynamic object segmenta- tion framework that can generate both precise and complete dy- namic masks by integrating multimodal cues including 2D point tracks, 3D reconstruction, and semantic information. We design a network combining Transformer architectures with feature clustering aggregation modules to perform static/dynamic clas- sification of multimodal feature trajectories. It enables the model to adaptively determine which type of feature should dominate based on the characteristics of each scene, while also mitigating the impact of feature degradation. Additionally, we introduce a novel point-query-based SAM post-processing method capable of handling multiple objects within a single mask. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both dynamic object segmentation and static scene reconstruction tasks.

Index terms

Computer Vision for Automation Deep Learning for Visual Perception Mapping