Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting
Weiquan Wang, Jun Xiao, Feifei Shao, Yi Yang, Yueting Zhuang, Long Chen
AI summary
Problem
Reconstructing dynamic scenes with multiple interacting humans and objects from sparse views is hindered by severe mutual occlusion that breaks view consistency, and the combinatorial complexity of modeling subtle inter-instance dependencies at contact regions.
Approach
The method initializes 3D Gaussians from deformed canonical human and object models, then uses a cross-view fusion network to ensure per-instance consistency, followed by a global scene graph to refine Gaussian attributes based on inter-instance interactions.
Key results
- Addresses the novel multi-human multi-object rendering task from sparse views
- Introduces MM-GS, a hierarchical framework decoupling per-instance fusion from scene-level interaction
- Designs specialized modules for cross-view consistency and graph-based inter-instance dependency modeling
- Achieves state-of-the-art performance on complex datasets with high-fidelity, coherent digital twins
Why it matters
Provides a critical foundation for safe human-robot interaction, navigation, intent prediction, and sim-to-real transfer by enabling realistic dynamic digital twins.
Abstract
Reconstructing dynamic scenes with multiple in- teracting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high- fidelity digital twins for robotics and VR/AR. This prob- lem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view- consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high- fidelity details and plausible inter-instance contacts.