← Back ICRA 2026

Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

Weiquan Wang, Jun Xiao, Feifei Shao, Yi Yang, Yueting Zhuang, Long Chen

PDF

AI summary

Key figure (auto-extracted from paper)

MM-GS enables high-fidelity, view-consistent rendering of complex multi-human and multi-object interactions from sparse views by decoupling per-instance fusion from scene-level graph-based interaction modeling.

Multi-Human Multi-Object Rendering 3D Gaussian Splatting Sparse-View Reconstruction Inter-Instance Interaction Digital Twins Graph Attention

Problem

Reconstructing dynamic scenes with multiple interacting humans and objects from sparse views is hindered by severe mutual occlusion that breaks view consistency, and the combinatorial complexity of modeling subtle inter-instance dependencies at contact regions.

Approach

The method initializes 3D Gaussians from deformed canonical human and object models, then uses a cross-view fusion network to ensure per-instance consistency, followed by a global scene graph to refine Gaussian attributes based on inter-instance interactions.

Key results

Addresses the novel multi-human multi-object rendering task from sparse views
Introduces MM-GS, a hierarchical framework decoupling per-instance fusion from scene-level interaction
Designs specialized modules for cross-view consistency and graph-based inter-instance dependency modeling
Achieves state-of-the-art performance on complex datasets with high-fidelity, coherent digital twins

Why it matters

Provides a critical foundation for safe human-robot interaction, navigation, intent prediction, and sim-to-real transfer by enabling realistic dynamic digital twins.

Abstract

Reconstructing dynamic scenes with multiple in- teracting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high- fidelity digital twins for robotics and VR/AR. This prob- lem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view- consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high- fidelity details and plausible inter-instance contacts.

Index terms

Semantic Scene Understanding Deep Learning for Visual Perception Human-Centered Automation