← Back ICRA 2026

NavCrafter: Exploring 3D Scenes from a Single Image

diffusion model.

PDF

AI summary

Key figure (auto-extracted from paper)

NavCrafter synthesizes camera-controllable novel-view videos from a single image, enabling high-fidelity 3D scene reconstruction without dense multi-view inputs.

Novel View Synthesis 3D Reconstruction Video Diffusion Models Camera Control 3D Gaussian Splatting Single-Image 3D

Problem

Generating flexible 3D scenes from a single image is critical but challenging due to geometric errors, weak camera supervision, and poor spatio-temporal consistency in existing generative models.

Approach

The framework leverages a video diffusion model with a multi-stage camera control mechanism to generate temporally consistent novel views, guided by a collision-aware trajectory planner and refined through an enhanced 3D Gaussian Splatting pipeline.

Key results

State-of-the-art novel-view synthesis under large viewpoint shifts
Substantially improved 3D reconstruction fidelity and geometric consistency
Precise camera control and broad scene coverage via collision-aware planning
Reduced camera pose errors and enhanced visual/temporal coherence

Why it matters

It provides a practical pipeline for high-fidelity 3D scene exploration and reconstruction from sparse inputs, benefiting VR/AR, robotics, and digital content creation.

Abstract

Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision- aware camera trajectory planner and an enhanced 3D Gaus- sian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel- view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

Index terms

Deep Learning for Visual Perception Mapping Visual Learning