← Back ICRA 2026

TUN3D: Towards Real-World Scene Understanding from Unposed Images

Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi

PDF

AI summary

Key figure (auto-extracted from paper)

TUN3D enables joint 3D object detection and room layout estimation from multi-view images without depth sensors or camera poses, achieving state-of-the-art real-world performance.

3D object detection layout estimation unposed images sparse convolutions scene understanding structure-from-motion

Problem

Existing 3D scene understanding methods rely heavily on point clouds or depth sensors, limiting their use on consumer devices that only capture visual data. There is currently no method capable of jointly estimating room layouts and detecting 3D objects from unposed multi-view images.

Approach

TUN3D converts multi-view images into pseudo-point clouds using dense structure-from-motion, then processes them with a lightweight sparse-convolutional network. It jointly predicts 3D object bounding boxes and room layouts using a novel, geometrically constrained wall parameterization.

Key results

State-of-the-art joint layout and detection performance across point clouds, posed, and unposed image inputs
Novel 2D-offset wall parameterization improves geometric consistency and layout accuracy
Robust real-world benchmark results on ScanNet and S3DIS without depth or pose supervision
Real-time inference speed matching specialized single-task detection models

Why it matters

Enables real-time, sensor-free 3D scene understanding on everyday consumer devices, bridging the gap between academic benchmarks and practical AR/VR and robotics applications.

Abstract

Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semanti- cally rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground- truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, set- ting a new benchmark in holistic indoor scene understanding.

Index terms

Deep Learning for Visual Perception RGB-D Perception Computer Vision for Automation