← Back ICRA 2026

Single-View 3D-Aware Representations for Reinforcement Learning by Cross-View Neural Radiance Fields

Daesol Cho, Seungyeon Yoo, Dongseok Shim, H. Jin Kim

PDF

AI summary

Key figure (auto-extracted from paper)

A novel framework extracts 3D-aware representations from single-view RGB images using cross-view NeRF completion, significantly boosting reinforcement learning performance in simulation and real-world robotic tasks without needing camera poses or multi-view inputs during deployment.

Reinforcement learning 3D-aware representation Neural Radiance Fields single-view inference robotic manipulation cross-view completion

Problem

Image-based reinforcement learning typically relies on 2D visual features that lack 3D geometric awareness, while existing 3D-aware methods require impractical multi-view camera setups or precise camera poses during deployment, hindering real-world robotic applications.

Approach

The authors propose SinCro, which pre-trains a masked Vision Transformer encoder with a latent-conditioned NeRF decoder using cross-view completion and contrastive loss to learn view-invariant 3D scene representations, later deployed with only single-view RGB images for downstream reinforcement learning.

Key results

First framework to extract 3D-aware implicit representations using only single-view RGB images during RL deployment
Learns view-invariant, 3D geometry-aware scene representations via NeRF-based cross-view completion and contrastive regularization
Achieves superior downstream RL performance over prior methods in both Meta-World simulation and real-world UR3 robotic tasks
Demonstrates robustness to novel viewpoints without requiring camera pose information during deployment

Why it matters

Enables practical, deployment-ready robotic manipulation by eliminating the need for complex multi-view camera setups or pose calibration while preserving critical 3D spatial understanding for reinforcement learning.

Abstract

Reinforcement learning (RL) has enabled robots to develop complex skills, but its success in image-based tasks often depends on effective representation learning. Prior works have primarily focused on 2D representations, often overlooking the inherent 3D geometric structure of the world, or have attempted to learn 3D representations that require extensive resources such as synchronized multi-view images even during deployment. To address these issues, we propose a novel RL framework that extracts 3D-aware representations from single- view RGB input, without requiring camera pose or synchronized multi-view images during the downstream RL. Our method employs an autoencoder architecture, using a masked Vision Transformer (ViT) as the encoder and a latent-conditioned Neural Radiance Fields (NeRF) as the decoder, trained with cross- view completion to implicitly capture fine-grained, 3D geometry- aware representations. Additionally, we utilize a time contrastive loss that further regularizes the learned representation for consistency across different viewpoints, which enables viewpoint- robust robot manipulations. Our method significantly enhances the RL agent’s performance both in simulation and real-world experiments, demonstrating superior effectiveness compared to prior 3D-aware representation-based methods, even when using only single-view RGB images during deployment. Project page: https://sincro-ral.github.io/.

Index terms

Reinforcement Learning Representation Learning Visual Learning