← Back ICRA 2026

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao LI, Qiang Zhang, Jingkai SUN, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

PDF

AI summary

Key figure (auto-extracted from paper)

A hybrid structural latent point representation combined with a lightweight 3D Gaussian Splatting pipeline significantly improves sample efficiency, robustness, and real-world performance in robotic manipulation.

3D-aware pretraining structural latent points 3D Gaussian Splatting robotic manipulation visual representation learning latent variational autoencoder

Problem

Existing 3D-aware pretraining methods for robotics force a trade-off between expressive but unstructured implicit fields and structurally explicit but resolution-limited primitives, leading to poor generalization across varying scenes.

Approach

The authors insert a point-wise latent variational autoencoder into a point-cloud autoencoder to regularize features and coordinates toward a Gaussian prior, creating a compact structural latent representation. This is paired with a deliberately streamlined 3D Gaussian Splatting rendering pipeline for efficient, self-supervised 3D-aware pretraining.

Key results

Consistent gains in task success and sample efficiency on RLBench and ManiSkill2 benchmarks
Successful real-robot validation across six diverse manipulation tasks
Improved robustness to viewpoint and scene variations over strong baselines
Ablation studies confirm the critical role of the point-wise latent VAE and streamlined rendering pipeline

Why it matters

Bridges the expressiveness of implicit fields with the structural priors of explicit representations, offering a scalable and efficient foundation for downstream embodied AI and robotic manipulation.

Abstract

Current 3D-aware pretraining methods for em- bodied perception and manipulation are largely built on differ- entiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from reso- lution limits and weak generalization. To address these limita- tions, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point- wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS- based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational ca- pacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robust- ness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

Index terms

Deep Learning for Visual Perception Representation Learning RGB-D Perception