Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao LI, Qiang Zhang, Jingkai SUN, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao
AI summary
Problem
Existing 3D-aware pretraining methods for robotics force a trade-off between expressive but unstructured implicit fields and structurally explicit but resolution-limited primitives, leading to poor generalization across varying scenes.
Approach
The authors insert a point-wise latent variational autoencoder into a point-cloud autoencoder to regularize features and coordinates toward a Gaussian prior, creating a compact structural latent representation. This is paired with a deliberately streamlined 3D Gaussian Splatting rendering pipeline for efficient, self-supervised 3D-aware pretraining.
Key results
- Consistent gains in task success and sample efficiency on RLBench and ManiSkill2 benchmarks
- Successful real-robot validation across six diverse manipulation tasks
- Improved robustness to viewpoint and scene variations over strong baselines
- Ablation studies confirm the critical role of the point-wise latent VAE and streamlined rendering pipeline
Why it matters
Bridges the expressiveness of implicit fields with the structural priors of explicit representations, offering a scalable and efficient foundation for downstream embodied AI and robotic manipulation.
Abstract
Current 3D-aware pretraining methods for em- bodied perception and manipulation are largely built on differ- entiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from reso- lution limits and weak generalization. To address these limita- tions, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point- wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS- based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational ca- pacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robust- ness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.