← Back ICRA 2026

VistaBot: View-Robust Robot Manipulation Via Spatiotemporal-Aware View Synthesis

Songen Gu, Yuhang Zheng, Weize Li, Yupeng Zheng, Yating Feng, Xiang Li, Yilun Chen, Pengfei Li, Wenchao Ding

PDF

AI summary

Key figure (auto-extracted from paper)

VistaBot enables robust cross-view robotic manipulation by fusing geometric priors with video diffusion models to synthesize canonical-view latents for closed-loop control without test-time calibration.

view-robust manipulation video diffusion models novel view synthesis closed-loop control cross-view generalization geometric priors

Problem

End-to-end robotic manipulation policies suffer from severe performance degradation when camera viewpoints change during testing, typically requiring tedious re-calibration or retraining.

Approach

The framework uses a fine-tuned feed-forward geometric model to estimate 4D structure and a conditional video diffusion model to extract spatiotemporal latents from novel views, enabling a policy to learn actions directly from these geometry-aware representations.

Key results

Improves View Generalization Score by 2.79× over ACT and 2.63× over π0
Delivers high-fidelity novel view synthesis and robust closed-loop manipulation in simulation and real-world settings
Eliminates the need for test-time camera calibration or pose estimation
Introduces the View Generalization Score (VGS) metric for standardized cross-view evaluation

Why it matters

Enables scalable deployment of robust visuomotor policies in dynamic environments where camera positions cannot be fixed or calibrated at runtime.

Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view syn- thesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based (π0) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross- view generalization. Results show that VistaBot improves VGS by 2.79× and 2.63× over ACT and π0, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

Index terms

Perception for Grasping and Manipulation Deep Learning in Grasping and Manipulation Imitation Learning