← Back ICRA 2026

GAF: Gaussian Action Field As a 4D Representation for Dynamic World Modeling in Robotic Manipulation

Ying Chai, litao deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang LI, Hongwen Zhang, Yebin Liu

PDF

AI summary

Key figure (auto-extracted from paper)

GAF explicitly models dynamic scene evolution via a 4D Gaussian representation to directly predict and refine robotic actions, significantly outperforming static 3D and direct vision-to-action baselines.

Gaussian Splatting 4D Representation Robotic Manipulation Dynamic World Model Action Prediction Vision-to-Action

Problem

Existing vision-based robotic manipulation methods struggle with action inaccuracies because they rely on static 3D representations or direct 2D mappings that fail to capture temporal scene dynamics.

Approach

GAF extends 3D Gaussian Splatting with learnable motion attributes to reconstruct current and future scene states from sparse RGB inputs, then extracts initial actions from Gaussian displacement and refines them via an action-vision-aligned diffusion denoiser.

Key results

+11.54 dB PSNR, +0.39 SSIM, and -0.56 LPIPS improvements in scene reconstruction
+7.3% average success rate boost in robotic manipulation tasks over state-of-the-art baselines
Pose-free dynamic scene reconstruction and future frame prediction from sparse multi-view RGB inputs
Successful real-world deployment with robust closed-loop manipulation under occlusion

Why it matters

It bridges high-fidelity 4D dynamic perception with precise action generation, enabling more reliable and generalizable vision-based robotic manipulation in unstructured environments.

Abstract

Accurate scene perception is critical for vision- based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

Index terms

Visual Learning Perception for Grasping and Manipulation Learning from Demonstration