← Back ICRA 2026

Learning Unified Probabilistic Spatial Relation Representation from Visual Demonstrations

Paul Emil Hannuschka, Jianfeng Gao, Tamim Asfour

PDF

AI summary

Key figure (auto-extracted from paper)

A few-shot spotlight model learns unified probabilistic spatial relations from visual demonstrations, enabling robots to accurately reason about and generate 3D object configurations with quantified uncertainty.

Spatial relations probabilistic modeling robotic manipulation few-shot learning spotlight metaphor 3D reasoning

Problem

Existing spatial relation models oversimplify object geometry and lack generative capabilities, while vision-language models require massive datasets and struggle with precise 3D metric reasoning for robotic manipulation.

Approach

The authors model spatial relations using a rotating spotlight metaphor that computes directional, distance, and overlap distributions from object point clouds, allowing the system to learn and generalize spatial constraints from just one or a few demonstrations.

Key results

Unified probabilistic representation encoding directional, distance, and topological relations via KDE, GMM, and Gaussian distributions
Framework learns static and dynamic spatial relations from one or a few visual demonstrations
Superior performance on generative manipulation and discriminative reasoning tasks compared to baseline geometric models
Competitive results against large-scale vision-language models while requiring minimal training data

Why it matters

Provides robots with a data-efficient, uncertainty-aware method to interpret and synthesize precise 3D spatial configurations, advancing autonomous manipulation and human-robot interaction.

Abstract

The ability to interpret and reason about spatial relations is fundamental for robotic manipulation tasks. For instance, a robot must understand that “inside” requires different geometric constraints than “touching”, and “closer” involves dynamic changes in distance relationships. Despite progress in modeling spatial relations, existing approaches face two critical limitations: they either oversimplify object geometry to points or bounding boxes, or they lack generative capabilities for synthesizing new spatial configurations. This paper introduces a novel generative and probabilistic model that jointly encodes object sizes, distances, and orientations within a unified representation, which captures distance-based, directional, and topological spatial relations while providing explicit uncertainty quantification. The model learns both static and dynamic semantic spatial relations from one or a few visual demonstrations and generalizes to novel contexts and configurations. We evaluate our approach across a set of spatial reasoning and robot manipulation tasks, demonstrating the model’s robust performance with varied object shapes, sizes, and spatial arrangements. Videos and source code are available at https://sites.google.com/view/spatial-relations.

Index terms

Semantic Scene Understanding Learning Categories and Concepts Representation Learning