Learning Unified Probabilistic Spatial Relation Representation from Visual Demonstrations
Paul Emil Hannuschka, Jianfeng Gao, Tamim Asfour
AI summary
Problem
Existing spatial relation models oversimplify object geometry and lack generative capabilities, while vision-language models require massive datasets and struggle with precise 3D metric reasoning for robotic manipulation.
Approach
The authors model spatial relations using a rotating spotlight metaphor that computes directional, distance, and overlap distributions from object point clouds, allowing the system to learn and generalize spatial constraints from just one or a few demonstrations.
Key results
- Unified probabilistic representation encoding directional, distance, and topological relations via KDE, GMM, and Gaussian distributions
- Framework learns static and dynamic spatial relations from one or a few visual demonstrations
- Superior performance on generative manipulation and discriminative reasoning tasks compared to baseline geometric models
- Competitive results against large-scale vision-language models while requiring minimal training data
Why it matters
Provides robots with a data-efficient, uncertainty-aware method to interpret and synthesize precise 3D spatial configurations, advancing autonomous manipulation and human-robot interaction.
Abstract
The ability to interpret and reason about spatial relations is fundamental for robotic manipulation tasks. For instance, a robot must understand that “inside” requires different geometric constraints than “touching”, and “closer” involves dynamic changes in distance relationships. Despite progress in modeling spatial relations, existing approaches face two critical limitations: they either oversimplify object geometry to points or bounding boxes, or they lack generative capabilities for synthesizing new spatial configurations. This paper introduces a novel generative and probabilistic model that jointly encodes object sizes, distances, and orientations within a unified representation, which captures distance-based, directional, and topological spatial relations while providing explicit uncertainty quantification. The model learns both static and dynamic semantic spatial relations from one or a few visual demonstrations and generalizes to novel contexts and configurations. We evaluate our approach across a set of spatial reasoning and robot manipulation tasks, demonstrating the model’s robust performance with varied object shapes, sizes, and spatial arrangements. Videos and source code are available at https://sites.google.com/view/spatial-relations.