HeRO: Hierarchical 3D Semantic Representation for Pose-Aware Object Manipulation
Chongyang Xu, Shen Cheng, Li Haipeng, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu
AI summary
Problem
Purely geometric 3D manipulation policies lack explicit part-level semantics, causing failures in tasks requiring precise object part alignment. Existing semantic fields often produce holistic representations that blur fine-grained part distinctions.
Approach
HeRO fuses discriminative DINOv2 and coherent Stable Diffusion features via dense semantic lifting to create 3D global and local semantic fields, which condition a diffusion policy through a permutation-invariant hierarchical module.
Key results
- Sets new state-of-the-art on pose-aware manipulation benchmarks
- Improves dual shoe placement success by 12.3%
- Averages 6.5% success gain across six challenging tasks
- Validated in both simulation and real-world robotic experiments
Why it matters
It enables robots to reliably execute complex, part-aware manipulation tasks by bridging the gap between geometric precision and semantic understanding.
Abstract
Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose- aware manipulation (e.g., distinguishing a shoe’s “toe” from “heel”). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipula- tion. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.