← Back ICRA 2026

HeRO: Hierarchical 3D Semantic Representation for Pose-Aware Object Manipulation

Chongyang Xu, Shen Cheng, Li Haipeng, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

PDF

AI summary

Key figure (auto-extracted from paper)

HeRO sets a new state-of-the-art in pose-aware robotic manipulation by fusing geometric and semantic features into a hierarchical 3D representation that guides a diffusion policy.

pose-aware manipulation diffusion policy 3D semantic fields hierarchical conditioning dense semantic lifting robotic imitation learning

Problem

Purely geometric 3D manipulation policies lack explicit part-level semantics, causing failures in tasks requiring precise object part alignment. Existing semantic fields often produce holistic representations that blur fine-grained part distinctions.

Approach

HeRO fuses discriminative DINOv2 and coherent Stable Diffusion features via dense semantic lifting to create 3D global and local semantic fields, which condition a diffusion policy through a permutation-invariant hierarchical module.

Key results

Sets new state-of-the-art on pose-aware manipulation benchmarks
Improves dual shoe placement success by 12.3%
Averages 6.5% success gain across six challenging tasks
Validated in both simulation and real-world robotic experiments

Why it matters

It enables robots to reliably execute complex, part-aware manipulation tasks by bridging the gap between geometric precision and semantic understanding.

Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose- aware manipulation (e.g., distinguishing a shoe’s “toe” from “heel”). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipula- tion. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

Index terms

Imitation Learning Representation Learning Dual Arm Manipulation