← Back ICRA 2026

PokeNet: Learning Kinematic Models of Articulated Objects from Human Observations

Anmol Gupta, Weiwei Gu, Omkar Deepak Patil, Jun Ki Lee, Nakul Gopalan

PDF

AI summary

Key figure (auto-extracted from paper)

PokeNet accurately predicts joint parameters, states, and manipulation order for multi-DoF objects from a single human demonstration without prior object knowledge.

Articulation modeling Kinematic estimation Human demonstration Set prediction Robot manipulation Point cloud processing

Problem

Existing articulation modeling methods rely on strong priors, require extensive multi-view data, or fail to recover occluded joints and manipulation sequences in multi-DoF objects.

Approach

PokeNet processes sequential single-view point clouds from a human demonstration using a transformer-based set prediction framework to jointly estimate joint types, parameters, states, and their operational order.

Key results

Improves joint axis and state estimation accuracy by up to 25% in simulation and 30% on real-world data
Successfully infers manipulation order for multi-DoF objects without prior structural assumptions
Generalizes to unseen object categories and scales in both simulated and real-world environments
Releases the largest annotated real-world dataset of 3D articulated object interactions (5,500 sequences)

Why it matters

Enables robots to safely manipulate complex, unseen articulated objects using minimal human demonstration data.

Abstract

Articulation modeling enables robots to learn joint parameters of articulated objects for effective manipulation which can then be used downstream for skill learning or planning. Existing approaches often rely on prior knowledge about the objects, such as the number or type of joints. Some of these approaches also fail to recover occluded joints that are only revealed during interaction. Others require large numbers of multi-view images for every object, which is impractical in real-world settings. Furthermore, prior works neglect the order of manipulations, which is essential for many multi- DoF objects where one joint must be operated before another, such as a dishwasher. We introduce PokeNet, an end-to-end framework that estimates articulation models from a single human demonstration without prior object knowledge. Given a sequence of point cloud observations of a human manipulating an unknown object, PokeNet predicts joint parameters, infers manipulation order, and tracks joint states over time. PokeNet outperforms existing state-of-the-art methods, improving joint axis and state estimation accuracy by an average of over 27% across diverse objects, including novel and unseen categories. We demonstrate these gains in both simulation and real-world environments. Code and dataset are available on our webpage+.

Index terms

Perception for Grasping and Manipulation Deep Learning for Visual Perception Object Detection Segmentation and Categorization