← Back ICRA 2026

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Lan Xu, Jingyi Yu

PDF

AI summary

Key figure (auto-extracted from paper)

Kinematify accurately infers kinematic topologies and joint parameters for high-DoF articulated objects from single images or text without requiring motion data.

articulated object modeling kinematic topology inference joint parameter estimation Monte Carlo Tree Search SDF optimization open-vocabulary 3D generation

Problem

Creating accurate articulated 3D models for high-DoF objects is labor-intensive and typically relies on motion sequences or strong priors, making automated reconstruction from static inputs difficult.

Approach

The framework generates segmented meshes from images or text, uses Monte Carlo Tree Search with structural rewards to infer the kinematic tree, and optimizes joint parameters via a distance-weighted, contact-aware SDF objective.

Key results

Open-vocabulary framework generating articulated objects from RGB images or text without motion data or training.
MCTS-based kinematic tree inference that resolves multi-branch ambiguities using structural priors.
DW-CAVL algorithm for accurate joint parameter estimation from static geometry via SDF-driven optimization.
State-of-the-art accuracy in joint axis orientation and tree topology recovery on everyday objects and high-DoF robots.

Why it matters

Enables automated, physics-aware robot self-modeling and environment interaction without costly motion capture or manual annotation.

Abstract

A deep understanding of kinematic structures is essential for robot motion and interaction with the environment. Such understanding is captured through articulated objects, which are essential for physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects from arbitrary RGB images or textual descriptions. Our method addresses two core chal- lenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static 3D geometry. To achieve this, we combine MCTS search for structural in- ference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid models. We evaluate Kinematify on diverse inputs from both synthetic environments and real-world, demonstrating improvements in registration and kinematic topology accuracy over prior work. https://sites.google.com/deemos.com/kinematify 1Deemos Corporation, Wilmington, DE 19801, USA. Emails: {joel.wang, dingyou, zhangqx}@deemos.com. 2ShanghaiTech University, Shanghai, China. Emails: {wangdy2024, zhangqx1, yujingyi, xulan1}@shanghaitech.edu.cn. 3Contextual Robotics Institute, UC San Diego, La Jolla, CA 92093, USA. Emails: {jiw179, jih189}@ucsd.edu. †Project lead: Qixuan Zhang (zhangqx@deemos.com). *Corresponding authors: Jingyi Yu (yujingyi@shanghaitech.edu.cn), Lan Xu (xulan1@shanghaitech.edu.cn).

Index terms

Semantic Scene Understanding Perception for Grasping and Manipulation Mechanism Design