← Back ICRA 2026

ArthroCut: Autonomous Policy Learning for Robotic Bone Resection in Knee Arthroplasty

Xu Lu, Yiling Zhang, Wenquan Cheng, Longfei Ma, Fang Chen, Hongen Liao

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing preoperative imaging with real-time multimodal surgical data enables an autonomous vision-language model to achieve an 86% success rate in complex robotic bone resections.

Autonomous surgery Vision-language-action Robotic bone resection Multimodal learning Knee arthroplasty Constrained decoding

Problem

Current orthopedic robots primarily execute preplanned trajectories with limited real-time adaptation, failing to integrate heterogeneous intraoperative data for autonomous decision-making.

Approach

The framework fine-tunes a vision-language model to generate robot actions by combining preoperative anatomical tokens with time-aligned intraoperative video, tracking, and robot state tokens, constrained by a safe action grammar.

Key results

First time-synchronized multimodal dataset for autonomous knee arthroplasty
86% average success rate across six standard bone resections
11.95% reduction in execution time compared to vanilla baselines
TAST tokens proven critical for reliability, PIT tokens for anatomical grounding

Why it matters

Provides a scalable, data-driven pathway toward safer, context-aware autonomy in orthopedic robotic surgery.

Abstract

Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose Arthro- Cut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context- aware action generation. ArthroCut fine-tunes a Qwen–VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB–D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB–D surgical video, robot state, and textual intent. The method operates on two complementary token families—Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence—and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.

Index terms

Medical Robots and Systems AI-Based Methods Deep Learning Methods