← Back ICRA 2026

Attention-Based Markerless Pose Estimation of the Assistant-Port Trocar in Robot-Assisted Surgery with a Head-Mounted Display

Nicholas Greene, Aoqi Long, Peter Kazanzides

PDF

AI summary

Key figure (auto-extracted from paper)

A markerless, real-time deep learning method accurately tracks surgical trocar pose via an AR head-mounted display, enabling reliable instrument insertion guidance without disrupting sterile workflows.

Markerless tracking 6-DoF pose estimation Augmented reality Surgical robotics Head-mounted display Deep learning

Problem

Assistant surgeons in robot-assisted surgery lack immersive visual feedback and rely on suboptimal viewing conditions for instrument insertion. Previous augmented reality solutions required attached fiducial markers, which disrupt sterile surgical workflows and complicate procedures.

Approach

The method uses a U-Net-based neural network with cross-attention to predict 2D keypoints from HMD camera images, which are then converted to 6-DoF poses via Perspective-n-Point and refined using multi-view geometry for shaft alignment.

Key results

Real-time (66 Hz) markerless 6-DoF trocar tracking via HMD
U-Net architecture with cross-attention and ASPP for robust keypoint prediction
Three-stage synthetic-to-real training strategy overcoming imperfect annotations
Phantom validation yielding ~5.5 mm positional and ~1.9° angular error

Why it matters

Provides a clinically viable, marker-free augmented reality guidance system that preserves sterile surgical workflows while improving assistant surgeon spatial awareness.

Abstract

In robotic-assisted minimally invasive surgery, an assistant surgeon stands at the bedside to insert and manipulate instruments while the primary surgeon operates the robot. Augmented reality (AR) head-mounted displays (HMDs) may improve the assistant’s spatial awareness, but require tracking of surgical tools (both robotic and hand-held) for accurate overlay. In this work, we propose a markerless method to estimate the 6-DoF trocar pose for the assistant port, which can convey the insertion trajectory of any handheld instrument to the assistant surgeon. The method is based on a deep U-Net architecture with cross-attention and Atrous Spatial Pyramid Pooling (ASPP) to predict 2D keypoints on the trocar, which are then used by a Perspective-n-Point (PnP) method to estimate the trocar’s pose. From the predicted trocar pose, we can also directly find the 4-DoF shaft-line of the handheld instrument using a multi-view method; this enables correction for misalignment of the trocar and instrument shaft. The trocar tracking runs in real-time (66 Hz) and can be integrated into an AR-assisted workflow. Experimental results with a phantom show an accuracy of ∼5.5 mm and angle error of ∼1.9 degrees, which is sufficient to guide instrument insertion into the endoscope field of view.

Index terms

Surgical Robotics: Laparoscopy Computer Vision for Medical Robotics Visual Tracking