← Back ICRA 2026

Fine-Grained Classification for Depth Estimation from Monocular Microscopy for Robotic Micromanipulation of Motile Cells

Yu Sun,∗and Zhuoran Zhang,,∗

PDF

AI summary

Key figure (auto-extracted from paper)

A fine-grained classification network with attention and feature augmentation enables real-time, accurate depth estimation from monocular microscopy, achieving 92% success in robotic sperm aspiration.

Monocular depth estimation fine-grained classification robotic micromanipulation motile cell tracking attention fusion sperm aspiration

Problem

Obtaining accurate Z-axis depth feedback for motile cells under monocular microscopy is challenging because traditional depth-from-focus and defocus methods are too slow or inaccurate due to rapid morphological changes and subtle blur differences across focal planes.

Approach

The authors reformulate depth estimation as a fine-grained multi-class classification task and introduce a Fine-Grained Attention Fusion Module with channel-based feature augmentation and a weighted loss function to extract subtle depth-related features from moving cell images.

Key results

83.52% top-1 and 96.88% top-3 depth classification accuracy
Real-time inference at 90 frames per second
92% success rate in robotic live motile sperm aspiration
Effective real-time 3D pipette localization guidance

Why it matters

Enables precise, real-time 3D visual feedback for robotic manipulation of motile cells, advancing clinical IVF procedures and bio-hybrid micro-robot research.

Abstract

Manipulation of motile cells is crucial for biological research and clinical applications. However, obtaining Z-axis visual feedback under monocular microscopy remains a challenge for robotic micromanipulation. Traditional depth-from-focus and depth-from-defocus methods fail to handle motile cells due to time-consuming focus search or inaccurate defocus modeling. This paper addresses these limitations by reformulating depth estimation as a fine-grained multi-class depth classification prob- lem that exploits the shallow depth-of-field characteristic of microscopy. We propose a Fine-Grained Attention Fusion Module (FGAF-Module) that combines multi-scale grouped convolu- tion for extracting subtle depth-related features with attention mechanisms to focus on discriminative regions in cell images. Additionally, channel-based feature augmentation methods, in- cluding CrossNorm and SelfNorm, enhance fine-grained feature discrimination while improving model generalization to handle morphological variations during cell movement. A weighted loss function further guides the model to distinguish between adjacent depth categories by penalizing errors proportionally to depth differences. For network training evaluation, the FGAF-module enhanced network achieved 83.52% top-1 classification accuracy and 96.88% top-3 classification accuracy while maintaining real- time performance at 90 frames per second. To demonstrate the capability of our approach in providing visual feedback for robotic manipulation of motile cells, the trained depth estimation model was integrated into a robotic sperm aspiration system. The model provided real-time visual depth feedback to guide 3D pipette localization during sperm aspiration procedures, achiev- ing a 92% success rate for live motile sperm aspiration. These results validate the effectiveness of fine-grained classification for monocular depth estimation in micromanipulation applications.

Index terms

Biological Cell Manipulation Automation at Micro-Nano Scales Deep Learning for Visual Perception