← Back ICRA 2026

MonoKey: Keypoint-Based Monocular 3D Object Detection Using Prior Guidance for Occlusion Robustness

Yeon Woo Cho, Jung Woo Cheon, Jae Hyun Yoon, Seok Bong Yoo

PDF

AI summary

Key figure (auto-extracted from paper)

MonoKey achieves state-of-the-art monocular 3D detection under occlusion by reconstructing missing object geometry via keypoint priors and frequency-based depth estimation.

Monocular 3D detection occlusion robustness keypoint reconstruction frequency-domain depth autonomous driving graph refinement

Problem

Monocular 3D object detection relies heavily on appearance cues that degrade under occlusion, making accurate depth estimation and object localization challenging.

Approach

MonoKey estimates visible 2D keypoints and reconstructs occluded ones using a prior-guided autoencoder, then fuses these cues with frequency-based depth features and refines bounding boxes via a relational graph.

Key results

State-of-the-art KITTI car detection under moderate and hard occlusion
Superior performance in adverse weather (snow, rain, fog) on CADC and Dense datasets
Robust reconstruction of occluded object geometry using symmetry and yaw priors
Mitigation of spectral bias in depth estimation via frequency-domain processing

Why it matters

Provides a cost-effective, robust perception solution for autonomous vehicles and robotics operating in complex, occluded environments.

Abstract

Monocular 3D object detection has garnered at- tention due to its cost-efficiency and simpler setup compared with multisensor systems. In this task, an accurate depth estimation is crucial for precise object localization, however extracting sufficient depth cues from a single image remains challenging. Moreover, when occlusions occur, structural cues become limited, making precise object localization increasingly difficult. To address these problems, we propose MonoKey, a keypoint-based monocular 3D object detection method that is robust to occlusion. MonoKey applies 2D keypoints due to their suitability for recovering occluded regions. The occlusion-robust 2D keypoint detection approach estimates object keypoints and reconstructs occluded ones using prior information. The frequency-based global-local depth predictor estimates 3D cues using fast Fourier convolution to incorporate global and local contexts. These 3D cues and keypoints are fused in a 3D detection decoder. Relational graph refinement adjusts the initial bounding boxes for improved localization. The experi- mental results indicate that MonoKey outperforms the existing monocular 3D object detection methods. The source code is available at https://github.com/yeonwoo29/MonoKey.git.

Index terms

Object Detection Segmentation and Categorization Computer Vision for Automation AI-Based Methods