DroneKey++: A Size Prior-Free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images
Seo-Bin Hwang, Yeong-Jun Cho
AI summary
Problem
Existing methods rely on manual physical size or 3D mesh priors, limiting deployment on unseen drones. Current datasets are also small-scale and model-specific, hindering reliable generalization testing.
Approach
The framework jointly detects keypoints, classifies drone types, and estimates 3D pose using a learned decoder that implicitly encodes scale via class embeddings, eliminating external priors and PnP solvers.
Key results
- Rotation MAE of 17.34° and translation MAE of 0.135 m
- Inference speeds of 414 FPS (GPU) and 19 FPS (CPU)
- Introduction of 6DroneSyn: 52K-image benchmark with 7 drone models and 88 backgrounds
- Strong generalization across diverse drone types without manual size inputs
Why it matters
Provides a scalable, prior-free solution for real-time anti-drone surveillance and a comprehensive benchmark to advance future research.
Abstract
Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small- scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34◦and MedAE 17.1◦for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is available at [link].