Spiking-Refined 3D Object Detection through YOLO�SNN Fusion
Budiarianto Suryo Kusumo, Ulrike Thomas
AI summary
Problem
Monocular 3D object detection suffers from depth ambiguity, temporal jitter, and high computational demands, hindering deployment on resource-constrained embedded platforms for robotics and AR/VR.
Approach
The pipeline integrates YOLOv11 for 2D detection, Depth Anything v2 for depth estimation, and a lightweight spiking neural network that refines class predictions via product-of-experts fusion, coupled with an uncertainty-aware Kalman filter for stable 3D tracking.
Key results
- SNN validation accuracy reached ~99.9% with strong generalization
- YOLO–SNN fusion improved 3D detection accuracy by +2.2 points and reduced orientation error by ~1°
- Uncertainty-aware Kalman filtering reduced 3D trajectory jitter by 30%
- Pipeline achieves real-time performance (~20–30 FPS) on CPU-only hardware
Why it matters
Provides a lightweight, sensor-free alternative for reliable 3D perception in robotics, AR/VR, and embedded applications where cost and power are critical.
Abstract
This paper presents Spiking-Refined 3D Object Detection through YOLO–SNN Fusion, a real-time pipeline that leverages both convolutional and spiking neural representations for enhanced scene perception. Our system integrates YOLOv11 for robust 2D detection, Depth Anything v2 for monocular depth inference, and geometry-based reasoning for 3D bounding box construction, while a Bird’s-Eye View visualizer provides spatial context. To further improve recognition consistency, we fuse the predictions of a trained Spiking Neural Network (SNN) with YOLO outputs, enabling class refinement that is more resilient to temporal noise and ambiguous appearances. Kalman filtering is employed to stabilize trajectories over time, ensuring coherent 3D tracking. Unlike sensor-heavy setups, our approach runs on a single RGB camera and lightweight models, making it suitable for robotic perception, AR/VR applications, and low-cost embedded platforms. Experiments on real-world video sequences demonstrate improved 3D detection accuracy, temporal stability, and cross-class discrimination compared to conventional monocular pipelines.