← Back ICRA 2026

Spiking-Refined 3D Object Detection through YOLO�SNN Fusion

Budiarianto Suryo Kusumo, Ulrike Thomas

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing a compact spiking neural network with YOLOv11 and monocular depth estimation significantly boosts 3D detection accuracy and temporal stability in real-time, low-cost systems.

Monocular 3D detection Spiking neural networks YOLOv11 Real-time tracking Embedded robotics Uncertainty-aware filtering

Problem

Monocular 3D object detection suffers from depth ambiguity, temporal jitter, and high computational demands, hindering deployment on resource-constrained embedded platforms for robotics and AR/VR.

Approach

The pipeline integrates YOLOv11 for 2D detection, Depth Anything v2 for depth estimation, and a lightweight spiking neural network that refines class predictions via product-of-experts fusion, coupled with an uncertainty-aware Kalman filter for stable 3D tracking.

Key results

SNN validation accuracy reached ~99.9% with strong generalization
YOLO–SNN fusion improved 3D detection accuracy by +2.2 points and reduced orientation error by ~1°
Uncertainty-aware Kalman filtering reduced 3D trajectory jitter by 30%
Pipeline achieves real-time performance (~20–30 FPS) on CPU-only hardware

Why it matters

Provides a lightweight, sensor-free alternative for reliable 3D perception in robotics, AR/VR, and embedded applications where cost and power are critical.

Abstract

This paper presents Spiking-Refined 3D Object Detection through YOLO–SNN Fusion, a real-time pipeline that leverages both convolutional and spiking neural representations for enhanced scene perception. Our system integrates YOLOv11 for robust 2D detection, Depth Anything v2 for monocular depth inference, and geometry-based reasoning for 3D bounding box construction, while a Bird’s-Eye View visualizer provides spatial context. To further improve recognition consistency, we fuse the predictions of a trained Spiking Neural Network (SNN) with YOLO outputs, enabling class refinement that is more resilient to temporal noise and ambiguous appearances. Kalman filtering is employed to stabilize trajectories over time, ensuring coherent 3D tracking. Unlike sensor-heavy setups, our approach runs on a single RGB camera and lightweight models, making it suitable for robotic perception, AR/VR applications, and low-cost embedded platforms. Experiments on real-world video sequences demonstrate improved 3D detection accuracy, temporal stability, and cross-class discrimination compared to conventional monocular pipelines.

Index terms

Object Detection Segmentation and Categorization Visual Learning