← Back ICRA 2026

Learning-Based Fusion for Robust Multi-Spectral Visual Servoing

Enrico Fiasche, Siddharth Singh Savner, Ezio Malis, Philippe Martinet

PDF

AI summary

Key figure (auto-extracted from paper)

Learning-based autoencoder fusion of multispectral data significantly improves visual servoing accuracy and stability under noisy conditions while preserving real-time efficiency.

Multispectral Visual Servoing Autoencoder Fusion Robotic Control Real-time Perception Noise-Robust Vision

Problem

Compressing multiple spectral bands into a single representation for real-time visual servoing is computationally heavy and sensitive to noise, and prior handcrafted gradient-based methods lack robustness and generalizability.

Approach

A convolutional autoencoder compresses noisy multispectral inputs into a compact, denoised 2D image that directly feeds into a standard Direct Visual Servoing controller.

Key results

A generalized learning-driven fusion pipeline for multispectral visual servoing
A noise-attenuated 2D image representation that outperforms gradient-based pixel selection
Improved convergence stability and positioning accuracy under noisy conditions
Validation via simulation and real-robot experiments maintaining real-time efficiency

Why it matters

Enables reliable robotic control in complex, unstructured environments where standard RGB cameras fail due to lighting variations.

Abstract

Multispectral sensors, which measure multiple wavelength bands beyond the standard red, green, and blue channels, capture richer information than conventional RGB cameras. Such enriched data is especially valuable in visual ser- voing, where robot control critically depends on image content. However, leveraging multiple spectral bands (typically around a dozen) directly within real-time visual servoing constitutes a significant challenge. The only prior work tackled this problem using a Pixel Selection strategy based on image gradients. This paper introduces a learning-based framework to enhance Multi-Spectral Visual Servoing (MSVS) by fusing data from multispectral cameras into a single, robust representation for control. An autoencoder is employed to compress multispectral inputs into a noise-attenuated 2D image, which is then used within a standard rule-based Direct Visual Servoing (DVS) scheme. Comparison experiments both with simulated data and with a real robot in complex and unstructured environments show that the proposed learning-based fusion maintains stable convergence and improves positioning accuracy under noisy conditions while preserving computational efficiency.

Index terms

Visual Servoing Deep Learning for Visual Perception Visual Tracking