← Back ICRA 2026

Neuromorphic Event Camera-Based Object Recognition and Grasping Position Detection Using a Transfer Learning-Enhanced Multi-Task Model

Muhammad Hamza Zafar, Syed Kumayl Raza Moosavi, Filippo Sanfilippo

PDF

AI summary

Key figure (auto-extracted from paper)

A unified multi-task neural network leveraging event camera data and transfer learning achieves state-of-the-art accuracy for simultaneous object recognition and grasp detection.

Event camera multi-task learning object recognition grasp detection transfer learning neuromorphic vision

Problem

Robotic systems traditionally treat object recognition and grasp detection as separate tasks, causing computational inefficiency and integration challenges in dynamic environments. Existing models also fail to fully exploit the temporal advantages of event-based vision for unified multi-task learning.

Approach

The authors propose CSA-AInceptNet, a unified architecture combining channel sharpening attention with adaptive inception networks to process asynchronous event streams. Transfer learning bridges dataset gaps, enabling efficient simultaneous object classification and grasp bounding box prediction.

Key results

99.47% accuracy and 0.9370 mean IoU on E-Grasp
98.58% accuracy and 0.4897 mean IoU on Neuro-Grasp via transfer learning
Outperforms ConvNeXt, DarkNet, DenseNet, and VGG16 in accuracy and efficiency
Ablation studies confirm effectiveness of channel sharpening attention and adaptive inception modules

Why it matters

Provides a computationally efficient, real-time perception solution for dynamic robotic manipulation, advancing human-robot collaboration and automated industrial applications.

Abstract

Object recognition and grasping position detection are critical tasks in robotic manipulation, particularly when operating in dynamic and unstructured environments. This paper presents the Channel Sharpening Attention-based Adaptive Inception Network (CSA-AInceptNet), a novel multi-task learn- ing model designed for these tasks using event camera data. The proposed architecture integrates channel sharpening attention with adaptive inception networks to enhance feature extraction and improve robustness. The model’s performance is evaluated on two state-of-the-art event camera datasets, E-Grasp and Neuro-Grasp. On the E-Grasp dataset, CSA-AIncepNet achieves a remarkable accuracy of 99.47% and a mean Intersection over Union (IoU) of 0.9370, significantly surpassing existing methods. On the Neuro-Grasp dataset, leveraging transfer learning, the model attains 98.58% accuracy and a mean IoU of 0.4897, demonstrating strong generalization capabilities across datasets. Comparative analyses and ablation studies further validate the effectiveness of the proposed architecture, highlighting its superiority over conventional models like ConvNeXt, DarkNet, DenseNet, and VGG16. The results establish CSA-AIncepNet as a robust solution for event-based object recognition and grasping detection, paving the way for advancements in human-robot collaboration and dynamic robotic manipulation. Note to Practitioners—This work provides a practical solution for improving object recognition and grasping position detection in robotic systems, particularly in unpredictable and fast- changing real-world environments. By leveraging event camera data, the proposed approach enables robots to efficiently identify objects and determine optimal grasping positions, even under challenging conditions. The results highlight the model’s ability to outperform existing methods, making it highly suitable for applications such as human-robot collaboration and precise object handling. This advancement has significant implications for industries like manufacturing, logistics, and healthcare, where robots must interact with objects quickly and accurately. Practi- tioners can adopt this method to enhance robotic performance, reduce errors, and improve operational efficiency. Future work could focus on testing the model in more complex environments and adapting it for real-time deployment in dynamic settings. Received 17 December 2024; revised 5 May 2025 and 2 July 2025; accepted 7 August 2025. Date of publication 13 August 2025; date of current version 27 August 2025. This article was recommended for publication by Associate Editor Y. Wu and Editor X. Liu upon evaluation of the reviewers’ comments. (Corresponding author: Filippo Sanfilippo.) Muhammad Hamza Zafar and Syed Kumayl Raza Moosavi are with the Department of Engineering Sciences, University of Agder, 4879 Grimstad, Norway (e-mail: muhammad.h.zafar@uia.no; syed.k.moosavi@uia.no). Filippo Sanfilippo is with the Department of Engineering Sciences, Uni- versity of Agder, 4879 Grimstad, Norway, and also with the Department of Software Engineering, Kaunas University of Technology, 51368 Kaunas, Lithuania (e-mail: filippo.sanfilippo@uia.no). Digital Object Identifier 10.1109/TASE.2025.3598695

Index terms

Deep Learning in Grasping and Manipulation Industrial Robots Computer Vision for Automation