← Back IROS 2024

A Lightweight De-Confounding Transformer for Image Captioning in Wearable Assistive Navigation Device

Zhengcai Cao, Ji Xia, Yinbin Shi, MengChu Zhou

PDF

Abstract

Image captioning is a multi-modal task that en- ables the transformation from scene images to natural language, providing valuable insights for visually impaired individuals to understand their environment. Therefore, its application to wearable navigation devices for visually impaired individuals holds immense potential. However, in practical applications, confusion between scene visuals and semantics, coupled with model complexity, often leads to performance degradation, resulting in inaccurate environmental interpretation. In light of this, we introduce a Lightweight De-confounding Transformer Network (LDTNet) for image captioning equipped with a Causal Adjustment module to eliminate confounders. Moreover, we design a Suppression Gate Unit that efficiently integrates fine-grained information from shallow features, while reducing the number of network layers to have a lightweight model. Experimental results demonstrate that our approach not only addresses the visual-semantic confusion issue effectively but also improves the response speed of wearable devices in comparison with the state of the art. Twenty volunteers are recruited to evaluate LDTNet’s efficacy in real-world settings in terms of both response speed and generated outputs by wearing the resulting assistive navigation devices. The outcomes well show its outstanding performance and great potential for visualy impaired individuals to use.

Index terms

Deep Learning for Visual Perception Wearable Robotics Deep Learning Methods