← Back SII 2025

Foundation Models Need to Be Culturally Fine-Tuned

Jose Alfredo Garcia-Alvarado, Floris Marc Arden Erich, Tomohiro Motoda, Abdullah Mustafa, Yukiyasu Domae, Ixchel Georgina Ramirez-Alpizar

PDF

Abstract

This paper investigates the adaptability of Vision- Language Models (VLM) use in environments other than in the West, exemplified by CLIP. While models like CLIP exhibit commendable performance on established image datasets, their effectiveness in recognizing objects within specific cultural contexts remains an open question. Our experiments, conducted in a simulated environment, reveal noteworthy performance disparities between Western and Japanese datasets. Addition- ally, we explore the integration of a segmentation model to obtain segmentation masks with language-aligned features. By addressing these crucial gaps, our study provides insights into the nuanced challenges of cross-cultural recognition within the vision-language paradigm. These findings contribute to informed and unbiased model development for practical ap- plications across diverse cultural domains.

Index terms

Multi-Modal Perception Machine Learning