Incorporating Scene Graphs into Pre-Trained Vision-Language Models for Multimodal Open-Vocabulary Action Recognition
Chao Wei, Zhidong Deng
Abstract
This paper presents Action-SGFA, a novel action feature alignment approach to learn unified joint embeddings across four action modalities incorporating scene graph (SG) comprehension. A new training paradigm for Action-SGFA is also devised to improve pre-trained VL models using datasets with SG annotation. When learning from image-SG pairs, it captures structure-associated action knowledge for visual and textual encoders. SG supervision generates fine-grained captions based on various graph augmentations highlighting different compositional aspects of action scenes. Furthermore, our research reveals that all combinations of paired data are unnecessary to train such unified embeddings, and only image-paired data is sufficient to bind all action modalities together. Our Action-SGFA can leverage existing large VL models, enhancing their zero-shot capabilities of new modalities due to their natural pairings with images. The open-vocabulary zero-shot performance improves with the strength of the pre- trained VL model and the SG comprehension. We establish a new state-of-the-art in several zero-shot action recognition tasks across modalities, significantly surpassing the vanilla skeleton zero-shot method by 27.0% and 19.7% on NTU-60 and NTU- 120, respectively. Additionally, in the context of RGB videos, we surpass the state-of-the-art method on Kinetics-400 by 2.1%.