Ebhaam at SemEval-2023 Task 1: A CLIP-Based Approach for Comparing Cross-modality and Unimodality in Visual Word Sense Disambiguation

Published in the Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Abstract

This paper presents an approach to tackle the task of Visual Word Sense Disambiguation (Visual-WSD), which involves determining the most appropriate image to represent a given polysemous word in one of its particular senses. The proposed approach leverages the CLIPmodel, prompt engineering, and text-to-image models such as GLIDE and DALL-E 2 for both image retrieval and generation. To evaluate our approach, we participated in the SemEval 2023 shared task on “Visual Word Sense Disambiguation (Visual-WSD)” using a zero-shot learning setting, where we compared the accuracy of different combinations of tools, including “Simple prompt-based” methods and “Generated prompt-based” methods for prompt engineering using completion models, and textto- image models for changing input modality from text to image. Moreover, we explored the benefits of cross-modality evaluation between text and candidate images using CLIP. Our experimental results demonstrate that the proposed approach reaches better results than cross-moda ity approaches, highlighting the potential of prompt engineering and text-to-image models to improve accuracy in Visual-WSD tasks. We assessed our approach in a zero-shot learning scenario and attained an accuracy of 68.75% in our best attempt.

Pipeline
Image-image pipeline

(a) Image-modality comparison method. Using CLIP encoders for both input text and candidate images. The CLIP model is used to retrieve the candidate images that best match the input text prompts, and the retrieved images are evaluated against the ground truth images.

Image-text pipeline

(b) Cross-modality comparison method. Text prompts are used to generate images using text-to-image models. The CLIP image encoder is used to encode both the generated images and candidate images, and the closest match is selected.

Download Paper