Back to Explorer
Research PaperResearchia:202604.09008[Computational Linguistics > NLP]

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang

Abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.


Source: arXiv:2604.07338v1 - http://arxiv.org/abs/2604.07338v1 PDF: https://arxiv.org/pdf/2604.07338v1 Original Link: http://arxiv.org/abs/2604.07338v1

Submission:4/9/2026
Comments:0 comments
Subjects:NLP; Computational Linguistics
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images | Researchia