The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

要約

大規模マルチモーダルモデル (LMM) は、さまざまなマルチモーダルタスクにわたって優れたパフォーマンスを示します。
ただし、ほとんどのデータとモデルは主に西洋中心の性質を持っているため、異文化の文脈におけるそれらの有効性は依然として限られています。
逆に、マルチエージェントモデルは、複雑なタスクを解決する上で大きな能力を示しています。
私たちの研究では、文化的な画像のキャプションを付けるという新しいタスクに対するマルチエージェントインタラクション環境における LMM の集団的なパフォーマンスを評価しています。
私たちの貢献は次のとおりです。 (1) 異なる文化的ペルソナを持つ LMM を使用して、異文化間の画像キャプションを強化するマルチエージェントフレームワークである MosAIC を導入します。
(2) 当社は、GeoDE、GD-VCR、CVQA の 3 つのデータセットにわたって、中国、インド、ルーマニアの画像に対して、文化的に強化された英語の画像キャプションのデータセットを提供しています。
(3) 画像キャプション内の文化情報を評価するための文化に適応可能な指標を提案します。
(4) マルチエージェントの相互作用がさまざまな指標にわたってシングルエージェントモデルよりも優れていることを示し、将来の研究に貴重な洞察を提供します。
データセットとモデルには https://github.com/MichiganNLP/MosAIC からアクセスできます。

要約(オリジナル)

Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.

arxiv情報

著者	Longju Bai,Angana Borah,Oana Ignat,Rada Mihalcea
発行日	2024-11-18 17:37:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー