Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

要約

大規模なビジョン言語モデル（LVLMS）のパフォーマンスが向上するにつれて、複数の言語で応答できるようになり、LVLMSによって生成された説明の需要が増加することが期待されています。
ただし、Visionエンコーダーの事前トレーニングとVisionエンコーダーを使用したLLMSの統合トレーニングは、主に英語のトレーニングデータを使用して実施されているため、LVLMSが英語以外の言語で説明を生成するときに潜在能力を完全に処理できるかどうかは不確かです。
さらに、機械翻訳を使用してデータセットを作成する多言語QAベンチマークには、文化的な違いとバイアスがあり、評価タスクとして使用する問題が残ります。
これらの課題に対処するために、この研究は、機械翻訳に依存することなく、複数の言語で拡張データセットを作成しました。
次に、ニュアンスと国固有のフレーズを考慮したこのデータセットを使用して、LVLMSの生成説明能力を評価しました。
さらに、この研究では、リソースが豊富な英語での命令調整が他の言語のパフォーマンスを改善するかどうかを調べました。
私たちの調査結果は、LVLMSが英語と比較して英語以外の言語ではより悪化することを示しています。
さらに、LVLMSは英語のデータから学んだ知識を効果的に管理するのに苦労することが観察されました。
データセットはhttps://huggingface.co/datasets/naist-nlp/multiexpartで入手できます

要約(オリジナル)

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data. Our dataset is available at https://huggingface.co/datasets/naist-nlp/MultiExpArt

arxiv情報

著者	Shintaro Ozaki,Kazuki Hayashi,Yusuke Sakai,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
発行日	2025-02-14 09:56:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー