IRR: Image Review Ranking Framework for Evaluating Vision-Language Models

要約

大規模ビジョン言語モデル (LVLM) は、画像とテキストの両方を処理し、画像のキャプションや説明の生成などのマルチモーダルタスクに優れています。
ただし、これらのモデルは事実に基づくコンテンツの生成には優れていますが、コンテキストに応じて、同じ画像に対する視点を反映したテキストを生成および評価する機能については十分に検討されていません。
これに対処するために、私たちは批評家のレビューテキストを複数の観点から評価するように設計された新しい評価フレームワークである IRR: Image Review Rank を提案します。
IRR は、LVLM の判断が人間の解釈とどの程度一致しているかを測定することによって LVLM を評価します。
15 のカテゴリの画像のデータセットを使用して検証します。各カテゴリには、英語と日本語の両方で 5 つの批評家レビューテキストと注釈付きのランキングが含まれており、合計 2,000 以上のデータインスタンスになります。
データセットは https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0 で入手できます。
私たちの結果は、LVLM が言語間で一貫したパフォーマンスを示したものの、人間による注釈との相関関係が不十分であり、さらなる進歩の必要性を強調していることを示しています。
これらの発見は、現在の評価方法の限界と、視覚と言語のタスクにおける人間の推論をより適切に捉えるアプローチの必要性を浮き彫りにしています。

要約(オリジナル)

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. The datasets are available at https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.

arxiv情報

著者	Kazuki Hayashi,Kazuma Onishi,Toma Suzuki,Yusuke Ide,Seiji Gobara,Shigeki Saito,Yusuke Sakai,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
発行日	2024-12-16 16:09:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

IRR: Image Review Ranking Framework for Evaluating Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー