MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

要約

マルチモーダル大規模言語モデル (MLLM) は最近大きな注目を集めており、汎用人工知能における顕著な可能性を示しています。
しかし、MLLM の有用性を評価することは、主に人間の好みに合わせたマルチモーダルなベンチマークが存在しないため、かなりの課題を抱えています。
この文書は、LLM の LLM-as-a-Judge に触発され、MLLM-as-a-Judge と呼ばれる新しいベンチマークを紹介します。これは、スコアリング評価、ペアの比較、バッチという 3 つの異なるタスクを含む、裁判官を支援する際の MLLM の能力を評価するものです。
ランキング。
私たちの研究では、MLLM はペア比較では人間のような優れた洞察力を示しますが、スコアリング評価とバッチランキングのタスクでは人間の好みとは大きな乖離があることが明らかになりました。
さらに、MLLM は、GPT-4V のような先進的なモデルであっても、多様な偏見、幻覚反応、矛盾など、判断において依然として課題に直面しています。
これらの調査結果は、完全に信頼できる評価者としての MLLM に関する機能強化とさらなる研究努力が差し迫った必要性を強調しています。
コードとデータセットは https://github.com/Dongping-Chen/MLLM-as-a-Judge で入手できます。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence multimodal benchmarks that align with human preferences. Inspired by LLM-as-a-Judge in LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges including three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparisons, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking tasks. Furthermore, MLLMs still face challenges in judgment, including diverse biases, hallucinatory responses, and inconsistencies, even for advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts regarding MLLMs as fully reliable evaluators. Code and dataset are available at https://github.com/Dongping-Chen/MLLM-as-a-Judge.

arxiv情報

著者	Dongping Chen,Ruoxi Chen,Shilin Zhang,Yinuo Liu,Yaochen Wang,Huichi Zhou,Qihui Zhang,Pan Zhou,Yao Wan,Lichao Sun
発行日	2024-02-07 12:28:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー