Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

要約

チャートは、人々がデータを理解し、推論するのを助けるため、遍在しています。
最近、チャートの質問応答、Chart2Text、およびファクトチェックなどのさまざまなダウンストリームタスクが登場しました。
大規模なビジョン言語モデル（LVLMS）は、これらのタスクに取り組むことで有望ですが、その評価は費用がかかり、時間がかかり、現実世界の展開を制限しています。
LVLMSを審査員として使用して、他のLVLMのチャート理解能力を評価することができますが、評価プロセス、独自のデータセット、強力なモデルへのアクセスの制限、および評価コストなどの課題を合理化できます。
この目的のために、多様なチャートの理解と推論タスクの裁判官として、13のオープンソースLVLMの包括的な評価を提示します。
事実上の正確性、情報性、関連性などの基準をカバーするペアワイズとポイントワイズの両方の評価タスクを設計します。
さらに、フォーマットアドヒアランス、位置一貫性、長さのバイアス、および命令フォローに基づいてLVLM審査員を分析します。
LVLMジャッジの精度を測定するために標準化された評価プロトコルとルーブリックに従って、研究と商業使用の両方に適した費用対効果の高いLVLMS（<10Bパラメーター）に焦点を当てています。実験結果は顕著な変動性を明らかにしています。一部のオープンLVLM審査員はGPT-4レベルの評価パフォーマンス（GPT-4判断と約80％の合意）を達成し、他のLVLEVERの評価パフォーマンスを達成しますが、苦労しています（〜10％の合意）。私たちの調査結果は、最先端のオープンソースLVLMSがチャート関連のタスクの費用対効果の高い自動評価者として役立つことを強調していますが、位置設定や長さのバイアスなどのバイアスが持続します。

要約(オリジナル)

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

arxiv情報

著者	Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Ahmed Masry,Mizanur Rahman,Amran Bhuiyan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang
発行日	2025-05-13 11:50:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー