TP-Eval: Tap Multimodal LLMs’ Potential in Evaluation by Customizing Prompts

要約

最近、マルチモーダル大規模言語モデル (MLLM) がその優れた機能で注目を集めています。
MLLM の評価は、MLLM の属性を分析し、貴重な洞察を提供するために重要になってきています。
ただし、現在のベンチマークではプロンプトの感度の問題が見落とされており、プロンプトのわずかな変動がパフォーマンスの大幅な変動につながる可能性があります。
したがって、不適切なプロンプトによってモデルの機能が曖昧になり、モデルのパフォーマンスが過小評価される可能性があります。
さらに、モデルが異なればプロンプトに対する好みも異なるため、すべてのモデルに同じプロンプトを使用すると評価の偏りが生じます。
このペーパーでは、既存のベンチマークのこの欠陥を分析し、さらに TP-Eval という新しい評価フレームワークを紹介します。これは、評価のバイアスを軽減し、モデルの可能性を引き出すための迅速なカスタマイズ方法を導入します。
TP-Eval は、元のプロンプトを、さまざまなモデルのさまざまなカスタマイズされたプロンプトに書き換えます。
特に、MLLM 評価のシナリオに合わせて迅速にカスタマイズできるように、いくつかの適切に設計されたモジュールを提案します。
広範な実験により、モデルの機能を明らかにするための私たちのアプローチの有効性が実証されており、TP-Eval は、より包括的で説得力のある MLLM 評価ベンチマークを開発する上でコミュニティに利益をもたらすはずです。

要約(オリジナル)

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity – minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models’ capabilities, underestimating the models’ performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models’ potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models’ capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

arxiv情報

著者	Yuxuan Xie,Tianhua Li,Wenqi Shao,Kaipeng Zhang
発行日	2024-10-23 17:54:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TP-Eval: Tap Multimodal LLMs’ Potential in Evaluation by Customizing Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー