LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

要約

大規模な基礎モデルの進歩により、幅広、低コスト、およびゼロ汚染ベンチマークが必要です。
言語モデルの評価の継続的な調査にもかかわらず、大規模なマルチモーダルモデル（LMM）の評価に関する包括的な研究は限られたままです。
この作業では、透明で再現可能な評価を促進するために、50を超えるタスクと10を超えるモデルを備えた統一された標準化されたマルチモーダルベンチマークフレームワークであるLMMS-Evalを紹介します。
LMMS-Evalは包括的なカバレッジを提供しますが、低コストとゼロ汚染を達成するのにまだ不足していることがわかります。
この評価トリレマにアプローチするために、さらに、カバレッジと効率の両方を強調する剪定された評価ツールキットであるLMMS-Eval Liteを紹介します。
さらに、ニュースおよびオンラインフォーラムを継続的に更新して、モデルの一般化能力を野生の一般化能力を評価するマルチモーダルライブベンチを紹介し、低コストおよびゼロ汚染評価アプローチを備えています。
要約すると、私たちの研究は、評価のトリレマを検討することの重要性を強調し、大規模なマルチモーダルモデルを評価する際のトレードオフをナビゲートするための実用的なソリューションを提供し、LMMのより効果的で信頼できるベンチマークへの道を開きます。
https://github.com/evolvinglmms-lab/lmms-evalとhttps://huggingface.co/spaces/lmms-lab/livebenchで、コードベースをオープンソースし、ライブベンチのリーダーボードを維持します。

要約(オリジナル)

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

arxiv情報

著者	Kaichen Zhang,Bo Li,Peiyuan Zhang,Fanyi Pu,Joshua Adrian Cahyono,Kairui Hu,Shuai Liu,Yuanhan Zhang,Jingkang Yang,Chunyuan Li,Ziwei Liu
発行日	2025-05-05 04:48:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー