LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

要約

大規模な基礎モデルの進歩により、広範囲をカバーし、低コストで汚染ゼロのベンチマークが必要になります。
言語モデル評価の継続的な調査にもかかわらず、大規模マルチモーダルモデル (LMM) の評価に関する包括的な研究は依然として限られています。
この作業では、透明性と再現性のある評価を促進するために、50 を超えるタスクと 10 を超えるモデルを備えた統一および標準化されたマルチモーダルベンチマークフレームワークである LMMS-EVAL を紹介します。
LMMS-EVAL は包括的なカバーを提供しますが、低コストと汚染ゼロを達成するにはまだ不十分であることがわかりました。
この評価のトリレンマにアプローチするために、カバレッジと効率の両方を重視したプルーニングされた評価ツールキットである LMMS-EVAL LITE をさらに導入します。
さらに、継続的に更新されるニュースとオンラインフォーラムを利用して、モデルの汎化能力を実際に評価する、低コストで汚染ゼロの評価アプローチを特徴とするマルチモーダル LIVEBENCH も紹介します。
要約すると、私たちの研究は評価のトリレンマを考慮することの重要性を強調し、大規模なマルチモーダルモデルを評価する際のトレードオフを回避するための実用的なソリューションを提供し、LMM のより効果的で信頼性の高いベンチマークへの道を開きます。
私たちはコードベースをオープンソースにし、https://github.com/EvolvingLMMs-Lab/lmms-eval および https://huggingface.co/spaces/lmms-lab/LiveBench で LIVEBENCH のリーダーボードを管理しています。

要約(オリジナル)

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

arxiv情報

著者	Kaichen Zhang,Bo Li,Peiyuan Zhang,Fanyi Pu,Joshua Adrian Cahyono,Kairui Hu,Shuai Liu,Yuanhan Zhang,Jingkang Yang,Chunyuan Li,Ziwei Liu
発行日	2024-07-17 17:51:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー