Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

要約

Vibe-Evalは、マルチモーダルチャットモデルを評価するための新しいオープンベンチマークであり、フレームワークです。Vibe-Evalは、269の視覚的理解プロンプトで構成され、そのうちの100は難易度が高く、専門家によって作成されたゴールドスタンダードの回答が含まれています。Vibe-Evalは、(i) マルチモーダルチャットモデルを日常的なタスクに対応させること、(ii) 現在のフロンティアモデルの能力を厳密にテストすること、という2つの目的を持つ、オープンエンドでチャレンジングなものです。特に、私たちのハードセットには、すべてのフロンティアモデルが不正解となる問題が50%以上含まれています。我々は、超難易度の高いプロンプトに対するモデルの設計、評価、ランキングのニュアンスを探求する。また、人間と自動評価のトレードオフについて議論し、Reka Coreを使用した自動モデル評価が人間の判断とほぼ相関することを示します。また、Vibe-Evalの自動評価で良好な結果を得た公開モデルについては、正式な人間による評価を実施する予定です。評価コードとデータは https://github.com/reka-ai/reka-vibe-eval をご覧ください。

要約(オリジナル)

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval’s automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval

arxiv情報

著者	Piotr Padlewski,Max Bain,Matthew Henderson,Zhongkai Zhu,Nishant Relan,Hai Pham,Donovan Ong,Kaloyan Aleksiev,Aitor Ormazabal,Samuel Phua,Ethan Yeo,Eugenie Lamprecht,Qi Liu,Yuqi Wang,Eric Chen,Deyu Fu,Lei Li,Che Zheng,Cyprien de Masson d’Autume,Dani Yogatama,Mikel Artetxe,Yi Tay
発行日	2024-05-03 17:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー