GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

要約

大規模マルチモーダルモデル (LMM) は、多くの視覚的なタスクにわたって熟練度を示しています。
モデルのパフォーマンスを評価するためのよく知られたベンチマークが多数存在しますが、それらのヘッドルームはますます不十分になっています。
そのため、次世代の LMM にとって十分に挑戦的な新世代のベンチマークが緊急に必要とされています。
LMM が可能性を示す領域の 1 つはグラフ分析です。具体的には、関数やデータ系列の平均値、切片、相関関係の推定など、数値を解釈するときにアナリストが通常実行するタスクです。
この研究では、現在および将来のフロンティア LMM に適したグラフ分析ベンチマークである GRAB を紹介します。
当社のベンチマークは完全に総合的なものであり、高品質でノイズのない質問を保証します。
GRAB は 2,170 の質問で構成され、4 つのタスクと 23 のグラフプロパティをカバーします。
GRAB で 20 個の LMM を評価したところ、これは挑戦的なベンチマークであることがわかり、最高パフォーマンスのモデルでもわずか 21.7% のスコアしか得られませんでした。
最後に、さまざまなアブレーションを実行して、モデルがどこで成功し、どこで苦戦しているかを調査します。
私たちは、この重要な成長分野における進歩を促進するために GRAB をリリースします。

要約(オリジナル)

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

arxiv情報

著者	Jonathan Roberts,Kai Han,Samuel Albanie
発行日	2024-08-21 17:59:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー