IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

要約

現在の基礎モデルは、テキストのみ、または画像とテキストの両方の入力をプロンプトに表示すると、優れた機能を発揮します。
しかし、その機能は入力モダリティに応じて変化するのでしょうか?
この研究では、数学、科学、アルゴリズム、ゲームの 4 つの主要分野の問題を含むベンチマークデータセットである $\textbf{IsoBench}$ を提案します。
各例は、視覚的、テキスト的、数学的表現などの入力の複数の $\textbf{同型表現}$ を使用して提示されます。
IsoBench は、表現の形式によって生じるパフォーマンスのギャップを診断するためのきめの細かいフィードバックを提供します。
さまざまな基礎モデルにわたって、同じ問題に関して、モデルが一貫してテキスト表現を優先していることが観察されています。
最も顕著なのは、すべての IsoBench 問題で評価した場合、Claude-3 Opus は、テキストの代わりに画像が提供された場合にパフォーマンスが 28.7 ポイント悪かったことです。
同様に、GPT-4 Turbo は 18.7 ポイント悪く、Gemini Pro は 14.9 ポイント悪くなります。
最後に、$\textit{IsoCombination}$ と $\textit{IsoScratchPad}$ という 2 つのプロンプト手法を紹介します。これらは、異なる入力表現の組み合わせや変換を考慮することでモデルのパフォーマンスを向上させます。

要約(オリジナル)

Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.

arxiv情報

著者	Deqing Fu,Ghazal Khalighinejad,Ollie Liu,Bhuwan Dhingra,Dani Yogatama,Robin Jia,Willie Neiswanger
発行日	2024-04-02 15:46:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー