iWISDM: Assessing instruction following in multimodal models at scale

要約

詳細な指示に従って複雑なタスクを実行する能力は、私たちの種の多くの顕著な成果の鍵です。
人間として、私たちはさまざまなタスクを実行できるだけでなく、完了するまでに数百、数千の手順を必要とする非常に複雑なタスクも実行できます。
大規模な言語モデルと、テキスト入力と視覚的入力を統合する最近のマルチモーダル対応物は、複雑なタスクの実行において前例のない成功を収めています。
しかし、既存のベンチマークのほとんどは主に単一モダリティ入力 (テキストまたはビジョン) に限定されており、特にマルチモーダルなコンテキストでの指示に従う場合のマルチモーダル評価の範囲が狭くなっています。
このギャップを埋めるために、さまざまな複雑さの無限のビジョン言語タスクを生成するように設計された指示付き仮想視覚意思決定 (iWISDM) 環境を導入します。
iWISDM を使用して、さまざまな複雑さレベルにわたる視覚タスクに従う命令の 3 つの異なるベンチマークをコンパイルし、これらのベンチマークで新しく開発されたいくつかのマルチモーダルモデルを評価しました。
私たちの調査結果は、iWISDM が既存および新規のマルチモーダルモデルの指示遵守を評価するための堅牢なベンチマークであることを確立し、これらのモデルの指示に正確に従う能力と人間の能力との間に大きなギャップがあることを浮き彫りにしました。iWISDM のコードは、GitHub (https:
//github.com/BashivanLab/iWISDM。

要約(オリジナル)

The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models’ ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.

arxiv情報

著者	Xiaoxuan Lei,Lucas Gomez,Hao Yuan Bai,Pouya Bashivan
発行日	2024-06-25 15:12:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

iWISDM: Assessing instruction following in multimodal models at scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー