AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

要約

近年、GPT-4o、Gemini 1.5 Pro、Reka Coreなどのマルチモーダル大規模言語モデル（MLLM）は、視覚と音声のモダリティを含むように機能を拡張しています。これらのモデルは、幅広いオーディオビジュアルアプリケーションにおいて素晴らしい性能を発揮する一方で、我々の提案するDeafTestは、MLLMが、1）2つの音のうちどちらが大きいかを判断する、2）2つの音のうちどちらがピッチが高いかを判断する、といった、人間が些細だと感じる単純なタスクに苦戦することが多いことを明らかにしている。これらの観察に動機づけられ、我々はAV-Odyssey Benchを紹介する。AV-Odyssey Benchは、MLLMがオーディオビジュアル情報を本当に理解できるかどうかを評価するために設計された包括的なオーディオビジュアルベンチマークである。このベンチマークは4,555の入念に作られた問題を含み、それぞれがテキスト、ビジュアル、オーディオの要素を含んでいる。解答の推測を成功させるためには、視覚と音声の両方から得られる手がかりを効果的に活用する必要があります。MLLMの回答を正確かつ客観的に評価するため、問題を多肢選択式に構成し、人間による評価やLLMによる評価の必要性を排除しました。一連のクローズドソースとオープンソースのモデルをベンチマークし、観察結果をまとめる。現在のモデルの限界を明らかにすることで、将来のデータセット収集とモデル開発に有益な洞察を提供することを目指す。

要約(オリジナル)

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

arxiv情報

著者	Kaixiong Gong,Kaituo Feng,Bohao Li,Yibing Wang,Mofan Cheng,Shijia Yang,Jiaming Han,Benyou Wang,Yutong Bai,Zhuoran Yang,Xiangyu Yue
発行日	2024-12-03 17:41:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー