WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

要約

このペーパーでは、視覚、オーディオ、テキスト入力を同時に網羅するマルチモーダルビデオの理解を評価する最初のベンチマークであるWorldSenseを紹介します。
既存のベンチマークとは対照的に、WorldSenseにはいくつかの機能があります。（i）Omni-Modalityのコラボレーションでは、オーディオとビデオの強力な結合を特徴とする評価タスクを設計し、モデルがオムニモダリティの相乗的認識を効果的に利用する必要があります。
（ii）ビデオとタスクの多様性には、ワールドセンスには、1,662のオーディオビジュアル同期ビデオの多様なコレクションが含まれます。これは、8つのプライマリドメインと67のファイングレインサブカテゴリに体系的に分類され、幅広いシナリオ、および3,172のマルチチョイスQAペアを越えて3,172のマルチチョイスQAペアを網羅しています。
包括的な評価を可能にするタスク。
（iii）高品質の注釈、すべてのQAペアは、品質を確保するために複数の修正を伴う80の専門家アノテーターによって手動でラベル付けされます。
ワールドセンスに基づいて、さまざまな最先端のモデルを広範囲に評価します。
実験結果は、既存のモデルが実際のシナリオを理解する上で重要な課題に直面していることを示しています（48.0％の最高の精度）。
WorldSenseが、Omni-Modalityから一貫したコンテキストを構築および理解する能力を評価するためのプラットフォームを提供できることを願っています。

要約(オリジナル)

In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

arxiv情報

著者	Jack Hong,Shilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie
発行日	2025-02-06 18:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー