CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

要約

このペーパーでは、豊富で複雑な動的なオーディオビジュアルコンポーネントで構成されるシナリオにおける質問に答えるという課題に焦点を当てます。
既存のマルチモーダル大規模言語モデル (MLLM) は視聴覚コンテンツに応答できますが、これらの応答は曖昧な場合があり、特定の視聴覚イベントを説明できません。
この制限を克服するために、次の 3 つの方法で MLLM を強化する CAT を導入します。 1) 音声とビデオを直接ブリッジすることに加えて、動的な視聴覚シナリオで質問に関連する手がかりを集約して、質問に必要な詳細な知識を充実させる手がかりアグリゲーターを設計します。
大規模な言語モデル。
2) CAT は混合マルチモーダルデータセットでトレーニングされ、オーディオビジュアルシナリオに直接適用できます。
特に、セマンティック間の相関関係をモデル化する CAT の能力をさらに強化するために、AVinstruct という名前のオーディオとビジュアルの共同命令データセットを収集します。
3) 我々は、AI 支援のあいまいさを意識した直接優先最適化を提案します。これは、あいまいさのない応答を優先し、特定の視聴覚オブジェクトの位置を特定する能力を向上させるためにモデルを再トレーニングすることに特化した戦略です。
広範な実験結果は、CAT がマルチモーダルタスク、特にオーディオビジュアル質問応答 (AVQA) タスクにおいて既存の方法よりも優れていることを示しています。
コードと収集された命令は https://github.com/rikeilong/Bay-CAT で公開されています。

要約(オリジナル)

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

arxiv情報

著者	Qilang Ye,Zitong Yu,Rui Shao,Xinyu Xie,Philip Torr,Xiaochun Cao
発行日	2024-03-07 16:31:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー