Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

要約

多言語ビジュアル回答ローカリゼーション (MVAL) の目標は、特定の多言語の質問に回答するビデオセグメントを見つけることです。
既存の方法は、視覚モダリティのみに焦点を当てるか、視覚モダリティと字幕モダリティを統合します。
ただし、これらの方法ではビデオのオーディオモダリティが無視されるため、入力情報が不完全になり、MVAL タスクのパフォーマンスが低下します。
この論文では、MVAL タスクのビジュアル表現とテキスト表現の両方を強化するオーディオモダリティを組み込んだ、統合オーディオビジュアルテキストスパンローカリゼーション (AVTSL) 手法を提案します。
具体的には、3 つのモダリティの機能を統合し、融合されたモダリティの独自の貢献に合わせて調整された 3 つの予測器 (視聴覚予測器、視覚的予測器、およびテキスト予測器) を開発します。
各予測子は、それぞれのモダリティに基づいて予測を生成します。
予測結果全体の一貫性を維持するために、オーディオビジュアルテキスト一貫性モジュールを導入します。
このモジュールは動的三角損失 (DTL) 関数を利用し、各モダリティの予測器が他のモダリティから動的に学習できるようにします。
この協調学習により、モデルは一貫性のある包括的な回答を生成します。
広範な実験により、私たちが提案した方法がいくつかの最先端（SOTA）方法よりも優れていることが示されており、オーディオモダリティの有効性が実証されています。

要約(オリジナル)

The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities. However, these methods neglect the audio modality in videos, consequently leading to incomplete input information and poor performance in the MVAL task. In this paper, we propose a unified Audio-Visual-Textual Span Localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations for the MVAL task. Specifically, we integrate features from three modalities and develop three predictors, each tailored to the unique contributions of the fused modalities: an audio-visual predictor, a visual predictor, and a textual predictor. Each predictor generates predictions based on its respective modality. To maintain consistency across the predicted results, we introduce an Audio-Visual-Textual Consistency module. This module utilizes a Dynamic Triangular Loss (DTL) function, allowing each modality’s predictor to dynamically learn from the others. This collaborative learning ensures that the model generates consistent and comprehensive answers. Extensive experiments show that our proposed method outperforms several state-of-the-art (SOTA) methods, which demonstrates the effectiveness of the audio modality.

arxiv情報

著者	Zhibin Wen,Bin Li
発行日	2024-11-05 06:49:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー