Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

要約

学習ベースの音源定位に関する最近の研究は、主に定位パフォーマンスの観点に焦点を当てています。
しかし、これまでの研究や既存のベンチマークでは、インタラクティブな音源定位に不可欠なクロスモーダルインタラクションという重要な側面が見落とされています。
クロスモーダルインタラクションは、無音のオブジェクトや画面外の音など、意味的に一致する、または一致しないオーディオビジュアルイベントを理解するために不可欠です。
このペーパーでは、まず既存の手法、ベンチマーク、評価指標、およびクロスモーダル理解タスクのクロスモーダル相互作用を包括的に検証します。
次に、以前の研究の限界を特定し、その限界を克服するためにいくつかの貢献を行います。
まず、インタラクティブな音源定位のための新しい合成ベンチマークを紹介します。
次に、音源定位手法を厳密に評価するための新しい評価指標を導入し、定位パフォーマンスとクロスモーダルインタラクション能力の両方を正確に評価することに重点を置いています。
第三に、クロスモーダルインタラクションを強化するクロスモーダル調整戦略を備えた学習フレームワークを提案します。
最後に、インタラクティブな音源定位と補助クロスモーダル検索タスクの両方を一緒に評価して、クロスモーダルインタラクション機能と競合する手法のベンチマークを徹底的に評価します。
私たちの新しいベンチマークと評価指標は、音源定位研究においてこれまで見落とされていた問題を明らかにします。
私たちが提案する新しい方法は、強化されたクロスモーダルアライメントを備えており、優れた音源定位性能を示します。
この研究では、これまでで最も包括的な音源定位の分析を提供し、新しい標準の評価指標を使用して、既存のベンチマークと新しいベンチマークの両方で競合する手法を広範に検証します。

要約(オリジナル)

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.

arxiv情報

著者	Arda Senocak,Hyeonggon Ryu,Junsik Kim,Tae-Hyun Oh,Hanspeter Pfister,Joon Son Chung
発行日	2024-07-18 16:51:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー