Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

要約

オーディオビジュアルセグメンテーション (AVS) は、ビデオ内の音源のピクセルレベルの定位を達成することを目的としていますが、オーディオビジュアルセマンティックセグメンテーション (AVSS) は、AVS の拡張として、オーディオビジュアルシーンの意味論的な理解をさらに追求します。
ただし、AVSS タスクでは、視聴覚の対応と意味の理解を同時に確立する必要があるため、以前の方法ではエンドツーエンドのトレーニングでこの目的のマッシュアップを処理するのに苦労し、その結果、学習が不十分で部分最適化が行われていないことがわかりました。
したがって、\textit{Stepping Stones} と呼ばれる 2 段階のトレーニング戦略を提案します。これは、AVSS タスクをローカリゼーションから意味理解までの 2 つの単純なサブタスクに分解し、各段階で完全に最適化され、段階的なグローバル最適化を実現します。
このトレーニング戦略は、既存の方法に対する一般化と有効性も証明しています。
AVS タスクのパフォーマンスをさらに向上させるために、新しいフレームワークであるアダプティブオーディオビジュアルセグメンテーションを提案します。このフレームワークでは、アダプティブオーディオクエリジェネレーターを組み込み、マスクされたアテンションをトランスデコーダーに統合して、ビジュアルとオーディオの機能の適応的な融合を促進します。
広範な実験により、私たちの方法が 3 つの AVS ベンチマークすべてで最先端の結果を達成できることが実証されました。
プロジェクトのホームページには https://gewu-lab.github.io/stepping_stones/ からアクセスできます。

要約(オリジナル)

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/.

arxiv情報

著者	Juncheng Ma,Peiwen Sun,Yaoting Wang,Di Hu
発行日	2024-07-16 15:08:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー