Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

要約

オーディオと視覚を効果的に相互作用させる方法は、マルチモダリティ研究分野において大きな関心を集めています。
最近、オーディオキューのガイドの下でビデオフレーム内の音声オブジェクトをセグメント化することを目的とした、新しいオーディオビジュアルセグメンテーション (AVS) タスクが提案されました。
しかし、既存の AVS 手法のほとんどは、オーディオキューの一方向性と不十分な統合により、視覚的特徴がオーディオモダリティの特徴を支配する傾向にあるモダリティの不均衡によって妨げられています。
この不均衡により、特徴表現が視覚的側面に偏り、共同の視聴覚表現の学習が妨げられ、セグメンテーションの不正確さを引き起こす可能性があります。
この問題に対処するために、AVSAC を提案します。
私たちのアプローチは、統合された双方向ブリッジを備えた双方向オーディオビジュアルデコーダー (BAVD) を特徴としており、オーディオキューを強化し、オーディオとビジュアルモダリティ間の継続的な相互作用を促進します。
この双方向の相互作用によりモダリティの不均衡が狭まり、統合されたオーディオビジュアル表現のより効果的な学習が促進されます。
さらに、BAVD のきめ細かいガイダンスとして、オーディオとビジュアルのフレーム単位の同期のための戦略を提示します。
この戦略は、視覚的特徴における聴覚コンポーネントの割合を強化し、よりバランスの取れた視聴覚表現の学習に貢献します。
広範な実験により、私たちの方法が AVS パフォーマンスの新しいベンチマークを達成することが示されました。

要約(オリジナル)

How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method attains new benchmarks in AVS performance.

arxiv情報

著者	Tianxiang Chen,Zhentao Tan,Tao Gong,Qi Chu,Yue Wu,Bin Liu,Le Lu,Jieping Ye,Nenghai Yu
発行日	2024-02-06 11:35:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー