Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

要約

映像中の顕著な領域を決定するためには、視覚情報と聴覚情報の両方が重要である。ディープコンボリューションニューラルネットワーク（CNN）は、オーディオビジュアルの顕著性予測タスクに対処するための強力な能力を示しています。撮影シーンや天候などの様々な要因により、ソーストレーニングデータとターゲットテストデータの間に適度な分布の不一致が存在することが多い。このような分布の不一致は、CNNモデルのテストデータにおける性能劣化を引き起こす。本論文では、オーディオビジュアル顕著性予測における教師なし領域適応問題に取り組む初期の試みを行う。我々は、ソースデータとターゲットデータの間のドメインの不一致を緩和するために、二重ドメイン-逆行学習アルゴリズムを提案する。まず、聴覚的特徴分布を整合させるために、特定のドメイン判別枝を構築する。次に、これらの聴覚的特徴は、クロスモーダル自己注意モジュールを通して視覚的特徴に融合される。もう一つの領域識別ブランチは、融合された視聴覚特徴によって暗示される視覚特徴と視聴覚相関の領域不一致を低減するために考案される。公開されたベンチマークを用いた実験により、我々の手法が領域不一致による性能劣化を緩和できることが示された。

要約(オリジナル)

Both visual and auditory information are valuable to determine the salient regions in videos. Deep convolution neural networks (CNN) showcase strong capacity in coping with the audio-visual saliency prediction task. Due to various factors such as shooting scenes and weather, there often exists moderate distribution discrepancy between source training data and target testing data. The domain discrepancy induces to performance degradation on target testing data for CNN models. This paper makes an early attempt to tackle the unsupervised domain adaptation problem for audio-visual saliency prediction. We propose a dual domain-adversarial learning algorithm to mitigate the domain discrepancy between source and target data. First, a specific domain discrimination branch is built up for aligning the auditory feature distributions. Then, those auditory features are fused into the visual features through a cross-modal self-attention module. The other domain discrimination branch is devised to reduce the domain discrepancy of visual features and audio-visual correlations implied by the fused audio-visual features. Experiments on public benchmarks demonstrate that our method can relieve the performance degradation caused by domain discrepancy.

arxiv情報

著者	Yingzi Fan,Longfei Han,Yue Zhang,Lechao Cheng,Chen Xia,Di Hu
発行日	2022-08-10 08:50:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー