Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

要約

タイトル：外見は同じ、音は異なる：カウンターファクトのクロスモーダルペアを活用した音声視覚表現学習

要約：

– 音声視覚表現学習では、視覚と音響の対応関係に依存することが一般的である。
– しかし、1つの視覚シーンに対して複数の音声トラックが存在することがしばしばある。
– これは例えば、同じ混雑した街中での複数の会話などを考慮することができる。
– このようなカウンターファクトのペアが音声視覚表現学習に与える影響は以前に研究されていない。
– この問題に対処するために、映画の吹き替えバージョンを使用してクロスモーダル対比学習を拡張する方法を提案する。
– このアプローチにより、音声のコンテンツが異なる別の音声トラックを、同じビデオと同様に表現することができる。
– 結果として、吹き替え拡張トレーニングは、言語的なタスク全体に大きな影響を与えずに、さまざまな聴覚および音声視覚タスクのパフォーマンスを改善することができる。
– さらに、キプレトレーニング前に音声を除去する強力なベースラインとこのアプローチを比較し、吹き替え拡張トレーニングがより効果的であり、音声除去がパラリングイスティックおよび音声視覚課題で悪影響を及ぼすことがわかった。
– これらの研究結果は、シーンレベルの音声視覚対応を学習する際に言語の変化を考慮することの重要性を強調し、吹き替え音声がより堅牢なパフォーマンスを向上させるための有用な拡張技術である可能性を示唆している。

要約(オリジナル)

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech content, similarly to the same video. Our results show that dub-augmented training improves performance on a range of auditory and audiovisual tasks, without significantly affecting linguistic task performance overall. We additionally compare this approach to a strong baseline where we remove speech before pretraining, and find that dub-augmented training is more effective, including for paralinguistic and audiovisual tasks where speech removal leads to worse performance. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance.

arxiv情報

著者	Nikhil Singh,Chih-Wei Wu,Iroro Orife,Mahdi Kalayeh
発行日	2023-04-12 04:17:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー