CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

要約

音声アクティビティ検出 (VAD) は、人が話しているかどうかを自動的に判断し、視聴覚データ内の発話のタイミングを識別するプロセスです。
従来、このタスクは、音声信号または視覚データのいずれかを処理することによって、あるいは融合または共同学習を通じて両方のモダリティを組み合わせることによって取り組まれてきました。
私たちの研究では、視覚言語モデルの最近の進歩からインスピレーションを得て、Contrastive Language-Image Pretraining (CLIP) モデルを活用した新しいアプローチを導入しています。
CLIP ビジュアルエンコーダは個人の上半身で構成されるビデオセグメントを分析し、テキストエンコーダはプロンプトエンジニアリングによって自動的に生成されたテキストの説明を処理します。
その後、これらのエンコーダーからの埋め込みがディープニューラルネットワークを通じて融合され、VAD が実行されます。
3 つの VAD ベンチマークにわたる当社の実験分析により、既存の視覚的 VAD アプローチと比較して、当社の手法の優れたパフォーマンスが実証されました。
特に、私たちのアプローチは、そのシンプルさにもかかわらず、また広範な視聴覚データセットでの事前トレーニングを必要としないにもかかわらず、いくつかの視聴覚手法よりも優れています。

要約(オリジナル)

Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

arxiv情報

著者	Andrea Appiani,Cigdem Beyan
発行日	2024-10-18 14:43:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー