End-To-End Audiovisual Feature Fusion for Active Speaker Detection

要約

アクティブスピーカー検出は、人間と機械の相互作用において重要な役割を果たします。
最近、いくつかのエンドツーエンドのオーディオビジュアルフレームワークが登場しました。
ただし、これらのモデルの推論時間は調査されておらず、複雑で入力サイズが大きいため、リアルタイムアプリケーションには適用できません。
さらに、彼らはオーディオおよびビジュアル入力にConvNetを採用する同様の特徴抽出戦略を検討しました。
この作品は、VGG-Mを介して画像から抽出された特徴と、オーディオ波形から抽出された生のメル周波数ケプストラム係数の特徴を融合した、新しい2ストリームのエンドツーエンドフレームワークを示しています。
ネットワークには、融合前の各ストリームの時間的ダイナミクスを処理するために、各ストリームに接続された2つのBiGRUレイヤーがあります。
融合後、1つのBiGRU層を取り付けて、関節の時間的ダイナミクスをモデル化します。
AVA-ActiveSpeakerデータセットの実験結果は、新しい特徴抽出戦略が、両方のモダリティでConvNetを採用したモデルよりも、ノイズの多い信号に対してより堅牢で、推論時間が優れていることを示しています。
提案されたモデルは44.41ミリ秒以内に予測します。これは、リアルタイムアプリケーションに十分な速度です。
当社の最高性能のモデルは88.929％の精度を達成し、最先端の作業とほぼ同じ検出結果を達成しました。

要約(オリジナル)

Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models’ inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream’s temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.

arxiv情報

著者	Fiseha B. Tesema,Zheyuan Lin,Shiqiang Zhu,Wei Song,Jason Gu,Hong Wu
発行日	2022-07-27 10:25:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー