DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

要約

ディープフェイク技術は急速に進歩し、情報の完全性と社会的信頼に重大な脅威をもたらしています。
ディープフェイクの検出は大幅に進歩しましたが、音声と視覚のモダリティを同時に操作し、場合によっては小さな部分であっても意味を変えてしまうため、検出シナリオはさらに困難になります。
我々は、ディープフェイクとは対照的に、実際のサンプルでは視覚信号と音声信号が情報の点で一致しているという仮定に基づいて、音声の機械認識におけるモダリティ間の違いを活用する、新しいオーディオビジュアルディープフェイク検出フレームワークを提案します。
私たちのフレームワークは、ビデオとオーディオの音声認識に特化したディープネットワークの機能を活用して、フレームレベルのクロスモーダルの不一致を特定し、それによってディープフェイク偽造の時間的位置を特定します。
この目的を達成するために、DiMoDif は、機能ピラミッドスキームとローカルアテンションを備えた Transformer エンコーダベースのアーキテクチャを採用し、フレームレベルの検出と偽の間隔の位置特定を考慮した複合損失関数を通じて検出モデルを最適化します。
DiMoDif は、時間的偽造ローカリゼーションタスクに関して、AV-Deepfake1M で +47.88% AP@0.75 という最先端のパフォーマンスを上回り、LAV-DF と同等のパフォーマンスを発揮します。
ディープフェイク検出タスクでは、AV-Deepfake1M で +30.5% AUC、FakeAVCeleb で +2.8% AUC と最先端のパフォーマンスを上回り、LAV-DF と同等のパフォーマンスを示します。
コードは https://github.com/mever-team/dimodif で入手できます。

要約(オリジナル)

Deepfake technology has rapidly advanced, posing significant threats to information integrity and societal trust. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts but still altering the meaning, presents a more challenging detection scenario. We present a novel audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples – in contrast to deepfakes – visual and audio signals coincide in terms of information. Our framework leverages features from deep networks that specialize in video and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, DiMoDif employs a Transformer encoder-based architecture with a feature pyramid scheme and local attention, and optimizes the detection model through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Temporal Forgery Localization task by +47.88% AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. On the Deepfake Detection task, it outperforms the state-of-the-art by +30.5% AUC on AV-Deepfake1M, +2.8% AUC on FakeAVCeleb, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.

arxiv情報

著者	Christos Koutlis,Symeon Papadopoulos
発行日	2024-11-15 13:47:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー