Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

要約

現在の音声ディープフェイク検出器にとって一般化は主な問題であり、配布外のデータに対して信頼性の高い結果を提供するのに苦労しています。
ますます正確な合成手法が開発されるスピードを考えると、トレーニングされていないデータに対しても適切に機能する手法を設計することが非常に重要です。
この論文では、一般化能力に特に焦点を当てて、オーディオディープフェイク検出のための大規模な事前トレーニング済みモデルの可能性を研究します。
この目的を達成するために、検出の問題は話者検証フレームワークで再定式化され、テスト中の音声サンプルと主張される身元の音声の間の不一致によって偽の音声が暴露されます。
このパラダイムでは、トレーニングに偽の音声サンプルは必要なく、根元の生成方法とのつながりが遮断され、完全な汎化能力が保証されます。
特徴は汎用の大規模な事前トレーニング済みモデルによって抽出されるため、特定の偽検出や話者検証データセットでのトレーニングや微調整は必要ありません。
検出時には、テスト対象の ID の限られた音声フラグメントのセットのみが必要です。
コミュニティに広く普及しているいくつかのデータセットでの実験では、事前トレーニング済みモデルに基づく検出器が優れたパフォーマンスを達成し、強力な汎化能力を示し、分布内データでは教師あり手法に匹敵し、分布外データでは教師あり手法を大幅に克服することが示されています。

要約(オリジナル)

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

arxiv情報

著者	Alessandro Pianese,Davide Cozzolino,Giovanni Poggi,Luisa Verdoliva
発行日	2024-07-01 12:25:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー