CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

要約

最近の敵対的生成ネットワーク (GAN) の進歩と拡散モデルの出現により、非常に現実的で広くアクセス可能な合成コンテンツの制作が大幅に合理化されました。
その結果、ディープフェイクによってもたらされる潜在的なリスクを軽減するための効果的な汎用検出メカニズムが緊急に必要とされています。
この論文では、普遍的なディープフェイク検出のための最近の適応方法と組み合わせた場合の、事前トレーニング済みビジョン言語モデル (VLM) の有効性を調査します。
この分野での以前の研究に従って、CLIP をディープフェイク検出に適応させるために、単一のデータセット (ProGAN) のみを使用します。
ただし、CLIP のテキスト部分を無視してビジュアル部分のみに依存する先行研究とは対照的に、私たちの分析では、テキスト部分を保持することが重要であることが明らかになりました。
その結果、私たちが採用したシンプルで軽量なプロンプトチューニングベースの適応戦略は、トレーニングデータの 3 分の 1 未満 (720k と比較して 200k 画像) を利用しながら、以前の SOTA アプローチよりも 5.01% mAP と 6.61% の精度で優れています。
提案したモデルの実世界への適用性を評価するために、さまざまなシナリオにわたる包括的な評価を実施します。
これには、GAN ベース、拡散ベース、商用ツールによって生成されたものを含む、21 の異なるデータセットから取得した画像に対する厳密なテストが含まれます。

要約(オリジナル)

The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.

arxiv情報

著者	Sohail Ahmed Khan,Duc-Tien Dang-Nguyen
発行日	2024-02-20 11:26:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー