Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

要約

拡散モデル（DMS）やビジョン自動回帰モデル（VAR）などの最先端の視覚生成モデルは、非常に現実的な画像を生成します。
以前の作業は、視覚ドメインの仕事に安全ではない（NSFW）コンテンツを緩和しましたが、新しい脅威を特定します。画像に組み込まれたNSFWテキストの生成です。
これには、in辱、人種的中傷、性的に明示的な用語などの攻撃的な言語が含まれ、ユーザーに重大なリスクをもたらします。
すべての最先端のDMS（例：SD3、Flux、Deepfloyd IF）とVAR（例えば、無限）がこの問題に対して脆弱であることを示します。
広範な実験を通じて、視覚コンテンツに効果的な既存の緩和手法は、有害なテキスト生成を防ぎながら、良性のテキスト生成を実質的に分解しないことを実証します。
この脅威に対処するための最初のステップとして、カスタマイズされたデータセットを使用して、主要なDMアーキテクチャの基礎となるテキストエンコーダーの安全性微調整を検討します。
これにより、全体的な画像とテキスト生成の品質を維持しながら、NSFWの生成を抑制します。
最後に、この分野での研究を進めるために、画像のNSFWテキスト生成を評価するためのオープンソースベンチマークであるToxicBenchを紹介します。
ToxicBenchは、有害なプロンプト、新しいメトリック、およびNSFW性と生成品質の両方を評価する評価パイプラインのキュレーションされたデータセットを提供します。
私たちのベンチマークは、テキストから画像モデルのNSFWテキスト生成を緩和する際の将来の努力を導くことを目的としています。

要約(オリジナル)

State-of-the-art visual generation models, such as Diffusion Models (DMs) and Vision Auto-Regressive Models (VARs), produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, Flux, DeepFloyd IF) and VARs (e.g., Infinity) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we explore safety fine-tuning of the text encoder underlying major DM architectures using a customized dataset. Thereby, we suppress NSFW generation while preserving overall image and text generation quality. Finally, to advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. ToxicBench provides a curated dataset of harmful prompts, new metrics, and an evaluation pipeline assessing both NSFW-ness and generation quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models.

arxiv情報

著者	Aditya Kumar,Tom Blanchard,Adam Dziedzic,Franziska Boenisch
発行日	2025-02-10 14:58:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー