Is Your Text-to-Image Model Robust to Caption Noise?

要約

Text-to-Image (T2I) 生成では、画像の再キャプションに Vision Language Model (VLM) を利用するトレーニング手法が一般的です。
VLM は幻覚を示し、視覚的現実から逸脱した説明コンテンツを生成することが知られていますが、そのような字幕幻覚が T2I 生成パフォーマンスに及ぼす影響はまだ調査されていません。
実証的調査を通じて、私たちはまず VLM で生成されたキャプションを含む包括的なデータセットを確立し、次にキャプションの幻覚が生成結果にどのような影響を与えるかを体系的に分析します。
私たちの調査結果は、(1) キャプション品質の差異が、微調整中にモデルの出力に継続的に影響を与えることを明らかにしました。
(2) VLM の信頼スコアは、データ分布におけるノイズ関連のパターンを検出および特徴付けるための信頼できる指標として機能します。
(3) キャプションの忠実度の微妙な違いでさえ、学習された表現の品質に大きな影響を与えます。
これらの調査結果は、キャプションの品質がモデルのパフォーマンスに与える重大な影響を総合的に強調し、T2I におけるより洗練された堅牢なトレーニングアルゴリズムの必要性を強調しています。
これらの観察に応えて、我々は、VLM 信頼スコアを活用して字幕ノイズを軽減し、それによって字幕の幻覚に対する T2I モデルの堅牢性を強化するアプローチを提案します。

要約(オリジナル)

In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning. Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored. Through our empirical investigation, we first establish a comprehensive dataset comprising VLM-generated captions, and then systematically analyze how caption hallucination influences generation outcomes. Our findings reveal that (1) the disparities in caption quality persistently impact model outputs during fine-tuning. (2) VLMs confidence scores serve as reliable indicators for detecting and characterizing noise-related patterns in the data distribution. (3) even subtle variations in caption fidelity have significant effects on the quality of learned representations. These findings collectively emphasize the profound impact of caption quality on model performance and highlight the need for more sophisticated robust training algorithm in T2I. In response to these observations, we propose a approach leveraging VLM confidence score to mitigate caption noise, thereby enhancing the robustness of T2I models against hallucination in caption.

arxiv情報

著者	Weichen Yu,Ziyan Yang,Shanchuan Lin,Qi Zhao,Jianyi Wang,Liangke Gui,Matt Fredrikson,Lu Jiang
発行日	2024-12-27 08:53:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is Your Text-to-Image Model Robust to Caption Noise?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー