RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

要約

サブジェクト駆動型のテキストからイメージ（T2I）Generationは、参照されたサブジェクト画像から視覚的アイデンティティを保持しながら、特定のテキスト説明に合わせた画像を作成することを目的としています。
画像生成の強化されたパーソナライズからビデオレンダリングの一貫したキャラクター表現に至るまで、その幅広い下流の適用性にもかかわらず、この分野の進歩は、信頼できる自動評価の欠如によって制限されます。
既存の方法は、タスクの1つの側面のみ（つまり、テキストアライメントまたはサブジェクトの保存）のみを評価し、人間の判断との誤った整理、または費用のかかるAPIベースの評価に依存します。
これに対処するために、単一の予測でテキストアラインメントと被験者の保存の両方を評価する費用対効果の高いメトリックであるRefvnliを紹介します。
Video-Reasoning Benchmarks and Image Turburationsから派生した大規模なデータセットでトレーニングされているRefvnliは、複数のベンチマークとサブジェクトカテゴリ（例えば、\ emphing {Animal}、\ emph {object}）にわたって既存のベースラインを上回るか、一致させ、テキストアレインメントで最大6.4ポイントのゲインを達成します。
また、あまり知られていない概念に優れており、87％を超える精度で人間の好みに合わせています。

要約(オリジナル)

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability — ranging from enhanced personalization in image generation to consistent character representation in video rendering — progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

arxiv情報

著者	Aviv Slobodkin,Hagai Taitelbaum,Yonatan Bitton,Brian Gordon,Michal Sokolik,Nitzan Bitton Guetta,Almog Gueta,Royi Rassin,Itay Laish,Dani Lischinski,Idan Szpektor
発行日	2025-04-24 12:44:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー