InstructEngine: Instruction-driven Text-to-Image Alignment

要約

補強材/AIフィードバック（RLHF/RLAIF）からの学習は、テキストから画像モデルの優先アラインメントのために広く利用されています。
既存の方法は、データとアルゴリズムの両方の観点から特定の制限に直面しています。
トレーニングデータの場合、ほとんどのアプローチは、ジェネレーターを直接微調整するか、トレーニング報酬モデルをトレーニングするためにトレーニングの信号を提供することにより、手動注釈付き選好データに依存しています。
ただし、注釈コストが高いため、スケールアップが困難になり、報酬モデルは追加の計算を消費し、精度を保証できません。
アルゴリズムの観点から、ほとんどの方法はテキストの値を無視し、画像フィードバックを比較信号としてのみ使用します。これは非効率的でまばらです。
これらの欠点を軽減するために、InstructEngineフレームワークを提案します。
注釈コストに関して、最初にテキストから画像の生成のための分類法を構築し、次にそれに基づいて自動化されたデータ構築パイプラインを開発します。
高度な大規模なマルチモーダルモデルと人間定義のルールを活用して、25Kのテキストイメージ優先ペアを生成します。
最後に、相互に類似したサンプルを相互に匹敵するペアに整理することにより、データ効率を改良する相互検証アライメント法を導入します。
Drawbenchの評価は、InstruceEntingineがSD V1.5とSDXLのパフォーマンスを10.53％および5.30％改善し、最先端のベースラインを上回ることを示しており、Ablation StudyはInstractEngineのすべてのコンポーネントの利点を確認しています。
人間のレビューで50％以上の勝利率は、Instructentengineが人間の好みとより適合していることを証明しています。

要約(オリジナル)

Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL’s performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine’s all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

arxiv情報

著者	Xingyu Lu,Yuhang Hu,YiFan Zhang,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Jinpeng Wang,Bin Wen,Chun Yuan,Fan Yang,Tingting Gao,Di Zhang
発行日	2025-04-14 15:36:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InstructEngine: Instruction-driven Text-to-Image Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー