HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

要約

自己回帰パラダイムの顕著な成功により、Multimodal大言語モデル（MLLMS）で大きな進歩が生じ、Show-O、Transfusion、EMU3などの強力なモデルが統一された画像の理解と生成の顕著な進歩を達成しています。
初めて、一般的な現象を明らかにします。MLLMの理解能力は、通常、生成能力よりも強く、2つの間に大きなギャップがあります。
この洞察に基づいて、MLLMSの理解と生成の間のギャップをシームレスに埋めるように設計されたシンプルでありながら一般的なフレームワークであるHermesflowを提案します。
具体的には、理解と生成の両方の相同選好データをキュレートするための入力として相同データを取得します。
ペア-DPOおよびセルフプレイの反復最適化により、HermesFlowは相同選好データを使用してマルチモーダルの理解と生成を効果的に整列させます。
広範な実験は、特にマルチモーダルの理解と生成の間のギャップを狭める際に、以前の方法よりもアプローチの重要な優位性を示しています。
これらの調査結果は、次世代マルチモーダルファンデーションモデルの一般的なアライメントフレームワークとしてのHermesflowの可能性を強調しています。
コード：https：//github.com/gen-verse/hermesflow

要約(オリジナル)

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

arxiv情報

著者	Ling Yang,Xinchen Zhang,Ye Tian,Chenming Shang,Minghao Xu,Wentao Zhang,Bin Cui
発行日	2025-02-17 18:57:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー