CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

要約

大規模なビジョン言語モデル（LVLMS）の急速な進歩により、マルチモーダルタスクの大幅な進歩が促進され、視覚ドメインとテキストドメインの両方で出力を解釈、推論、生成できるようになりました。
生成タスクで優れている間、既存のLVLMは、検索用の画像やテキストの埋め込みを生成するなど、高忠実度表現学習を必要とするタスクの制限に直面することがよくあります。
最近の研究では、表現学習のためのFinetuning LVLMSが提案されていますが、微調整されたモデルは、代表的な学習トレーニングパラダイムのために生成能力を失うことがよくあります。
このトレードオフに対処するために、表現タスクと生成タスクの両方のLVLMSを強化する対照的な自動微細な微調整フレームワークであるCafeを紹介します。
対照的な目的を自動脱着言語モデリングと統合することにより、私たちのアプローチは、これらの伝統的に個別のタスクを統合し、オブジェクト幻覚（OH）緩和を含むマルチモーダル検索とマルチモーダル生成ベンチマークの両方で最先端の結果を達成します。
Cafeは、単一のモデルに埋め込み機能と生成機能を相乗する新しいフレームワークを確立し、検索精度とコヒーレント出力生成の両方で優れた将来のマルチモーダルモデルの基礎を設定します。

要約(オリジナル)

The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

arxiv情報

著者	Hao Yu,Zhuokai Zhao,Shen Yan,Lukasz Korycki,Jianyu Wang,Baosheng He,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Hanchao Yu
発行日	2025-03-25 17:57:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー