Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

要約

最近の研究では、長く詳細な画像キャプションを使用したビジョン言語モデル (VLM) のトレーニングにますます重点が置かれています。
ただし、小規模な VLM は、これらのキャプションの豊かさと、微調整中にコンテンツが幻覚を起こすリスクとのバランスを取るのに苦労することがよくあります。
このペーパーでは、VLM がそのようなキャプションにどの程度うまく適応するかを調査します。
キャプションの品質を定量化するために、生成されたキャプションを個々の提案に分解し、それぞれを個別に評価する評価フレームワークである分解 NLI (DNLI) を提案します。
このきめ細かい分析により、説明的な詳細の捕捉と幻覚の防止との間の重要なバランスが明らかになります。
私たちの調査結果は、単にキャプションの複雑さを軽減したり、標準的なデータキュレーション技術を採用したりするだけでは、この問題を効果的に解決できないことを示しています。
この課題に取り組むために、モデルの既存の知識と視覚的な理解をトレーニングデータに自動的に適応させるデータ中心のアプローチである、Knowledge Adapted (KnowAda) ファインチューニングを導入します。
KnowAda は、高い説明性を維持しながら幻覚を最小限に抑えます。
私たちは、いくつかの小規模 VLM (最大 7B パラメータ) と高密度キャプションデータセットにわたってこのアプローチを検証し、KnowAda が幻覚低減と説明性のバランスを効果的にとっていることを実証しました。
私たちの結果は、KnowAda が自動測定基準と人間による評価の両方においてさまざまなベースラインを上回っていることを示しています。
コードとモデルを公開します。

要約(オリジナル)

Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model’s existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.

arxiv情報

著者	Moran Yanuka,Assaf Ben Kish,Yonatan Bitton,Idan Szpektor,Raja Giryes
発行日	2025-01-24 15:12:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー