Learning to Instruct for Visual Instruction Tuning

要約

視覚指導のチューニング（VIT）の進歩であるLITを提案します。
VITはマルチモーダルLLMS（MLLMS）を有望なマルチモーダル機能に装備していますが、VITの現在の設計の選択により、過剰適合とショートカット学習、パフォーマンスの低下になります。
このギャップは、視覚情報の積極的な理解を無視しながら、指導に従う能力に関する過度の強調から生じます。
これに触発されたLITは、損失関数を命令シーケンスと応答シーケンスの両方に組み込むことにより、シンプルで効果的なアプローチを採用しています。
トレーニングデータをシームレスに展開し、MLLMSを言語プライアーに過度に依存して正規化します。
このメリットに基づいて、LITは包括的なマルチモーダルベンチマークで最大9％の大幅な相対的な改善を達成し、追加のトレーニングデータを必要とせず、無視できる計算オーバーヘッドが発生します。
驚くべきことに、LITは例外的な基本的な視覚能力を達成し、キャプションパフォーマンスの最大18％の改善をもたらし、同時にMLLMの幻覚を緩和します。

要約(オリジナル)

We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.

arxiv情報

著者	Zhihan Zhou,Feng Hong,Jiaan Luo,Jiangchao Yao,Dongsheng Li,Bo Han,Ya Zhang,Yanfeng Wang
発行日	2025-03-28 08:04:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Instruct for Visual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー