Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning

要約

マルチモーダルモデルは通常、強力な大規模言語モデル（LLM）と視覚エンコーダを組み合わせ、命令チューニングによってマルチモーダルデータで学習される。このプロセスにより、LLMはマルチモーダル設定に適応するが、この適応が本来の言語推論能力を損なうかどうかは依然として不明である。本研究では、マルチモーダルな命令チューニングが言語推論性能に及ぼす影響を探索する。我々は、VicunaやMistralのようなLLMとCLIPビジョンエンコーダを統合した、代表的なマルチモーダルフレームワークであるLLaVAに焦点を当てる。我々は、8つの言語推論タスクにおいて、オリジナルのLLMとマルチモーダル適応されたLLMの性能を比較する。我々の実験からいくつかの重要な知見が得られた。第一に、マルチモーダル学習の影響はVicunaとMistralで異なる。Mistralでは言語推論の劣化が見られるが、Vicunaではほとんどのタスクで改善が見られる。第二に、マルチモーダル学習は数学的推論課題（GSM8Kなど）では一貫してパフォーマンスを低下させるが、コモンセンス推論課題（CommonsenseQAなど）ではパフォーマンスを向上させる。最後に、訓練不要のモデルマージ技術により、マルチモーダル適応型Mistralで観察される言語推論の劣化を効果的に緩和し、視覚タスクの性能さえ向上させることができることを実証する。

要約(オリジナル)

Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear whether this adaptation compromises their original language reasoning capabilities. In this work, we explore the effects of multimodal instruction tuning on language reasoning performance. We focus on LLaVA, a leading multimodal framework that integrates LLMs such as Vicuna or Mistral with the CLIP vision encoder. We compare the performance of the original LLMs with their multimodal-adapted counterparts across eight language reasoning tasks. Our experiments yield several key insights. First, the impact of multimodal learning varies between Vicuna and Mistral: we observe a degradation in language reasoning for Mistral but improvements for Vicuna across most tasks. Second, while multimodal instruction learning consistently degrades performance on mathematical reasoning tasks (e.g., GSM8K), it enhances performance on commonsense reasoning tasks (e.g., CommonsenseQA). Finally, we demonstrate that a training-free model merging technique can effectively mitigate the language reasoning degradation observed in multimodal-adapted Mistral and even improve performance on visual tasks.

arxiv情報

著者	Neale Ratzlaff,Man Luo,Xin Su,Vasudev Lal,Phillip Howard
発行日	2024-12-04 16:56:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー