Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

要約

ビジョン言語モデル（VLM）は、視覚的知覚と、大きな言語モデル（LLM）の推論などの一般的な能力を組み合わせています。
ただし、これらの2つの能力を組み合わせて貢献できるメカニズムは、よく理解されていないままです。
この作業では、さまざまなモデルのパラメーターを接続するモデル合併を通じて、知覚と推論を構成することを探ります。
同じ種類のモデルのマージに焦点を当てた以前の作品とは異なり、モダリティ全体でモデルをマージすることを提案し、LLMの推論能力をVLMに組み込むことを可能にします。
広範な実験を通じて、モデルの合併は、トレーニングのない方法でLLMSからVLMSに推論能力を伝達するための成功した経路を提供することを実証します。
さらに、マージされたモデルを利用して、知覚と推論の内部メカニズムと、マージがそれにどのように影響するかを理解します。
推論は、主にモデルの初期層で主にエンコードされていることがわかりますが、推論は中間層から層の層によって大部分が促進されます。
マージした後、すべての層が推論に寄与し始めるのに対し、レイヤー間の知覚能力の分布はほとんど変わらないままであることがわかります。
これらの観察結果は、マルチモーダルの統合と解釈のためのツールとしてのモデルマージの可能性に光を当てました。

要約(オリジナル)

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

arxiv情報

著者	Shiqi Chen,Jinghan Zhang,Tongyao Zhu,Wei Liu,Siyang Gao,Miao Xiong,Manling Li,Junxian He
発行日	2025-05-08 17:56:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー