An Empirical Study of Multimodal Model Merging

要約

モデルのマージ (補間やタスク演算など) は、異なるタスクでトレーニングされた複数のモデルを融合して、マルチタスクソリューションを生成します。
この手法は、モデルが同様のタスクと同じ初期化でトレーニングされた以前の研究で成功していることが証明されています。
この論文では、さまざまなモダリティでトレーニングされたトランスフォーマーを結合することにより、この概念をマルチモーダル設定に拡張します。
さらに、モダリティ固有のアーキテクチャの視覚、言語、クロスモーダルトランスフォーマーを統合して、パラメータ効率の高いモダリティに依存しないアーキテクチャを作成できるという新しい目標に向けて研究を行っています。
包括的な実験を通じて、初期化、マージメカニズム、モデルアーキテクチャなど、マージ後のモデルのパフォーマンスに影響を与える主要な要素を体系的に調査します。
また、マージする重み間の距離を評価し、マージ結果の指標として機能する 2 つの指標も提案します。
私たちの分析は、モデルのマージを介してモダリティに依存しないベースライン (つまり、最初から事前トレーニングされた) のパフォーマンスと一致させるための効果的なトレーニングレシピを導き出します。
また、私たちの手法はさまざまなタスクで単純なマージを大幅に上回り、VQA で 3%、COCO 取得で 7%、NLVR2 で 25%、Flickr30k で 14%、ADE20k で 3% の改善が見られました。
私たちのコードは https://github.com/ylsung/vl-merging で入手できます。

要約(オリジナル)

Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. We also propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e., pre-trained from scratch) via model merging. Our method also outperforms naive merging significantly on various tasks, with improvements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k and 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging

arxiv情報

著者	Yi-Lin Sung,Linjie Li,Kevin Lin,Zhe Gan,Mohit Bansal,Lijuan Wang
発行日	2023-10-11 15:08:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Empirical Study of Multimodal Model Merging

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー