Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine

要約

物理的、化学的、または生物学的システムの全体的な計算表現を学習するには、同じモデル内のさまざまな分布とモダリティから情報を処理する能力が必要です。
したがって、マルチモーダルの機械学習モデルの需要は、シーケンス、グラフ、時系列、表面データなど、ビジョンや言語を超えたモダリティに対して急激に増加しています。
利用可能なマルチモーダル融合とアライメントアプローチは多くありますが、それらのほとんどはエンドツーエンドトレーニングを必要とし、モダリティの数と二次的にスケーリングするか、トレーニングセットの高いモダリティの不均衡のケースを処理できないか、非常にトポロジ固有のものであり、多くの生物医学的学習タスクではあまりにも制限されています。
このペーパーでは、マルチモーダルレゴ（MMレゴ）を紹介します。これは、エンコーダーのセットを微調整なしまたは最小限の微調整を伴う競合マルチモーダルモデルに変える汎用融合フレームワークです。
これを達成し、モダリティ表現間の形状の一貫性を強制するユニモーダルエンコーダーのラッパーを導入します。
周波数ドメイン内の特徴を学習することにより、これらの表現を調和させて、信号干渉がほとんどないモデルの融合を可能にします。
MMレゴ1）は、微調整なしでエンドツーエンドの融合モデルで競争力のあるパフォーマンスを達成するモデルマージメソッドとして使用できることを示します。

要約(オリジナル)

Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for any unimodal encoder that enforces shape consistency between modality representations. It harmonises these representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, surpasses all benchmarks in five out of seven datasets.

arxiv情報

著者	Konstantin Hemker,Nikola Simidjievski,Mateja Jamnik
発行日	2025-04-16 16:43:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー