MokA: Multimodal Low-Rank Adaptation for MLLMs

要約

この論文では、現在の最新のマルチモーダル微調整方法が主要な制限によって妨げられていることが明らかになりました。それらはLLMSから直接借用され、しばしばマルチモーダルシナリオの本質的な違いを無視し、すべてのモダリティの完全な利用にさえ影響を与えることさえあります。
私たちの経験的観察に触発された私たちは、単峰性の適応とクロスモーダルの適応は、MLLMの効果的な微調整に2つの重要な部分であると主張します。
この観点から、マルチモーダルに認識された効率的な微調整戦略であるマルチモーダル低ランク適応（MOKA）を提案します。
モダリティ固有のパラメーターによって非モーダル情報を圧縮しながら、クロスモーダル相互作用を明示的に強化し、単峰性とモーダルの両方の適応を確保します。
広範な実験では、3つの代表的なマルチモーダルシナリオ（オーディオビジュアルテキスト、ビジュアルテキスト、音声テキスト）と複数のLLMバックボーン（LLAMA2/3、QWEN2、QWEN2.5-VLなど）をカバーしています。
一貫した改善は、提案された方法の有効性と汎用性を示しています。
アブレーション研究と効率評価も、私たちの方法を完全に評価するために行われます。
全体として、MokaはMLLMを効率的に適応させるためのよりターゲットを絞ったソリューションを提供し、さらなる調査への道を開くと考えています。
プロジェクトページはhttps://gewu-lab.github.io/mokaにあります。

要約(オリジナル)

In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

arxiv情報

著者	Yake Wei,Yu Miao,Dongzhan Zhou,Di Hu
発行日	2025-06-05 16:04:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MokA: Multimodal Low-Rank Adaptation for MLLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー