MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

要約

投機的デコードは、軽量のドラフトモデルが複数のターゲットモデルが同時に検証することを提案できるようにすることにより、言語モデルの推論を大幅に加速します。
ただし、この手法をVision言語モデル（VLMS）に適用すると、2つの基本的な課題があります。効率的な起草者として機能する小言語モデルは、視覚入力を処理するためのアーキテクチャコンポーネントが欠けており、視覚コンテキストを考慮するVLMターゲットモデルの予測と一致することに失敗します。
視覚言語モデル（MASSV）の投機的デコードのためにマルチモーダル適応と自己データ蒸留を導入します。これは、既存の小言語モデルを2フェーズアプローチを通じて効果的なマルチモーダルドラフトに変換します。
MASSVは、最初にターゲットVLMのビジョンエンコーダーを軽量トレーニング可能なプロジェクターを介してドラフトモデルに接続し、次に、ターゲットVLMによって生成された応答を使用してトークン予測を調整するために自己拡張視覚命令チューニングを適用します。
QWEN2.5-VLおよびGEMMA3モデルファミリー全体の包括的な実験は、MASSVが受け入れられた長さを最大30％増加させ、視覚づけのタスクで最大1.46倍のエンドツーエンドの推論速度を提供することを示しています。
MASSVは、電流と将来のVLMの両方を加速するためのスケーラブルでアーキテクチャ互換の方法を提供します。

要約(オリジナル)

Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM’s vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

arxiv情報

著者	Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa
発行日	2025-05-15 17:37:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー