Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

要約

このペーパーでは、マルチモーダル大規模言語モデル (MLLM) の事前トレーニングを高速化するビジョン言語ブリッジモジュールである Chain-of-Sight を紹介します。
私たちのアプローチでは、さまざまな空間スケールで視覚的な詳細をキャプチャする一連の視覚的リサンプラーを採用しています。
このアーキテクチャは、グローバルおよびローカルのビジュアルコンテキストを効果的に活用するだけでなく、複合トークンスケーリング戦略を通じてビジュアルトークンの柔軟な拡張を容易にし、事前トレーニング後のトークン数を最大 16 倍に増やすことができます。
その結果、Chain-of-Sight では、微調整フェーズと比較して、事前トレーニングフェーズで必要なビジュアルトークンが大幅に少なくなります。
事前トレーニング中のビジュアルトークンのこの意図的な削減により、事前トレーニングプロセスが著しく加速され、実時間のトレーニング時間が最大 73% 削減されます。
一連のビジョン言語ベンチマークの実証結果では、Chain-of-Sight による事前トレーニングの高速化が、パフォーマンスを犠牲にすることなく達成され、トレーニングプロセス全体を通じてすべてのビジュアルトークンを利用する標準パイプラインと同等またはそれを上回っていることが明らかになりました。
事前トレーニング用のビジュアルトークンの数をさらにスケールアップすると、一連のベンチマークにおける既存のアプローチに匹敵する、より強力なパフォーマンスが得られます。

要約(オリジナル)

This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

arxiv情報

著者	Ziyuan Huang,Kaixiang Ji,Biao Gong,Zhiwu Qing,Qinglong Zhang,Kecheng Zheng,Jian Wang,Jingdong Chen,Ming Yang
発行日	2024-07-22 17:33:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー