Scaling Capability in Token Space: An Analysis of Large Vision Language Model

要約

スケーリング機能は、パラメーターの数とトレーニングデータのサイズに関して、ニューラル言語モデルで広く検証されています。
1 つの重要な質問は、大規模ビジョン言語モデルのビジョントークンの数に関しても同様にスケーリング機能が存在するかということです。
この研究では、ビジョントークンの数とビジョン言語モデルのパフォーマンスの関係を調査することでギャップを埋めています。
私たちの理論的分析と経験的評価は、このモデルがビジョントークン \(N_l\) の数に関してスケーラブルなパフォーマンス \(S(N_l)\) を示し、関係 \(S(N_l) \estimate (c/) によって特徴付けられることを示しています。
N_l)^{\alpha}\)。
さらに、ユーザーの質問とビジョントークンを統合する融合メカニズムの影響も調査します。
その結果、2 つの重要な発見が明らかになりました。
まず、融合メカニズムを組み込んでも、スケーリング機能はそのまま残ります。
第 2 に、融合メカニズムにより、特にユーザーの質問がタスク固有で関連性のある場合に、モデルのパフォーマンスが向上します。
この分析は、幅広いタスクとドメインにわたる 15 の多様なベンチマークに対して実施され、提案されたアプローチの有効性を検証します。

要約(オリジナル)

The scaling capability has been widely validated in neural language models with respect to the number of parameters and the size of training data. One important question is that does the scaling capability also exists similarly with respect to the number of vision tokens in large vision language Model? This study fills the gap by investigating the relationship between the number of vision tokens and the performance on vision-language models. Our theoretical analysis and empirical evaluations demonstrate that the model exhibits scalable performance \(S(N_l)\) with respect to the number of vision tokens \(N_l\), characterized by the relationship \(S(N_l) \approx (c/N_l)^{\alpha}\). Furthermore, we also investigate the impact of a fusion mechanism that integrates the user’s question with vision tokens. The results reveal two key findings. First, the scaling capability remains intact with the incorporation of the fusion mechanism. Second, the fusion mechanism enhances model performance, particularly when the user’s question is task-specific and relevant. The analysis, conducted on fifteen diverse benchmarks spanning a broad range of tasks and domains, validates the effectiveness of the proposed approach.

arxiv情報

著者	Tenghui Li,Guoxu Zhou,Xuyang Zhao,Qibin Zhao
発行日	2024-12-30 11:00:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー