A Single Transformer for Scalable Vision-Language Modeling

要約

スケーラブルな視覚言語モデルのための単一トランスフォーマーである SOLO を紹介します。
LLaVA などの現在の大規模ビジョン言語モデル (LVLM) は、ほとんどの場合、事前トレーニングされたビジュアルエンコーダを大規模言語モデル (LLM) に接続する異種アーキテクチャを採用し、視覚認識と複雑な推論を容易にします。
比較的軽量なトレーニングで顕著なパフォーマンスを達成できますが、スケーラビリティの主な制限が 4 つあります。 (1) 視覚能力は、事前にトレーニングされたビジュアルエンコーダによって制限されます。通常、LLM よりも 1 桁小さいです。
(2) 異種混合アーキテクチャにより、確立されたハードウェアおよびソフトウェアインフラストラクチャの使用が複雑になります。
(3) このようなアーキテクチャに関するスケーリング則を研究するには、ビジュアルエンコーダ、コネクタ、LLM という 3 つの個別のコンポーネントを考慮する必要があり、分析が複雑になります。
(4) 既存のビジュアルエンコーダを使用するには、通常、入力を固定解像度の正方形画像に再形成するなど、画像入力の前処理の事前定義された仕様に従う必要があります。これにより、高解像度の画像や画像の処理とトレーニングが困難になります。
アスペクト比が異常なもの。
SOLO のような統合された単一の Transformer アーキテクチャは、LVLM におけるこれらのスケーラビリティの問題に効果的に対処します。
ただし、現代の状況での導入が限られているのは、両方のモダリティのバランスを取り、10 億規模のモデルの安定したトレーニングを保証する信頼できるトレーニングレシピが存在しないことが原因である可能性があります。
このペーパーでは、適度な学術リソースを使用したオープンソース 7B LVLM である SOLO を開発するための最初のオープンソーストレーニングレシピを紹介します。
トレーニングレシピには、LLM からの初期化、ImageNet および Web スケールデータでの順次事前トレーニング、厳選された高品質データセットでの命令の微調整が含まれます。
広範な評価により、SOLO は LLaVA-v1.5-7B に匹敵するパフォーマンスを示し、特に視覚的な数学的推論において優れています。

要約(オリジナル)

We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components – visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.

arxiv情報

著者	Yangyi Chen,Xingyao Wang,Hao Peng,Heng Ji
発行日	2024-11-13 18:21:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Single Transformer for Scalable Vision-Language Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー