The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

要約

このペーパーでは、単一のアーキテクチャ内で生のピクセルエンコードと言語デコードを統合する単一の変圧器統一マルチモーダル大手言語モデル（MLLM）であるSailを紹介します。
事前に訓練されたビジョントランス（VIT）に依存する既存のモジュラーMLLMとは異なり、セールは個別のビジョンエンコーダーの必要性を排除し、よりミニマリストのアーキテクチャデザインを提示します。
Sailは、新しいアーキテクチャコンポーネントを導入する代わりに、視覚的およびテキストモダリティの明確な特性とより適切に整合するために、混合オーテンションメカニズムとマルチモーダル位置エンコーディングを適応させます。
Modular MLLMのスケーラビリティ、クロスモーダル情報フローパターン、および視覚表現機能を含むSailのプロパティを体系的に比較します。
トレーニングデータとモデルサイズの両方をスケーリングすることにより、SailはモジュラーMLLMに匹敵するパフォーマンスを実現します。
特に、前処理されたVIT成分を除去すると、Sailのスケーラビリティが向上し、クロスモーダルの情報フローパターンが大幅に異なります。
さらに、Sailは強力な視覚表現能力を示し、セマンティックセグメンテーションなどのビジョンタスクでVIT-22Bと同等の結果を達成します。
コードとモデルはhttps://github.com/bytedance/sailで入手できます。

要約(オリジナル)

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL’s properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL’s scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

arxiv情報

著者	Weixian Lei,Jiacong Wang,Haochen Wang,Xiangtai Li,Jun Hao Liew,Jiashi Feng,Zilong Huang
発行日	2025-04-14 17:50:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー