Multimodal Autoregressive Pre-training of Large Vision Encoders

要約

大規模ビジョンエンコーダの事前トレーニングのための新しい方法を紹介します。
視覚モデルの自己回帰事前トレーニングにおける最近の進歩に基づいて、このフレームワークを画像とテキストなどのマルチモーダル設定に拡張します。
本稿では、AIMV2 について紹介します。AIMV2 は、簡単な事前トレーニングプロセス、スケーラビリティ、およびさまざまなダウンストリームタスクにわたる優れたパフォーマンスを特徴とするジェネラリストビジョンエンコーダファミリです。
これは、ビジョンエンコーダと、生の画像パッチとテキストトークンを自己回帰的に生成するマルチモーダルデコーダを組み合わせることで実現されます。
当社のエンコーダは、マルチモーダル評価だけでなく、ローカリゼーション、グラウンディング、分類などの視覚ベンチマークにも優れています。
特に、当社の AIMV2-3B エンコーダは、フリーズしたトランクを使用した ImageNet-1k で 89.5% の精度を達成しています。
さらに、AIMV2 は、多様な設定にわたるマルチモーダル画像理解において、最先端の対照モデル (CLIP、SigLIP など) を常に上回ります。

要約(オリジナル)

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

arxiv情報

著者	Enrico Fini,Mustafa Shukor,Xiujun Li,Philipp Dufter,Michal Klein,David Haldimann,Sai Aitharaju,Victor Guilherme Turrisi da Costa,Louis Béthune,Zhe Gan,Alexander T Toshev,Marcin Eichner,Moin Nabi,Yinfei Yang,Joshua M. Susskind,Alaaeldin El-Nouby
発行日	2024-11-21 18:31:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Autoregressive Pre-training of Large Vision Encoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー