Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

要約

我々は、視覚処理における従来の所定解像度アプローチを再定義する、以前の Qwen-VL モデルの高度なアップグレードである Qwen2-VL シリーズを紹介します。
Qwen2-VL は、Naive Dynamic Resolution メカニズムを導入しています。これにより、モデルは、さまざまな解像度の画像をさまざまな数のビジュアルトークンに動的に処理できます。
このアプローチにより、モデルは人間の知覚プロセスと密接に一致した、より効率的かつ正確な視覚表現を生成できるようになります。
このモデルには、Multimodal Rotary Position Embedding (M-RoPE) も統合されており、テキスト、画像、ビデオにわたる位置情報の効果的な融合が容易になります。
画像とビデオの両方を処理するための統一パラダイムを採用し、モデルの視覚認識機能を強化します。
大規模なマルチモーダルモデルの可能性を探るため、Qwen2-VL は大規模ビジョン言語モデル (LVLM) のスケーリング則を調査します。
Qwen2-VL シリーズは、2B、8B、および 72B パラメーターのバージョンを含むモデルサイズとトレーニングデータの量の両方をスケーリングすることにより、非常に競争力のあるパフォーマンスを実現します。
特に、Qwen2-VL-72B モデルは、さまざまなマルチモーダルベンチマークにわたって GPT-4o や Claude3.5-Sonnet などの主要モデルと同等の結果を達成し、他のジェネラリストモデルを上回っています。
コードは \url{https://github.com/QwenLM/Qwen2-VL} で入手できます。

要約(オリジナル)

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model’s visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{https://github.com/QwenLM/Qwen2-VL}.

arxiv情報

著者	Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin
発行日	2024-09-18 17:59:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー