Qwen2.5-VL Technical Report

要約

QWEN Vision-Languageシリーズの最新のフラッグシップモデルであるQWEN2.5-VLを紹介します。これは、基礎能力と革新的な機能の両方に大きな進歩を示しています。
QWEN2.5-VLは、視覚認識の強化、正確なオブジェクトのローカリゼーション、堅牢なドキュメント解析、および長距離理解を通じて、世界を理解し、相互作用することにおいて大きな飛躍を達成します。
QWEN2.5-VLの傑出した機能は、境界ボックスまたはポイントを正確に使用してオブジェクトをローカライズする機能です。
請求書、フォーム、テーブルからの堅牢な構造化データ抽出、およびチャート、図、レイアウトの詳細な分析を提供します。
複雑な入力を処理するために、QWEN2.5-VLは動的解像度の処理と絶対時間エンコードを導入し、第2レベルのイベントローカリゼーションで、延長期間（最大時間）のさまざまなサイズとビデオの画像を処理できるようにします。
これにより、モデルは、従来の正規化手法に依存することなく、空間スケールと時間的ダイナミクスをネイティブに知覚できます。
ネイティブの動的解像度ビジョントランス（VIT）をゼロからトレーニングし、ウィンドウの注意を組み込むことにより、ネイティブ解像度を維持しながら計算オーバーヘッドを減らします。
その結果、QWEN2.5-VLは、静的な画像と文書の理解だけでなく、推論、ツールの使用、およびタスクの実行が、コンピューターの操作やモバイルデバイスなどの実際のシナリオでのタスクの実行が可能なインタラクティブな視覚エージェントとしても優れています。
QWEN2.5-VLは3つのサイズで利用でき、Edge AIから高性能コンピューティングまでの多様なユースケースに対処します。
フラッグシップQWEN2.5-VL-72Bモデルは、特に文書や図の理解に優れているGPT-4OやClaude 3.5ソネットなどの最先端モデルと一致しています。
さらに、QWEN2.5-VLは堅牢な言語パフォーマンスを維持し、QWEN2.5 LLMのコア言語能力を維持します。

要約(オリジナル)

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

arxiv情報

著者	Shuai Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Sibo Song,Kai Dang,Peng Wang,Shijie Wang,Jun Tang,Humen Zhong,Yuanzhi Zhu,Mingkun Yang,Zhaohai Li,Jianqiang Wan,Pengfei Wang,Wei Ding,Zheren Fu,Yiheng Xu,Jiabo Ye,Xi Zhang,Tianbao Xie,Zesen Cheng,Hang Zhang,Zhibo Yang,Haiyang Xu,Junyang Lin
発行日	2025-02-19 18:00:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Qwen2.5-VL Technical Report

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー