TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

要約

密なテキストを読んだり、画像内のオブジェクトを見つけたりすることは、高度なジョブを担う大規模視覚言語モデル (LVLM) の基本的な能力です。
GPT-4o のような優れた独自モデルを含む、これまでの LVLM は、両方のタスクで同時に優れることに苦労していました。
さらに、きめ細かい認識を備えた以前の LVLM は、画像ごとに数千のトークンが必要であり、リソースを大量に消費します。
TextHawk2 は、効率的できめ細かい認識を特徴とし、16 分の 1 の画像トークンで汎用、OCR、およびグラウンディングタスクにわたって最先端のパフォーマンスを実証するバイリンガル LVLM です。
重要な改善点は次のとおりです: (1) トークン圧縮: 前世代の効率的なアーキテクチャを基盤として構築された TextHawk2 は、画像あたりのトークン数を 16 分の 1 に大幅に削減し、最小限のリソースで TextHawk シリーズのトレーニングと展開を容易にします。
(2) ビジュアルエンコーダの強化: LVLM の共同トレーニングを通じてビジュアルエンコーダを強化し、中国語 OCR やグラウンディングなど、これまで目に見えなかったタスクの可能性を解き放ちます。
(3) データの多様性: 事前トレーニングデータのソースを多様化しながら、1 億サンプルという同等の規模を維持します。
TextHawk2 は複数のベンチマークで評価されており、一貫して優れたパフォーマンスを提供し、OCRBench で 78.4% の精度、ChartQA で 81.4% の精度、DocVQA で 89.6% の ANLS、および 0.5 で 88.1% の精度を達成するなど、同様の規模のクローズドソースモデルを上回っています。
RefCOCOg テストで。

要約(オリジナル)

Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder Reinforcement: We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding. (3) Data Diversity: We maintain a comparable scale of 100 million samples while diversifying the sources of pre-training data. We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale, such as achieving 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.

arxiv情報

著者	Ya-Qi Yu,Minghui Liao,Jiwen Zhang,Jihao Wu
発行日	2024-10-07 17:58:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー