LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

要約

テキストが全体的な理解を導く中心的な視覚要素として機能する、テキストが豊富な画像は、プレゼンテーションスライド、スキャンされた文書、Web ページのスナップショットなど、現実世界のアプリケーションで広く普及しています。
テキストが豊富な複数の画像を含むタスクは、個々の画像の内容を理解するだけでなく、複数の視覚入力にわたる相互関係や論理的フローについて推論する必要があるため、特に困難です。
これらのシナリオの重要性にも関わらず、現在のマルチモーダル大規模言語モデル (MLLM) は、(1) テキストが豊富な複数画像シナリオ用の高品質な命令チューニングデータセットの不足、(2) という 2 つの重要な課題により、このようなタスクの処理に苦労しています。
）画像解像度と視覚的特徴シーケンスの長さのバランスをとることの難しさ。
これらの課題に対処するために、私たちは、複数のテキストが豊富な画像を含む視覚言語タスクを処理するために特別に設計された MLLM である \OurMethod を提案します。
まず、テキストが豊富な複数画像のシナリオに合わせて、約 100 万件の高品質のマルチモーダル命令チューニングデータを厳選しました。
次に、入力画像の元のアスペクト比と解像度に基づいてビジュアルシーケンスの長さの割り当てを動的に最適化する適応型高解像度マルチ画像エンコードモジュールを開発しました。
幅広いベンチマークにわたる実験により、テキストが豊富な複数画像の評価におけるモデルの優れた機能と、一般的なドメイン評価における競争力のあるパフォーマンスが実証されました。

要約(オリジナル)

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model’s superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

arxiv情報

著者	Mengzhao Jia,Wenhao Yu,Kaixin Ma,Tianqing Fang,Zhihan Zhang,Siru Ouyang,Hongming Zhang,Meng Jiang,Dong Yu
発行日	2024-10-02 16:55:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー