Towards Visual Text Grounding of Multimodal Large Language Model

要約

マルチモーダル大手言語モデル（MLLMS）の既存の進化にもかかわらず、特にドキュメントのテキストが豊富な画像では、視覚的なテキストの接地との闘いには無視できない制限が残っています。
スキャンされたフォームやインフォグラフィックなどのドキュメント画像は、複雑なレイアウトやテキストコンテンツによる重要な課題を強調しています。
ただし、現在のベンチマークは、テキストが豊富なドキュメント画像ではなく、自然画像の視覚的接地に主に焦点を当てているため、これらの課題に完全に対処するものではありません。
したがって、このギャップを橋渡しするために、ドキュメントの質問におけるMLLMのテキストが豊富な画像接地機能をベンチマークして改善するための新しく設計された命令データセットを備えた新しいタスクであるTrigを紹介します。
具体的には、4つの多様なデータセットに基づいて、ベンチマークとして800の手動注釈付き質問ペアと90ドルの合成データの大規模なトレーニングセットを作成するために、OCR-llm-Humanインタラクションパイプラインを提案します。
提案されているベンチマークでのさまざまなMLLMの包括的な評価は、テキストが豊富な画像の接地能力の大幅な制限を明らかにします。
さらに、一般的な命令の調整とプラグアンドプレイ効率の埋め込みに基づいて、2つのシンプルで効果的なトリグメソッドを提案します。
合成データセットでMLLMを微調整することにより、彼らは空間的推論と接地能力を有望に改善します。

要約(オリジナル)

Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

arxiv情報

著者	Ming Li,Ruiyi Zhang,Jian Chen,Jiuxiang Gu,Yufan Zhou,Franck Dernoncourt,Wanrong Zhu,Tianyi Zhou,Tong Sun
発行日	2025-04-07 12:01:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Visual Text Grounding of Multimodal Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー