Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

要約

チャートやドキュメントなどの豊富なテキストを持つ画像に関する推論は、ビジョン言語モデル（VLM）の重要なアプリケーションです。
ただし、VLMは、多様なテキストが豊富なビジョン言語データが不足しているため、しばしばこれらのドメインで苦労しています。
この課題に対処するために、CoSynを提示します。これは、テキストのみの大手言語モデル（LLM）のコーディング機能を活用して、合成テキストが豊富なマルチモーダルデータを自動的に作成するフレームワークです。
ターゲットドメイン（「栄養事実ラベル」など）を記述する入力テキストが与えられた場合、COSYNはLLMに合成画像をレンダリングするためのコード（Python、HTML、ラテックスなど）を生成するように促します。
基礎となるコードが合成画像のテキスト表現として、CoSynはテキストのみのLLMに依存して、高品質の命令調整データを生成できます。
Cosynを使用して、400kの画像と2.7mの列の視力命令調整データを含むデータセットを構築しました。
7つのベンチマークでの包括的な実験は、合成データでトレーニングされたモデルが、Llama 3.2を含む競合するオープンソースモデル間で最先端のパフォーマンスを達成し、GPT-4VやGemini 1.5フラッシュなどの専有モデルを超えることを示しています。
さらに、COSYNは合成ポインティングデータを生成し、VLMが入力画像内で情報を接地できるようにし、実際の環境で作用できるマルチモーダルエージェントを開発する可能性を示しています。

要約(オリジナル)

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., ‘nutrition fact labels’), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

arxiv情報

著者	Yue Yang,Ajay Patel,Matt Deitke,Tanmay Gupta,Luca Weihs,Andrew Head,Mark Yatskar,Chris Callison-Burch,Ranjay Krishna,Aniruddha Kembhavi,Christopher Clark
発行日	2025-02-20 18:55:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー