Harnessing Webpage UIs for Text-Rich Visual Understanding

要約

マルチモーダル大規模言語モデル (MLLM) が構造化環境と効果的に対話するには、テキストが豊富な視覚的理解 (高密度のテキストコンテンツがビジュアルと統合されている環境を処理する機能) が不可欠です。
この機能を強化するために、テキストベースの大規模言語モデル (LLM) を使用して、Web ページ UI から一般的なマルチモーダル命令を合成することを提案します。
直接的な視覚入力がないにもかかわらず、テキストベースの LLM は、Web ページのアクセシビリティツリーからの構造化テキスト表現を処理できます。
これらの指示は UI スクリーンショットと組み合わせて、マルチモーダルモデルをトレーニングします。
MultiUI は、100 万の Web サイトからの 730 万のサンプルを含むデータセットで、多様なマルチモーダルタスクと UI レイアウトをカバーします。
MultiUI でトレーニングされたモデルは、Web UI タスクで優れているだけでなく (VisualWebBench で最大 48\% の向上、Web エージェントデータセット Mind2Web でのアクション精度で 19.1\% 向上を達成) だけでなく、非 Web UI タスクや
文書理解、OCR、チャート解釈などの非 UI ドメインにも適用されます。
これらの結果は、さまざまなシナリオにわたってテキストが豊富な視覚的な理解を促進するための Web UI データの幅広い適用可能性を強調しています。

要約(オリジナル)

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

arxiv情報

著者	Junpeng Liu,Tianyue Ou,Yifan Song,Yuxiao Qu,Wai Lam,Chenyan Xiong,Wenhu Chen,Graham Neubig,Xiang Yue
発行日	2024-10-17 17:48:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harnessing Webpage UIs for Text-Rich Visual Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー