ScreenAI: A Vision-Language Model for UI and Infographics Understanding

要約

画面ユーザーインターフェイス (UI) とインフォグラフィックスは、同様の視覚言語とデザイン原則を共有し、人間のコミュニケーションと人間と機械の相互作用において重要な役割を果たします。
UI とインフォグラフィックスの理解に特化したビジョン言語モデルである ScreenAI を紹介します。
私たちのモデルは、pix2struct の柔軟なパッチ戦略を使用して PaLI アーキテクチャを改良し、データセットの独自の混合でトレーニングされています。
この混合の中心となるのは、モデルが UI 要素のタイプと場所を識別する必要がある新しい画面注釈タスクです。
これらのテキストアノテーションを使用して画面を大規模言語モデルに記述し、質問応答 (QA)、UI ナビゲーション、要約トレーニングデータセットを大規模に自動生成します。
これらの設計選択の影響を実証するために、アブレーション研究を実施しています。
ScreenAI は、わずか 50 億のパラメータで、UI およびインフォグラフィックスベースのタスク (マルチページ DocVQA、WebSRC、MoTIF、ウィジェットキャプション) で新しい最先端の結果を達成し、その他のタスク (Chart QA) でクラス最高の新しいパフォーマンスを達成します。
、DocVQA、および InfographicVQA) を同様のサイズのモデルと比較しました。
最後に、3 つの新しいデータセットをリリースします。1 つは画面注釈タスクに焦点を当てたもの、もう 2 つは質問応答に焦点を当てたものです。

要約(オリジナル)

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

arxiv情報

著者	Gilles Baechler,Srinivas Sunkara,Maria Wang,Fedir Zubach,Hassan Mansoor,Vincent Etter,Victor Cărbune,Jason Lin,Jindong Chen,Abhanshu Sharma
発行日	2024-02-19 17:03:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー