Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

要約

モバイル UI の理解は、UI の自動化やアクセシビリティなど、さまざまな対話タスクを有効にするために重要です。
以前のモバイル UI モデリングは、多くの場合、UI の構造データを直接提供する画面のビュー階層情報に依存し、画面ピクセルから視覚モデリングの困難なタスクを回避することを望んでいました。
ただし、ビュー階層は常に利用できるとは限らず、オブジェクトの説明が欠落していたり、構造情報が正しく配置されていないために破損していることがよくあります。
その結果、ビュー階層を使用すると短期的には利益が得られる可能性がありますが、最終的にはモデルの適用性とパフォーマンスが妨げられる可能性があります。
このホワイトペーパーでは、モバイル UI を理解するための視覚のみのアプローチである Spotlight を提案します。
具体的には、UI のスクリーンショットと画面上の関心領域 (フォーカス) のみを入力として取得する視覚言語モデルを強化します。
Spotlight のこの一般的なアーキテクチャは、簡単に拡張でき、さまざまな UI モデリングタスクを実行できます。
私たちの実験は、モデルがいくつかの代表的な UI タスクで SoTA の結果を確立し、スクリーンショットとビュー階層の両方を入力として使用する以前の方法よりも優れていることを示しています。
さらに、提案されたモデルのマルチタスク学習と少数ショットのプロンプト機能を調査し、マルチタスク学習の方向性で有望な結果を示します。

要約(オリジナル)

Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen — the focus — as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.

arxiv情報

著者	Gang Li,Yang Li
発行日	2023-02-24 01:41:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー