UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

要約

グラフィカルユーザーインターフェイス（GUI）をナビゲートしてドキュメントの編集やファイル管理などのタスクを自動化する自律エージェントは、コンピューターのワークフローを大幅に強化できます。
既存の研究では、オンライン設定に焦点を当てていますが、多くの専門的および日常的なタスクにとって重要であるデスクトップ環境は、データ収集の課題とライセンスの問題のために未脱カッティングのままです。
現実世界のデスクトップ環境でのコンピューター使用エージェントのオフラインで微調整された評価のための最初の包括的なライセンス頻度のベンチマークであるUI-Visionを紹介します。
オンラインベンチマークとは異なり、UI-Visionは次のことを提供します。
デスクトップ環境でのパフォーマンス。
私たちの評価は、プロのソフトウェアの理解、空間的推論、ドラッグアンドドロップなどの複雑なアクションを理解する問題を含む、UI-TARS-72Bのような最先端モデルの重要な制限を明らかにしています。
これらの調査結果は、完全に自律的なコンピューター使用エージェントの開発における課題を強調しています。
UI-visionをオープンソースとしてリリースすることにより、実際のデスクトップタスクのために、より有能なエージェントの開発を進めることを目指しています。

要約(オリジナル)

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

arxiv情報

著者	Shravan Nayak,Xiangru Jian,Kevin Qinghong Lin,Juan A. Rodriguez,Montek Kalsi,Rabiul Awal,Nicolas Chapados,M. Tamer Özsu,Aishwarya Agrawal,David Vazquez,Christopher Pal,Perouz Taslakian,Spandana Gella,Sai Rajeswar
発行日	2025-05-06 17:43:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー