A Survey on (M)LLM-Based GUI Agents

要約

グラフィカルユーザーインターフェイス（GUI）エージェントは、ルールベースの自動化スクリプトから複雑なインターフェイス操作を理解して実行できる洗練されたAI駆動型システムに進化するヒューマンコンピューター相互作用の変革的パラダイムとして浮上しています。
この調査では、LLMベースのGUIエージェントの急速に前進する分野の包括的な調査を提供し、建築財団、技術的コンポーネント、および評価方法を体系的に分析します。
最新のGUIエージェントを構成する4つの基本コンポーネントを特定および分析します。（1）包括的なインターフェイス理解のためにマルチモーダル理解とテキストベースの解析を統合する知覚システム。
（2）内部モデリング、歴史的経験、および外部情報検索を通じて知識ベースを構築および維持する探索メカニズム。
（3）タスク分解と実行のための高度な推論方法論を活用する計画フレームワーク。
（4）堅牢な安全制御を使用してアクション生成を管理する相互作用システム。
これらのコンポーネントの厳密な分析を通じて、デスクトップ、モバイル、およびWebプラットフォーム全体でGUIオートメーションに革命をもたらした大規模な言語モデルとマルチモーダル学習の最近の進歩がどのように革新されたかを明らかにします。
現在の評価フレームワークを批判的に検討し、標準化の方向を提案しながら、既存のベンチマークの方法論的制限を強調します。
また、この調査では、正確な要素のローカリゼーション、効果的な知識の回復、長老の計画、安全性の認識の実行制御など、GUIエージェントの能力を高めるための有望な研究方向性を概説する重要な技術的課題も特定しています。
当社の系統的レビューは、研究者と実践者にフィールドの現在の状態を完全に理解し、インテリジェントインターフェイスの自動化における将来の発展に関する洞察を提供します。

要約(オリジナル)

Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents’ capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field’s current state and offers insights into future developments in intelligent interface automation.

arxiv情報

著者	Fei Tang,Haolei Xu,Hang Zhang,Siqi Chen,Xingyu Wu,Yongliang Shen,Wenqi Zhang,Guiyang Hou,Zeqi Tan,Yuchen Yan,Kaitao Song,Jian Shao,Weiming Lu,Jun Xiao,Yueting Zhuang
発行日	2025-06-04 17:29:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Survey on (M)LLM-Based GUI Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー