You Only Look at Screens: Multimodal Chain-of-Action Agents

要約

自律ユーザーインターフェイス (UI) エージェントは、手動介入なしでユーザーインターフェイスと対話することでタスクの自動化を促進することを目的としています。
最近の研究では、多様な環境で効果的に関与するための大規模言語モデル (LLM) の機能を引き出すことが研究されています。
LLM の入出力要件に合わせるために、既存のアプローチはサンドボックス設定の下で開発されており、外部ツールとアプリケーション固有の API に依存して環境をテキスト要素に解析し、予測されるアクションを解釈します。
その結果、これらのアプローチは推論の非効率性やエラー伝播のリスクに直面することがよくあります。
課題を軽減するために、環境解析やアプリケーション依存の API への依存の必要性を回避し、インターフェイスと直接対話するマルチモーダルソリューションである Auto-UI を導入します。
さらに、エージェントがどのようなアクションを実行するかを決定するのに役立つ、一連の中間的な以前のアクション履歴と将来のアクション計画を活用するアクション連鎖手法を提案します。
私たちは、アプリケーション操作、Web 検索、Web ショッピングなどの複数ステップのタスクにわたる 30,000 個の固有の命令を備えた新しいデバイス制御ベンチマーク AITW でアプローチを評価します。
実験の結果、Auto-UI はアクションタイプの予測精度 90%、全体的なアクションの成功率 74% という最先端のパフォーマンスを達成したことが示されています。
コードは https://github.com/cooelf/Auto-UI で公開されています。

要約(オリジナル)

Autonomous user interface (UI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-UI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique — leveraging a series of intermediate previous action histories and future action plans — to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-UI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-UI.

arxiv情報

著者	Zhuosheng Zhan,Aston Zhang
発行日	2023-09-20 16:12:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

You Only Look at Screens: Multimodal Chain-of-Action Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー