Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

要約

特定のタスクやシナリオでは成功しているにもかかわらず、大規模モデル (LM) や高度なツールによって強化された既存の基盤エージェントは、依然としてさまざまなシナリオに一般化できません。これは、主にシナリオ間の観察とアクションの劇的な違いが原因です。
この作業では、General Computer Control (GCC) 設定を提案します。これは、コンピューターの画面イメージ (および場合によっては音声) のみを入力として受け取り、キーボードとマウスの操作を出力として生成することによって、あらゆるコンピュータータスクを習得できる基盤エージェントを構築します。
人間とコンピュータの相互作用に。
GCC を達成するための主な課題は、1) 意思決定のための多様な観察、2) キーボードとマウスの正確な制御の要件、3) 長期記憶と推論の必要性、4) 効率的な探索能力です。
そして自己改善。
GCC をターゲットとするために、次の 6 つの主要モジュールを備えたエージェントフレームワークである Cradle を導入します。1) マルチモダリティ情報を抽出するための情報収集、2) 過去の経験を再考するための反省、3) 最適な次のタスクを選択するためのタスク推論、
4) 与えられたタスクに関連するスキルを生成および更新するためのスキルキュレーション、5) キーボードおよびマウス制御のための特定の操作を生成するためのアクションプランニング、および 6) 過去の経験および既知のスキルを保存および検索するためのメモリ。
Cradle の一般化と自己改善の機能を実証するために、複雑な AAA ゲーム Red Dead Redemption II に Cradle を導入し、困難な目標を伴う GCC に向けた予備的な試みとして機能します。
私たちの知る限り、私たちの取り組みは、事前の知識やリソースへの依存を最小限に抑えながら、LMM ベースのエージェントがメインのストーリーラインに従い、複雑な AAA ゲームで実際のミッションを完了できるようにする初めてのものです。
プロジェクトの Web サイトは https://baai-agents.github.io/Cradle/ にあります。

要約(オリジナル)

Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. The main challenges of achieving GCC are: 1) the multimodal observations for decision-making, 2) the requirements of accurate control of keyboard and mouse, 3) the need for long-term memory and reasoning, and 4) the abilities of efficient exploration and self-improvement. To target GCC, we introduce Cradle, an agent framework with six main modules, including: 1) information gathering to extract multi-modality information, 2) self-reflection to rethink past experiences, 3) task inference to choose the best next task, 4) skill curation for generating and updating relevant skills for given tasks, 5) action planning to generate specific operations for keyboard and mouse control, and 6) memory for storage and retrieval of past experiences and known skills. To demonstrate the capabilities of generalization and self-improvement of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. To our best knowledge, our work is the first to enable LMM-based agents to follow the main storyline and finish real missions in complex AAA games, with minimal reliance on prior knowledge or resources. The project website is at https://baai-agents.github.io/Cradle/.

arxiv情報

著者	Weihao Tan,Ziluo Ding,Wentao Zhang,Boyu Li,Bohan Zhou,Junpeng Yue,Haochong Xia,Jiechuan Jiang,Longtao Zheng,Xinrun Xu,Yifei Bi,Pengjie Gu,Xinrun Wang,Börje F. Karlsson,Bo An,Zongqing Lu
発行日	2024-03-07 14:41:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー