Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

要約

ジェネラリストエージェントの評価には、その幅広い能力と、真の汎用性を評価する際の現在のベンチマークの制限があるため、大きな課題が生じます。
オープンワールドゲーム Minecraft 内に設定された完全に自動化されたベンチマークフレームワークである Minecraft Universe (MCU) を紹介します。
MCU は、広範囲のタスクを動的に生成および評価し、3 つのコアコンポーネントを提供します。1) 高い自由度と可変性を提供するタスク生成メカニズム、2) 3K を超える構成可能なアトミックタスクの拡大し続けるセット、3) 一般的なタスク
オープンエンドのタスク評価をサポートする評価フレームワーク。
大規模言語モデル (LLM) を統合することにより、MCU は評価ごとに多様な環境を動的に作成し、エージェントの汎用化を促進します。
このフレームワークは、ビジョン言語モデル (VLM) を使用して評価基準を自動的に生成し、多次元評価全体で人間の評価と 90% 以上の一致を達成しています。これは、MCU がジェネラリストエージェントを評価するための拡張性と説明可能なソリューションであることを示しています。
さらに、最先端の基礎モデルは特定のタスクでは良好に機能しますが、タスクの多様性と難易度の増加に苦戦することが多いことを示します。

要約(オリジナル)

Evaluating generalist agents presents significant challenges due to their wide-ranging abilities and the limitations of current benchmarks in assessing true generalization. We introduce the Minecraft Universe (MCU), a fully automated benchmarking framework set within the open-world game Minecraft. MCU dynamically generates and evaluates a broad spectrum of tasks, offering three core components: 1) a task generation mechanism that provides high degrees of freedom and variability, 2) an ever-expanding set of over 3K composable atomic tasks, and 3) a general evaluation framework that supports open-ended task assessment. By integrating large language models (LLMs), MCU dynamically creates diverse environments for each evaluation, fostering agent generalization. The framework uses a vision-language model (VLM) to automatically generate evaluation criteria, achieving over 90% agreement with human ratings across multi-dimensional assessments, which demonstrates that MCU is a scalable and explainable solution for evaluating generalist agents. Additionally, we show that while state-of-the-art foundational models perform well on specific tasks, they often struggle with increased task diversity and difficulty.

arxiv情報

著者	Xinyue Zheng,Haowei Lin,Kaichen He,Zihao Wang,Zilong Zheng,Yitao Liang
発行日	2024-11-29 10:39:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー