A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

要約

この研究では、空間指向タスクにおける LLM 能力の評価に適した LLM ベンチマークに Minecraft ビルダータスクを適応させ、ビルダーエージェントの設計に情報を提供することを提案します。
これまでの研究では、さまざまな複雑な構造を備えたコーパスや、人間が書いた指示が提案されてきました。
代わりに、一般的な構築操作で構成される一連の個別のタスクにわたってビルダーエージェントをテストするための包括的な総合ベンチマークを提供することを試みます。
私たちは、このアプローチにより、さまざまなエージェントの特定の長所と短所を調査し、空間推論とベクトルベースの数学という難しい領域における LLM の能力をテストできると信じています。

要約(オリジナル)

In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.

arxiv情報

著者	Chris Madge,Massimo Poesio
発行日	2024-07-17 16:52:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー