GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

要約

空間推論は、人間の認知能力の重要な機能であり、実際に多くの応用が可能ですが、純粋に言語ベースではない核となる常識スキルの 1 つであり、（最適ではなく）満足のいく解決策を得るには、ある程度の最低限の計画が必要です。
Commonsense Spatial Reasoning (CSR) の既存のベンチマークは、大規模言語モデル (LLM) が $\textit{descriptions}$ に応じて生成した計画を直接評価するのではなく、テキストベースの空間 $\textit{descriptions}$ をどのように解釈するかを評価する傾向があります。
}$ 空間推論の問題。
このペーパーでは、エージェントがエネルギー収集の問題を担当する 16,000 のグリッドベースの環境で構成される GRASP と呼ばれる大規模なベンチマークを構築します。
これらの環境には、5 つの異なるエネルギー分布、エージェント開始位置の 2 つのモード、2 つの異なる障害物構成、および 3 種類のエージェント制約を含む、160 の異なるグリッド設定のそれぞれを使用してインスタンス化された 100 のグリッドインスタンスが含まれます。
GRASP を使用して、ランダムウォークや貪欲な検索方法などの古典的なベースラインアプローチと、GPT-3.5-Turbo、GPT-4o、GPT-o1-mini などの高度な LLM を比較します。
実験結果は、これらの高度な LLM でさえ、満足のいくソリューションを一貫して達成するのに苦労していることを示しています。

要約(オリジナル)

Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial $\textit{descriptions}$ rather than directly evaluate a plan produced by the LLM in response to a $\textit{specific}$ spatial reasoning problem. In this paper, we construct a large-scale benchmark called GRASP, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem. These environments include 100 grid instances instantiated using each of the 160 different grid settings, involving five different energy distributions, two modes of agent starting position, and two distinct obstacle configurations, as well as three kinds of agent constraints. Using GRASP, we compare classic baseline approaches, such as random walk and greedy search methods, with advanced LLMs like GPT-3.5-Turbo, GPT-4o, and GPT-o1-mini. The experimental results indicate that even these advanced LLMs struggle to consistently achieve satisfactory solutions.

arxiv情報

著者	Zhisheng Tang,Mayank Kejriwal
発行日	2025-01-17 04:29:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー