Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

要約

時空間的推論は、サイバー物理システム（CPS）で重要な役割を果たします。
大規模な言語モデル（LLMS）と大規模な推論モデル（LRM）の進歩にもかかわらず、複雑な空間的信号について推論する能力は既知のままです。
このペーパーでは、3つのレベルの推論の複雑さにわたってLLMを体系的に評価するために、階層的な空間的推論ベンチマーク、Starkを提案します：状態推定（たとえば、フィールド変数の予測、空間と時間のローカライズと追跡イベント）、状態の時空間的推論（例えば、空間的な関係の推測）、および世界的知識の推論と世界知識の推論
（例えば、意図予測、ランドマーク認識ナビゲーション）。
モデルが直接またはPythonコードインタープリターによって回答する14,552の課題で構成される、多様なセンサーモダリティを備えた26の異なる空間的タスクをキュレートします。
3つのLRMと8 LLMを評価すると、LLMは、特に複雑さが増加するにつれて、幾何学的推論（多層や三角測量など）を必要とするタスクで限られた成功を収めています。
驚くべきことに、LRMSは、さまざまなレベルの難易度を持つタスク全体で堅牢なパフォーマンスを示し、しばしば従来の第一原理ベースの方法を競合または競合したり、それを上回ったりします。
私たちの結果は、世界の知識を必要とする推論タスクでは、LLMSとLRMSのパフォーマンスのギャップが狭くなり、LLMがLRMを上回っていることを示しています。
ただし、LRM O3モデルは、評価されたすべてのタスクで主要なパフォーマンスを達成し続けています。これは、主に推論モデルのより大きなサイズに起因する結果です。
Starkは、LLMとLRMSの時空間的推論の制限を特定するための構造化されたフレームワークを提供することにより、インテリジェントCPSのモデルアーキテクチャと推論パラダイムの将来の革新を動機付けます。

要約(オリジナル)

Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.

arxiv情報

著者	Pengrui Quan,Brian Wang,Kang Yang,Liying Han,Mani Srivastava
発行日	2025-05-27 16:52:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー