Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

要約

最近、テキストの説明から象徴的な世界モデルを生成するために、大規模な言語モデル（LLMS）を活用することに関心が高まっています。
LLMは世界モデリングのコンテキストで広範囲に調査されていますが、以前の研究では、評価のランダム性、間接メトリックへの依存、限られたドメインスコープなど、いくつかの課題に遭遇しました。
これらの制限に対処するために、計画ドメイン定義言語（PDDL）に基づいて、数百の多様なドメインを特徴とし、より堅牢な評価のためにマルチ基準、実行ベースのメトリックを採用した新しいベンチマークText2Worldを導入します。
Text2Worldを使用して現在のLLMをベンチマークし、大規模な強化学習で訓練された推論モデルが他の人よりも優れていることがわかります。
ただし、最高のパフォーマンスモデルでさえ、世界モデリングの能力が限られていることを示しています。
これらの洞察に基づいて、テスト時間スケーリング、エージェントトレーニングなど、LLMの世界モデリング能力を強化するためのいくつかの有望な戦略を検討します。
Text2Worldが重要なリソースとして機能し、LLMを世界モデルとして活用する将来の研究の基礎を築くことができることを願っています。
プロジェクトページは、https：//text-to-world.github.io/で入手できます。

要約(オリジナル)

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

arxiv情報

著者	Mengkang Hu,Tianxing Chen,Yude Zou,Yuheng Lei,Qiguang Chen,Ming Li,Hongyuan Zhang,Wenqi Shao,Ping Luo
発行日	2025-02-18 17:59:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー