AlphaMaze: Enhancing Large Language Models’ Spatial Intelligence via GRPO

要約

大規模な言語モデル（LLM）は、言語処理における印象的な能力を実証していますが、多くの場合、本物の視覚的空間推論を必要とするタスクに苦労しています。
このペーパーでは、ゼロナビゲーションの視覚的推論能力を標準LLMに装備するために設計された新しい2段階トレーニングフレームワークを紹介します。
まず、トークン化された迷路表現のキュレーションされたデータセットに監視された微調整（SFT）を活用して、段階的な動きのコマンドを予測するようモデルに教えます。
次に、グループ相対ポリシー最適化（GRPO）を適用します。これは、モデルのシーケンシャルな意思決定を改良し、緊急のチェーンオブサブの行動を促進するために、慎重に作成された報酬機能を備えたdeepseekr1で使用される手法です。
合成的に生成された迷路での実験結果は、ベースラインモデルが迷路のナビゲートに失敗する一方で、SFTトレーニングモデルは86％の精度を達成し、GRPOの微調整が精度を93％に増やすことを示しています。
定性的分析により、GRPOはより堅牢で自己保護された推論を促進し、言語モデルと視覚的な空間タスクのギャップを埋めるアプローチの可能性を強調しています。
これらの調査結果は、統合された視覚的および順次推論を必要とするロボット工学、自律ナビゲーション、およびその他のドメインにおけるアプリケーションに有望な意味を提供します。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model’s sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

arxiv情報

著者	Alan Dao,Dinh Bach Vu
発行日	2025-02-20 16:05:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AlphaMaze: Enhancing Large Language Models’ Spatial Intelligence via GRPO

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー