STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

要約

マルチモーダル大手言語モデル（MLLM）は、多様なタスク全体で顕著な能力を実証していますが、空間的推論では人間にかなり遅れています。
変換駆動型の視覚推論（TVR）を通じてこのギャップを調査します。これは、さまざまな視点で画像全体でオブジェクト変換を識別する必要がある挑戦的なタスクです。
従来の監視された微調整（SFT）は、クロスビュー設定でコヒーレントな推論パスを生成できませんが、スパースリワード補強学習（RL）は、非効率的な探索とゆっくりした収束に苦しんでいます。
これらの制限に対処するために、TVRに合わせて調整された細かい報酬メカニズムと単一段階のRLパラダイムを統合する新しいフレームワークであるStar-R1を提案します。
具体的には、STAR-R1は、過度の列挙と受動的不作為を罰し、効率的な調査と正確な推論を可能にしながら、部分的な正確性に報います。
包括的な評価は、STAR-R1が11のメトリックすべてで最先端のパフォーマンスを達成し、クロスビューシナリオでSFTを23％上回ることを示しています。
さらなる分析により、STAR-R1の擬人化された動作が明らかになり、空間推論を改善するためにすべてのオブジェクトを比較する独自の能力が強調されています。
私たちの仕事は、MLLMSと推論モデルの研究を進める上で重要な洞察を提供します。
コード、モデルの重み、およびデータは、https://github.com/zongzhao23/star-r1で公開されます。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1’s anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

arxiv情報

著者	Zongzhao Li,Zongyang Ma,Mingze Li,Songyou Li,Yu Rong,Tingyang Xu,Ziqi Zhang,Deli Zhao,Wenbing Huang
発行日	2025-05-26 16:00:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー