Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

要約

アクティブな知覚としても知られるアクティブビジョンは、タスク関連の情報を収集するために、どこでどのように検索するかを積極的に選択するプロセスを指します。
これは、人間と高度な具体化されたエージェントにおける効率的な認識と意思決定の重要な要素です。
最近、ロボットシステムの中央計画および意思決定モジュールとしてマルチモーダル大手言語モデル（MLLM）を使用することは、広範な注目を集めています。
ただし、具体化された知性における積極的な知覚の重要性にもかかわらず、MLLMをどのように能力を備えたり、積極的な認識能力を装備したり学んだかについては、ほとんどまたはまったく探求されていません。
この論文では、最初にMLLMベースのアクティブ認識タスクの体系的な定義を提供します。
最近提案されたGPT-O3モデルのズームイン検索戦略は、積極的な知覚の特別なケースと見なすことができることを指摘します。
ただし、検索効率が低く、領域の選択が不正確になっていることに依然として苦しんでいます。
これらの問題に対処するために、MLLMSにアクティブな知覚能力を装備するように設計されたGRPOの上に構築された純粋に強化学習ベースのトレーニングフレームワークであるActive-O3を提案します。
さらに、小型オブジェクトや密なオブジェクトの接地などの一般的なオープンワールドタスクの両方でアクティブO3を評価するための包括的なベンチマークスイートを確立し、リモートセンシングや自律運転における小さなオブジェクト検出、および微細な粒度の相互作用セグメンテーションなどのドメイン固有のシナリオを評価します。
さらに、Active-O3は、明示的な推論データに依存することなく、V*ベンチマークで強力なゼロショット推論能力を示しています。
私たちの仕事が、MLLMの積極的な知覚に関する将来の研究を促進するために、単純なコードベースと評価プロトコルを提供できることを願っています。

要約(オリジナル)

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model’s zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

arxiv情報

著者	Muzhi Zhu,Hao Zhong,Canyu Zhao,Zongze Du,Zheng Huang,Mingyu Liu,Hao Chen,Cheng Zou,Jingdong Chen,Ming Yang,Chunhua Shen
発行日	2025-05-27 17:29:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー