Visuospatial Cognitive Assistant

要約

ビデオベースの空間認知は、ロボット工学と具体化されたAIにとって不可欠ですが、現在の視覚言語モデル（VLM）に挑戦しています。
この論文は、2つの重要な貢献をしています。
まず、VICA（視覚空間認知アシスタント）-322Kを紹介します。322K、実際の屋内ビデオ（Arkitscenes、Scannet、Scannet ++）の322,003 QAペアの多様なデータセットを紹介し、3Dメタデータベースの複雑な推論の監督を提供します。
第二に、VICA-322Kで微調整されたVICA-7Bを開発します。VICA-322Kは、8つのVSIベンチタスクすべてで新しい最先端を達成し、より大きなモデルを含む既存のモデルを上回ります（たとえば、絶対距離で+26.1）。
解釈可能性のために、明示的な推論チェーンを備えたデータセットであるVICA-Thinking-2.68Kを提示し、VICA-7Bを微調整して、その空間推論を明確にするモデルであるVICA-7B考えを作成します。
私たちの研究は、ターゲットデータの重要性を強調し、時間的空間モデリングを改善するためのパスを提案しています。
すべてのリソースをリリースして、堅牢な視覚空間情報の研究を促進します。

要約(オリジナル)

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

arxiv情報

著者	Qi Feng
発行日	2025-05-28 08:31:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visuospatial Cognitive Assistant

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー