Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

要約

オブジェクトオリエンテーションの理解は、ロボット操作や拡張現実などのアプリケーションにとって重要な視覚的知覚の基本的な課題を表しています。
現在のビジョン言語ベンチマークは、この能力を分離することができず、しばしばそれを位置の関係や一般的なシーンの理解と混同します。
主要な評価ターゲットとしてオブジェクトオリエンテーションの知覚を確立する包括的なベンチマークであるDori（識別指向性推論インテリジェンス）を紹介します。
Doriは、方向の理解の4つの次元を評価します：前頭整列、回転変換、相対方向の関係、および標準的な方向の理解。
Doriは、合成および実世界のシナリオにまたがる67のオブジェクトカテゴリにまたがる11のデータセットから慎重にキュレーションされたタスクを通じて、マルチモーダルシステムがオブジェクトの方向を理解する方法についての洞察を提供します。
15の最先端のビジョン言語モデルの評価は、重大な制限を明らかにしています。最高のモデルでさえ、粗いタスクで54.2％の精度と粒状方向判断で33.0％しか達成されず、参照フレームシフトまたは複合回転を必要とするタスクのパフォーマンスが悪化します。
これらの発見は、モデルが正確な角度推定を実行できないことを示し、視点間での方向の変化を追跡し、複合回転を理解できないことを示しているため、専用の方向表現メカニズムの必要性を示しています。
マルチモーダルシステムでの方向認識のために特別に設計された最初の診断フレームワークとして、DORIは、物理環境でのロボット制御、3Dシーンの再構築、および人間との相互作用の改善に影響を与えます。
DORIデータ：https：//huggingface.co/datasets/appledora/dori-benchmark

要約(オリジナル)

Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations – suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

arxiv情報

著者	Keanu Nichols,Nazia Tasnim,Yuting Yan,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan A. Plummer
発行日	2025-06-04 17:28:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー