Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

要約

拡張現実（AR）は、仮想コンテンツを統合することにより現実の世界を強化しますが、ARエクスペリエンスの品質、使いやすさ、安全性を確保することは大きな課題をもたらします。
ビジョン言語モデル（VLM）は、AR生成シーンの自動評価のためのソリューションを提供できますか？
ビジョン言語モデル（VLM）は、AR生成シーンの自動評価のためのソリューションを提供できますか？
この研究では、ARシーンの特定と説明において、3つの最先端の商用VLM（GPT、Gemini、およびClaude）の能力を評価します。
この目的のために、幅広いARシーンの複雑さにわたって仮想コンテンツを分析するVLMの能力を評価するために特別に設計された最初のARデータセットであるDiversearを使用します。
我々の調査結果は、VLMが一般にARシーンを知覚および説明できることを示しており、知覚では最大93％、記述のために71％の真の正の率（TPR）を達成することを示しています。
輝くリンゴなどの明らかな仮想オブジェクトを特定することに優れている間、現実的な影を持つ仮想ポットなど、シームレスに統合されたコンテンツに直面したときに苦労しています。
私たちの結果は、ARシナリオを理解する際のVLMの強みと限界の両方を強調しています。
仮想コンテンツの配置、品質のレンダリング、物理的妥当性など、VLMのパフォーマンスに影響を与える重要な要因を特定します。
この研究では、ARエクスペリエンスの品質を評価するためのツールとしてのVLMの可能性を強調しています。

要約(オリジナル)

Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs — GPT, Gemini, and Claude — in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs’ ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.

arxiv情報

著者	Lin Duan,Yanming Xiu,Maria Gorlatova
発行日	2025-01-30 15:35:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー