3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

要約

3D シーングラフはコンパクトなシーンモデルを表し、オブジェクトとオブジェクト間の意味論的な関係に関する情報を保存するため、ロボットタスクでの使用が期待できます。
ユーザーと対話するとき、身体化されたインテリジェントエージェントは、自然言語で表現されたシーンに関するさまざまなクエリに応答できなければなりません。
大規模言語モデル (LLM) は、自然言語の理解と推論能力により、ユーザーとロボットの対話にとって有益なソリューションです。
3D シーンの学習可能な表現を作成するための最近の方法は、3D 世界に適応することで LLM の応答の品質を向上させる可能性を実証しました。
ただし、既存の方法はオブジェクト間の意味論的な関係に関する情報を明示的に利用しておらず、オブジェクトの座標に関する情報に限定されています。
この研究では、3D シーングラフの学習可能な表現を構築するためのメソッド 3DGraphLLM を提案します。
学習可能な表現は、LLM が 3D ビジョン言語タスクを実行するための入力として使用されます。
一般的な ScanRefer、RIORefer、Multi3DRefer、ScanQA、Sqa3D、および Scan2cap データセットでの実験では、オブジェクト間のセマンティックな関係に関する情報を使用しないベースラインメソッドと比べて、このアプローチの利点を実証しました。
コードは https://github.com/CognitiveAISystems/3DGraphLLM で公開されています。

要約(オリジナル)

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

arxiv情報

著者	Tatiana Zemskova,Dmitry Yudin
発行日	2024-12-24 14:21:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー