3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

要約

人間は、時間的および空間的経験を越えて長期的なメモリを活用することにより、複雑なタスクの実行に優れています。
対照的に、現在の大規模な言語モデル（LLMS）は、動的なマルチルーム3D環境で効果的に計画し、行動するのに苦労しています。
この制限の一部は、LLMSの適切な3D空間的メモリモデリングがないためであると仮定します。
これに対処するために、最初に3DMEMベンチを紹介します。これは、3D環境での長期記憶上で推論するエージェントの能力を評価するために設計された、26,000を超える軌跡と2,892の具体化されたタスク、質問回答、キャプションを含む包括的なベンチマークです。
第二に、LLMSでの空間的推論と行動の具体化された空間的推論と行動のための新しい動的メモリ管理と融合モデルである3DLLM-MEMを提案します。
私たちのモデルは、過去の観測と相互作用を保存するエピソードメモリから最も有用な空間的および時間的特徴に選択的に出席し、融合するためのクエリとして、現在の観測を表すワーキングメモリトークンを使用します。
私たちのアプローチにより、エージェントは、複雑で長期の環境でメモリ効率を維持しながら、タスク関連情報に集中することができます。
実験結果は、3DLLM-MEMがさまざまなタスクにわたって最先端のパフォーマンスを達成し、3DMEMベンチの最も挑戦的な野生の具体化されたタスクの成功率の最強のベースラインを16.5％上回ることを示しています。

要約(オリジナル)

Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench’s most challenging in-the-wild embodied tasks.

arxiv情報

著者	Wenbo Hu,Yining Hong,Yanjun Wang,Leison Gao,Zibu Wei,Xingcheng Yao,Nanyun Peng,Yonatan Bitton,Idan Szpektor,Kai-Wei Chang
発行日	2025-05-28 17:59:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー