DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

要約

動的環境でのイベントの分析は、人間と相互作用できるインテリジェントエージェントやロボットの開発における基本的な課題をもたらします。
現在のアプローチは、主に視覚モデルを利用しています。
ただし、これらの方法は、多くの場合、画像から情報を暗黙的にキャプチャし、解釈可能な空間的オブジェクト表現を欠いています。
この問題に対処するために、動的グラフをエンコードするための新しい方法であるDygencを紹介します。
この方法は、圧縮された空間的構造観察表現と、大規模な言語モデルの認知能力を統合します。
この統合の目的は、一連のテキストシーングラフに基づいて高度な質問回答を可能にすることです。
星とAGQAのデータセットでの拡張評価は、Dygencが既存の視覚的方法を15〜25％の大きなマージンよりも優れていることを示しています。
さらに、提案された方法は、車輪付きマニピュレータープラットフォームで行われたロボット実験の結果によって実証されているように、明示的なテキストシーングラフを抽出するための基礎モデルを使用して生の入力画像を処理するためにシームレスに拡張できます。
これらの発見が、長期の推論のための堅牢で圧縮されたグラフベースのロボットメモリの実装に貢献することを願っています。
コードはgithub.com/linukc/dygencで入手できます。

要約(オリジナル)

The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc – a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at github.com/linukc/DyGEnc.

arxiv情報

著者	Sergey Linok,Vadim Semenov,Anastasia Trunova,Oleg Bulichev,Dmitry Yudin
発行日	2025-05-06 14:41:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー