DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

要約

AI エージェントの分野は、大規模言語モデル (LLM) の機能により、前例のない速度で進歩しています。
ただし、LLM 駆動のビジュアルエージェントは、画像モダリティのタスクを解決することに主に焦点を当てているため、現実世界の動的な性質を理解する能力が制限されており、実験室での学生の指導や実験の指導など、現実のアプリケーションからはまだ遠いものになっています。
彼らの間違い。
ビデオモダリティは、現実世界のシナリオの絶えず変化し、知覚的に集中的な性質をよりよく反映していると考え、動的なビデオタスクを処理するための LLM によって駆動される包括的で概念的に洗練されたシステムである DremonGPT を考案しました。
質問/タスクを含むビデオが与えられると、ドラえもんGPT は、大量のコンテンツを含む入力ビデオを \textit{タスク関連} 属性を格納するシンボリックメモリに変換することから始めます。
この構造化された表現により、サブタスクツールによる時空間クエリと推論が可能になり、簡潔で関連性の高い中間結果が得られます。
特殊な領域（実験の基礎となる科学原理の分析など）に関して、LLM の内部知識が限られていることを認識し、外部知識を評価し、さまざまな領域にわたるタスクに対処するためのプラグアンドプレイツールを組み込みます。
さらに、さまざまなツールをスケジュールするための大規模な計画スペースを効率的に探索するために、モンテカルロツリー検索に基づいた新しい LLM 駆動のプランナーを導入します。
プランナーは、結果の報酬を逆伝播することによって実行可能な解決策を繰り返し見つけます。複数の解決策を要約して、改善された最終的な回答を得ることができます。
私たちは動的なシーンでドラモンGPT を広範囲に評価し、以前の研究よりも複雑な問題を処理する能力を実証する実際のショーケースを提供します。

要約(オリジナル)

The field of AI agents is advancing at an unprecedented rate due to the capabilities of large language models (LLMs). However, LLM-driven visual agents mainly focus on solving tasks for the image modality, which limits their ability to understand the dynamic nature of the real world, making it still far from real-life applications, e.g., guiding students in laboratory experiments and identifying their mistakes. Considering the video modality better reflects the ever-changing and perceptually intensive nature of real-world scenarios, we devise DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to handle dynamic video tasks. Given a video with a question/task, DoraemonGPT begins by converting the input video with massive content into a symbolic memory that stores \textit{task-related} attributes. This structured representation allows for spatial-temporal querying and reasoning by sub-task tools, resulting in concise and relevant intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, we introduce a novel LLM-driven planner based on Monte Carlo Tree Search to efficiently explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result’s reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT in dynamic scenes and provide in-the-wild showcases demonstrating its ability to handle more complex questions than previous studies.

arxiv情報

著者	Zongxin Yang,Guikun Chen,Xiaodi Li,Wenguan Wang,Yi Yang
発行日	2024-01-16 14:33:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー