VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

要約

現実世界の環境における段階的な意思決定は、身体化された人工知能において最も困難なタスクの 1 つです。
特に要求の厳しいシナリオの 1 つは、視覚および自然言語の理解と、空間的および時間的な推論能力を必要とする Vision and Language Navigation~(VLN) です。
身体化されたエージェントは、ストリートビューのような現実世界の環境の観察におけるナビゲーション指示の理解を基礎とする必要があります。
他の研究分野における LLM の目覚ましい成果にもかかわらず、LLM をインタラクティブな視覚環境に最適に接続する方法は継続的な問題です。
この研究では、軌道と視覚環境の観察の言語化を次のアクションの文脈上のプロンプトとして使用する、具体化された LLM エージェントである VELMA を提案します。
視覚情報は、人間が書いたナビゲーション指示からランドマークを抽出し、CLIP を使用して現在のパノラマビューでの可視性を決定するパイプラインによって言語化されます。
ほんの 2 つのコンテキスト内の例を使用して、VELMA がストリートビューのナビゲーション指示に正常に従うことができることを示します。
数千の例で LLM エージェントをさらに微調整し、2 つのデータセットについて以前の最先端技術と比較してタスク完了において 25% ～ 30% の相対的な向上を達成しました。

要約(オリジナル)

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

arxiv情報

著者	Raphael Schumann,Wanrong Zhu,Weixi Feng,Tsu-Jui Fu,Stefan Riezler,William Yang Wang
発行日	2024-01-24 15:10:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー