SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

要約

言語の曖昧さは私たちの日常生活のいたるところに存在します。
以前の作品では、言語の曖昧さを解消するためにロボットと人間の間の対話が採用されていました。
それにもかかわらず、対話型ロボットを日常環境に導入すると、複雑で予測不可能な視覚入力、無制限の対話、多様なユーザーの要求に起因する、人間とロボットの自然な対話には重大な課題が生じます。
この論文では、自然言語に基づいた人間とロボットの対話のための自己進化型対話型ビジュアルエージェントである SInViG を紹介します。これは、マルチターンの視覚言語対話を通じて、言語の曖昧さがある場合にそれを解決することを目的としています。
人間の介入なしに、ラベルのない画像や大規模な言語モデルから継続的かつ自動的に学習し、視覚的および言語的な複雑さに対してより堅牢になります。
自己進化の恩恵を受けて、いくつかのインタラクティブなビジュアルグラウンディングベンチマークに新たな最先端を設定します。
さらに、私たちの人間とロボットのインタラクション実験では、進化したモデルが人間のユーザーから一貫してますます多くの好みを獲得していることが示されています。
さらに、インタラクティブな操作タスクのためにモデルを Franka ロボットにも展開しました。
結果は、環境の複雑さや撹乱にもかかわらず、私たちのモデルが多様なユーザーの指示に従い、自然言語で人間と自然に対話できることを示しています。

要約(オリジナル)

Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In this paper, we present SInViG, which is a self-evolving interactive visual agent for human-robot interaction based on natural languages, aiming to resolve language ambiguity, if any, through multi-turn visual-language dialogues. It continuously and automatically learns from unlabeled images and large language models, without human intervention, to be more robust against visual and linguistic complexity. Benefiting from self-evolving, it sets new state-of-the-art on several interactive visual grounding benchmarks. Moreover, our human-robot interaction experiments show that the evolved models consistently acquire more and more preferences from human users. Besides, we also deployed our model on a Franka robot for interactive manipulation tasks. Results demonstrate that our model can follow diverse user instructions and interact naturally with humans in natural language, despite the complexity and disturbance of the environment.

arxiv情報

著者	Jie Xu,Hanbo Zhang,Xinghang Li,Huaping Liu,Xuguang Lan,Tao Kong
発行日	2024-02-20 04:33:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー