Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

要約

近年、大規模なデータセットから視覚と言語の関係を学習するモデルが数多く公開されています。
これらのモデルは、画像に関する質問への回答、画像に最もよく対応する文の検索、フレーズに対応する画像内の領域の検索など、さまざまなタスクを実行します。
いくつかの例はありますが、これらの事前トレーニング済みの視覚言語モデルとロボット工学との間の接続はまだ弱いです。
ロボットの動きに直結すると、ロボットの具現化やデータ収集の難しさから汎用性が失われ、幅広い身体や状況に適用できなくなります。
したがって、この研究では、事前に訓練された視覚言語モデルを、ロボットの動作に直接接続することなく、ロボットが理解できる方法で柔軟かつ簡単に利用する方法を分類して要約します。
モデルを再トレーニングせずに、これらのモデルをロボットの動作選択と動作計画に使用する方法について説明します。
ロボットが理解できる情報を抽出する手法を5種類考え、組み合わせによる状態認識、物体認識、アフォーダンス認識、関係認識、異常検知の結果を示します。
この研究により、既存のロボットの認識動作に柔軟性と使いやすさ、および新しいアプリケーションが追加されることが期待されます。

要約(オリジナル)

In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that best correspond to images, and finding regions in images that correspond to phrases. Although there are some examples, the connection between these pre-trained vision-language models and robotics is still weak. If they are directly connected to robot motions, they lose their versatility due to the embodiment of the robot and the difficulty of data collection, and become inapplicable to a wide range of bodies and situations. Therefore, in this study, we categorize and summarize the methods to utilize the pre-trained vision-language models flexibly and easily in a way that the robot can understand, without directly connecting them to robot motions. We discuss how to use these models for robot motion selection and motion planning without re-training the models. We consider five types of methods to extract information understandable for robots, and show the results of state recognition, object recognition, affordance recognition, relation recognition, and anomaly detection based on the combination of these five methods. We expect that this study will add flexibility and ease-of-use, as well as new applications, to the recognition behavior of existing robots.

arxiv情報

著者	Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
発行日	2023-03-10 02:55:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー