Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

要約

近年、大規模なデータセットから視覚と言語の関係を学習するモデルが数多くリリースされています。
これらのモデルは、画像に関する質問に答える、画像に最もよく対応する文を取得する、フレーズに対応する画像内の領域を見つけるなど、さまざまなタスクを実行します。
いくつかの例はありますが、これらの事前トレーニング済み視覚言語モデルとロボット工学とのつながりはまだ弱いです。
ロボットの動作に直結すると、ロボットの具体化やデータ収集の難しさから汎用性が失われ、幅広い身体や状況に適用できなくなります。
そこで本研究では、事前学習済み視覚言語モデルをロボットの動作に直接結びつけることなく、ロボットが理解できるように柔軟かつ容易に活用する方法を分類してまとめます。
モデルを再トレーニングせずにロボットの動作選択と動作計画にこれらのモデルを使用する方法について説明します。
ロボットが理解できる情報を抽出する手法として5種類を考え、これら5つの手法を組み合わせた状態認識、物体認識、アフォーダンス認識、関係認識、異常検知の結果を示します。
この研究により、既存のロボットの認識動作に柔軟性と使いやすさ、そして新たな応用が加わることが期待されます。

要約(オリジナル)

In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that best correspond to images, and finding regions in images that correspond to phrases. Although there are some examples, the connection between these pre-trained vision-language models and robotics is still weak. If they are directly connected to robot motions, they lose their versatility due to the embodiment of the robot and the difficulty of data collection, and become inapplicable to a wide range of bodies and situations. Therefore, in this study, we categorize and summarize the methods to utilize the pre-trained vision-language models flexibly and easily in a way that the robot can understand, without directly connecting them to robot motions. We discuss how to use these models for robot motion selection and motion planning without re-training the models. We consider five types of methods to extract information understandable for robots, and show the results of state recognition, object recognition, affordance recognition, relation recognition, and anomaly detection based on the combination of these five methods. We expect that this study will add flexibility and ease-of-use, as well as new applications, to the recognition behavior of existing robots.

arxiv情報

著者	Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
発行日	2023-10-11 08:54:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー