Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization

要約

日常生活支援や警備業務を行うロボットにとって、ドアの開閉や照明の点灯・消灯など、環境や物体の状態認識は不可欠です。
これまでの状態認識手法は、手動のアノテーションによるニューラルネットワークのトレーニング、認識用の特別なセンサーの準備、または点群や生の画像から特徴を抽出するための手動プログラミングに基づいていました。
対照的に、我々は、画像からテキストへの検索（ITR）タスクが可能な、事前訓練された視覚言語モデルを使用したロボット状態認識方法を提案します。
あらかじめ数種類の言語プロンプトを用意し、これらのプロンプトと現在の画像との類似度をITRで計算し、状態認識を行います。
ブラックボックス最適化により各プロンプトに最適な重み付けを適用することで、より高精度に状態認識を行うことができます。
実験によると、この理論により、ニューラルネットワークの再トレーニングや手動プログラミングを行わずに、複数のプロンプトを準備するだけでさまざまな状態認識が可能になります。
また、認識器ごとにプロンプトとその重みだけを用意すればよいため、複数のモデルを用意する必要がなく、リソース管理が容易になります。
これまで難しかった透明なドアの開閉状態や蛇口から水が出ているかどうか、さらにはキッチンがきれいかどうかといった定性的な状態までを言語で認識することが可能です。
。

要約(オリジナル)

State recognition of the environment and objects, such as the open/closed state of doors and the on/off of lights, is indispensable for robots that perform daily life support and security tasks. Until now, state recognition methods have been based on training neural networks from manual annotations, preparing special sensors for the recognition, or manually programming to extract features from point clouds or raw images. In contrast, we propose a robotic state recognition method using a pre-trained vision-language model, which is capable of Image-to-Text Retrieval (ITR) tasks. We prepare several kinds of language prompts in advance, calculate the similarity between these prompts and the current image by ITR, and perform state recognition. By applying the optimal weighting to each prompt using black-box optimization, state recognition can be performed with higher accuracy. Experiments show that this theory enables a variety of state recognitions by simply preparing multiple prompts without retraining neural networks or manual programming. In addition, since only prompts and their weights need to be prepared for each recognizer, there is no need to prepare multiple models, which facilitates resource management. It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not, which have been challenging so far, through language.

arxiv情報

著者	Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
発行日	2024-10-30 05:34:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー