RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

要約

ロボットビジョンアプリケーションでは、多くの場合、物体の検出、セグメンテーション、識別など、幅広い視覚認識タスクが必要になります。
これらの個々のタスクは大幅に進歩しましたが、特殊なモデルを統合ビジョンパイプラインに統合するには、エンジニアリング上で大きな課題とコストが発生します。
最近、マルチモーダル大規模言語モデル (MLLM) が、さまざまな下流タスクの新しいバックボーンとして登場しました。
私たちは、MLLM の事前トレーニング機能を活用することで、簡素化されたフレームワークの作成が可能になり、タスク固有のエンコーダーの必要性が軽減されると主張します。
具体的には、MLLM の大規模な事前トレーニング済み知識により、下流のロボットビジョンタスクの微調整が容易になり、優れたパフォーマンスが得られます。
BEiT-3 バックボーンを備えた RoboLLM フレームワークを導入し、ARMBench チャレンジ (現実世界の倉庫シナリオに関する大規模なロボット操作データセット) のすべての視覚認識タスクに対処します。
RoboLLM は既存のベースラインを上回るパフォーマンスを発揮するだけでなく、モデルの選択とチューニングに関連するエンジニアリングの負担を大幅に軽減します。
ソースコードは https://github.com/longkukuhi/armbench で公開されています。

要約(オリジナル)

Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.

arxiv情報

著者	Zijun Long,George Killick,Richard McCreadie,Gerardo Aragon Camarasa
発行日	2024-02-23 15:18:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー