DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

要約

MLLM は、複雑な言語と視覚データを使用して、優れた理解力と推論能力を実証してきました。
これらの進歩は、複雑な人間の指示を理解し、さまざまな具体化されたタスクを達成することに熟達した汎用的なロボット MLLM を確立するというビジョンに拍車をかけています。
ただし、現実世界のロボット用の MLLM の開発は、ロボットプラットフォームで利用できる計算能力とメモリ容量が通常は限られているため、困難です。
対照的に、MLLM の推論には、数十億のパラメーターの保存と膨大な計算の実行が含まれ、多大なハードウェア要求が課せられます。
私たちの論文では、当面の各状況に基づいてアクティブ化された MLLM のサイズを自動的に調整する、ロボット視覚言語アクションモデル (DeeR-VLA、または単に DeeR) の動的早期終了フレームワークを提案します。
このアプローチでは、MLLM のマルチ出口アーキテクチャを活用しており、特定の状況に対して適切なサイズのモデルがアクティブ化されると、モデルは処理を終了できるため、さらなる冗長な計算が回避されます。
さらに、平均計算コスト (つまり、消費電力) だけでなく、ピークの計算消費量 (つまり、レイテンシ) や GPU メモリ使用量などの事前定義された要求を条件として、DeeR の早期終了基準を確立する新しいアルゴリズムを開発します。
これらの機能強化により、競争力のあるパフォーマンスを維持しながら、さまざまなリソース制約の下で DeeR が効率的に動作することが保証されます。
CALVIN ロボット操作ベンチマークでは、DeeR は、パフォーマンスを損なうことなく、LLM の計算コストを 5.2 ～ 6.5 倍、LLM の GPU メモリを 2 ～ 6 倍に大幅に削減することを実証しました。
コードとチェックポイントは https://github.com/yueyang130/DeeR-VLA で入手できます。

要約(オリジナル)

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

arxiv情報

著者	Yang Yue,Yulin Wang,Bingyi Kang,Yizeng Han,Shenzhi Wang,Shiji Song,Jiashi Feng,Gao Huang
発行日	2024-11-04 18:26:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー