QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

要約

このペーパーでは、四足動物言語アクション（quar-VLA）タスクにマルチモーダルラージランゲージモデル（MLLM）の展開に関連する固有の推論潜在性の課題に対処します。
私たちの調査は、従来のパラメーター削減手法が最終的に、アクション命令調整段階で言語基盤モデルのパフォーマンスを損ない、この目的には不適切であることを明らかにしています。
言語基礎モデルのパフォーマンスを低下させることなく推論効率を高めるように設計された、Quart-Onlineと呼ばれる、新しいレイテンシフリーのクアドゥルアップMLLMモデルを導入します。
アクションチャンク離散化（ACD）を組み込むことにより、元のアクション表現スペースを圧縮し、重要な情報を保存しながら、より小さなディスクリート代表ベクトルのセットに連続的なアクション値をマッピングします。
その後、MLLMを微調整して、ビジョン、言語、および圧縮アクションを統合されたセマンティックスペースに統合します。
実験結果は、Quart-Onlineが既存のMLLMシステムと連携して動作し、基礎となるコントローラー周波数と同期してリアルタイムの推論を達成し、さまざまなタスクの成功率を65％上昇させることを示しています。
プロジェクトページはhttps://quart-online.github.ioです。

要約(オリジナル)

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.

arxiv情報

著者	Xinyang Tong,Pengxiang Ding,Yiguo Fan,Donglin Wang,Wenjie Zhang,Can Cui,Mingyang Sun,Han Zhao,Hongyin Zhang,Yonghao Dang,Siteng Huang,Shangke Lyu
発行日	2025-03-11 14:09:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー