QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

要約

このペーパーでは、四足歩行ビジョン言語アクション (QUAR-VLA) タスクでのマルチモーダル大規模言語モデル (MLLM) の展開に関連する固有の推論レイテンシーの課題に対処します。
私たちの調査により、従来のパラメータ削減手法は最終的にアクション命令の調整段階で言語基礎モデルのパフォーマンスを損ない、この目的には適さないことが明らかになりました。
言語基盤モデルのパフォーマンスを低下させることなく推論効率を向上させるように設計された、QUART-Online と呼ばれる新しいレイテンシのない四足 MLLM モデルを紹介します。
アクションチャンク離散化 (ACD) を組み込むことで、元のアクション表現空間を圧縮し、重要な情報を維持しながら連続アクション値をより小さな離散代表ベクトルのセットにマッピングします。
その後、MLLM を微調整して、視覚、言語、圧縮されたアクションを統一された意味空間に統合します。
実験結果は、QUART-Online が既存の MLLM システムと連携して動作し、基礎となるコントローラー周波数と同期したリアルタイム推論を実現し、さまざまなタスクの成功率を 65\% 大幅に向上させることを示しています。
私たちのプロジェクトページは \href{https://quart-online.github.io}https://quart-online.github.io です。

要約(オリジナル)

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65\%. Our project page is \href{https://quart-online.github.io}https://quart-online.github.io.

arxiv情報

著者	Xinyang Tong,Pengxiang Ding,Donglin Wang,Wenjie Zhang,Can Cui,Mingyang Sun,Yiguo Fan,Han Zhao,Hongyin Zhang,Yonghao Dang,Siteng Huang,Shangke Lyu
発行日	2024-12-20 05:17:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー