HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

要約

一般的な推論のための視覚言語モデル（VLM）の最近の進歩により、視覚言語アクション（VLA）モデルの開発が行われ、ロボットが一般化された操作を実行できます。
既存の自己回帰VLAメソッドは、大規模な事前に抑制された知識を活用していますが、アクションの連続性を混乱させます。
一方、一部のVLAメソッドには、追加の拡散ヘッドが組み込まれ、継続的なアクションを予測し、VLM抽出された機能のみに依存して、推論機能を制限します。
このペーパーでは、単に接続するのではなく、単一の大手言語モデル内で自己回帰ポリシーと拡散ポリシーの両方の強度をシームレスに統合する統合されたフレームワークであるHybridvlaを紹介します。
生成ギャップを埋めるために、拡散モデリングを次のトークン予測に直接注入する共同トレーニングレシピが提案されています。
このレシピにより、これらの2つの形式のアクション予測は、互いを強化するだけでなく、異なるタスクでさまざまなパフォーマンスを示すことがわかります。
したがって、これらの2つの予測を適応的に融合させ、より堅牢な制御につながる共同アクションアンサンブルメカニズムを設計します。
実験では、HybridVLAは、シングルアームロボットとデュアルアームロボットの両方を含むさまざまなシミュレーションと現実世界のタスクにわたって以前の最先端のVLAメソッドを上回り、以前に見えない構成の安定した操作を示します。

要約(オリジナル)

Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.

arxiv情報

著者	Jiaming Liu,Hao Chen,Pengju An,Zhuoyang Liu,Renrui Zhang,Chenyang Gu,Xiaoqi Li,Ziyu Guo,Sixiang Chen,Mengzhen Liu,Chengkai Hou,Mengdi Zhao,KC alex Zhou,Pheng-Ann Heng,Shanghang Zhang
発行日	2025-03-13 17:59:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー