BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

要約

Vision-Language-active（VLA）モデルは、幅広いロボット工学操作タスクで印象的な機能を示しています。
ただし、モデルサイズの成長は、リソース制約のロボットシステムへの展開に大きな課題をもたらします。
1ビットの前orainingは、パフォーマンスの損失を最小限に抑えて大規模な言語モデルの推論効率を高めるのに効果的であることが証明されていますが、VLAモデルへの適用は依存していないままです。
この作業では、ロボット工学操作の最初の1ビットVLAモデルであるBITVLAを提示します。すべてのパラメーターは、{-1、0、1}、つまり{-1、0、1}です。
Visionエンコーダーのメモリフットプリントをさらに削減するために、フルエンコーダーを1.58ビット重量に圧縮する蒸留対象トレーニング戦略を提案します。
このプロセス中、全精度エンコーダーは、潜在的な表現をよりよく調整するための教師モデルとして機能します。
大規模なロボット工学の事前トレーニングが不足しているにもかかわらず、BitVLAは、リベロベンチマークで4ビットのトレーニング量子化を備えた最先端のモデルOpenVLA-Offに匹敵するパフォーマンスを達成し、メモリの29.8％しか消費しません。
これらの結果は、メモリが制約されているエッジデバイスでの展開に対するBitVLAの約束を強調しています。
https://github.com/ustcwhy/bitvlaでコードとモデルの重みをリリースします。

要約(オリジナル)

Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA’s promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.

arxiv情報

著者	Hongyu Wang,Chuyan Xiong,Ruiping Wang,Xilin Chen
発行日	2025-06-09 08:15:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー