FAST: Efficient Action Tokenization for Vision-Language-Action Models

要約

Transformer ベースのビジョン言語アクション (VLA) ポリシーなどの自己回帰シーケンスモデルは、複雑で一般化可能なロボットの動作を捕捉するのに非常に効果的です。
ただし、そのようなモデルでは、連続動作信号のトークン化を選択する必要があります。これにより、モデルによって予測された離散シンボルが連続ロボット動作にどのようにマッピングされるかが決まります。
シンプルな次元ごと、タイムステップごとのビニングスキームに基づくロボットアクションのトークン化に対する現在のアプローチは、高頻度のロボットデータから器用なスキルを学習する場合、一般にパフォーマンスが低いことがわかりました。
この課題に対処するために、離散コサイン変換に基づいた、ロボット動作のための新しい圧縮ベースのトークン化スキームを提案します。
私たちのトークン化アプローチである周波数空間アクションシーケンストークン化 (FAST) を使用すると、標準的な離散化手法では完全に機能しない、非常に器用で高頻度のタスク向けに自己回帰 VLA をトレーニングできます。
FAST に基づいて、100 万の実際のロボットアクションの軌跡でトレーニングされたユニバーサルロボットアクショントークナイザーである FAST+ をリリースします。
多様なアクション空間と制御周波数を備えた、幅広いロボットアクションシーケンスのブラックボックストークナイザーとして使用できます。
最後に、pi0 VLA と組み合わせると、私たちの手法が 10,000 時間のロボットデータのトレーニングに拡張でき、拡散 VLA のパフォーマンスに匹敵すると同時に、トレーニング時間を最大 5 分の 1 に短縮できることを示します。

要約(オリジナル)

Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.

arxiv情報

著者	Karl Pertsch,Kyle Stachowicz,Brian Ichter,Danny Driess,Suraj Nair,Quan Vuong,Oier Mees,Chelsea Finn,Sergey Levine
発行日	2025-01-16 18:57:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FAST: Efficient Action Tokenization for Vision-Language-Action Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー