Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

要約

人間のアクションビデオに関するトレーニング前のビジョン言語表現は、具体化されたエージェントをトレーニングするための大規模な専門家デモへの依存を減らすための有望なアプローチとして浮上しています。
ただし、以前の方法では、目標を達成するヒューリスティックに基づいて時間対照的な学習を使用し、初期フレームから最終フレームに徐々に言語の指示を調整します。
将来のフレームのこの過剰症は、アクションが早期に終了したり、最終的には無関係な瞬間を含める可能性があるため、誤ったビジョン言語の関連付けをもたらす可能性があります。
この問題に対処するために、硬直した目標ベースの制約なしに、秩序と継続的な視覚言語表現を学ぶために、アクション時間コヒーレンス学習（Actol）を提案します。
Actolは、ビデオを連続的な軌跡として扱い、（1）フレーム間のセマンティックな違いを自然な秩序化を反映し、（2）中間フレーム間のスムーズな遷移を確保するために地元のブラウンブリッジの制約を課します。
シミュレートされたロボットと実際のロボットの両方での広範な模倣学習実験は、前提条件の特徴が、異なる言語スタイルの指示に対する高い堅牢性を備えた下流の操作タスクを大幅に強化し、一般化された具体化されたエージェントへの実行可能な経路を提供することを示しています。

要約(オリジナル)

Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments on both simulated and real robots show that the pretrained features significantly enhance downstream manipulation tasks with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents.

arxiv情報

著者	Zhizhen Zhang,Lei Zhu,Zhen Fang,Zi Huang,Yadan Luo
発行日	2025-05-22 08:03:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー