Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models

要約

エンドツーエンドの学習により、感覚入力をアクションに直接マッピングし、複雑なロボットタスクに対して高度に統合された効率的なポリシーを作成します。
ただし、このようなモデルは効率的にトレーニングするのが難しく、トレーニングシナリオを超えて一般化するのに苦労することが多く、新しい環境、タスク、概念への適応性が制限されます。
この研究では、目に見えないテキスト命令と視覚的な配布の変化の下でビジョンベースの制御ポリシーを使用して堅牢な閉ループパフォーマンスを達成するために必要な最小限のデータ要件とアーキテクチャの適応を調査します。
この目的を達成するために、さまざまなレベルの豊富なデータ表現を備えたデータセットを設計し、マルチモーダル基盤モデルエンコーダーを活用して特徴抽出プロトコルを改良し、さまざまなポリシーネットワークヘッドの適合性を評価します。
私たちの発見は、事前にトレーニングされたビジョン言語モデル (VLM) を凍結パッチ単位の特徴抽出器として使用するフレームワークである Flex (Fly-lexical) で合成され、意味情報と視覚情報を統合する空間認識埋め込みを生成します。
これらの豊富な機能は、プラットフォーム、環境、テキストで指定されたタスク全体で一般化できる、非常に堅牢なダウンストリームポリシーをトレーニングするための基礎を形成します。
私たちは、クアッドローターフライトゥターゲットタスクにおけるこのアプローチの有効性を実証します。そこでは、シミュレートされた小さなデータセットでの動作クローン作成によって訓練されたエージェントが、現実世界のシーンにうまく一般化され、多様な新しい目標とコマンドの定式化を処理します。

要約(オリジナル)

End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models are tricky to efficiently train and often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. To this end, we design datasets with various levels of data representation richness, refine feature extraction protocols by leveraging multi-modal foundation model encoders, and assess the suitability of different policy network heads. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. These rich features form the basis for training highly robust downstream policies capable of generalizing across platforms, environments, and text-specified tasks. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes, handling diverse novel goals and command formulations.

arxiv情報

著者	Makram Chahine,Alex Quach,Alaa Maalouf,Tsun-Hsuan Wang,Daniela Rus
発行日	2024-10-16 19:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー