PaLM-E: An Embodied Multimodal Language Model

要約

大規模な言語モデルは、様々な複雑なタスクに秀でている。しかし、ロボット工学の問題など、実世界での一般的な推論を可能にするためには、グラウンディングという課題がある。我々は、実世界の連続的なセンサーモダリティを直接言語モデルに組み込み、それによって言葉と知覚の間のリンクを確立する、体現型言語モデルを提案する。我々の言語モデルへの入力は、視覚、連続状態推定、およびテキスト入力エンコーディングを織り交ぜたマルチモーダル文である。我々は、これらのエンコーディングを、事前に訓練された大規模言語モデルと組み合わせて、連続的なロボット操作の計画、視覚的な質問応答、キャプションを含む複数の具象タスクに対してエンドツーエンドで訓練します。また、インターネットスケールの言語、視覚、視覚言語ドメインにまたがる多様な共同訓練から、モデルが恩恵を受けるというポジティブトランスファーを示していることも示しています。私たちの最大のモデルであるPaLM-E-562B（パラメータ562B）は、ロボットタスクで訓練されていることに加え、OK-VQAで最先端の性能を持つ視覚言語ジェネラリストであり、スケールが大きくなってもジェネラリストの言語能力を維持する。

要約(オリジナル)

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

arxiv情報

著者	Danny Driess,Fei Xia,Mehdi S. M. Sajjadi,Corey Lynch,Aakanksha Chowdhery,Brian Ichter,Ayzaan Wahid,Jonathan Tompson,Quan Vuong,Tianhe Yu,Wenlong Huang,Yevgen Chebotar,Pierre Sermanet,Daniel Duckworth,Sergey Levine,Vincent Vanhoucke,Karol Hausman,Marc Toussaint,Klaus Greff,Andy Zeng,Igor Mordatch,Pete Florence
発行日	2023-03-06 18:58:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

PaLM-E: An Embodied Multimodal Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー