Magma: A Foundation Model for Multimodal AI Agents

要約

マグマは、デジタルワールドと物理世界の両方でマルチモーダルAIエージェントタスクを提供する基礎モデルです。
マグマは、後者のVL理解能力（言語知能）を保持するだけでなく、視覚空間の世界で計画し行動する能力も装備されているという点で、ビジョン言語（VL）モデルの重要な拡張です（空間）
-porlal Intelligence）およびUIナビゲーションからロボット操作に至るまでの完全なエージェントタスク。
エージェント機能に授与するために、マグマは画像、ビデオからロボット工学データにまたがる大量の不均一なデータセットで事前に処理されます。ここでは、画像の実用的な視覚オブジェクト（たとえば、GUIのクリック可能なボタン）にセットマーク（SOM）がラベル付けされています。
動画のアクションの接地とオブジェクトの動き（たとえば、人間の手やロボットアームの痕跡など）の場合、Trace-of-Mark（TOM）がラベル付けされています。
アクションプランニング。
広範な実験では、SOMとTOMが大きな相乗効果に到達し、図1に示すように広範囲のタスクの基本であるMAGMAモデルの空間的知能の獲得を促進することが示されています。
特に、Magmaは、これらのタスクに合わせて特別に調整された以前のモデルを上回るUIナビゲーションおよびロボット操作タスクに新しい最先端の結果を作成します。
画像関連のマルチモーダルタスクでは、マグマは、はるかに大きなデータセットでトレーニングされている一般的な大規模なマルチモーダルモデルとも好ましいものです。
https://microsoft.github.io/magmaで再現性のためにモデルとコードを公開します。

要約(オリジナル)

We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.

arxiv情報

著者	Jianwei Yang,Reuben Tan,Qianhui Wu,Ruijie Zheng,Baolin Peng,Yongyuan Liang,Yu Gu,Mu Cai,Seonghyeon Ye,Joel Jang,Yuquan Deng,Lars Liden,Jianfeng Gao
発行日	2025-02-18 18:55:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Magma: A Foundation Model for Multimodal AI Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー