LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

要約

LLaVA-Plus は、大規模なマルチモーダルモデルの機能を拡張する汎用マルチモーダルアシスタントです。
事前にトレーニングされたビジョンおよびビジョン言語モデルのスキルリポジトリを維持し、ユーザーの入力に基づいて関連ツールをアクティブにして現実世界のタスクを実行できます。
LLaVA-Plus は、視覚的な理解、生成、外部知識の検索、および構成をカバーするツールを使用する能力を獲得するために、マルチモーダルな命令に従うデータでトレーニングされます。
実験結果は、LLaVA-Plus が既存の機能で LLaVA を上回り、新しい機能を発揮することを示しています。
画像クエリが人間と AI のインタラクションセッション全体を通じて直接的に実行され、積極的に関与し、ツールの使用パフォーマンスが大幅に向上し、新しいシナリオが可能になるという点が特徴です。

要約(オリジナル)

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users’ inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

arxiv情報

著者	Shilong Liu,Hao Cheng,Haotian Liu,Hao Zhang,Feng Li,Tianhe Ren,Xueyan Zou,Jianwei Yang,Hang Su,Jun Zhu,Lei Zhang,Jianfeng Gao,Chunyuan Li
発行日	2023-11-09 15:22:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー