LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

要約

LLaVA-Interactive は、マルチモーダルな人間と AI のインタラクションのための研究プロトタイプです。
このシステムは、マルチモーダルなユーザー入力を受け取り、マルチモーダルな応答を生成することにより、人間のユーザーとマルチターンの対話を行うことができます。
重要なのは、LLaVA-Interactive は言語プロンプトを超えており、対話における人間の意図を調整するために視覚的なプロンプトが有効になっているということです。
LLaVA-Interactive の開発は、システムが追加のモデルトレーニングなしで、事前に構築された AI モデルの 3 つのマルチモーダルスキル (LLaVA のビジュアルチャット、SEEM からの画像セグメンテーション、GLIGEN からの画像生成と編集) を組み合わせているため、非常にコスト効率が高くなります。
LLaVA-Interactive の可能性を実証し、マルチモーダルインタラクティブシステムにおける将来の研究にインスピレーションを与えるために、さまざまなアプリケーションシナリオが提示されています。

要約(オリジナル)

LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.

arxiv情報

著者	Wei-Ge Chen,Irina Spiridonova,Jianwei Yang,Jianfeng Gao,Chunyuan Li
発行日	2023-11-01 15:13:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー