Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

要約

ChatGPT は、多くのドメインにわたって優れた会話能力と推論能力を備えた言語インターフェイスを提供するため、分野を超えた関心を集めています。
ただし、ChatGPT は言語でトレーニングされているため、現在、視覚的な世界から画像を処理または生成することはできません。
同時に、Visual Transformers や Stable Diffusion などの Visual Foundation Models は、優れた視覚的理解と生成機能を示しますが、1 回限りの固定入力と出力を持つ特定のタスクの専門家にすぎません。
この目的のために、さまざまな Visual Foundation モデルを組み込んだ \textbf{Visual ChatGPT} というシステムを構築し、ユーザーが 1) 言語だけでなく画像も送受信して ChatGPT とやり取りできるようにします。2) 複雑な視覚的な質問または視覚的な質問を提供します。
マルチステップで複数の AI モデルのコラボレーションを必要とする編集命令。
3) フィードバックを提供し、修正結果を求める。
複数の入力/出力のモデルと視覚的なフィードバックが必要なモデルを考慮して、視覚的なモデル情報を ChatGPT に挿入するための一連のプロンプトを設計します。
実験では、Visual ChatGPT が Visual Foundation Models の助けを借りて、ChatGPT の視覚的な役割を調査するための扉を開くことが示されています。
私たちのシステムは、\url{https://github.com/microsoft/visual-chatgpt} で公開されています。

要約(オリジナル)

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.

arxiv情報

著者	Chenfei Wu,Shengming Yin,Weizhen Qi,Xiaodong Wang,Zecheng Tang,Nan Duan
発行日	2023-03-08 15:50:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー