MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

要約

マルチモーダルな推論とアクションを実現するために、ChatGPT をビジョンエキスパートのプールと統合するシステムパラダイムである MM-REACT を提案します。
このホワイトペーパーでは、解決するのが興味深いが、既存のビジョンおよびビジョン言語モデルの機能を超える可能性がある、高度なビジョンタスクの包括的なリストを定義して調査します。
このような高度なビジュアルインテリジェンスを実現するために、MM-REACT では、テキストの説明、テキスト化された空間座標、および画像やビデオなどの高密度の視覚信号の整列されたファイル名を表すことができるテキストプロンプトデザインが導入されています。
MM-REACT の迅速な設計により、言語モデルはマルチモーダルな情報を受け入れ、関連付け、処理できるため、ChatGPT とさまざまなビジョンエキスパートの相乗的な組み合わせが促進されます。
ゼロショット実験は、関心のある指定された機能に対処する際の MM-REACT の有効性と、高度な視覚的理解を必要とするさまざまなシナリオでの幅広いアプリケーションを示しています。
さらに、MM-REACT のシステムパラダイムを、共同微調整によってマルチモーダルシナリオの言語モデルを拡張する代替アプローチと比較して説明します。
コード、デモ、ビデオ、視覚化は、https://multimodal-react.github.io/ で入手できます。

要約(オリジナル)

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT’s prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT’s effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT’s system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

arxiv情報

著者	Zhengyuan Yang,Linjie Li,Jianfeng Wang,Kevin Lin,Ehsan Azarnasab,Faisal Ahmed,Zicheng Liu,Ce Liu,Michael Zeng,Lijuan Wang
発行日	2023-03-20 18:31:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー