Caption Anything: Interactive Image Description with Diverse Multimodal Controls

要約

制御可能な画像キャプションは、人間の目的 $\textit{e.g.}$ に従って自然言語で画像を記述し、指定された領域を観察したり、特定のテキストスタイルで伝えたりすることを目的とした、新たなマルチモーダルトピックです。
最先端のメソッドは、注釈付きの入力コントロールと出力キャプションのペアでトレーニングされます。
ただし、このような十分に注釈が付けられたマルチモーダルデータは不足しているため、対話型 AI システムの使いやすさとスケーラビリティが大幅に制限されます。
単峰性の命令追従基盤モデルを活用することは、より広範なデータソースから恩恵を受ける有望な代替手段です。
この論文では、幅広いマルチモデルコントロールをサポートする基礎モデル拡張画像キャプションフレームワークである Caption AnyThing (CAT) を紹介します。1) 点、ボックス、軌跡を含む視覚的なコントロール。
2) 感情、長さ、言語、事実などの言語制御。
Segment Anything Model (SAM) と ChatGPT を利用して、視覚的プロンプトと言語プロンプトをモジュール化されたフレームワークに統合し、さまざまなコントロール間の柔軟な組み合わせを可能にします。
広範なケーススタディは、当社のフレームワークのユーザー意図調整機能を実証し、ビジョン言語アプリケーションにおける効果的なユーザーインタラクションモデリングに光を当てます。
私たちのコードは https://github.com/ttengwang/Caption-Anything で公開されています。

要約(オリジナル)

Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.

arxiv情報

著者	Teng Wang,Jinrui Zhang,Junjie Fei,Hao Zheng,Yunlong Tang,Zhe Li,Mingqi Gao,Shanshan Zhao
発行日	2023-07-06 13:47:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー