Controllable Dense Captioner with Multimodal Embedding Bridging

要約

本稿では、言語ガイダンスを導入することでユーザーの高密度キャプションへの意図に対応する、制御可能な高密度キャプショナ (ControlCap) を提案します。
ControlCap は、マルチモーダルエンベディングブリッジングアーキテクチャとして定義されており、マルチモーダルエンベディング生成 (MEG) モジュールと双方向エンベディングブリッジング (BEB) モジュールで構成されます。
MEG モジュールは、詳細情報の埋め込みとコンテキスト認識型の埋め込みを組み合わせることによってオブジェクト/領域を表現しますが、言語ガイダンスとして利用することで、ControlCap に特殊なコントロールへの適応性も与えます。
BEB モジュールは、視覚領域との間で特徴を借用/返却し、そのような特徴を収集してテキストの説明を予測することにより、言語ガイダンスを視覚的埋め込みと調整します。
Visual Genome および VG-COCO データセットの実験では、ControlCap が最先端の方法よりそれぞれ 1.5% および 3.7% (mAP) 優れていることが示されています。
最後に重要なことですが、領域とカテゴリのペアを領域とテキストのペアに変換する機能により、ControlCap は高密度キャプション用の強力なデータエンジンとして機能できます。
コードは https://github.com/callsys/ControlCap で入手できます。

要約(オリジナル)

In this paper, we propose a controllable dense captioner (ControlCap), which accommodates user’s intention to dense captioning by introducing linguistic guidance. ControlCap is defined as a multimodal embedding bridging architecture, which comprises multimodal embedding generation (MEG) module and bi-directional embedding bridging (BEB) module. While MEG module represents objects/regions by combining embeddings of detailed information with context-aware ones, it also endows ControlCap the adaptability to specialized controls by utilizing them as linguistic guidance. BEB module aligns the linguistic guidance with visual embeddings through borrowing/returning features from/to the visual domain and gathering such features to predict text descriptions. Experiments on Visual Genome and VG-COCO datasets show that ControlCap respectively outperforms the state-of-the-art methods by 1.5% and 3.7% (mAP). Last but not least, with the capability of converting region-category pairs to region-text pairs, ControlCap is able to act as a powerful data engine for dense captioning. Code is available at https://github.com/callsys/ControlCap.

arxiv情報

著者	Yuzhong Zhao,Yue Liu,Zonghao Guo,Weijia Wu,Chen Gong,Qixiang Ye,Fang Wan
発行日	2024-01-31 15:15:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Controllable Dense Captioner with Multimodal Embedding Bridging

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー