Semi-supervised multimodal coreference resolution in image narrations

要約

この論文では、特に長い説明テキスト、つまりナレーションが画像とペアになっている場合のマルチモーダル共参照解決について研究します。
これは、きめの細かい画像とテキストの位置合わせ、物語言語に存在する固有の曖昧さ、および注釈付きの大規模なトレーニングセットが利用できないことにより、重大な課題を引き起こします。
これらの課題に取り組むために、画像とナレーションのペアを利用して、マルチモーダルなコンテキストでの共参照と物語の基礎付けを解決する、データ効率の高い半教師ありアプローチを提案します。
私たちのアプローチは、クロスモーダルフレームワーク内でラベル付きデータとラベルなしデータの両方の損失を組み込みます。
私たちの評価では、提案されたアプローチが、共参照の解決と物語の基礎付けのタスクに関して、量的および質的に強力なベースラインを上回っていることが示されています。

要約(オリジナル)

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.

arxiv情報

著者	Arushi Goel,Basura Fernando,Frank Keller,Hakan Bilen
発行日	2023-10-20 16:10:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Semi-supervised multimodal coreference resolution in image narrations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー