Contextual AD Narration with Interleaved Multimodal Sequence

要約

オーディオディスクリプション (AD) タスクは、視覚障害のある人が映画などの長い形式のビデオコンテンツにアクセスできるように、視覚要素の説明を生成することを目的としています。
ビデオ機能、テキスト、キャラクターバンク、およびコンテキスト情報を入力として使用すると、生成された AD は名前でキャラクターに対応し、視聴者が映画のストーリーラインを理解するのに役立つ適切な文脈上の説明を提供できます。
この目標を達成するために、シンプルで統一されたフレームワークを通じて事前トレーニングされた基礎モデルを活用し、Uni-AD と呼ばれるインターリーブされたマルチモーダルシーケンスを入力として持つ AD を生成することを提案します。
さまざまなモダリティにわたる特徴の調整をより細かい粒度で強化するために、ビデオ特徴をテキスト特徴空間にマッピングするシンプルで軽量なモジュールを導入します。
さらに、ビデオのコンテキストでより重要な役割を果たす主要な登場人物を特定することにより、より正確な情報を提供するキャラクター絞り込みモジュールも提案します。
これらのユニークな設計により、コンテキスト情報とコントラスト損失をアーキテクチャにさらに組み込み、よりスムーズでコンテキストに応じた AD を生成します。
MAD-eval データセットの実験では、Uni-AD が AD 生成において最先端のパフォーマンスを達成できることが示されており、これは私たちのアプローチの有効性を示しています。
コードは https://github.com/MCG-NJU/Uni-AD で入手できます。

要約(オリジナル)

The Audio Description (AD) task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. To enhance the alignment of features across various modalities with finer granularity, we introduce a simple and lightweight module that maps video features into the textual feature space. Moreover, we also propose a character-refinement module to provide more precise information by identifying the main characters who play more significant role in the video context. With these unique designs, we further incorporate contextual information and a contrastive loss into our architecture to generate more smooth and contextual ADs. Experiments on the MAD-eval dataset show that Uni-AD can achieve state-of-the-art performance on AD generation, which demonstrates the effectiveness of our approach. Code will be available at https://github.com/MCG-NJU/Uni-AD.

arxiv情報

著者	Hanlin Wang,Zhan Tong,Kecheng Zheng,Yujun Shen,Limin Wang
発行日	2024-03-19 17:27:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Contextual AD Narration with Interleaved Multimodal Sequence

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー