AutoAD II: The Sequel — Who, When, and What in Movie Audio Description

要約

オーディオディスクリプション (AD) は、視覚障害のある視聴者のために、適切な時間間隔でビジュアルコンテンツの説明を生成するタスクです。
映画の場合、これは顕著な課題を示します。AD は会話の既存の一時停止中にのみ発生する必要があり、登場人物を名前で参照する必要があり、ストーリー全体の理解を助ける必要があります。
この目的を達成するために、フレームの CLIP 視覚的特徴、キャストリスト、および音声の時間的位置を考慮して、映画 AD を自動的に生成するための新しいモデルを開発します。
「誰が」、「いつ」、「何を」という 3 つの質問すべてに対処します: (i) 誰 — キャラクターの名前、その役を演じた俳優、および顔の CLIP 機能で構成されるキャラクターバンクを導入します。
、各映画の主要なキャストについて説明し、これを使用して生成された AD の名前付けを改善する方法を示します。
(ii) いつ — 間隔とその近傍の視覚的な内容に基づいて、ある時間間隔に対して AD を生成すべきかどうかを決定するためのいくつかのモデルを調査します。
(iii) 内容 — このタスク用に新しいビジョン言語モデルを実装します。これにより、キャラクターバンクから提案を取り込みながら、相互注意を使用して視覚的特徴を調整し、これが AD の以前のアーキテクチャと比べてどのように改善されるかを実証します。
リンゴとリンゴを比較してテキストを生成します。

要約(オリジナル)

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges — AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the ‘who’, ‘when’, and ‘what’ questions: (i) who — we introduce a character bank consisting of the character’s name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when — we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what — we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.

arxiv情報

著者	Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman
発行日	2023-10-10 17:59:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AutoAD II: The Sequel — Who, When, and What in Movie Audio Description

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー