AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

要約

私たちの目的は、トレーニング不要の方法で映画とテレビシリーズの両方のオーディオディスクリプション (AD) を生成することです。
私たちは、既製のビジュアル言語モデル (VLM) とラージ言語モデル (LLM) の機能を利用して、このタスクのためのビジュアルおよびテキストプロンプト戦略を開発します。
私たちの貢献は 3 つあります。(i) 微調整を必要とせずに、視覚的な表示を通じて文字情報を直接プロンプトした場合、VLM が文字に名前を付けて参照できることを実証します。
(ii) AD を生成するために 2 段階のプロセスが開発されており、第 1 段階では VLM にビデオを包括的に記述するように要求し、続いて LLM を利用して高密度のテキスト情報を 1 つの簡潔な AD 文に要約する第 2 段階が続きます。
(iii) TV オーディオ記述用の新しいデータセットが策定されます。
AutoAD-Zero と名付けられた私たちのアプローチは、映画とテレビシリーズの両方の AD 生成において卓越したパフォーマンス (グラウンドトゥルース AD で微調整された一部のモデルとさえ競合) を示し、最先端の CRITIC スコアを達成しました。

要約(オリジナル)

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

arxiv情報

著者	Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman
発行日	2024-07-22 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー