DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

要約

画像とビデオの作成、特に AI ベースの画像合成の最近の進歩により、高度な抽象性と多様性を示す多数のビジュアルシーンが作成されるようになりました。
その結果、画像のコレクションから意味のある一貫した物語を生成するタスクであるビジュアルストーリーテリング (VST) はさらに困難になり、現実世界の画像を超えた需要がますます高まっています。
通常、自己回帰デコーダを使用する既存の VST 技術は大幅に進歩しましたが、推論速度が遅いという欠点があり、合成シーンにはあまり適していません。
この目的を達成するために、我々は、一連の視覚的記述の生成を単一の条件付きノイズ除去プロセスとしてモデル化する、新しい拡散ベースのシステム DiffuVST を提案します。
推論時の DiffuVST の確率的かつ非自己回帰的な性質により、非常に多様なナラティブをより効率的に生成できます。
さらに、DiffuVST は、双方向のテキスト履歴ガイダンスとマルチモーダルアダプターモジュールを備えた独自の設計を特徴としており、文間の一貫性と画像とテキストの忠実度を効果的に向上させます。
4 つの架空のビジュアルストーリーデータセットを対象としたストーリー生成タスクに関する広範な実験により、テキストの品質と推論速度の両方の点で、DiffuVST が従来の自己回帰モデルよりも優れていることが実証されました。

要約(オリジナル)

Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.

arxiv情報

著者	Shengguang Wu,Mei Yuan,Qi Su
発行日	2023-12-12 08:40:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー