Intelligent Grimm — Open-ended Visual Storytelling via Latent Diffusion Models

要約

生成モデルは最近、テキスト記述に基づく画像生成など、さまざまなシナリオで優れた機能を発揮しています。
この研究では、オープンエンドのビジュアルストーリーテリングと呼ばれる、特定のストーリーラインに基づいて一連の一貫した画像シーケンスを生成するタスクに焦点を当てます。
私たちは次の 3 つの貢献を行います: (i) 視覚的なストーリーテリングのタスクを実行するために、事前トレーニングされた安定した拡散モデルに 2 つのモジュールを導入し、StoryGen と呼ばれる自動回帰画像ジェネレーターを構築します。
テキストプロンプトと前のフレームの両方を条件付けしてフレームを作成します。
(ii) 提案したモデルをトレーニングするために、ビデオ、電子書籍などのさまざまなオンラインソースから画像とテキストのペアのサンプルを収集し、StorySalon という名前の多様なデータセットを構築するためのデータ処理パイプラインを確立します。
既存のアニメーション固有のデータセットよりも語彙が豊富です。
(iii) 私たちは 3 段階のカリキュラムトレーニング戦略を採用しており、それぞれスタイルの伝達、視覚的なコンテキストの調整、およびヒューマンフィードバックの調整を可能にします。
定量的な実験と人による評価により、画質、スタイルの一貫性、コンテンツの一貫性、および視覚言語の整合性の点で、提案したモデルの優位性が検証されました。
コード、モデル、データセットを研究コミュニティに公開します。

要約(オリジナル)

Generative models have recently exhibited exceptional capabilities in various scenarios, for example, image generation based on text description. In this work, we focus on the task of generating a series of coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling, we introduce two modules into a pre-trained stable diffusion model, and construct an auto-regressive image generator, termed as StoryGen, that enables to generate the current frame by conditioning on both a text prompt and a preceding frame; (ii) to train our proposed model, we collect paired image and text samples by sourcing from various online sources, such as videos, E-books, and establish a data processing pipeline for constructing a diverse dataset, named StorySalon, with a far larger vocabulary than existing animation-specific datasets; (iii) we adopt a three-stage curriculum training strategy, that enables style transfer, visual context conditioning, and human feedback alignment, respectively. Quantitative experiments and human evaluation have validated the superiority of our proposed model, in terms of image quality, style consistency, content consistency, and visual-language alignment. We will make the code, model, and dataset publicly available to the research community.

arxiv情報

著者	Chang Liu,Haoning Wu,Yujie Zhong,Xiaoyun Zhang,Weidi Xie
発行日	2023-06-01 17:58:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Intelligent Grimm — Open-ended Visual Storytelling via Latent Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー