EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

要約

我々は、主に CLIP や ImageBind などのエンコーダに依存し、十分な量のトレーニングデータを必要とする既存のマルチモーダルモデルとは異なり、拡散モデルと大規模言語モデル (LLM) の機能を活用することでマルチモーダルの理解と生成を強化するように設計された効率的なモデルである EasyGen を紹介します。
ブリッジモダリティでは、EasyGen は双方向の条件付き拡散モデルである BiDiffuser を活用して、より効率的なモダリティインタラクションを促進します。
Easygen は、BiDiffuser と LLM をリンクする投影層をトレーニングすることでテキスト生成を実現し、LLM のテキスト空間を BiDiffuser の画像空間と位置合わせするようにアダプターをトレーニングすることで画像生成を容易にします。包括的な定量的および定性的実験により、EasyGen がデータ効率の高いトレーニングに優れていることが示されています。
高品質の画像生成と拡張性を備え、マルチモーダル生成の課題に効果的に対処します。
ソースコードは https://github.com/zxy556677/EasyGen で入手できます。

要約(オリジナル)

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM’s text space with the BiDiffuser’s image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.

arxiv情報

著者	Xiangyu Zhao,Bo Liu,Qijiong Liu,Guangyuan Shi,Xiao-Ming Wu
発行日	2024-05-17 08:30:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー