MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

要約

マルチモーダルインタラクティブシステムの開発は、LLM に大量に必要とされる豊富なマルチモーダル (テキスト、画像) 会話データの欠如によって妨げられています。
これまでのアプローチでは、取得した画像を使用してテキストの対話を強化しており、プライバシー、多様性、品質の制約が生じていました。
この研究では、テキストのみの対話を多様なユーザーと拡張するフレームワークである \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID) を導入します。
そして高品質の画像。
その後、拡散モデルを適用して対応する画像を作成し、識別されたテキストとの位置合わせを保証します。
最後に、MAGID には、画像説明生成モジュール (テキスト LLM) と画像品質モジュール (美観、画像とテキストのマッチング、安全性への対応) の間に革新的なフィードバックループが組み込まれており、高品質でマルチモーダルなダイアログを生成するために連携して機能します。
自動評価と人間による評価を使用して、3 つの対話データセットで MAGID を他の SOTA ベースラインと比較します。
私たちの結果は、MAGID がベースラインと同等かそれより優れており、特に画像データベースが小さい場合の検索ベースラインに対して人間の評価が大幅に向上していることを示しています。

要約(オリジナル)

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

arxiv情報

著者	Hossein Aboutalebi,Hwanjun Song,Yusheng Xie,Arshit Gupta,Justin Sun,Hang Su,Igor Shalyminov,Nikolaos Pappas,Siffi Singh,Saab Mansour
発行日	2024-03-05 18:31:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー