Teaching Text-to-Image Models to Communicate


対話構造の特性を考慮して、異なる話者を区別するために、対話の各文の前にセグメント トークンを置きます。
次に、事前トレーニングされたテキストから画像へのモデルを微調整して、処理されたダイアログ コンテキストに応じて画像を生成できるようにします。


Various works have been extensively studied in the research of text-to-image generation. Although existing models perform well in text-to-image generation, there are significant challenges when directly employing them to generate images in dialogs. In this paper, we first highlight a new problem: dialog-to-image generation, that is, given the dialog context, the model should generate a realistic image which is consistent with the specified conversation as response. To tackle the problem, we propose an efficient approach for dialog-to-image generation without any intermediate translation, which maximizes the extraction of the semantic information contained in the dialog. Considering the characteristics of dialog structure, we put segment token before each sentence in a turn of a dialog to differentiate different speakers. Then, we fine-tune pre-trained text-to-image models to enable them to generate images conditioning on processed dialog context. After fine-tuning, our approach can consistently improve the performance of various models across multiple metrics. Experimental results on public benchmark demonstrate the effectiveness and practicability of our method.


著者 Xiaowen Sun,Jiazhan Feng,Yuxuan Wang,Yuxuan Lai,Xingyu Shen,Dongyan Zhao
発行日 2023-09-27 09:33:16+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV パーマリンク