ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

要約

画像ベースの対話システムは、視覚情報を統合することで大きな恩恵を受け、高品質の応答を生成します。
ただし、現在のモデルは、主に画像モダリティとテキストモダリティの間の相違により、ゼロリソースシナリオでそのような情報を効果的に利用するのに苦労しています。
この課題を克服するために、私たちは ZRIGF と呼ばれる革新的なマルチモーダルフレームワークを提案します。これは、リソースがゼロの状況で対話を生成するために画像に基づいた情報を同化します。
ZRIGF は、対照的な事前トレーニングと生成的な事前トレーニングからなる 2 段階の学習戦略を実装します。
対照的な事前トレーニングには、画像とテキストを統一されたエンコードされたベクトル空間にマッピングするテキスト画像マッチングモジュールと、トレーニング前の視覚的特徴を保存し、さらなるマルチモーダルな特徴の調整を促進するテキスト支援マスク画像モデリングモジュールが含まれています。
生成的事前トレーニングでは、マルチモーダル融合モジュールと情報伝達モジュールを使用して、調和されたマルチモーダル表現に基づいて洞察力に富んだ応答を生成します。
テキストベースと画像ベースの両方の対話データセットに対して行われた包括的な実験により、状況に応じて適切で有益な応答を生成する際の ZRIGF の有効性が実証されました。
さらに、画像に基づいた対話データセットに完全にゼロリソースのシナリオを採用し、新しい領域におけるフレームワークの堅牢な一般化機能を実証します。
コードは https://github.com/zhangbo-nlp/ZRIGF で入手できます。

要約(オリジナル)

Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF’s efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework’s robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.

arxiv情報

著者	Bo Zhang,Jian Wang,Hui Ma,Bo Xu,Hongfei Lin
発行日	2023-08-01 09:28:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー