Universal Multimodal Representation for Language Understanding

要約

表現学習は自然言語処理(NLP)の基礎となるものである。本研究では、一般的な自然言語処理タスクの補助信号として視覚情報を利用する新しい方法を提示する。各文章に対して、まず、既存の文-画像ペアに対して抽出された軽いトピック-画像ルックアップテーブルか、既製のテキスト-画像ペアに対して事前に学習された共有クロスモーダル埋め込み空間から、柔軟な数の画像を取得する。次に、テキストと画像はそれぞれTransformerエンコーダと畳み込みニューラルネットワークによって符号化される。さらに、2つの表現系列は、2つのモダリティの相互作用のために、アテンション層によって融合される。本研究では、検索プロセスを制御可能で柔軟なものとする。また，普遍的な視覚表現により，大規模な対訳文-画像ペアの欠如を克服している．提案手法は，人手で注釈を付けたマルチモーダルなパラレルコーパスがなくても，テキストのみのタスクに容易に適用できる．我々は提案手法を、ニューラル機械翻訳、自然言語推論、意味類似性など、幅広い自然言語生成・理解タスクに適用する。実験結果から、本手法は異なるタスクや言語に対して概ね有効であることが示される。解析の結果、視覚信号が内容語のテキスト表現を豊かにし、概念と事象の関係に関するきめ細かい根拠情報を提供し、曖昧性解消に寄与する可能性があることが示された。

要約(オリジナル)

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation.

arxiv情報

著者	Zhuosheng Zhang,Kehai Chen,Rui Wang,Masao Utiyama,Eiichiro Sumita,Zuchao Li,Hai Zhao
発行日	2023-01-09 13:54:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Universal Multimodal Representation for Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー