Distilled Dual-Encoder Model for Vision-Language Understanding

要約

視覚的推論や視覚的質問応答などの視覚言語理解タスクのデュアルエンコーダーモデルをトレーニングするためのクロスモーダル注意蒸留フレームワークを提案します。
デュアルエンコーダーモデルは、フュージョンエンコーダーモデルよりも推論速度が速く、推論中に画像とテキストを事前に計算できます。
ただし、デュアルエンコーダーモデルで使用される浅い相互作用モジュールは、複雑な視覚言語理解タスクを処理するには不十分です。
画像とテキストの深い相互作用を学習するために、クロスモーダルな注意の蒸留を導入します。これは、フュージョンエンコーダーモデルの画像からテキストへ、およびテキストから画像への注意分布を使用して、デュアルエンコーダーのトレーニングをガイドします。
モデル。
さらに、事前トレーニング段階と微調整段階の両方にクロスモーダル注意蒸留を適用すると、さらなる改善が達成されることを示します。
実験結果は、蒸留されたデュアルエンコーダーモデルが、融合エンコーダーモデルよりもはるかに速い推論速度を享受しながら、視覚的推論、視覚的含意、および視覚的質問応答タスクで競争力のあるパフォーマンスを達成することを示しています。
私たちのコードとモデルは、https://github.com/kugwzk/Distilled-DualEncoder で公開されます。

要約(オリジナル)

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

arxiv情報

著者	Zekun Wang,Wenhui Wang,Haichao Zhu,Ming Liu,Bing Qin,Furu Wei
発行日	2022-10-17 16:27:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilled Dual-Encoder Model for Vision-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー