PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

要約

この論文では、パラレルアテンションメカニズムと呼ばれる、マルチモーダル学習のための新しいスキームを紹介します。
さらに、ベトナム語の文法と文脈の利点を考慮するために、LSTM ネットワークを使用して言語特徴を抽出する代わりに、階層的言語特徴抽出器を提案します。
これら 2 つの新しいモジュールに基づいて、パラレルアテンショントランスフォーマー (PAT) を導入し、ベンチマーク ViVQA データセットおよび SAAA や MCAN を含む他の SOTA 手法のすべてのベースラインと比較して最高の精度を実現します。

要約(オリジナル)

We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.

arxiv情報

著者	Nghia Hieu Nguyen,Kiet Van Nguyen
発行日	2023-07-17 05:05:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー