Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

要約

Medical Visual Question Answering (MedVQA) は、コンピュータービジョンと自然言語処理の交差点としてますます注目を集めています。
MedVQA は、放射線画像を解釈し、臨床上の問い合わせに対して正確な回答を提供する機能により、医師の診断上の意思決定をサポートし、放射線科医の作業負荷を軽減するための貴重なツールとして位置付けられています。
最近のアプローチは、クロスモーダル Transformer のようなマルチモーダルフュージョンに統合された事前トレーニング済みの大規模モデルを使用することに焦点を当てていますが、この分野ではより効率的なフュージョン方法に関する研究は依然として比較的少ないです。
この論文では、直交性損失、マルチヘッドアテンション、およびバイリニアアテンションネットワーク (OMniBAN) を統合して、事前トレーニングを必要とせずに高い計算効率と強力なパフォーマンスを実現する新しい融合モデルを紹介します。
私たちは包括的な実験を実施し、バイリニアアテンションフュージョンを強化して大規模モデルと同等のパフォーマンスを実現する方法の側面を明らかにします。
実験結果では、OMniBAN が、より低い計算コストを維持しながら、主要な MedVQA ベンチマークで従来のモデルを上回るパフォーマンスを示すことが示されており、放射線医学および病理画像質問応答における効率的な臨床応用の可能性が示されています。

要約(オリジナル)

Medical Visual Question Answering (MedVQA) has gained increasing attention at the intersection of computer vision and natural language processing. Its capability to interpret radiological images and deliver precise answers to clinical inquiries positions MedVQA as a valuable tool for supporting diagnostic decision-making for physicians and alleviating the workload on radiologists. While recent approaches focus on using unified pre-trained large models for multi-modal fusion like cross-modal Transformers, research on more efficient fusion methods remains relatively scarce within this discipline. In this paper, we introduce a novel fusion model that integrates Orthogonality loss, Multi-head attention and Bilinear Attention Network (OMniBAN) to achieve high computational efficiency and strong performance without the need for pre-training. We conduct comprehensive experiments and clarify aspects of how to enhance bilinear attention fusion to achieve performance comparable to that of large models. Experimental results show that OMniBAN outperforms traditional models on key MedVQA benchmarks while maintaining a lower computational cost, which indicates its potential for efficient clinical application in radiology and pathology image question answering.

arxiv情報

著者	Zhilin Zhang,Jie Wang,Ruiqi Zhu,Xiaoliang Gong
発行日	2024-10-28 13:24:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー