VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

要約

テキスト、テーブル、画像間のマルチモーダルデータの可用性の向上は、複雑なクロスモーダル推論が可能なモデルを開発するための新しい課題を提示します。
マルチモーダルマルチホップ質問応答（MMQA）の既存の方法は、限られた推論能力、モダリティ変換への依存、視覚表現とテキスト表現の間の不十分な整合性に悩まされています。
これらの制限に対処するために、このペーパーでは、変圧器ベースのビジョンエンコーダーをシーケンスツーシーケンス言語モデルと統合する統合されたアーキテクチャであるVision-Language Multimodal Transformer（VLMT）を紹介します。
VLMTは、直接トークンレベルのインジェクションメカニズムを採用して、共有埋め込みスペース内で視覚とテキストの入力を融合し、中間投影層の必要性を排除します。
クロスモーダルのアライメントと推論を強化するために、視覚言語の表現を徐々に整列させ、マルチモーダル理解のためのモデルの能力を向上させるために、3段階の事前トレーニング戦略が提案されています。
前処理されたバックボーンに基づいて、2つのタスク固有のモジュールがインスタンス化されて2段階のMMQAフレームワークを形成します。ドキュメント関連スコアを予測し、コンテキスト回収のためのトップK戦略で相対的なしきい値を使用するマルチモーダル再ランカー、およびリテリーブエビデンスに基づいたコンテキストの接続された答えを生成するマルチモーダルの質問回答モデル。
2つのベンチマークデータセットでの包括的な実験は、提案されたアプローチの有効性を示しています。
MultimodalQA検証セットでは、VLMT-Largeは76.5％の正確な一致と80.1％F1を達成し、以前の最先端を正確な一致で +9.1％、F1で +8.8％よりも上回ります。
WebQAでは、PERQAなどの以前のモデルを+3.2で上回る47.6のQAスコアを達成します。
これらの結果は、マルチモーダル推論におけるVLMTの強力な能力と、実際の情報検索と質問回答システムを前進させる可能性を強調しています。

要約(オリジナル)

The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion, and inadequate alignment between visual and textual representations. To address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. To enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model’s capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a multimodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. On MultimodalQA validation set, VLMT-Large achieves 76.5% Exact Match and 80.1% F1, outperforming the previous state-of-the-art by +9.1% in Exact Match and +8.8% in F1. On WebQA, it attains a QA score of 47.6, surpassing prior models such as PERQA by +3.2. These results highlight VLMT’s strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems.

arxiv情報

著者	Qi Zhi Lim,Chin Poo Lee,Kian Ming Lim,Kalaiarasi Sonai Muthu Anbananthen
発行日	2025-04-11 05:51:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー