Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

要約

マルチモーダル推論は、特に複雑なタスクに取り組む場合に、人間のような知能を発揮する人工知能システムを追求する上で重要な要素です。
思考連鎖 (CoT) 手法はかなりの注目を集めていますが、小学校および高校の教科書に掲載されている多様な科学的質問と説明に焦点を当てている既存の ScienceQA データセットには、多様なアプローチの包括的な評価が欠けています。
このギャップに対処するために、ラージオブジェクトデータセット COCO から派生した自由形式の質問、根拠、および回答の広範なコレクションを網羅する新しいデータセットである COCO Multi-Modal Reasoning Dataset (COCO-MMRD) を紹介します。
多肢選択式の質問に依存する以前のデータセットとは異なり、私たちのデータセットは、マルチモーダル CoT のコンテキストにおける自由回答形式の質問の使用を先駆けて、CoT モデルの推論能力を効果的に評価する、より困難な問題を導入しています。
包括的な評価と詳細な分析を通じて、貴重な洞察を提供し、画像とテキストのエンコーダーを強化するためのマルチホップクロスモーダル注意や文レベルの対照学習などの革新的な技術を提案します。
広範な実験により、提案されたデータセットと手法の有効性が実証され、マルチモーダル推論を進めるための新しい視点が提供されます。

要約(オリジナル)

Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning Dataset(COCO-MMRD), a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning.

arxiv情報

著者	Jingxuan Wei,Cheng Tan,Zhangyang Gao,Linzhuang Sun,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li
発行日	2023-07-24 08:58:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー