Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

要約

ディープニューラルネットワーク (DNN) からの表現は、視覚処理と言語処理の両方に関与する神経活動を著しく予測できることが証明されています。
これらの成功にもかかわらず、これまでのほとんどの研究はユニモーダル DNN に関するもので、ビジュアル入力またはテキスト入力のいずれかをエンコードするものであり、両方をエンコードするものではありません。
しかし、人間の意味表現が言語情報と感覚運動情報を統合しているという証拠が増えてきています。
今回我々は、現在の視覚と言語の DNN モデル (VLM) によって操作されるマルチモーダル情報の統合が、言語のみおよび視覚のみの DNN によって得られるものよりも人間の脳の活動により一致した表現につながるかどうかを調査します。
私たちは、参加者が全文または付随する写真の文脈で概念単語を読んでいる間に記録された fMRI 応答に焦点を当てます。
私たちの結果は、VLM 表現が、言語および視覚のみの DNN よりも、言語処理に機能的に関連する脳領域の活性化とより強く相関していることを明らかにしました。
さまざまな種類の視覚言語アーキテクチャを比較すると、最近の生成 VLM は以前のアーキテクチャに比べて頭脳との整合性が低く、下流アプリケーションのパフォーマンスが低下する傾向があることがわかります。
さらに、複数の VLM にわたる脳と行動の一致を比較する追加の分析を通じて、1 つの注目すべき例外を除いて、行動の判断と強く一致する表現は脳の反応と高度に相関しないことを示しました。
これは、脳の類似性と行動の類似性は連動せず、またその逆も同様であることを示しています。

要約(オリジナル)

Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that — with one remarkable exception — representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.

arxiv情報

著者	Anna Bavaresco,Marianne de Heer Kloots,Sandro Pezzelle,Raquel Fernández
発行日	2024-07-25 10:08:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー