Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

要約

トランスベースの言語モデルは、脳の記録を模倣するために明示的に訓練されていませんが、脳の活動と驚くべき整合性を示しています。
これらのモデルの進捗状況 – スルーサイズの増加、命令調整、およびマルチモダリティHASは、神経データとのより良い表現の整合性をもたらしました。
最近、命令チューニングされたマルチモーダルLLMS（MLLMS）の新しいクラスが登場し、オープンエンドのマルチモーダルビジョンタスクで顕著なゼロショット機能を示しています。
ただし、MLLMが自然な指示で促されたときに、より良い脳の整列につながり、命令固有の表現を効果的にキャプチャするかどうかは不明です。
これに対処するために、最初に脳のアライメントを調査します。つまり、参加者が自然なシーンを視聴する際にMLLMSからのテキスト出力応答の埋め込みを使用して、神経視覚活動の予測の程度を測定します。
10の異なる指示を用いた実験では、MLLMが視力のみのモデルよりも脳の整列が大幅に優れていることを示しており、クリップなどの非インストラクションチューニングされたマルチモーダルモデルと同等に実行されます。
また、これらのMLLMは、タスク固有の命令に適した高品質の応答を生成するのに効果的であるが、すべての命令が脳の整合に関連するわけではないことがわかります。
さらに、さまざまな命令により、入力画像に関連する命令固有の視覚概念をエンコードするMLLMSを作成します。
この分析は、MLLMがカウント関連の認識関連の概念を効果的にキャプチャし、脳の活動との強い整合性を示していることを示しています。
特に、脳エンコードモデルの説明された分散の大部分は、画像キャプションのMLLM埋め込みとその他の命令の間で共有されます。
これらの結果は、タスク固有の情報をキャプチャするMLLMの能力を高めると、さまざまなタイプの命令をよりよく区別し、それにより脳の反応を予測する際の精度を改善する可能性があることを示唆しています。

要約(オリジナル)

Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity. Progress in these models-through increased size, instruction-tuning, and multimodality-has led to better representational alignment with neural data. Recently, a new class of instruction-tuned multimodal LLMs (MLLMs) have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks. However, it is unknown whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations. To address this, we first investigate brain alignment, i.e., measuring the degree of predictivity of neural visual activity using text output response embeddings from MLLMs as participants engage in watching natural scenes. Experiments with 10 different instructions show that MLLMs exhibit significantly better brain alignment than vision-only models and perform comparably to non-instruction-tuned multimodal models like CLIP. We also find that while these MLLMs are effective at generating high-quality responses suitable to the task-specific instructions, not all instructions are relevant for brain alignment. Further, by varying instructions, we make the MLLMs encode instruction-specific visual concepts related to the input image. This analysis shows that MLLMs effectively capture count-related and recognition-related concepts, demonstrating strong alignment with brain activity. Notably, the majority of the explained variance of the brain encoding models is shared between MLLM embeddings of image captioning and other instructions. These results suggest that enhancing MLLMs’ ability to capture task-specific information could lead to better differentiation between various types of instructions, and thereby improving their precision in predicting brain responses.

arxiv情報

著者	Subba Reddy Oota,Akshett Jindal,Ishani Mondal,Khushbu Pahwa,Satya Sai Srinath Namburi,Manish Shrivastava,Maneesh Singh,Bapi S. Raju,Manish Gupta
発行日	2025-05-26 14:18:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー