Quantifying Cross-Modality Memorization in Vision-Language Models

要約

トレーニング中にニューラルネットワークがどのように、どのように覚えているかを理解することは、潜在的に機密情報の意図しない記憶の観点から、また実際の知識集約型タスクの効果的な知識習得の観点からも重要です。
以前の研究では、大規模な言語モデルでのテキストの記憶や拡散モデルでの画像暗記など、単一のモダリティ内の暗記を主に調査していますが、統一されたマルチモーダルモデルは実際のアプリケーションでますます一般的になっています。
この作業では、クロスモダリティの記憶のユニークな特徴に焦点を当て、ビジョン言語モデルを中心とした体系的な研究を実施します。
制御された実験を容易にするために、まず、多様な合成型の画像とテキストの説明を含む合成ペルソナデータセットを紹介します。
単一のモダリティでモデルをトレーニングし、他のパフォーマンスを評価することにより、事実の知識の記憶とクロスモーダル移転性を定量化します。
私たちの結果は、あるモダリティで学んだ事実が他のモダリティに転送されたことを明らかにしていますが、ソースの情報とターゲットのモダリティのリコール情報との間には大きなギャップが存在します。
さらに、このギャップは、より有能なモデル、マシンの学習、マルチホップケースなど、さまざまなシナリオに存在することがわかります。
最後に、この課題を軽減するためのベースライン方法を提案します。
私たちの研究が、より堅牢なマルチモーダル学習技術の開発に関する将来の研究を刺激して、クロスモーダルの移転性を高めることを願っています。

要約(オリジナル)

Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

arxiv情報

著者	Yuxin Wen,Yangsibo Huang,Tom Goldstein,Ravi Kumar,Badih Ghazi,Chiyuan Zhang
発行日	2025-06-05 16:10:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Quantifying Cross-Modality Memorization in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー