Revisiting Data Auditing in Large Vision-Language Models

要約

大規模な言語モデル（LLMS）の急増により、視覚的接地を正確に視覚的に接地するためにVisionエンコーダーをLLMと統合する大型視覚モデル（VLMS）があります。
ただし、VLMSは通常、大規模なWebが縮小した画像で訓練され、著作権侵害とプライバシー違反に対する懸念を引き起こし、データ監査がますます緊急になっています。
サンプルがトレーニングで使用されているかどうかを決定するメンバーシップ推論（MI）が重要な監査手法として浮上しており、LLAVAのようなオープンソースVLM（AUC> 80％）の有望な結果が得られました。
この作業では、これらの進歩を再検討し、重大な問題を明らかにします。現在のMIベンチマークは、メンバーと非メンバーの画像間の分布シフトに苦しみ、MIパフォーマンスを膨らませるショートカットキューを導入します。
さらに、これらのシフトの性質を分析し、分布の不一致を定量化するための最適な輸送に基づいて原則的なメトリックを提案します。
現実的な設定でMIを評価するために、I.I.Dを使用して新しいベンチマークを構築します。
メンバーおよび非メンバーの画像。
既存のMIメソッドは、これらの公平な条件下で失敗し、偶然よりもわずかに優れたパフォーマンスを発揮します。
さらに、VLMの埋め込みスペース内のベイズの最適性を調査することにより、MIの理論上の上限を探り、既約のエラー率が高いことを発見します。
この悲観的な見通しにもかかわらず、VLMのMIが特に挑戦的である理由を分析し、監査が実現可能になる場合、フィンチューニング、グラウンドトゥルーステキストへのアクセス、およびセットベースの推論の3つの実用的なシナリオを特定します。
私たちの研究は、VLMSのMIの制限と機会の体系的な見解を提示し、信頼できるデータ監査における将来の努力のガイダンスを提供します。

要約(オリジナル)

With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)–which integrate vision encoders with LLMs for accurate visual grounding–have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM’s embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios–fine-tuning, access to ground-truth texts, and set-based inference–where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

arxiv情報

著者	Hongyu Zhu,Sichu Liang,Wenwen Wang,Boheng Li,Tongxin Yuan,Fangqi Li,ShiLin Wang,Zhuosheng Zhang
発行日	2025-04-25 13:38:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting Data Auditing in Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー