MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

要約

モデルコンテキストプロトコル（MCP）は、生成AIエージェントのシームレスな統合を可能にするオープン標準として広く適合しています。
しかし、最近の研究により、MCPは検索ベースの「誤った良性」攻撃（FBA）の影響を受けやすく、悪意のあるシステムアクセスと資格情報の盗難を可能にしますが、ユーザーがシステムに直接ファイルを直接ダウンロードすることを要求しています。
ここでは、MCPベースの攻撃の脅威モデルが以前に考えられていたよりも大幅に広いことを示しています。つまり、攻撃者はMCPエージェントを欺くために、疑いを持たない被害者のシステムに対する攻撃を実行するために、悪意のあるコンテンツのみをオンラインで必要とする必要があります。
このような攻撃に対する調整ガードレールを改善するために、FBAの新しいMCPデータセットと（真に）良性サンプルを導入して、大規模な言語モデル（LLMS）の拒否トレーニングのための直接選好最適化（DPO）の有効性を調査します。
DPOはそのような攻撃に対してモデルガードレールを改善しますが、拒否学習の有効性は、モデルの元のトレーニング後のアライメントスキーム（例）によって劇的に変化することを示しています。
したがって、FBAの拒否をさらに改善するために、RAGに基づく新しい優先アライメント戦略である優先順位のための検索拡張生成（RAG-PREF）を導入します。
RAG-PREFは、特にDPOアライメントと組み合わされた場合、LLMSがFBAを拒否する能力を大幅に改善し、MCPベースの攻撃に対するガードレールを大幅に改善することを示しています。

要約(オリジナル)

The model context protocol (MCP) has been widely adapted as an open standard enabling the seamless integration of generative AI agents. However, recent work has shown the MCP is susceptible to retrieval-based ‘falsely benign’ attacks (FBAs), allowing malicious system access and credential theft, but requiring that users download compromised files directly to their systems. Herein, we show that the threat model of MCP-based attacks is significantly broader than previously thought, i.e., attackers need only post malicious content online to deceive MCP agents into carrying out their attacks on unsuspecting victims’ systems. To improve alignment guardrails against such attacks, we introduce a new MCP dataset of FBAs and (truly) benign samples to explore the effectiveness of direct preference optimization (DPO) for the refusal training of large language models (LLMs). While DPO improves model guardrails against such attacks, we show that the efficacy of refusal learning varies drastically depending on the model’s original post-training alignment scheme–e.g., GRPO-based LLMs learn to refuse extremely poorly. Thus, to further improve FBA refusals, we introduce Retrieval Augmented Generation for Preference alignment (RAG-Pref), a novel preference alignment strategy based on RAG. We show that RAG-Pref significantly improves the ability of LLMs to refuse FBAs, particularly when combined with DPO alignment, thus drastically improving guardrails against MCP-based attacks.

arxiv情報

著者	John Halloran
発行日	2025-05-29 16:44:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー