Get my drift? Catching LLM Task Drift with Activation Deltas

要約

LLMは、外部ソースからのデータに基づいてユーザー命令を実行するために、検索されたアプリケーションで一般的に使用されます。
たとえば、最新の検索エンジンはLLMSを使用して、関連する検索結果に基づいてクエリに回答します。
電子メールプラグインは、LLMを介してコンテンツを処理してメールを要約します。
ただし、これらのデータソースの潜在的に信頼されていない起源は、迅速なインジェクション攻撃につながる可能性があります。この場合、LLMは外部データに埋め込まれた自然言語の指示によって操作され、ユーザーの元の指示から逸脱します。
この偏差をタスクドリフトとして定義します。
タスクドリフトは、攻撃者がデータを除去したり、他のユーザーのLLMの出力に影響を与えることができるため、重大な懸念事項です。
LLMのアクティベーションは、タスクドリフトを検出するソリューションとして研究し、外部データを処理する前後の活性化の違いがこの現象と強く相関していることを示しています。
2つのプロービング方法を通じて、単純な線形分類器が、分散型テストセットでほぼ完璧なROC AUCでドリフトを検出できることを実証します。
ユーザーのタスク、システムプロンプト、および攻撃をどのように表現できるかについて最小限の仮定を行うことにより、これらの方法を評価します。
このアプローチは、これらの攻撃のいずれかについて訓練されることなく、迅速な注入、脱獄、悪意のある指示など、目に見えないタスクドメインに驚くほどよく一般化されることを観察します。
興味深いことに、このソリューションではLLMの変更を必要としないという事実（微調整など）、および既存のメタ採用ソリューションとの互換性により、費用対効果が高く展開が容易になります。
アクティベーションベースのタスク検査、デコード、および解釈性に関するさらなる調査を促進するために、500Kを超えるインスタンスのデータセット、Six Sota言語モデルの表現、および一連の検査ツールを備えた大規模なタスクトラッカーツールキットをリリースします。

要約(オリジナル)

LLMs are commonly used in retrieval-augmented applications to execute user instructions based on data from external sources. For example, modern search engines use LLMs to answer queries based on relevant search results; email plugins summarize emails by processing their content through an LLM. However, the potentially untrusted provenance of these data sources can lead to prompt injection attacks, where the LLM is manipulated by natural language instructions embedded in the external data, causing it to deviate from the user’s original instruction(s). We define this deviation as task drift. Task drift is a significant concern as it allows attackers to exfiltrate data or influence the LLM’s output for other users. We study LLM activations as a solution to detect task drift, showing that activation deltas – the difference in activations before and after processing external data – are strongly correlated with this phenomenon. Through two probing methods, we demonstrate that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We evaluate these methods by making minimal assumptions about how users’ tasks, system prompts, and attacks can be phrased. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Interestingly, the fact that this solution does not require any modifications to the LLM (e.g., fine-tuning), as well as its compatibility with existing meta-prompting solutions, makes it cost-efficient and easy to deploy. To encourage further research on activation-based task inspection, decoding, and interpretability, we release our large-scale TaskTracker toolkit, featuring a dataset of over 500K instances, representations from six SoTA language models, and a suite of inspection tools.

arxiv情報

著者	Sahar Abdelnabi,Aideen Fay,Giovanni Cherubin,Ahmed Salem,Mario Fritz,Andrew Paverd
発行日	2025-03-06 17:43:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Get my drift? Catching LLM Task Drift with Activation Deltas

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー