Jatmo: Prompt Injection Defense by Task-Specific Finetuning

要約

大規模言語モデル (LLM) は、その命令追従機能により研究で大きな注目を集めており、ユーザーや開発者がさまざまなタスクに LLM を活用できるようになります。
ただし、LLM はプロンプトインジェクション攻撃に対して脆弱です。プロンプトインジェクション攻撃は、モデルの命令追従能力をハイジャックし、プロンプトに対する応答を望ましくない、おそらく悪意のあるものに変更する攻撃の一種です。
この研究では、プロンプトインジェクション攻撃に強いタスク固有のモデルを生成する方法である Jatmo を紹介します。
Jatmo は、LLM が命令チューニングを経た後にのみ命令に従うことができるという事実を利用します。
教師の命令調整モデルを利用してタスク固有のデータセットを生成し、そのデータセットを使用してベースモデル (つまり、命令調整されていないモデル) を微調整します。
Jatmo に必要なのは、タスクプロンプトとタスクの入力データセットのみです。教師モデルを使用して出力を生成します。
既存のデータセットがない状況では、Jatmo は 1 つのサンプルを使用するか、場合によってはサンプルをまったく使用せずに、完全に合成されたデータセットを生成します。
6 つのタスクに関する実験では、Jatmo モデルが、特定のタスクに関して標準 LLM と同じ品質の出力を提供しながら、プロンプト注入に対する回復力があることがわかりました。
最良の攻撃の成功率は、当社のモデルに対しては 0.5% 未満でしたが、GPT-3.5-Turbo に対しては 90% 以上の成功率でした。
Jatmo は https://github.com/wagner-group/prompt-injection-defense でリリースされます。

要約(オリジナル)

Large Language Models (LLMs) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage LLMs for a variety of tasks. However, LLMs are vulnerable to prompt-injection attacks: a class of attacks that hijack the model’s instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. In this work, we introduce Jatmo, a method for generating task-specific models resilient to prompt-injection attacks. Jatmo leverages the fact that LLMs can only follow instructions once they have undergone instruction tuning. It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. For situations with no pre-existing datasets, Jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. Our experiments on six tasks show that Jatmo models provide the same quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections. The best attacks succeeded in less than 0.5% of cases against our models, versus over 90% success rate against GPT-3.5-Turbo. We release Jatmo at https://github.com/wagner-group/prompt-injection-defense.

arxiv情報

著者	Julien Piet,Maha Alrashed,Chawin Sitawarin,Sizhe Chen,Zeming Wei,Elizabeth Sun,Basel Alomair,David Wagner
発行日	2023-12-29 16:37:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jatmo: Prompt Injection Defense by Task-Specific Finetuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー