Can LLMs Follow Simple Rules?

要約

大規模言語モデル (LLM) は現実世界での責任が増大するにつれて展開されるため、これらのシステムの動作を信頼性の高い方法で指定および制約できることが重要です。
モデル開発者は、「不正なコンテンツを生成しない」など、モデルに明示的なルールを設定したい場合がありますが、これらはジェイルブレイク技術によって回避される可能性があります。
LLM に対する敵対的な攻撃と防御の既存の評価には、通常、高価な手動レビューか、信頼性の低いヒューリスティックチェックが必要です。
この問題に対処するために、LLM のルール従う能力を測定するためのプログラムフレームワークであるルール従う言語評価シナリオ (RuLES) を提案します。
RuLES は 14 の単純なテキストシナリオで構成されており、モデルはユーザーと対話しながらさまざまなルールに従うように指示されます。
各シナリオには、モデルが会話内のルールに違反していないかどうかを判断するためのプログラムによる評価関数があります。
独自のモデルとオープンモデルを評価したところ、現在のほとんどすべてのモデルは、たとえ単純なテストケースであっても、シナリオルールに従うのに苦労していることがわかりました。
また、単純な最適化攻撃だけでテストケースの失敗率を大幅に高めることができることも示します。
最後に、テスト時のステアリングと監視付き微調整という 2 つの潜在的な改善手段を検討します。

要約(オリジナル)

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as ‘do not generate abusive content’, but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.

arxiv情報

著者	Norman Mu,Sarah Chen,Zifan Wang,Sizhe Chen,David Karamardian,Lulwa Aljeraisy,Basel Alomair,Dan Hendrycks,David Wagner
発行日	2024-03-07 10:18:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can LLMs Follow Simple Rules?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー