Efficient Detection of Toxic Prompts in Large Language Models

要約

ChatGPT や Gemini などの大規模言語モデル (LLM) は、大幅に高度な自然言語処理を備えており、チャットボットや自動コンテンツ生成などのさまざまなアプリケーションを可能にします。
ただし、これらのモデルは、有害なプロンプトを作成して有害または非倫理的な応答を引き出す悪意のある個人によって悪用される可能性があります。
これらの個人は、安全機構を回避するためにジェイルブレイク技術を使用することが多く、強力な有毒物質の即時検出方法の必要性が強調されています。
既存の検出技術は、ブラックボックスとホワイトボックスの両方で、有害なプロンプトの多様性、スケーラビリティ、および計算効率に関する課題に直面しています。
これに応えて、LLM の有害なプロンプトを効率的に検出するように設計された軽量のグレーボックス手法である ToxicDetector を提案します。
ToxicDetector は、LLM を利用して有害な概念プロンプトを作成し、埋め込みベクトルを使用して特徴ベクトルを形成し、プロンプト分類に多層パーセプトロン (MLP) 分類器を使用します。
LLama モデル、Gemma-2、および複数のデータセットのさまざまなバージョンに対する評価では、ToxicDetector が 96.39\% の高い精度と 2.00\% の低い偽陽性率を達成し、最先端の手法を上回るパフォーマンスを示していることが実証されました。
さらに、ToxicDetector の処理時間はプロンプトごとに 0.0780 秒であるため、リアルタイムアプリケーションに非常に適しています。
ToxicDetector は高い精度、効率、拡張性を実現しており、LLM における有毒物質の即時検出の実用的な方法となっています。

要約(オリジナル)

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector’s processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

arxiv情報

著者	Yi Liu,Junzhe Yu,Huijia Sun,Ling Shi,Gelei Deng,Yuqi Chen,Yang Liu
発行日	2024-08-21 15:54:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Detection of Toxic Prompts in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー