WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

要約

WalledEval は、大規模言語モデル (LLM) を評価するために設計された包括的な AI 安全性テストツールキットです。
オープンウェイトと API ベースのモデルを含む多様なモデルに対応し、多言語の安全性、誇張された安全性、迅速な注射などの分野をカバーする 35 を超える安全性ベンチマークを備えています。
このフレームワークは、LLM とジャッジベンチマークの両方をサポートし、未来時制や言い換えなどのさまざまなテキストスタイルの突然変異に対する安全性をテストするためのカスタムミューテーターを組み込んでいます。
さらに、WalledEval は、新しい小型で高性能のコンテンツモデレーションツールである WalledGuard と、文化的背景における誇張された安全性を評価するためのベンチマークである SGXSTest を導入します。
WalledEval は https://github.com/walledai/walledevalA で公開されています。

要約(オリジナル)

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledevalA.

arxiv情報

著者	Prannaya Gupta,Le Qi Yau,Hao Han Low,I-Shiang Lee,Hugo Maximus Lim,Yu Xin Teoh,Jia Hng Koh,Dar Win Liew,Rishabh Bhardwaj,Rajat Bhardwaj,Soujanya Poria
発行日	2024-08-07 15:22:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー