Trust-Oriented Adaptive Guardrails for Large Language Models

要約

ガードレールは、大規模言語モデル（LLM）が有害または有害な反応を緩和することによって、人間の価値観に合致することを保証するために設計された新しいメカニズムであり、その設計には社会技術的アプローチが必要である。本稿では、既存のガードレールには、特にアクセス権に関する様々なユーザーグループの多様なニーズに対応するための、根拠のある方法論が欠けているという、重要な問題を取り上げる。信頼モデリング（主に「社会的」側面）によりサポートされ、検索補強生成（「技術的」側面）を介したオンライン・イン・コンテキスト学習により強化された、ユーザの信頼メトリクスに基づき、センシティブコンテンツへのアクセスを動的に調整する適応型ガードレールメカニズムを紹介する。ユーザー信頼メトリクスは、直接対話による信頼と権威によって検証された信頼の新しい組み合わせとして定義され、ユーザーの信頼性とその問い合わせの特定のコンテキストに合わせることによって、システムがコンテンツモデレーションの厳しさを正確に調整することを可能にする。我々の実証評価では、適応型ガードレールが多様なユーザーのニーズを満たし、既存のガードレールを凌駕しながら、機密情報を保護し、コンテキストを意識した知識ベースを通じて潜在的に危険なコンテンツを正確に管理する有効性を実証している。我々の知る限り、この研究はガードレールシステムに信頼指向の概念を導入した最初のものであり、次世代LLMサービスの倫理的展開に関する言説を豊かにするスケーラブルなソリューションを提供するものである。

要約(オリジナル)

Guardrail, an emerging mechanism designed to ensure that large language models (LLMs) align with human values by moderating harmful or toxic responses, requires a sociotechnical approach in their design. This paper addresses a critical issue: existing guardrails lack a well-founded methodology to accommodate the diverse needs of different user groups, particularly concerning access rights. Supported by trust modeling (primarily on `social’ aspect) and enhanced with online in-context learning via retrieval-augmented generation (on `technical’ aspect), we introduce an adaptive guardrail mechanism, to dynamically moderate access to sensitive content based on user trust metrics. User trust metrics, defined as a novel combination of direct interaction trust and authority-verified trust, enable the system to precisely tailor the strictness of content moderation by aligning with the user’s credibility and the specific context of their inquiries. Our empirical evaluation demonstrates the effectiveness of the adaptive guardrail in meeting diverse user needs, outperforming existing guardrails while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. To the best of our knowledge, this work is the first to introduce trust-oriented concept into a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLM service.

arxiv情報

著者	Jinwei Hu,Yi Dong,Xiaowei Huang
発行日	2025-02-03 16:03:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Trust-Oriented Adaptive Guardrails for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー