Human-Calibrated Automated Testing and Validation of Generative Language Models

要約

このペーパーでは、銀行などのリスクの高い分野に導入された検索拡張生成 (RAG) システムに焦点を当てた、生成言語モデル (GLM) の評価と検証のための包括的なフレームワークを紹介します。
GLM の評価は、無制限の出力と主観的な品質評価のため、困難です。
生成された応答が事前定義されたドキュメントコレクションに基づいている RAG システムの構造化された性質を利用して、人間による調整による自動テスト (HCAT) フレームワークを提案します。
HCAT は、a) 層別サンプリングを使用した自動テスト生成、b) 機能、リスク、安全性属性を説明可能な評価するための埋め込みベースのメトリクス、および c) 確率校正および分析を通じて機械が生成した評価を人間の判断と一致させる 2 段階の校正アプローチを統合します。
等角予測。
さらに、このフレームワークには、敵対的、分布外、およびさまざまな入力条件に対するモデルのパフォーマンスを評価するための堅牢性テストと、改善すべき特定の領域を特定するための周辺分析および二変量分析を使用した対象を絞った弱点の特定が含まれています。
人間によって調整されたこの多層評価フレームワークは、GLM 評価に対するスケーラブルで透明性が高く、解釈可能なアプローチを提供し、精度、透明性、規制順守が最重要視されるアプリケーションに GLM を導入するための実用的で信頼性の高いソリューションを提供します。

要約(オリジナル)

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

arxiv情報

著者	Agus Sudjianto,Aijun Zhang,Srinivas Neppalli,Tarun Joshi,Michal Malohlava
発行日	2024-11-25 13:53:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human-Calibrated Automated Testing and Validation of Generative Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー