Evaluation Framework for AI Systems in ‘the Wild’

要約

生成AI（genai）モデルは業界全体で重要になりましたが、現在の評価方法はそれらの広範な使用に適応していません。
従来の評価は、多くの場合、ベンチマークや固定データセットに依存しており、実際のパフォーマンスを反映していないことが多く、ラボでテストされた結果と実用的なアプリケーションとの間にギャップが生じます。
このホワイトペーパーでは、現実世界のGenaiシステムをどのように評価すべきかについての包括的なフレームワークを提案し、多様で進化する入力と総合的、動的、継続的な評価アプローチを強調しています。
このペーパーでは、リアルタイムの機能を正確に反映する評価方法を設計する方法についての実務家にガイダンスを提供し、固定されたパフォーマンス数やパラメーターサイズではなく、社会的影響に焦点を当てたGenaiポリシーを作成するための推奨事項を政策立案者に提供します。
私たちは、パフォーマンス、公平性、倫理を統合し、人間と自動化の評価を組み合わせた継続的な結果指向の方法の使用を統合しながら、利害関係者間の信頼を促進するために透明性を持つ全体的な枠組みを提唱しています。
これらの戦略を実装することで、Genaiモデルは技術的に熟練しているだけでなく、倫理的に責任があり、影響力があります。

要約(オリジナル)

Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use. Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance, which creates a gap between lab-tested outcomes and practical applications. This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems, emphasizing diverse, evolving inputs and holistic, dynamic, and ongoing assessment approaches. The paper offers guidance for practitioners on how to design evaluation methods that accurately reflect real-time capabilities, and provides policymakers with recommendations for crafting GenAI policies focused on societal impacts, rather than fixed performance numbers or parameter sizes. We advocate for holistic frameworks that integrate performance, fairness, and ethics and the use of continuous, outcome-oriented methods that combine human and automated assessments while also being transparent to foster trust among stakeholders. Implementing these strategies ensures GenAI models are not only technically proficient but also ethically responsible and impactful.

arxiv情報

著者	Sarah Jabbour,Trenton Chang,Anindya Das Antar,Joseph Peper,Insu Jang,Jiachen Liu,Jae-Won Chung,Shiqi He,Michael Wellman,Bryan Goodman,Elizabeth Bondi-Kelly,Kevin Samy,Rada Mihalcea,Mosharaf Chowhury,David Jurgens,Lu Wang
発行日	2025-04-23 14:52:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluation Framework for AI Systems in ‘the Wild’

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー