A Statistical Framework for Ranking LLM-Based Chatbots

要約

大規模言語モデル (LLM) は自然言語処理を変革し、Chatbot Arena のようなフレームワークはこれらのモデルを評価するための先駆的なプラットフォームを提供します。
人間の判断に基づいて何百万ものペアごとの比較を容易にすることで、Chatbot Arena は LLM 評価の基礎となり、自由形式の会話タスクでモデルをランク付けするための豊富なデータセットを提供します。
この基盤に基づいて、ペアワイズ比較分析における特定の課題に対処するための主要な進歩を組み込んだ統計フレームワークを提案します。
まず、因数分解されたタイモデルを導入します。これは、人間が判断する比較に不可欠な要素であるタイを処理する能力を強化し、観察データへのモデルの適合性を大幅に向上させます。
2 番目に、フレームワークを拡張して競合他社間の共分散をモデル化し、パフォーマンスの関係についてのより深い洞察を可能にし、パフォーマンス階層への直感的なグループ化を容易にします。
第三に、新しい制約を導入することでパラメータの非一意性から生じる最適化の課題を解決し、安定した解釈可能なパラメータ推定を保証します。
厳密な評価と広範な実験を通じて、私たちのフレームワークは、ペアごとの比較データのモデリングにおいて既存の方法に比べて大幅な改善を示しています。
再現性と実際の導入をサポートするために、モデルと分析を実装するオープンソースの Python パッケージであるリーダーボットをリリースします。

要約(オリジナル)

Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties — an integral aspect of human-judged comparisons — significantly improving the model’s fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive groupings into performance tiers. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints, ensuring stable and interpretable parameter estimation. Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses.

arxiv情報

著者	Siavash Ameli,Siyuan Zhuang,Ion Stoica,Michael W. Mahoney
発行日	2024-12-24 12:54:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Statistical Framework for Ranking LLM-Based Chatbots

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー