Benchmarking LLMs’ Judgments with No Gold Standard

要約

大規模言語モデル (LLM) による言語生成を評価するための評価指標である GEM (相互情報生成推定器) を導入します。特に、ゴールドスタンダードの参照を必要とせずに有益な判断を生成する際に役立ちます。
GEM は、LLM 生成のパフォーマンスをベンチマークできるシナリオを、機械翻訳や要約などのゴールドスタンダードの参照がすぐに利用できる従来のシナリオから、学術ピアレビューなどの明確なゴールドスタンダードのない主観的なタスクまで広げます。
GEM は生成モデルを使用して、参照がゴールドスタンダードである必要なしに、候補応答と参照応答の間の相互情報を推定します。
人間の注釈が付けられたデータセットでの実験では、GEM は最先端の GPT-4o Examiner と比較して人間のスコアと競合する相関関係を示し、他のすべてのベースラインを上回っています。
さらに、GEM は、GPT-4o Examiner の下でスコアを人為的につり上げる可能性のある言い換えや伸長などの戦略的操作に対してより堅牢です。
また、学術研究論文に対してどれだけ質の高い査読を生成できるかに基づいて LLM を評価する GRE ベンチ (レビュー評価ベンチマークの生成) も紹介します。
GRE ベンチは GEM に基づいているため、その堅牢性の特性を継承しています。
さらに、GRE ベンチは、毎年大量に流入する新しいオープンアクセスの研究論文と査読を利用して、データ汚染問題 (またはデータ漏洩) を回避します。
ICLR2023 データセットを使用して、ピアレビュー機能に関するさまざまな人気のある LLM の GRE ベンチ結果を示します。

要約(オリジナル)

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

arxiv情報

著者	Shengwei Xu,Yuxuan Lu,Grant Schoenebeck,Yuqing Kong
発行日	2024-11-11 16:58:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking LLMs’ Judgments with No Gold Standard

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー