Can Large Language Models Be an Alternative to Human Evaluations?

要約

タイトル：大規模言語モデルは人間の評価の代替となり得るか？

要約：

– 機械学習モデルによって生成されたテキストや人間の書いたテキストの品質を評価するためには、人間の評価が必要である。
– しかし、人間の評価は再現性が非常に低く、その品質が不安定であるため、異なる自然言語処理（NLP）モデルやアルゴリズムの公平な比較を妨げている。
– 最近、大規模言語モデル（LLMs）は、タスクの指示だけが提供された場合に、未知のタスクで驚異的なパフォーマンスを発揮している。
– この研究では、LLMsが人間の評価の代替として使用できるかどうかを探究する。
– 人間評価のために使用された正確に同じ指示、サンプル、および質問をLLMsに提示し、それらの質問に対する回答を生成することでLLM評価を行う。
– オープンエンドのストーリー生成と敵対的攻撃の2つのNLPタスクのテキストを人間評価とLLM評価で評価する。
– 人間の専門家によって評価されたテキストは、LLMsによって評価されたテキストと一致することを示す。
– LLM評価の結果は、タスク指示の異なる形式や回答の生成に使用されるサンプリングアルゴリズムに関しても安定していることが分かった。
– この研究は、LLMsを使用してテキストの品質を評価する可能性を示し、LLM評価の限界と倫理的考慮事項について議論している。

要約(オリジナル)

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.

arxiv情報

著者	Cheng-Han Chiang,Hung-yi Lee
発行日	2023-05-03 07:28:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Can Large Language Models Be an Alternative to Human Evaluations?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー