F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods

要約

大規模言語モデル (LLM) は、その前例のないパフォーマンスで大きな注目を集めており、LLM を評価する研究の数が増加しています。
しかし、これらの評価基準は指導に従う能力を評価することに限定されており、訓練前の段階で現れる基礎的な能力は無視されています。
これまでの主観的な評価方法は、主に API モデルによるスコアリングに基づいて回答されていました。
ただし、参照がない場合、大規模なモデルでは微妙な違いを識別する能力が限られています。
このギャップを埋めるために、私たちは表現力、常識力、論理力などの基礎的な能力を評価するためのバイリンガル評価ベンチマークであるF-Evalを提案します。
F-Eval のタスクには、多肢選択の客観的タスク、自由回答型の客観的タスク、参照ベースの主観的タスク、参照なしの主観的タスクが含まれます。
リファレンスフリーの主観的なタスクについては、API モデルによるスコアリングの代替として機能する新しい評価方法を考案します。
13 の先進的な LLM の評価を実施します。
結果は、私たちの評価方法が他の評価者よりも高い相関係数と大きな区別を示すことを示しています。
さらに、さまざまなモデルのサイズ、次元、正規化方法の影響についても説明します。
私たちは、F-Eval が LLM の基本的な能力の研究を促進すると期待しています。

要約(オリジナル)

Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs’ fundamental abilities.

arxiv情報

著者	Yu Sun,Keyu Chen,Shujie Wang,Qipeng Guo,Hang Yan,Xipeng Qiu,Xuanjing Huang,Dahua Lin
発行日	2024-01-26 13:55:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー