Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?

要約

文法誤り訂正（GEC）における自動評価メトリクスの目標の1つは、人間の嗜好に合うようにGECシステムをランク付けすることである。しかし、現在の自動評価は人間の評価とは異なる手順に基づいています。具体的には、人間による評価では、文レベルの相対評価結果（例えばペアワイズ比較）をレーティングアルゴリズムを用いて集計することで順位を決定するのに対し、自動評価では、文レベルの絶対スコアを平均してコーパスレベルのスコアを求め、それをソートして順位を決定する。本研究では、このギャップを埋めるために、人間の評価手法に沿った既存の自動評価指標の集計方法を提案する。編集ベースのメトリクス、n-gramベースのメトリクス、文レベルのメトリクスなど、様々なメトリクスを用いて実験を行い、ギャップを解消することで、SEEDAベンチマークにおいてほとんどのメトリクスで結果が改善することを示した。また、BERTベースのメトリクスでもGPT-4のメトリクスを上回る場合があることがわかった。提案するランキング手法は、統合されたgec-metricsである。

要約(オリジナル)

One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, n-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. The proposed ranking method is integrated gec-metrics.

arxiv情報

著者	Takumi Goto,Yusuke Sakai,Taro Watanabe
発行日	2025-06-03 17:24:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー