GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

要約

関係抽出 (RE) の分野では、大規模言語モデル (LLM) の機能を活用した生成関係抽出 (GRE) への顕著な移行が起こっています。
しかし、精度や再現率などの従来の関係抽出 (RE) メトリクスでは、GRE 手法を評価するには不十分であることがわかりました。
この不足は、これらのメトリクスが人間による注釈付きの参照関係との正確な一致に依存しているのに対し、GRE メソッドは参照とは異なる多様で意味的に正確な関係を生成することが多いため発生します。
このギャップを埋めるために、GRE 結果のトピックの類似性、独自性、粒度、事実性、完全性の観点から多次元評価を行う GenRES を導入します。
GenRES では、(1) 精度/再現率が GRE メソッドのパフォーマンスを正当化できないことを経験的に特定しました。
(2) 人間が注釈を付けた参照関係は不完全である可能性があります。
(3) 固定された関係またはエンティティのセットを LLM に要求すると、幻覚を引き起こす可能性があります。
次に、GRE 手法の人による評価を実施し、GenRES が RE 品質に対する人間の好みと一致していることを示しました。
最後に、GRE における将来の研究のベンチマークを設定するために、それぞれドキュメント、バッグ、文レベルの RE データセットにわたって GenRES を使用して 14 の主要な LLM の包括的な評価を行いました。

要約(オリジナル)

The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE

arxiv情報

著者	Pengcheng Jiang,Jiacheng Lin,Zifeng Wang,Jimeng Sun,Jiawei Han
発行日	2024-02-16 15:01:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー