Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

要約

NLP システムのパフォーマンスは通常、クラウドソーシングを利用して大規模なデータセットを収集し、データ駆動型モデルをトレーニングし、データの保持された部分で評価することによって評価されます。
このアプローチには、誤った相関があり、自然言語の多様性を表す困難な例が欠如していることがわかっています。
代わりに、合成的に生成されたチャレンジセットに対するトレーニングセットの自由設定で最適化されたモデルを評価するためのフレームワークを検討します。
生成方法が単純であるにもかかわらず、MRC モデルの言語能力を評価する目的では、データは自然性と語彙の多様性に関してクラウドソースのデータセットと競合できることがわかりました。
私たちはさらなる実験を実施し、最先端の言語モデルベースの MRC システムが、評価対象の現象の一般概念を把握することなく、設定された課題に正しく成功することを学習できることを示しました。

要約(オリジナル)

Performance of NLP systems is typically evaluated by collecting a large-scale dataset by means of crowd-sourcing to train a data-driven model and evaluate it on a held-out portion of the data. This approach has been shown to suffer from spurious correlations and the lack of challenging examples that represent the diversity of natural language. Instead, we examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity for the purpose of evaluating the linguistic capabilities of MRC models. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly, although, without capturing the general notion of the evaluated phenomenon.

arxiv情報

著者	Viktor Schlegel,Goran Nenadic,Riza Batista-Navarro
発行日	2024-08-09 12:23:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー