Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

要約

機械翻訳 (MT) の評価指標により、翻訳の品質が自動的に評価されます。
最近、研究者はデータのフィルタリングや翻訳の再ランキングなど、さまざまな新しいユースケースに MT メトリクスを採用しています。
ただし、ほとんどの MT メトリクスは、解釈が難しいスカラースコアとして評価を返すため、情報に基づいた設計の選択が困難になります。
さらに、MT メトリクスの機能はこれまで、人間の判断との相関関係を使用して評価されてきましたが、その有効性にもかかわらず、特に新しいメトリクスのユースケースに関して、メトリクスのパフォーマンスに関する直感的な洞察を提供するには至っていません。
これらの問題に対処するために、MT メトリクスの解釈可能な評価フレームワークを導入します。
このフレームワーク内で、データフィルタリングと翻訳の再ランキングのユースケースのプロキシとして機能する 2 つのシナリオでメトリクスを評価します。
さらに、精度、再現率、F スコアを使用して MT メトリクスのパフォーマンスを測定することにより、人間の判断との相関関係よりも、MT メトリクスの機能についてのより明確な洞察が得られます。
最後に、直接評価 + スカラー品質メトリクス (DA+SQM) ガイドラインに従って手動でキュレーションされたデータの信頼性に関する懸念を提起し、多次元品質メトリクス (MQM) アノテーションとの一致が著しく低いことを報告します。

要約(オリジナル)

Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics return assessments as scalar scores that are difficult to interpret, posing a challenge to making informed design choices. Moreover, MT metrics’ capabilities have historically been evaluated using correlation with human judgment, which, despite its efficacy, falls short of providing intuitive insights into metric performance, especially in terms of new metric use cases. To address these issues, we introduce an interpretable evaluation framework for MT metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. Furthermore, by measuring the performance of MT metrics using Precision, Recall, and F-score, we offer clearer insights into their capabilities than correlation with human judgments. Finally, we raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines, reporting a notably low agreement with Multidimensional Quality Metrics (MQM) annotations.

arxiv情報

著者	Stefano Perrella,Lorenzo Proietti,Pere-Lluís Huguet Cabot,Edoardo Barba,Roberto Navigli
発行日	2024-10-07 16:42:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー