Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

要約

機械翻訳評価の着実な進歩にもかかわらず、既存の自動メトリックは、文の境界を超えて意味がどれほどうまく保持されているかを把握するのに苦労しています。
人間の判断を模倣するために訓練された単一の内因性の品質スコアへの依存は、長く複雑な文章の翻訳を評価するには不十分である可能性があり、より正確に重要な情報がコンテキストの翻訳によって伝えられるかを評価する「実用的」アプローチが必要であると仮定します。
TREQA（質問回答による翻訳評価）を紹介します。これは、元のソースまたはリファレンステキストの重要な情報をターゲットにする翻訳の正確な翻訳に正確な候補にどのように回答するかを評価することにより、翻訳の品質を補外的に評価するフレームワークです。
文学テキストなどの長期的な理解を必要とする挑戦的なドメインでは、TREQAが競争力があり、場合によっては、人間の判断と相関するように明示的に最適化されることはありませんが、ランキングの代替段落レベルの翻訳で最先端のニューラルおよびLLMベースのメトリックを上回ることを示します。
さらに、生成された質問と回答は解釈可能性を提供します。経験的分析は、評価されたデータセットの専門家によって特定された翻訳エラーを効果的にターゲットにしていることを示しています。
私たちのコードは、https：//github.com/deep-spin/treqaで入手できます

要約(オリジナル)

Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more “pragmatic” approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa

arxiv情報

著者	Patrick Fernandes,Sweta Agrawal,Emmanouil Zaranis,André F. T. Martins,Graham Neubig
発行日	2025-04-11 08:22:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー