Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

要約

ChatGPT などのジェネレーティブな大規模言語モデル (LLM) は、機械翻訳、質問応答、テキストの要約、自然言語理解など、いくつかの NLP タスクで顕著な習熟度を示しています。
最近の調査によると、ChatGPT を使用して機械翻訳 (MT) の品質を評価すると、システムレベルでは最先端のパフォーマンスが得られますが、セグメントレベルではパフォーマンスが低下します。
MT品質評価におけるLLMのパフォーマンスをさらに改善するために、いくつかのプロンプト方法について調査を実施しました。
私たちの結果は、Chain-of-Thoughts とエラー分析、\textbf{\texttt{Error Analysis Prompting}} と呼ばれる新しいプロンプト方法を組み合わせることで、ChatGPT のような LLM が \textit{システムと
セグメントレベル}。
さらに、MT エバリュエーターとしての ChatGPT には、単一のクエリで複数の翻訳が提供された場合の不安定なスコアリングやバイアスなど、いくつかの制限があることがわかりました。
私たちの調査結果は、ChatGPT で翻訳品質を適切に評価するための予備的な経験を提供すると同時に、コンテキスト内学習のプロンプトを設計するためのさまざまなトリックを提供することを目的としています。
このレポートは、メトリックの精度と信頼性の両方を強化することにより、LLM を使用した翻訳評価の分野の進歩に新たな光を当てることを期待しています。
プロジェクトは \url{https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt} にあります。

要約(オリジナル)

Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks such as machine translation, question answering, text summarization, and natural language understanding. Recent research has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conducted an investigation into several prompting methods. Our results indicate that by combining Chain-of-Thoughts and Error Analysis, a new prompting method called \textbf{\texttt{Error Analysis Prompting}}, LLMs like ChatGPT can \textit{generate human-like MT evaluations at both the system and segment level}. Additionally, we discovered some limitations of ChatGPT as an MT evaluator, such as unstable scoring and biases when provided with multiple translations in a single query. Our findings aim to provide a preliminary experience for appropriately evaluating translation quality on ChatGPT while offering a variety of tricks in designing prompts for in-context learning. We anticipate that this report will shed new light on advancing the field of translation evaluation with LLMs by enhancing both the accuracy and reliability of metrics. The project can be found in \url{https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt}.

arxiv情報

著者	Qingyu Lu,Baopu Qiu,Liang Ding,Liping Xie,Dacheng Tao
発行日	2023-03-24 05:05:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー