M-Prometheus: A Suite of Open Multilingual LLM Judges

要約

ロングフォームテキスト（LLM-as-a-judge）を自動的に評価するための言語モデルの使用はますます一般的になりつつありますが、ほとんどのLLM審査員は英語のみ専用であり、多言語評価機能を強化するための戦略が現在の文献では主に未開拓のままです。
これにより、英語以外の言語の自動評価方法の品質に格差が生まれ、最終的には多言語機能が向上したモデルの開発が妨げられます。
このギャップを埋めるために、3Bから14Bのパラメーターの範囲のオープンウェイトLLMジャッジのスイートであるM-Prometheusを紹介します。
M-Prometheusモデルは、20を超える言語にまたがる多言語報酬ベンチマークと、4つの言語ペアをカバーする文学機械翻訳（MT）評価で、最先端のオープンLLM審査員を上回ります。
さらに、M-Prometheusモデルをデコード時に活用して、3つのテストされた言語すべてで生成された出力を大幅に改善し、より良い多言語モデルの開発のためにユーティリティを紹介します。
最後に、広範なアブレーションを通じて、翻訳されたデータではなくネイティブな多言語フィードバックデータに関するバックボーンモデルの選択やトレーニングなど、効果的な多言語裁判官を取得するための重要な要因を特定します。
モデル、トレーニングデータセット、およびコードをリリースします。

要約(オリジナル)

The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.

arxiv情報

著者	José Pombal,Dongkeun Yoon,Patrick Fernandes,Ian Wu,Seungone Kim,Ricardo Rei,Graham Neubig,André F. T. Martins
発行日	2025-04-07 11:37:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M-Prometheus: A Suite of Open Multilingual LLM Judges

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー