Truth Knows No Language: Evaluating Truthfulness Beyond English

要約

バスク、カタロニア、ガリシア語、スペイン語の真実性を評価するために設計された真実のベンチマークの専門的に翻訳された拡張を紹介します。
大規模な言語モデル（LLM）の真実性評価は、主に英語で実施されています。
ただし、LLMが言語間で真実性を維持する能力は、未調査のままです。
私たちの研究では、12の最先端のオープンLLMSを評価し、人間の評価、多肢選択メトリック、およびLLM-A-a-a-judgeスコアリングを使用して、ベースと命令チューニングモデルを比較します。
私たちの調査結果は、LLMSが英語で最高のパフォーマンスを発揮し、バスクで最悪の状態であるが、言語間の全体的な真実性の矛盾は予想よりも小さいことを明らかにしています。
さらに、LLM-as-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-dudgeは、複数の選択メトリックよりも人間の判断とより密接に相関しており、情報性が真実性評価において重要な役割を果たすことを示しています。
また、私たちの結果は、機械翻訳が真実性ベンチマークを追加の言語に拡張するための実行可能なアプローチを提供し、専門的な翻訳に代わるスケーラブルな代替品を提供することを示しています。
最後に、文化的および時間的変動を説明する真実性評価の必要性を強調しているコンテキストおよび時間依存の質問よりも、普遍的な知識の質問は言語間でよりよく処理されていることがわかります。
データセットとコードは、公開ライセンスで公開されています。

要約(オリジナル)

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.

arxiv情報

著者	Blanca Calvo Figueras,Eneko Sagarzazu,Julen Etxaniz,Jeremy Barnes,Pablo Gamallo,Iria De Dios Flores,Rodrigo Agerri
発行日	2025-02-13 15:04:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Truth Knows No Language: Evaluating Truthfulness Beyond English

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー