Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

要約

多言語の大規模言語モデルは、さまざまな言語の話者に対応できるように設計され、主張され、期待されています。
私たちは、これらのモデルの微調整と評価の現在の実践は、翻訳に大きく依存しているため、この目的と完全には一致していない可能性があると仮説を立てています。これは、言語固有の知識をカバーできず、翻訳上の欠陥を引き起こす可能性があります。
命令データの性質がモデルの出力に影響を与えるかどうかは不明のままです。
逆に、翻訳されたテストセットがそのようなニュアンスを捉えられるかどうかは疑問です。
両方の段階で翻訳されたデータを使用するという実践が組み合わされることが多いため、このような不完全性は見落とされる可能性があります。
この研究では、命令の調整および評価段階で制御されたネイティブデータまたは変換されたデータを使用して、これらの問題を調査します。
ネイティブベンチマークまたは生成ベンチマークでは、特にモデルのパフォーマンスが高い場合にネイティブと変換された命令データの間の顕著な違いが明らかになりますが、他の種類のテストセットでは明らかな違いが明らかにならないことを示します。
往復翻訳とシングルパス翻訳の比較は、言語ネイティブのリソースからの知識の重要性を反映しています。
最後に、構造化タスクではあるが生成的タスクではないこのギャップを埋めるのに正則化が有益であることを示します。

要約(オリジナル)

Multilingual large language models are designed, claimed, and expected to cater to speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which cannot cover language-specific knowledge but can introduce translation defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during the instruction tuning and evaluation stages. We show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.

arxiv情報

著者	Pinzhen Chen,Simon Yu,Zhicheng Guo,Barry Haddow
発行日	2024-09-26 17:39:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー