Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

要約

大規模な言語モデル、特に多言語モデルは、さまざまな言語の母語話者に対応できるように設計され、主張され、期待されています。
これらのモデルの微調整と評価の現在の実践は、翻訳に大きく依存しているため、翻訳アーチファクトや欠陥が発生する可能性があるため、この目的と完全には一致していない可能性があると仮説を立てています。
命令データの性質がモデルの出力に影響を与えるかどうかは不明のままです。
逆に、翻訳されたテストセットがそのようなニュアンスを捉えられるかどうかは疑問です。
両方の段階で翻訳されたデータを使用するという実践が組み合わされることが多いため、このような不完全性は見落とされていた可能性があります。
この研究では、命令の調整および評価段階で制御されたネイティブデータまたは変換されたデータを使用して、これらの問題を調査します。
8 つの基本モデルと 8 つの異なるベンチマークでの実験では、ネイティブベンチマークまたは生成ベンチマークでは、特にモデルのパフォーマンスが高い場合に、ネイティブと変換された命令データの間の顕著な違いが明らかになりますが、他の種類のテストセットでは明らかな違いが明らかになりません。
往復翻訳とシングルパス翻訳の比較は、言語ネイティブのリソースからの知識の重要性を反映しています。
最後に、構造化タスクではあるが生成的タスクではないこのギャップを埋めるのに正則化が有益であることを示します。

要約(オリジナル)

Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during instruction tuning and evaluation stages. Experiments on eight base models and eight different benchmarks show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.

arxiv情報

著者	Pinzhen Chen,Simon Yu,Zhicheng Guo,Barry Haddow
発行日	2024-07-11 16:37:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー