Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

要約

大規模な言語モデルを提供するための命令データと評価ベンチマークの作成には、多くの場合、膨大な量の人的注釈が必要になります。
この問題は、日本語のような非英語言語向けのリソースを迅速に開発する場合に特に顕著になります。
既存の英語リソースを日本語 (例: Japanese-Alpaca) に直接翻訳する一般的な慣行に従う代わりに、GPT-4 に基づいた効率的な自己指導方法を提案します。
まず、少量の英語の指示を日本語に翻訳し、ネイティブレベルの品質を得るためにポストエディットします。
GPT-4はそれをデモンストレーションとして活用し、日本語の指示データを自動生成します。
また、GPT-4 を使用して、人的参照なしで LLM の応答品質を自動的に評価する、8 つのカテゴリにわたる 80 の質問を含む評価ベンチマークを構築します。
経験的結果は、GPT-4 自己指導データに基づいて微調整されたモデルが、3 つの事前トレーニング済みモデルすべてにおいてニホンアルパカのパフォーマンスを大幅に上回ったことを示唆しています。
GPT-4 自己指示データにより、LLaMA 13B モデルは 54.37\% の勝率で GPT-3.5 (Davinci-003) を破ることができました。
人間の評価は、GPT-4 の評価と人間の好みとの一貫性を示しています。
高品質な指導データと評価ベンチマークを公開しています。

要約(オリジナル)

The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4’s assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.

arxiv情報

著者	Yikun Sun,Zhen Wan,Nobuhiro Ueda,Sakiko Yahata,Fei Cheng,Chenhui Chu,Sadao Kurohashi
発行日	2024-03-06 13:17:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー