Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

要約

事前トレーニングされた言語モデルは AI アプリケーションに不可欠な部分ですが、トレーニングにかかる計算コストが高いため、アクセシビリティが制限されます。
Bloom や StarCoder などの取り組みは、共同コミュニティ開発のための事前トレーニング済みモデルへのアクセスを民主化することを目的としています。
こうした取り組みにもかかわらず、このようなモデルは、多言語機能の制限、継続的な事前トレーニング中の壊滅的な忘れのリスク、モデルをゼロからトレーニングするための高額なコスト、および AI の安全基準や規制の枠組みに合わせる必要性などの課題に直面しています。
この論文では、英語、フィンランド語、ヒンディー語、日本語、ベトナム語、およびコードでトレーニングされた 15B パラメーターの多言語オープンソースモデルである Aurora-M について説明します。
StarCoderPlus から 435B の追加トークンで継続的に事前トレーニングされた Aurora-M は、合計トレーニングトークン数で 2T トークンを超えています。
これは、人間がレビューした安全上の指示に基づいて微調整された初のオープンソースの多言語モデルであり、その開発は従来のレッドチームの考慮事項だけでなく、金庫に関するバイデン・ハリス大統領令に明記された特定の懸念にも合わせて行われています。
安全で信頼できる人工知能の開発と使用。
私たちは、幅広いタスクと言語にわたって Aurora-M を評価し、致命的な忘れに対する堅牢性と、多言語設定、特に安全性評価における優れたパフォーマンスを実証します。
https://huggingface.co/aurora-m で、大規模な言語モデルの責任あるオープンソース開発を奨励するために、Aurora-M とそのバリアントをオープンソースにしています。

要約(オリジナル)

Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

arxiv情報

著者	Taishi Nakamura,Mayank Mishra,Simone Tedeschi,Yekun Chai,Jason T Stillerman,Felix Friedrich,Prateek Yadav,Tanmay Laud,Vu Minh Chien,Terry Yue Zhuo,Diganta Misra,Ben Bogin,Xuan-Son Vu,Marzena Karpinska,Arnav Varma Dantuluri,Wojciech Kusa,Tommaso Furlanello,Rio Yokota,Niklas Muennighoff,Suhas Pai,Tosin Adewumi,Veronika Laippala,Xiaozhe Yao,Adalberto Junior,Alpay Ariyak,Aleksandr Drozd,Jordan Clive,Kshitij Gupta,Liangyu Chen,Qi Sun,Ken Tsui,Noah Persaud,Nour Fahmy,Tianlong Chen,Mohit Bansal,Nicolo Monti,Tai Dang,Ziyang Luo,Tien-Tung Bui,Roberto Navigli,Virendra Mehta,Matthew Blumberg,Victor May,Huu Nguyen,Sampo Pyysalo
発行日	2024-12-27 03:53:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー