A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

要約

大規模言語モデル (LLM) の基礎に基づいて、多言語大言語モデル (MLLM) は、高リソース言語から低リソース言語への知識の移転を実現することを目的として、多言語自然言語処理タスクの課題に対処するために開発されました。
しかし、言語の不均衡、多言語の連携、固有の偏見など、重大な制限と課題が依然として存在します。
このペーパーでは、MLLM の包括的な分析を提供し、これらの重要な問題をめぐる議論を深く掘り下げることを目的としています。
まず最初に、MLLM の概要を示し、その進化、主要なテクニック、多言語能力について説明します。
次に、MLLM のトレーニングに広く利用されている多言語コーパスと、MLLM の異言語能力を強化するために重要な下流タスク向けの多言語データセットを調査します。
第三に、多言語表現に関する既存の研究を調査し、現在のMLLMが世界共通言語表現を学習できるかどうかを調査します。
4 番目に、MLLM のバイアスについて、そのカテゴリと評価指標を含めて説明し、既存のバイアス除去手法を要約します。
最後に、既存の課題について議論し、有望な研究の方向性を指摘します。
これらの側面を実証することにより、このホワイトペーパーは、MLLM とさまざまな領域におけるその可能性についてのより深い理解を促進することを目的としています。

要約(オリジナル)

Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs’ training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.

arxiv情報

著者	Yuemei Xu,Ling Hu,Jiayi Zhao,Zihan Qiu,Yuqi Ye,Hanwen Gu
発行日	2024-06-06 16:04:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー