Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

要約

大規模言語モデル (LLM) は、世界の大多数の非英語話者を含むすべての人にサービスを提供する必要があります。
ただし、今日のほとんどの LLM、特にオープン LLM は、多くの場合、英語のみ (Llama2、Mistral など) または少数の高リソース言語 (Mixtral、Qwen など) での使用を目的としています。
最近の調査によると、使用目的に制限があるにもかかわらず、人々はさまざまな言語で LLM を要求していることがわかっています。
したがって、このホワイトペーパーでは、最先端のオープン LLM の、意図された用途を超えた基本的な多言語機能を調査します。
この目的のために、私たちは、類型的に多様な 137 言語のセットにわたる 27.4k のテスト質問を備えた、基本的な自由回答形式の新しいシルバースタンダードベンチマークである MultiQ を導入します。
MultiQ を使用して、言語の忠実度、つまりモデルが指示された言語で応答するかどうか、および質問応答の精度を評価します。
私たちがテストしたすべての LLM は、意図された用途を超えて、少なくとも一部の言語に対して忠実および/または正確に応答します。
ほとんどのモデルは、忠実に応答するとより正確になります。
ただし、モデル間の差異は大きく、モデルが正確でも忠実でもない言語のロングテールが存在します。
私たちは、調査結果の潜在的な説明としてトークン化の違いを調査し、さらなる調査を必要とする可能性のある相関関係を特定します。

要約(オリジナル)

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e.\ whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.

arxiv情報

著者	Carolin Holtermann,Paul Röttger,Timm Dill,Anne Lauscher
発行日	2024-03-06 16:01:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー