Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

要約

タイトル：プロンプトを用いた多言語大規模言語モデルのコードミックステキスト生成：東南アジア言語を例に

要約：
-　多言語の大規模言語モデル（LLM）を用いて、東南アジアの5つの言語（インドネシア語、マレー語、中国語、タガログ語、ベトナム語）とクレオール言語のシンガポール英語（Singlish）のコードミックステキストをゼロショットで生成する方法を探求した。
– ChatGPTは、【コードミックス】の用語が明示的に定義された場合、コードミックステキストを68％の確率で生成できるという最も潜在的な性能を示す。
– ChatGPTとInstructGPT（davinci-003）は、Singlishテキストの生成において優れた性能を示し、さまざまなプロンプトに対して平均96％の成功率に達した。しかし、語彙選択のエラーによる意味の不正確性によって、コードミックス能力は低下する。
– 他の多言語モデル（BLOOMZ、Flan-T5-XXL）は、コードミックステキストを生成できない。
– 特定のリソースが不足しているNLPコンテキストに同様の技術を適用する場合、LLMの限られた約束を強調し、計画的なアプローチを呼びかける。

要約(オリジナル)

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting multilingual LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) — Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68% of the time when the term ‘code-mixing’ is explicitly defined. Moreover, both ChatGPT’s and InstructGPT’s (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96% success rate across a variety of prompts. Their code-mixing proficiency, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts.

arxiv情報

著者	Zheng-Xin Yong,Ruochen Zhang,Jessica Zosa Forde,Skyler Wang,Samuel Cahyawijaya,Holy Lovenia,Genta Indra Winata,Lintang Sutawika,Jan Christian Blaise Cruz,Long Phan,Yin Lin Tan,Alham Fikri Aji
発行日	2023-03-30 14:59:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー