Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

要約

オープンドメインダイアログエージェントのドメインにおける一般的なパラダイムは、主に英語に焦点を当て、モデルとデータセットの両方を網羅しています。
さらに、特に複数の言語が関与している場合、Finetuningのためにこのようなデータセットをクラウドソーシングするために必要な金融および一時的な投資は、かなりのものです。
幸いなことに、大規模な言語モデル（LLM）の進歩により、多様なタスク全体で多くの可能性が発表されました。
具体的には、命令調整により、LLMは自然言語の指示に基づいてタスクを実行することができ、時には人間の群衆のパフォーマンスを上回ります。
さらに、これらのモデルには、単一のスレッド内のさまざまな言語で機能する機能があります。
その結果、さまざまな言語で新しいサンプルを生成するために、これらの機能を活用してデータ収集プロセスを再現することを提案します。
LLMSを使用して複数のターゲット言語でオープンドメインダイアログデータを生成するためのパイプラインを紹介し、デモンストレーションを一意のソース言語で提供します。
このアプローチで明示的な機械翻訳を避けることにより、言語固有のニュアンスへの順守を強化します。
この方法論をPersonachatデータセットに適用します。
生成された対話の開放性を高め、実生活のシナリを模倣するために、スピーカーが関与している会話のタイプに対応する音声イベントの概念と、会話の前提を表す共通の基盤の概念を追加しました。

要約(オリジナル)

The prevailing paradigm in the domain of Open-Domain Dialogue agents predominantly focuses on the English language, encompassing both models and datasets. Furthermore, the financial and temporal investments required for crowdsourcing such datasets for finetuning are substantial, particularly when multiple languages are involved. Fortunately, advancements in Large Language Models (LLMs) have unveiled a plethora of possibilities across diverse tasks. Specifically, instruction-tuning has enabled LLMs to execute tasks based on natural language instructions, occasionally surpassing the performance of human crowdworkers. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using LLMs, with demonstrations provided in a unique Source Language. By eschewing explicit Machine Translation in this approach, we enhance the adherence to language-specific nuances. We apply this methodology to the PersonaChat dataset. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation.

arxiv情報

著者	Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
発行日	2025-03-05 12:52:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー