The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

要約

アライメントチューニングにより、大規模な言語モデルは推論、命令追従、有害な世代の最小化において優れた性能を発揮する。しかし、これらのモデルは、その広範な展開にもかかわらず、単一言語バイアスを示し、言語間のアライメントの有効性に関する懸念を提起している。現在のアライメント手法は主に英語に焦点を当てており、アライメントメカニズムがどのように多言語環境に一般化されるかは不明なままである。この問題に対処するため、我々はアライメント前後のLLMの埋め込み空間における分布シフトを系統的に分析し、多様な言語間でのモデル動作への影響を明らかにする。アライメントが安全性制約をどのように強化するかを測定する定量的ツールとして、アライメントによって誘発される安全性空間の分離を活用する。本研究では、バランスの取れた毒性データセットと並列テキスト解毒ベンチマークを用いて7つのLLMを評価し、高リソース言語と低リソース言語の間で潜在表現空間に大きな格差があることを明らかにした。これらの知見は、公平で信頼性の高いロバストな多言語アライメントを保証するために、言語固有の微調整の必要性を強調するものである。我々の洞察は、真に安全な多言語LLMを開発するための基礎を提供し、代表的でない言語におけるアライメントギャップに対処することの緊急性を強調している。

要約(オリジナル)

Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

arxiv情報

著者	Nikhil Verma,Manasa Bharadwaj
発行日	2025-04-03 15:46:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー