Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

要約

大規模な事前トレーニング済み多言語言語モデル (ML-LM) は、言語間を直接監視することなく、ゼロショットで言語間転送を行う驚くべき能力を示しています。
これらの結果は有望ですが、その後の研究では、多言語埋め込み空間内に、言語間で共有される言語要素の表現を妨げる強力な言語同一性情報が存在することが判明しました。
言語を越えた文の検索などの意味論的なタスクでは、意味論的な情報を十分に活用するために、そのような言語同一性信号を除去することが望ましい。
この研究では、多言語埋め込み空間から言語固有の要素を投影するという新しい視点を提供します。
具体的には、主に意味論に無関係な情報（構文情報など）を符号化する低ランクの部分空間が存在することを発見しました。
この部分空間を特定するために、複数の単言語コーパスを入力として使用した特異値分解に基づく、シンプルだが効果的な教師なし手法を提案します。
部分空間が見つかると、元の埋め込みをヌル空間に直接投影して、微調整することなく言語非依存性を高めることができます。
私たちは、困難な言語に依存しない QA 検索タスクを含むさまざまなタスクでメソッドを体系的に評価します。
経験的な結果は、私たちの方法を一貫して適用すると、一般的に使用されている ML-LM よりも改善されることが示されています。

要約(オリジナル)

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.

arxiv情報

著者	Zhihui Xie,Handong Zhao,Tong Yu,Shuai Li
発行日	2024-01-11 09:54:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー