KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model

要約

生物医学の事前トレーニング済み言語モデルのほとんどは単一言語であり、増大する複数言語の要件に対応できません。
並列データはもちろんのこと、英語以外のドメインコーパスが不足していることが、多言語生物医学モデルのトレーニングにおいて大きな障害となっています。
知識はドメイン固有のコーパスの中核を形成し、さまざまな言語に正確に翻訳できるため、知識固定アプローチを使用して多言語事前学習モデル XLM-R を生物医学ドメインに変換する KBioXLM と呼ばれるモデルを提案します。
我々は、3 つの粒度の知識アラインメント (実体、事実、文章レベル) を単言語コーパスに組み込むことにより、生物医学多言語コーパスを実現します。
次に、対応する 3 つのトレーニングタスク (エンティティマスキング、関係マスキング、パッセージ関係予測) を設計し、XLM-R モデル上でトレーニングを継続して、ドメイン間の言語能力を強化します。
モデルの有効性を検証するために、複数のタスクの英語のベンチマークを中国語に翻訳します。
実験結果は、私たちのモデルが、言語をまたがるゼロショットおよび少数ショットのシナリオにおいて、単言語および多言語の事前トレーニング済みモデルを大幅に上回り、最大 10 ポイント以上の改善を達成することを示しています。
私たちのコードは https://github.com/ngwlh-gl/KBioXLM で公開されています。

要約(オリジナル)

Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements. The scarcity of non-English domain corpora, not to mention parallel data, poses a significant hurdle in training multilingual biomedical models. Since knowledge forms the core of domain-specific corpora and can be translated into various languages accurately, we propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach. We achieve a biomedical multilingual corpus by incorporating three granularity knowledge alignments (entity, fact, and passage levels) into monolingual corpora. Then we design three corresponding training tasks (entity masking, relation masking, and passage relation prediction) and continue training on top of the XLM-R model to enhance its domain cross-lingual ability. To validate the effectiveness of our model, we translate the English benchmarks of multiple tasks into Chinese. Experimental results demonstrate that our model significantly outperforms monolingual and multilingual pretrained models in cross-lingual zero-shot and few-shot scenarios, achieving improvements of up to 10+ points. Our code is publicly available at https://github.com/ngwlh-gl/KBioXLM.

arxiv情報

著者	Lei Geng,Xu Yan,Ziqiang Cao,Juntao Li,Wenjie Li,Sujian Li,Xinjie Zhou,Yang Yang,Jun Zhang
発行日	2023-11-20 07:02:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー