Can linguists better understand DNA?

要約

多言語伝達能力は、1 つのソース言語で微調整されたモデルが他の言語にどの程度適用できるかを反映しており、多言語の事前トレーニング済みモデルでよく研究されています。
しかし、自然言語と遺伝子配列/言語の間のそのような能力伝達の存在はまだ調査されていません。この研究は、自然言語における文の類似性を評価するために使用される文ペア分類タスクからインスピレーションを得て、このギャップに取り組んでいます。
我々は、DNA ペア分類 (DNA 配列の類似性) と DNA タンパク質ペア分類 (遺伝子コードの決定) という 2 つの類似したタスクを構築しました。
これらのタスクは、自然言語から遺伝子配列への能力の伝達可能性を検証するために設計されました。
英語で事前トレーニングされた GPT-2-small のような小規模な事前トレーニングモデルでさえ、英語の文ペア分類データで微調整された後、DNA ペア分類タスクで 78% の精度を達成しました(
XTREME PAWS-X）。
多言語テキストで BERT モデルをトレーニングすると、精度は 89% に達しました。
しかし、より複雑な DNA タンパク質ペアの分類タスクでは、モデルの出力はランダムな出力とほとんど区別できませんでした。実験による検証により、自然言語から生物学的言語への能力の移行が明確に存在することが確認されました。
この基盤に基づいて、モデルパラメーターのスケールと事前トレーニングがこの能力の移転に及ぼす影響も調査しました。
私たちは、自然言語から遺伝言語への能力の伝達を促進するための推奨事項と、この能力に基づいて生物学研究を実施するための新しいアプローチを提供します。この研究は、自然言語と遺伝言語の関係を探求する上で興味深い新しい視点を提供します。

要約(オリジナル)

Multilingual transfer ability, which reflects how well models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains under explored.This study addresses this gap by drawing inspiration from the sentence-pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an accuracy of 78% on the DNA-pair classification task after being fine-tuned on English sentence-pair classification data(XTREME PAWS-X). While training a BERT model on multilingual text, the precision reached 89%. On the more complex DNA-protein-pair classification task, however, the model’s output was barely distinguishable from random output.Experimental validation has confirmed that the transfer of capabilities from natural language to biological language is unequivocally present. Building on this foundation, we have also investigated the impact of model parameter scale and pre-training on this capability transfer. We provide recommendations for facilitating the transfer of capabilities from natural language to genetic language,as well as new approaches for conducting biological research based on this capability.This study offers an intriguing new perspective on exploring the relationship between natural language and genetic language.

arxiv情報

著者	Wang Liang
発行日	2025-01-17 08:54:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can linguists better understand DNA?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー