CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

要約

最近の研究では、ソースコードに対する大規模な言語モデルの事前トレーニングが広く採用され、ソースコード固有の事前トレーニング目標が提案され、ソースコードに対するさまざまな Transformer ベースの言語モデルアーキテクチャの適用可能性が調査されました。
この研究では、このようなモデルのもう 1 つの重要な側面、つまりさまざまなサブトークン化オプションの影響を調査し、コードの仕様を考慮して、最も効果的で長さ効率の良いサブトークン化を特定することを目的としています。
私たちは、ダウンストリームのパフォーマンスを低下させることなく平均長を 17% 削減するサブトークン化を提案し、慎重に選択したサブトークン化により、場合によっては長さが若干増加する可能性があり、品質が 0.5 ～ 2% 向上する可能性があることを示します。

要約(オリジナル)

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

arxiv情報

著者	Nadezhda Chirkova,Sergey Troshin
発行日	2023-08-01 17:40:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー