Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

要約

コードタスク向けに調整された現在の言語モデルは、多くの場合、自然言語処理による事前トレーニング、その後の微調整パラダイムを採用し、ソースコードをプレーンテキストとしてモデル化します。
ただし、このアプローチでは、プログラミング言語に固有の明確な構造が見落とされます。
この研究では、事前トレーニングされたコードモデルをさらに事前トレーニングし、プログラム構造で微調整することで、データ効率の高い適応を検討します。
具体的には、プログラムを解析ツリー (具象構文ツリー (CST) とも呼ばれる) として表現し、事前トレーニングされたモデルをシリアル化された CST に適応させます。
私たちが適応したモデルはプログラムの表面的な形式でのみ事前トレーニングされていますが、モデルアーキテクチャを変更せずに少量の継続的な事前トレーニングと CST の微調整を行うことで、さまざまなコードにわたってベースラインアプローチよりも改善が得られることがわかりました。
タスク。
この改善は、トレーニング例が限られている場合に特に顕著であることがわかり、構造で事前トレーニングされていないバックボーンモデルを操作する場合でも、プログラム構造をプレーンテキスト表現と統合することの有効性を示しています。

要約(オリジナル)

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees — also known as concrete syntax trees (CSTs) — and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.

arxiv情報

著者	Mayank Agarwal,Yikang Shen,Bailin Wang,Yoon Kim,Jie Chen
発行日	2024-01-19 14:27:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー