Topical: Learning Repository Embeddings from Source Code using Attention

要約

ソースコード上の機械学習 (MLOnCode) は、ソフトウェアの配信方法を変革することを約束します。
MLOnCode は、ソフトウェアアーティファクト間のコンテキストと関係をマイニングすることで、コードの自動生成、コードの推奨、コードの自動タグ付け、その他のデータ駆動型の機能強化によりソフトウェア開発者の機能を強化します。
これらのタスクの多くではコードのスクリプトレベルの表現で十分ですが、多くの場合、トピックによるリポジトリの自動タグ付けやリポジトリの自動ドキュメント化など、さまざまな依存関係やリポジトリ構造を考慮したリポジトリレベルの表現が不可欠です。
リポジトリレベルの表現を計算するための既存の方法には、(a) コードの自然言語ドキュメント (README ファイルなど) への依存、(b) 連結や平均化などによるメソッド/スクリプトレベルの表現の単純な集約、という問題があります。
このペーパーでは、公開されている GitHub コードリポジトリのリポジトリレベルの埋め込みをソースコードから直接生成するディープニューラルネットワークである Topical について紹介します。
Topical には、ソースコード、完全な依存関係グラフ、およびスクリプトレベルのテキスト情報を高密度のリポジトリレベルの表現に投影するアテンションメカニズムが組み込まれています。
リポジトリレベルの表現を計算するために、Topical は、グラウンドトゥルーストピックタグとともにクロールされた、公開されている GitHub リポジトリのデータセット上で、リポジトリに関連付けられたトピックを予測するようにトレーニングされます。
私たちの実験では、Topical によって計算されたエンベディングが、リポジトリの自動タグ付けのタスクで平均化または連結を通じてメソッドレベルの表現を単純に結合するベースラインを含む、複数のベースラインを上回るパフォーマンスを発揮できることが示されています。

要約(オリジナル)

Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode augments the software developers capabilities with code auto-generation, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example, auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging.

arxiv情報

著者	Agathe Lherondelle,Varun Babbar,Yash Satsangi,Fran Silavong,Shaltiel Eloul,Sean Moran
発行日	2023-07-07 13:44:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Topical: Learning Repository Embeddings from Source Code using Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー