Guess & Sketch: Language Model Guided Transpilation

要約

レガシーソフトウェアの保守には、ソフトウェアおよびシステムエンジニアリングに多くの時間が必要です。
アセンブリコードプログラムは、コンピュータマシンの状態に対する低レベルの制御を必要とし、変数名がないため、人間が分析するのは特に困難です。
既存の従来のプログラムトランスレータは正確性を保証していますが、問題のソースおよびターゲットプログラミング言語に合わせて手作業でエンジニアリングされています。
学習されたトランスパイル、つまりコードの自動翻訳は、手動による書き直しやエンジニアリング作業に代わる手段を提供します。
自動化されたシンボリックプログラム変換アプローチは正確性を保証しますが、検索スペースが指数関数的に大きいため、より長いプログラムに拡張するのに苦労します。
彼らの厳格なルールベースのシステムは表現力も制限するため、プログラムのスペースが減少することを考慮することしかできません。
確率的ニューラル言語モデル (LM) は、すべての入力に対して妥当な出力を生成しますが、そのためには正確性の保証が犠牲になります。
この研究では、アセンブリコードの学習されたトランスパイルに対するニューロシンボリックアプローチで LM とシンボリックソルバーの長所を活用します。
アセンブリコードは、記号的手法の使用に適した短い非分岐の基本ブロックに分割できるため、神経記号的アプローチに適した設定です。
Guess & Sketch は、LM の特徴からアラインメントと信頼度の情報を抽出し、それをシンボリックソルバーに渡して、トランスパイルの入力と出力の意味上の等価性を解決します。
難易度の異なるアセンブリトランスパイルタスクの 3 つの異なるテストセットで Guess & Sketch をテストし、GPT-4 よりも 57.6% 多くのサンプル、エンジニアリングされたトランスパイラーよりも 39.6% 多くのサンプルをトランスパイルできることを示しました。
このタスクのトレーニングと評価のデータセットも共有します。

要約(オリジナル)

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

arxiv情報

著者	Celine Lee,Abdulrahman Mahmoud,Michal Kurek,Simone Campanoni,David Brooks,Stephen Chong,Gu-Yeon Wei,Alexander M. Rush
発行日	2024-03-15 17:03:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Guess & Sketch: Language Model Guided Transpilation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー