LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine

要約

ノルウェーのCancer Registry（CRN）は、自動癌登録サポートシステム（CARESS）を使用して、コアがん登録活動、つまりデータキャプチャ、データキュレーション、およびさまざまな利害関係者のデータ製品と統計の生産をサポートしています。
Guriは愛careのコアコンポーネントであり、医療規則を使用して着信データを検証する責任があります。
このような医療規則は、医療基準、規制、および研究に基づいて医療専門家によって手動で実施されています。
大規模な言語モデル（LLM）は、これらの文書を含む大量の公開情報について訓練されているため、Guriのテストを生成するために使用できます。
したがって、GURIをテストするために、LLMベースのテスト生成および微分テストアプローチ（LLMediff）を提案します。
4つの異なるLLM、2つの医療ルールエンジンの実装、および58の実際の医療ルールを実験して、LLMの幻覚、成功、時間効率、およびテストを生成するための堅牢性を調査しました。
私たちの結果は、GPT-3.5が最も成功しておらず、一般的に最も堅牢であることをGPT-3.5が幻覚を最も少なくしていることを示しました。
ただし、最悪の時間効率があります。
微分テストでは、実装の不一致が発見された22の医療ルールが明らかになりました（例えば、ルールバージョンの処理に関して）。
最後に、結果に基づいて開業医と研究者に洞察を提供します。

要約(オリジナル)

The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e, data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of CaReSS, which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since large language models (LLMs) have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests’ ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.

arxiv情報

著者	Erblin Isaku,Christoph Laaber,Hassan Sartaj,Shaukat Ali,Thomas Schwitalla,Jan F. Nygård
発行日	2025-01-29 12:36:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー