LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

要約

大規模な言語モデル（LLM）が法的アプリケーションでますます使用されているため、現在の評価ベンチマークは、主に事実の正確さに焦点を当てる傾向があり、一方で、明確さ、コヒーレンス、用語などの重要な言語の品質の側面をほとんど無視しています。
このギャップに対処するために、3つのステップを提案します。最初に、明確さ、一貫性、用語に基づいて法的テキストの品質を評価するための回帰モデルを開発します。
第二に、私たちは特別な一連の法的質問を作成します。
第三に、この評価フレームワークを使用して49 LLMを分析します。
分析では、3つの重要な調査結果が特定されています。まず、140億パラメーターでモデルの品質レベルがオフになり、720億パラメーターで2.7ドル\％$のわずかな改善しかありません。
第二に、統計的有意性のしきい値が0.016を超えるように、量子化やコンテキストの長さなどのエンジニアリングの選択は無視できる影響を及ぼします。
第三に、推論モデルはベースアーキテクチャを一貫して上回ります。
私たちの研究の重要な結果は、ランキングリストとパレート分析のリリースです。これは、QWEN3シリーズをコストパフォーマンストレードオフの最適な選択肢として強調しています。
この作業は、法的LLMの標準化された評価プロトコルを確立するだけでなく、現在のトレーニングデータ改良アプローチの基本的な制限を明らかにします。
コードとモデルは、https：//github.com/lyxx3rd/legaleval-qで入手できます。

要約(オリジナル)

As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.

arxiv情報

著者	Li yunhan,Wu gengshen
発行日	2025-05-30 17:30:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー