LEXam: Benchmarking Legal Reasoning on 340 Law Exams

要約

テスト時間スケーリングの最近の進歩にもかかわらず、長い形式の法的推論は依然として大きな言語モデル（LLMS）にとって重要な課題です。
Lexamは、さまざまな科目と学位レベルにわたって116のロースクールコースにまたがる340の法律試験から派生した新しいベンチマークである紹介を紹介します。
データセットには、2,841個のロングフォーム、自由回答形式の質問、2,045個の複数選択の質問を含む、英語とドイツ語の4,886個の法律試験の質問が含まれます。
参照の回答に加えて、未解決の質問には、発行スポット、ルールリコール、またはルールアプリケーションなどの予想される法的推論アプローチの概要を示す明示的なガイダンスも伴います。
オープンエンドと複数の選択の両方の質問に関する私たちの評価は、現在のLLMに大きな課題を提示しています。
特に、彼らは特に、構造化されたマルチステップの法的推論を必要とする未解決の質問と闘っています。
さらに、我々の結果は、さまざまな機能を備えたモデルを区別する上でのデータセットの有効性を強調しています。
厳密な人間の専門家の検証でLLM-A-A-A-Judgeパラダイムを採用すると、モデル生成の推論ステップを一貫して正確に評価する方法を示します。
評価セットアップは、単純な精度メトリックを超えて法的推論の質を評価するためのスケーラブルな方法を提供します。
プロジェクトページ：https：//lexam-benchmark.github.io/

要約(オリジナル)

Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/

arxiv情報

著者	Yu Fan,Jingwei Ni,Jakob Merane,Etienne Salimbeni,Yang Tian,Yoan Hermstrüwer,Yinya Huang,Mubashara Akhtar,Florian Geering,Oliver Dreyer,Daniel Brunner,Markus Leippold,Mrinmaya Sachan,Alexander Stremitzer,Christoph Engel,Elliott Ash,Joel Niklaus
発行日	2025-05-29 15:37:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー