Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

要約

この研究では、クラッシュの物語の分類におけるディープラーニング（DL）モデルの精度と専門家の合意との関係を調査します。
BERTバリアント、ユニバーサルセンテンスエンコーダー（使用）、およびゼロショット分類器を含む5つのDLモデルを、専門家標識データと物語テキストに対して評価します。
分析はさらに、GPT-4、Llama 3、Qwen、およびClaudeの4つの大きな言語モデル（LLMS）に拡張されます。
私たちの結果は直感に反する傾向を明らかにします。技術的精度が高いモデルは、ドメインの専門家との一致が低いことがよくありますが、LLMは比較的低い精度スコアにもかかわらず、より大きなエキスパートアライメントを示します。
モデルと専門家の契約を定量化および解釈するために、CohenのKappa、主成分分析（PCA）、およびSHAPベースの説明可能性手法を採用しています。
調査結果は、エキスパートに合ったモデルが、位置固有のキーワードではなく、コンテキストおよび時間的言語の合図に依存する傾向があることを示しています。
これらの結果は、安全性が批判的なNLPアプリケーションのモデルを評価するには精度だけでは不十分であることを強調しています。
私たちは、モデル評価フレームワークの補完的なメトリックとして専門家契約を組み込むことを提唱し、クラッシュ分析パイプラインの解釈可能でスケーラブルなツールとしてLLMの約束を強調しています。

要約(オリジナル)

This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models — including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier — against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen’s Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.

arxiv情報

著者	Sudesh Ramesh Bhagat,Ibne Farabi Shihab,Anuj Sharma
発行日	2025-04-17 16:29:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー