HRET: A Self-Evolving LLM Evaluation Toolkit for Korean

要約

韓国の大手言語モデル（LLMS）の最近の進歩は、多数のベンチマークと評価方法論に拍車をかけてきましたが、標準化された評価フレームワークの欠如は一貫性のない結果と比較可能性を制限しました。
これに対処するために、韓国のLLMS専用に調整されたオープンソースの自己進化評価フレームワークであるHRET Haerae Evaluation Toolkitを紹介します。
HRETは、ロジットベースのスコアリング、正確な試合、言語継続性の罰則、LLM-A-a-Judge評価など、多様な評価方法を統合します。
モジュラーのレジストリベースのアーキテクチャは、主要なベンチマーク（HAE-RAEベンチ、KMMLU、Kudge、HRM8K）と複数の推論バックエンド（VLLM、Huggingface、OpenAI互換のエンドポイント）を統合します。
継続的な進化のための自動パイプラインにより、HRETは、再現性があり、公正で、透明な韓国NLP研究のための堅牢な基盤を提供します。

要約(オリジナル)

Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.

arxiv情報

著者	Hanwool Lee,Soo Yong Kim,Dasol Choi,SangWon Baek,Seunghyeok Hong,Ilgyun Jeong,Inseon Hwang,Naeun Lee,Guijin Son
発行日	2025-04-01 12:37:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HRET: A Self-Evolving LLM Evaluation Toolkit for Korean

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー