ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

要約

大規模言語モデル (LLM) に関する研究では、最近、長いドキュメント内の依存関係をより適切に捕捉するためにモデルのコンテキストサイズを拡張することへの関心が高まっています。
長距離能力を評価するベンチマークが提案されていますが、既存の取り組みでは主に、現実世界のアプリケーションと必ずしも一致していない一般的なタスクが考慮されていました。
対照的に、我々は、実用的な会議アシスタントのシナリオに焦点を当てた、長いコンテキストの LLM の新しいベンチマークを提案します。このシナリオでは、長いコンテキストは自動音声認識によって取得されたトランスクリプトで構成され、そのようなデータの固有のノイズ性と口頭の性質により、LLM に特有の課題が提示されます。
。
私たちのベンチマークである ELITR-Bench は、手動で作成された 271 の質問とその真実の回答、およびさまざまな単語誤り率レベルをターゲットにするために変更された会議記録のノイズの多いバージョンを追加することにより、既存の ELITR コーパスを強化します。
ELITR-Bench 上で 12 個のロングコンテキスト LLM を使用した実験では、プロプライエタリモデルとオープンモデルの両方の連続世代にわたる進歩が確認され、転写ノイズに対する堅牢性の点でそれらの不一致が指摘されています。
また、クラウドソーシング調査からの洞察を含む、GPT-4 ベースの評価の徹底的な分析も提供します。
私たちの調査結果は、GPT-4 のスコアは人間の審査員と一致しているものの、3 つのスコアレベルを超えて区別する能力には限界がある可能性があることを示しています。

要約(オリジナル)

Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models’ context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4’s scores align with human judges, its ability to distinguish beyond three score levels may be limited.

arxiv情報

著者	Thibaut Thonet,Jos Rozen,Laurent Besacier
発行日	2025-01-17 09:32:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー