The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

要約

ユーザープロンプト内の特定のコンテキストに関して事実に正確なテキストを生成する言語モデルの能力を評価するオンラインリーダーボードおよび関連ベンチマークである FACTS Grounding を紹介します。
私たちのベンチマークでは、各プロンプトにはユーザー要求と完全なドキュメントが含まれており、最大長は 32,000 のトークンであり、長い形式の応答が必要です。
長い形式の応答は、ユーザーの要求を満たす際に、提供されたコンテキストドキュメントに完全に基づいている必要があります。
モデルは、自動判定モデルを使用して 2 つのフェーズで評価されます。(1) 応答がユーザーの要求を満たさない場合、応答は失格となります。
(2) 回答が提供された文書に完全に基づいている場合、回答は正確であると判断されます。
自動判定モデルは、最適なプロンプトテンプレートを選択するために、保持されたテストセットに対して包括的に評価され、最終的な事実スコアは、評価バイアスを軽減するために複数の判定モデルの合計です。
FACTS Grounding リーダーボードは、時間の経過とともに積極的に維持され、リーダーボードの整合性を守りながら外部からの参加を可能にするパブリックとプライベートの両方の分割が含まれます。
https://www.kaggle.com/facts-leaderboard でご覧いただけます。

要約(オリジナル)

We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models’ ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

arxiv情報

著者	Alon Jacovi,Andrew Wang,Chris Alberti,Connie Tao,Jon Lipovetz,Kate Olszewska,Lukas Haas,Michelle Liu,Nate Keating,Adam Bloniarz,Carl Saroufim,Corey Fry,Dror Marcus,Doron Kukliansky,Gaurav Singh Tomar,James Swirhun,Jinwei Xing,Lily Wang,Madhu Gurumurthy,Michael Aaron,Moran Ambar,Rachana Fellinger,Rui Wang,Zizhao Zhang,Sasha Goldshtein,Dipanjan Das
発行日	2025-01-06 18:28:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー