GPQA: A Graduate-Level Google-Proof Q&A Benchmark

要約

GPQA は、生物学、物理学、化学の各分野の専門家によって作成された 448 個の多肢選択式質問からなるやりがいのあるデータセットです。
私たちは、質問が高品質で非常に難しいことを保証します。対応する分野で博士号を取得している、または取得を目指している専門家は 65% の精度に達します (後から専門家が特定した明らかな間違いを差し引いた場合は 74%)、高度なスキルを持つ非専門家の検証者のみが正解します。
ウェブへの無制限のアクセスで平均 30 分以上を費やしたにもかかわらず、正解率は 34% に達しました (つまり、質問は「Google に耐えられる」ものです)。
この質問は最先端の AI システムにとっても難しく、当社の最も強力な GPT-4 ベースのベースラインは 39% の精度を達成しています。
たとえば、新しい科学的知識を開発する場合など、非常に難しい質問に答えるために将来の AI システムを使用したい場合は、人間がその出力を監督できるようにするスケーラブルな監督方法を開発する必要がありますが、それはたとえ監督者が自分自身であっても困難な場合があります。
熟練していて知識が豊富。
GPQA は、熟練した非専門家とフロンティア AI システムの両方にとって困難であるため、現実的なスケーラブルな監視実験が可能になるはずです。これにより、人間の専門家が人間の能力を超えた AI システムから真実の情報を確実に取得する方法を考案するのに役立つと期待されています。

要約(オリジナル)

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are ‘Google-proof’). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

arxiv情報

著者	David Rein,Betty Li Hou,Asa Cooper Stickland,Jackson Petty,Richard Yuanzhe Pang,Julien Dirani,Julian Michael,Samuel R. Bowman
発行日	2023-11-20 18:57:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー