GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

要約

大規模言語モデル (LLM) の急速な進歩に伴い、その機能と限界を評価するための包括的な評価スイートが緊急に必要になっています。
既存の LLM リーダーボードは、一貫した設定やプロンプトを持たずに、他の論文で報告されたスコアを参照することが多く、より良い結果を得るために、意図せずに好みの設定やプロンプトを選択することを奨励する可能性があります。
この作業では、OpenAI Evals 上に構築されたオープンソースで再現可能な LLM 評価スイートである GPT-Fathom を紹介します。
当社は、7 つの機能カテゴリにわたる 20 以上の厳選されたベンチマークで、10 を超える主要な LLM と OpenAI のレガシーモデルを、すべて調整された設定の下で体系的に評価しています。
OpenAI の初期モデルに関する遡及研究は、GPT-3 から GPT-4 への進化の経路についての貴重な洞察を提供します。
現在、コミュニティは、コードデータの追加によって LLM の推論能力が向上するかどうか、LLM 機能のどの側面が SFT と RLHF によって向上できるか、どの程度調整されているかなどの技術的な詳細を含め、GPT-3 が GPT-4 にどのように改善されるかを知りたがっています。
私たちの分析は、高度な LLM の透明性を向上させることを目的として、これらの疑問の多くを明らかにします。

要約(オリジナル)

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI’s legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI’s earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM’s reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

arxiv情報

著者	Shen Zheng,Yuyu Zhang,Yijie Zhu,Chenguang Xi,Pengyang Gao,Xun Zhou,Kevin Chen-Chuan Chang
発行日	2023-12-19 07:41:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー