MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

要約

思考連鎖プロンプトなどの技術を備えた大規模言語モデル (LLM) は、優れた機能を実証していますが、複雑な設定で確実に推論する能力という点ではまだ不十分です。
ただし、論理演繹などのタスクのベンチマークデータセットが静的なままである一方で、システム機能は成長し続けているため、LLM 推論の評価は困難です。
自然言語の物語で指定された複数ステップのソフト推論タスクに関する言語モデルを評価するためのデータセットである MuSR を紹介します。
このデータセットには 2 つの重要な特徴があります。
まず、新しい神経象徴的な合成から自然への生成アルゴリズムを通じて作成され、GPT-4 に挑戦する複雑な推論インスタンス (たとえば、長さ約 1000 ワードの殺人ミステリー) の構築を可能にし、より有能な LLM としてさらに拡張することができます。
解放されます。
第 2 に、私たちのデータセットインスタンスは、現実世界の推論領域に対応するフリーテキストの物語です。
このため、人間のアノテーターが高精度で解決できる現実的で扱いやすいものであると同時に、合成的に作成された他のベンチマークよりもはるかに困難になります。
このデータセットでさまざまな LLM とプロンプト手法を評価し、堅牢な推論を実行するための思考連鎖などの手法に残されているギャップを特徴付けます。

要約(オリジナル)

While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.

arxiv情報

著者	Zayne Sprague,Xi Ye,Kaj Bostrom,Swarat Chaudhuri,Greg Durrett
発行日	2023-10-24 17:59:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー