Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

要約

重要な質問生成（CQS-Gen）のタスクは、システムが仮定を公開する質問を生成し、議論の推論に挑戦できるようにすることにより、批判的思考を促進することを目指しています。
この分野への関心が高まっているにもかかわらず、適切なデータセットと自動評価基準がないことにより、進歩が妨げられています。
この作業は、このタスクのシステムの開発とベンチマークをサポートする包括的なアプローチを提示します。
最初の大規模な手動で解決されたデータセットを構築します。
また、自動評価方法を調査し、人間の判断と最もよく相関する戦略として、大規模な言語モデル（LLMS）を使用した参照ベースの手法を特定します。
11 LLMのゼロショット評価は、タスクの難しさを紹介しながら、強力なベースラインを確立します。
データ、コード、およびパブリックリーダーボードは、モデルのパフォーマンスの観点からだけでなく、自動化された推論と人間の批判的思考の両方についてCQS-Genの実際的な利点を探求するために、さらなる研究を促進するために提供されます。

要約(オリジナル)

The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.

arxiv情報

著者	Banca Calvo Figueras,Rodrigo Agerri
発行日	2025-05-16 15:08:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー