THiNK: Can Large Language Models Think-aloud?

要約

特に表面レベルの精度を超えるタスクにおいて、大規模な言語モデル（LLMS）の高次思考スキルを評価することは根本的な課題です。
この作業では、Bloomの分類法に基づいたマルチエージェントのフィードバック駆動型評価フレームワークであるThink（高次の知識の概念をテストする）を提案します。
評価評価の推論は、問題の生成、批評、および修正の反復タスクとして、LLMが段階的な反射と改良を通じて考えられるように奨励していると考えてください。
これにより、低次（例：覚えている、理解する）と高次（例：評価、作成）の両方の思考スキルの両方の体系的な評価が可能になります。
Thinkを7つの最先端のLLMに適用し、それらの出力の詳細な認知分析を実行します。
結果は、モデルがより低い次数のカテゴリをよく実行する一方で、現実的なコンテキストで知識を適用することに苦労し、限られた抽象化を示すことを明らかにしています。
構造化されたフィードバックループは、特に高次思考において、推論パフォーマンスを大幅に改善します。
定性的評価により、思考ガイド付きの出力がドメインロジックと問題構造とより適切に整合することがさらに確認されます。
私たちのフレームワークのコードは、LLMの推論を調査および強化するためのスケーラブルな方法論を提供し、GitHubリポジトリで入手可能な学習科学に基づいた評価のための新しい方向性を提供します。

要約(オリジナル)

Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

arxiv情報

著者	Yongan Yu,Mengqian Wu,Yiran Lin,Nikki G. Lobczowski
発行日	2025-05-26 16:27:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

THiNK: Can Large Language Models Think-aloud?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー