DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

要約

大規模な言語モデル（LLM）は、科学研究評価、特に自動紙のレビューでますます利用されています。
ただし、既存のLLMベースのレビューシステムは、限られたドメインの専門知識、幻覚の推論、構造化された評価の欠如など、重要な課題に直面しています。
これらの制限に対処するために、構造化された分析、文献検索、およびエビデンスに基づいた議論を組み込むことにより、専門家のレビュー担当者をエミュレートするように設計されたマルチステージフレームワークであるDeepReviewを紹介します。
構造化された注釈を備えたキュレーションされたデータセットであるDeepReview-13Kを使用して、DeepReviewer-14Bをトレーニングします。これは、Cyclereviewer-70Bをより少ないトークンで上回ることができます。
その最良のモードでは、DeepReviewer-14Bは、評価でGPT-O1とDeepSeek-R1に対して88.21 \％および80.20 \％の勝利率を達成します。
私たちの仕事は、LLMベースのペーパーレビューの新しいベンチマークを設定し、すべてのリソースが公開されています。
コード、モデル、データセット、デモは、http：//ai-researcher.netでリリースされています。

要約(オリジナル)

Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.

arxiv情報

著者	Minjun Zhu,Yixuan Weng,Linyi Yang,Yue Zhang
発行日	2025-03-11 15:59:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー