Attribution in Scientific Literature: New Benchmark and Methods

要約

大規模な言語モデル（LLMS）は、科学的コミュニケーションにおける自動化されたソース引用のために、有望でありながら挑戦的なフロンティアを提示します。
引用生成に対する以前のアプローチは、引用のあいまいさとLLMの過剰な一般化によって制限されています。
Arxivの12の科学ドメインにわたって、文レベルの注釈を備えた新しいデータセットである理由を紹介します。
評価フレームワークは、2つの重要な引用シナリオをカバーしています。間接的なクエリ（紙のタイトルに文を一致させる）と直接クエリ（著者の帰属）、どちらもコンテキストメタデータで強化されています。
GPT-O1、GPT-4O、GPT-3.5、DeepSeekなどのモデル、および困惑AI（7b）などの他の小型モデルで広範な実験を実施します。
一流のLLMは、文の帰属で高性能を達成しますが、科学的信頼性の重要なメトリックである高い幻覚率と闘っています。
当社のメタデータの高度アプローチは、すべてのタスクにわたって幻覚率を低下させ、改善のための有望な方向性を提供します。
Mistralを使用した検索の高度発電（RAG）は、間接的なクエリのパフォーマンスを改善し、幻覚率を42％減らし、より大きなモデルで競争精度を維持します。
ただし、敵対的なテストは、紙のタイトルを要約にリンクする際の課題を強調し、現在のLLMの基本的な制限を明らかにしています。
理由は、科学的アプリケーションで信頼できる信頼できるLLMを開発するための挑戦的なベンチマークを提供します

要約(オリジナル)

Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication. Previous approaches to citation generation have been limited by citation ambiguity and LLM overgeneralization. We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv. Our evaluation framework covers two key citation scenarios: indirect queries (matching sentences to paper titles) and direct queries (author attribution), both enhanced with contextual metadata. We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B). While top-tier LLMs achieve high performance in sentence attribution, they struggle with high hallucination rates, a key metric for scientific reliability. Our metadata-augmented approach reduces hallucination rates across all tasks, offering a promising direction for improvement. Retrieval-augmented generation (RAG) with Mistral improves performance in indirect queries, reducing hallucination rates by 42% and maintaining competitive precision with larger models. However, adversarial testing highlights challenges in linking paper titles to abstracts, revealing fundamental limitations in current LLMs. REASONS provides a challenging benchmark for developing reliable and trustworthy LLMs in scientific applications

arxiv情報

著者	Yash Saxena,Deepa Tilwani,Ali Mohammadi,Edward Raff,Amit Sheth,Srinivasan Parthasarathy,Manas Gaur
発行日	2025-04-11 07:20:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attribution in Scientific Literature: New Benchmark and Methods

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー