Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

要約

研究論文でデータがどのように言及され、使用されるかを追跡することで、データの発見可能性、品質、および生産を改善するための重要な洞察が得られます。
ただし、データセットの言及を手動で識別して分類することは、膨大な学術文献全体でリソース集中的であり、スケーラブルではありません。
このペーパーでは、大規模な言語モデル（LLM）、合成データ、および2段階の微調整プロセスを活用することにより、研究ドメイン全体でデータセットに言及する検出を自動化する機械学習フレームワークを紹介します。
私たちは、研究論文からのゼロショット抽出、質の高い評価のためにLLMとしてのJudge、および洗練された監視の合成データセットを生成するための推論エージェントを採用しています。
Phi-3.5-mini instruceモデルは、このデータセットで事前に調整されており、その後、手動で注釈付きのサブセットで微調整されます。
推論では、Modernbertベースの分類器がデータセットの言及を効率的にフィルターし、高いリコールを維持しながら計算オーバーヘッドを削減します。
保有された手動で注釈付きのサンプルで評価された微調整モデルは、データセット抽出精度でnuextract-v1.5とグリーナーラージ-v2.1を上回ります。
私たちの結果は、LLM生成された合成データがトレーニングデータの不足を効果的に対処し、低リソースの設定で一般化を改善する方法を強調しています。
このフレームワークは、データギャップを特定し、情報に基づいた意思決定のためのデータアクセシビリティを強化する際に、データセットの使用、透明性の向上、サポート研究者、資金提供者、および政策立案者のスケーラブルな監視に向けた経路を提供します。

要約(オリジナル)

Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.

arxiv情報

著者	Aivin V. Solatorio,Rafael Macalaba,James Liounis
発行日	2025-02-14 16:16:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー