From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs

要約

フェイクニュースの急速な拡散は、特に適切なデータセットや検出ツールが不足しているバングラ語のような低リソース言語において、重大な世界的課題を引き起こしています。
手動によるファクトチェックは正確ですが、フェイクニュースの拡散を防ぐには費用がかかり、時間がかかります。
このギャップに対処するために、バングラのフェイクニュース検出を強化する堅牢なデータセットである BanFakeNews-2.0 を紹介します。
このバージョンには、信頼できる情報源から検証された、細心の注意を払って厳選された 11,700 件の追加のフェイクニュース記事が含まれており、13 のカテゴリにわたる 47,000 件の本物のニュースと 13,000 件のフェイクニュースの比例データセットが作成されます。
さらに、厳密な評価のために、460 のフェイクニュース項目と 540 の本物のニュース項目からなる手動で精選された独立したテストセットを作成しました。
私たちは、信頼できる情報源からフェイクニュースを収集し、言語の豊かさを維持しながら手動で検証することに力を入れています。
当社は、トランスフォーマーのバリアントからの微調整された双方向エンコーダー表現 (F1-87\%) と量子化低ランク近似を備えた大規模言語モデル (F1-89\%) を含む、トランスフォーマーベースのアーキテクチャを利用したベンチマークシステムを開発します。これは、従来のアーキテクチャを大幅に上回ります。
方法。
BanFakeNews-2.0 は、リソースの少ない言語のフェイクニュース検出における研究と応用を進めるための貴重なリソースを提供します。
この方向の研究を促進するために、私たちはデータセットとモデルを Github で公開しています。

要約(オリジナル)

The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87\%) and Large Language Models with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.

arxiv情報

著者	Hrithik Majumdar Shibu,Shrestha Datta,Md. Sumon Miah,Nasrullah Sami,Mahruba Sharmin Chowdhury,Md. Saiful Islam
発行日	2025-01-16 15:24:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー