TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

要約

通信詐欺の検出は、オーディオ信号を推論指向のテキスト分析と統合する高品質のマルチモーダルトレーニングデータがないため、重大な課題に直面しています。
このギャップに対処するために、自動化されたテレコム詐欺分析のために特別に設計された最初のオープンソースオーディオテキストスローチンキングデータセットであるTeleantifraud-28Kを提示します。
データセットは、3つの戦略を通じて構築されています。（1）プライバシーに保存されたテキストトゥルースサンプル生成自動的に音声認識（ASR）通話録音（匿名化された元のオーディオを使用）を使用し、テキストツースピーチ（TTS）モデル再生を通じて実世界の一貫性を確保します。
（2）シナリオカバレッジを拡大するための本物のASR出力に関する大規模な言語モデル（LLM）ベースの自己計算サンプリングを介したセマンティック強化。
（3）事前に定義されたコミュニケーションシナリオと詐欺の類型を通じて、新たな詐欺戦術をシミュレートするマルチエージェント敵対的統合。
生成されたデータセットには、28,511が厳密に処理された音声テキストペアが含まれており、詐欺の推論のための詳細な注釈が付いています。
データセットは、シナリオ分類、詐欺検出、詐欺タイプ分類の3つのタスクに分けられます。
さらに、テレコム詐欺検出タスクのモデルパフォーマンスの体系的なテストを容易にするために、データセットから比例してサンプリングされたインスタンスを含む標準化された評価ベンチマークであるTeleantifraud-benchを構築します。
また、ハイブリッドの実質/合成データで訓練された生産最適化された監視された微調整（SFT）モデルを貢献し、データ処理フレームワークをオープンソーシングして、コミュニティ駆動型のデータセット拡張を可能にします。
この作業は、データのプライバシーとシナリオの多様性における重要な課題に対処しながら、マルチモーダル反燃焼研究の基礎フレームワークを確立します。
このプロジェクトは、https：//github.com/jimmyma99/teleantifraudでリリースされます。

要約(オリジナル)

The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

arxiv情報

著者	Zhiming Ma,Peidong Wang,Minhua Huang,Jingpeng Wang,Kai Wu,Xiangzhao Lv,Yachun Pang,Yin Yang,Wenjie Tang,Yuchen Kang
発行日	2025-04-02 13:32:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー