TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

要約

道徳的な物語は、値を送信するための実施された手段ですが、現代のNLPには、一貫した物語を明示的な倫理レッスンと結びつける大規模で構造化されたコーパスがありません。
このギャップは、TF1-EN-3Mで閉じます。これは、8B以上のパラメーターよりも大きい命令チューニングモデルによってのみ生成される300万の英語のf話の最初のオープンデータセットです。
各ストーリーは、6スロットの足場（文字 – >特性 – >設定 – > complet-> resolution-> moral）に続き、広いテーマの空間を覆いながらジャンルの忠実度を保証する組み合わせプロンプトエンジンを介して生成されます。
ハイブリッド評価パイプラインは、（i）文法、創造性、道徳的な明快さ、およびテンプレートの遵守を（ii）参照のない多様性と読みやすさのメトリックとスコアリングするGPTベースの批評家をブレンドします。
10人のオープンウェイト候補のうち、8BパラメーターのLlama-3バリアントは、最高品質のスピードトレードオフを提供し、1,000のf話あたり約13.5セントで、単一の消費者GPU（<24 GB VRAM）に高得点を生成します。寛容なライセンスの下でデータセット、生成コード、評価スクリプト、および完全なメタデータをリリースし、正確な再現性とコストベンチマークを可能にします。 TF1-EN-3Mは、指導、物語の知性、価値の整合性、および子供に優しい教育AIにおける研究の研究の手段を開き、大規模な道徳的ストーリーテリングには独自の巨大なモデルがもはや必要ではないことを示しています。

要約(オリジナル)

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.

arxiv情報

著者	Mihai Nadas,Laura Diosan,Andrei Piscoran,Andreea Tomescu
発行日	2025-04-29 10:15:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー