FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration


データ重複排除に一般的に使用される方法は、Minhash LSHアルゴリズムです。
最近、NvidiaはGPUベースのMinhash LSH重約方ー法を導入しましたが、最適ではないままであり、処理効率をさらに改善する余地を残しています。
このペーパーでは、GPUクラスターのMinhash LSHを最適化し、計算効率が高い部分的に再利用可能な非暗号化可能なハッシュ関数のレバレッジを最適化するGPU加速重複排除フレームワークを提案します。
FRBは、SlimpajamaのCPUベースの重複排除ツール(64の論理CPUコアを使用)を最大107.2回、Nvidia NemoキュレーターのGPUベースのツールを4つのGPUを使用したノードで3,000万ドキュメントを処理すると最大6.3倍上回ります。
関連コードは、github(\ href {https://github.com/mcrl/fed} {https://github.com/mcrl/fed})で公開されています。


Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).


著者 Youngjun Son,Chaewon Kim,Jaejin Lee
発行日 2025-03-12 13:36:32+00:00
arxivサイト arxiv_id(pdf)

