cedar: Optimized and Unified Machine Learning Input Data Pipelines

要約

入力データパイプラインは、各機械学習 (ML) トレーニングジョブの重要なコンポーネントです。
大量のトレーニングデータを読み取り、複雑な変換を使用してサンプルのバッチを処理し、それらを低遅延かつ高スループットでトレーニングノードにロードする役割を果たします。
データ量の急増とトレーニングスループットの需要により、パフォーマンスの高い入力データシステムの重要性がますます高まっています。
残念ながら、現在の入力データシステムは主要なパフォーマンスの最適化を十分に活用できず、その結果、大量のリソースを必要とする非常に非効率なインフラストラクチャが生じ、さらに悪いことに、高価なアクセラレータが十分に活用されていません。
これらの要求に対処するために、ML 入力データパイプライン用に最適化され統合されたプログラミングフレームワークである Ceder を紹介します。
Ceder を使用すると、ユーザーは任意の ML フレームワークとライブラリをサポートするコンポーザブルオペレーターを使用して入力データパイプラインを定義できます。
Ceder は、最適化 (オフロード、キャッシュ、プリフェッチ、融合、並べ替えなど) の複雑な組み合わせを体系的に適用する拡張可能なオプティマイザーを導入しています。
ユーザー入力なしで、カスタマイズ可能なローカルおよび分散コンピューティングリソースのセット全体で処理を調整し、処理のパフォーマンスと効率を向上させます。
8 つのパイプライン全体で、sider は最先端の入力データシステムと比較してパフォーマンスを最大 1.87 倍から 10.65 倍向上させます。

要約(オリジナル)

The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources – or worse – underutilize expensive accelerators. To address these demands, we present cedar, an optimized and unified programming framework for ML input data pipelines. cedar allows users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. cedar introduces an extensible optimizer that systematically applies a complex combination of optimizations (e.g., offloading, caching, prefetching, fusion, and reordering). It orchestrates processing across a customizable set of local and distributed compute resources in order to improve processing performance and efficiency, all without user input. Across eight pipelines, cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.

arxiv情報

著者	Mark Zhao,Emanuel Adamiak,Christos Kozyrakis
発行日	2024-11-27 18:05:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

cedar: Optimized and Unified Machine Learning Input Data Pipelines

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー