Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

要約

タイトル：NVIDIA NICsにおける強化学習に基づくデータセンターの混雑制御の実装

要約：
– コミュニケーションプロトコルが進化するにつれ、データセンターネットワークの利用率が増加し、混雑がより頻繁に発生するようになっています。これにより、より高いレイテンシーとパケットロスが発生します。
– ワークロードの複雑さも増すため、混雑制御（CC）アルゴリズムの手動設計は非常に困難になっています。これに対して、AIアプローチの開発が求められています。
– 現在、ネットワークデバイスでAIモデルを展開することはできないため、この問題に対処するために、最近の強化学習CCアルゴリズムに基づく計算量の軽いソリューションを提供します。
– RL-CCの推論時間を複雑なニューラルネットワークから判断木に軽量化することで、推論時間をx500短縮します。この変換により、$\mu$-秒の決定時間要件内でリアルタイム推論が可能になり、品質にはほとんど影響がありません。
– 変換されたポリシーをライブクラスターのNVIDIA NICsに展開し、本番で使用される人気のCCアルゴリズムと比較します。テストされたフロー数の範囲が広いため、RL-CCはすべてのベンチマークで優れたパフォーマンスを発揮する唯一の方法です。
– RL-CCは、帯域幅、レイテンシー、パケットドロップといった複数の指標を同時にバランスさせることができます。これらの結果は、CCのデータ駆動方法が実現可能であり、最適なパフォーマンスを達成するためには、ハンドクラフトされたヒューリスティックが必要であるという従来の信念に挑戦していることを示唆しています。

要約(オリジナル)

As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the $\mu$-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.

arxiv情報

著者	Benjamin Fuhrer,Yuval Shpigelman,Chen Tessler,Shie Mannor,Gal Chechik,Eitan Zahavi,Gal Dalal
発行日	2023-04-30 13:12:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー