Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA – Madhukar Korupolu & Sanjay Chatterjee, NVIDIA
With the growing scale of DL and ML applications, distributed execution of jobs across multiple nodes becomes increasingly critical — to solve bigger problems faster — as illustrated by the recent MLperf results. However running such workloads in a production K8s cluster shared by multiple jobs/users has several challenges. In this talk, we’ll give an overview of this area — including distributed Tensorflow, Pytorch, Horovod, MPI — and the use of GPU nodes with NCCL and RDMA for accelerated performance. We’ll describe our end-to-end flow for multi-node jobs in K8s including gang scheduling, quotas, fairness and backfilling implemented in our custom scheduler for GPUs. Our cluster includes high-speed networking through RoCE and SR-IOV / Multus CNI. We’ll share our design choices, learnings and operational experience including failure handling, performance and telemetry.