HomeVideoProduction GPU Cluster with K8s for AI and DL Workloads - Madhukar...

Production GPU Cluster with K8s for AI and DL Workloads – Madhukar Korupolu, NVIDIA


Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi – learn more at kubecon.io

Don’t miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 – April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Production GPU Cluster with K8s for AI and DL Workloads – Madhukar Korupolu, NVIDIA

We will present NVIDIA’s experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we’ll describe how we addressed some of these in production at scale. We will describe the tools we have built for automated provisioning of GPU nodes (including CUDA driver upgrades), a custom scheduler specialized for batch jobs and monitoring GPU jobs in production with health checks and telemetry. We will also discuss gaps we have identified to enable more reliable and efficient utilization of GPU resources (e.g., GPU affinity, sharing, co-scheduling) and share an update of our current projects.

https://sched.co/MPai

View it at Youtube

Most Popular