Senior Manager, Technical Program Management - DGX Cloud @ NVIDIA

Wait 5 sec.

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, we're at the forefront of AI innovation powering breakthroughs in research, autonomous vehicles, robotics, and more. The DGX Cloud team builds and operates the AI infrastructure that fuels this progress. NVIDIA is seeking an experienced and driven Sr Manager for Technical Program Management team to lead a high-impact team within our DGX Cloud Infrastructure organization. You will play a critical role in driving sophisticated, cross-functional programs involving Compute Platform, cluster bring-ups (including cutting-edge systems like GB200), and ensuring world-class fleet availability, occupancy optimization, and infrastructure metrics tracking across the global DGX Cloud fleet.As a DGX Cloud leader within the Technical Program Management team, you will serve as the vital bridge between NVIDIA Research and DGXC Engineering, driving the development of resilient, high-performance infrastructure for AI training and inference. You’ll lead and scale a team that supports mission-critical systems empowering over 1,000 researchers. Your mission is To accelerate NVIDIA’s research by delivering a world-class AI environment — from GPU clusters to software stack — setting industry standards in productivity, performance, and global impact.What you will be doing:Lead with impact to build and scale a high-performing team of Technical Program Managers focused on delivering a world-class AI platform that empowers over 1K++ NVIDIA researchers. Ensure the team are customer-obsessed, prioritizing developer productivity, platform usability, and end-to-end user experienceDeep understanding of Slurm: architecture, configuration, workload management, job prioritization/fair-share policies, any alternative schedulers and hybrid scheduling architectures to drive capacity management and allocation process across Internal NVIDIA research teamsExperience with end-to-end cluster bring-ups and integration with MLOps stacks, including deep familiarity with operational models, Fleet efficiency metrics and deployment across hyperscaler environments such as OCI, GCP, and othersSkilled in capacity modeling, demand forecasting, and supply-demand balancing, with…