Job Description:
Senior Kubernetes Platform Engineer (Zero-Touch GPU Cloud – GitOps Automation)
We are looking for a Senior Kubernetes Platform Engineer with 10+ years of infrastructure experience to design and implement the Zero-Touch Build, Upgrade, and Certification pipeline for our on-premises GPU cloud platform.
This role focuses on automating the Kubernetes layer and its dependencies (e.g., GPU drivers, networking, runtime) using 100% GitOps workflows.
You will work across teams to deliver a fully declarative, scalable, and reproducible infrastructure stack—from hardware to Kubernetes and platform services.
Key Responsibilities
- Architect and implement GitOps-driven Kubernetes cluster lifecycle automation using tools like kubeadm, ClusterAPI, Helm, and Argo CD.
- Develop and manage declarative infrastructure components for:
- GPU stack deployment (e.g., NVIDIA GPU Operator)
- Container runtime configuration (Containerd)
- Networking layers (CNI plugins like Calico, Cilium, etc.)
- Lead automation efforts to enable zero-touch upgrades and certification pipelines for Kubernetes clusters and associated workloads.
- Maintain Git-backed sources of truth for all platform configurations and integrations.
- Standardize deployment practices across multi-cluster GPU environments, ensuring scalability, repeatability, and compliance.
- Drive observability, testing, and validation as part of the continuous delivery process (e.g., cluster conformance, GPU health checks).
- Collaborate with infrastructure, security, and SRE teams to ensure seamless handoffs between lower layers (hardware/OS) and the Kubernetes platform.
- Mentor junior engineers and contribute to the platform automation roadmap.
Required Skills & Experience
- 10+ years of hands-on experience in infrastructure engineering, with a strong focus on Kubernetes-based environments.
- Primary key skills required are Kubernetes API, Helm templating, Argo CD GitOps integration, Go/Python scripting, Containerd
- Deep knowledge and hands-on experience with:
- Kubernetes cluster management (kubeadm, ClusterAPI)
- Argo CD for GitOps-based delivery
- Helm for application and cluster add-on packaging
- Containerd as a container runtime and its integration in GPU workloads
- Experience deploying and operating the NVIDIA GPU Operator or equivalent in production environments.
- Solid understanding of CNI plugin ecosystems, network policies, and multi-tenant networking in Kubernetes.
- Strong GitOps mindset with experience managing infrastructure as code through Git-based workflows.
- Experience building Kubernetes clusters in on-prem environments (vs.
managed cloud services). - Proven ability to scale and manage multi-cluster, GPU-accelerated workloads with high availability and security.
- Solid scripting and automation skills (Bash, Python, or Go).
- Familiarity with Linux internals, systemd, and OS-level tuning for container workloads.
- Bonus:
- Experience with custom controllers, operators, or Kubernetes API extensions
- Contributions to Kubernetes or CNCF projects
- Exposure to service meshes, ingress controllers, or workload identity providers