Job Description:   
Senior Kubernetes Platform Engineer (Zero-Touch GPU Cloud – GitOps Automation)  
We are looking for a Senior Kubernetes Platform Engineer  with 10+ years of infrastructure experience to design and implement the Zero-Touch Build, Upgrade, and Certification pipeline  for our on-premises GPU cloud platform.
This role focuses on automating the Kubernetes layer and its dependencies (e.g., GPU drivers, networking, runtime) using 100% GitOps workflows .
You will work across teams to deliver a fully declarative, scalable, and reproducible infrastructure stack—from hardware to Kubernetes and platform services.
Key Responsibilities  
- Architect and implement GitOps-driven Kubernetes cluster lifecycle automation  using tools like kubeadm , ClusterAPI , Helm , and Argo CD .
 
 
- Develop and manage declarative infrastructure components for: 
- GPU stack deployment (e.g., NVIDIA GPU Operator ) 
- Container runtime configuration (Containerd ) 
- Networking layers (CNI plugins  like Calico, Cilium, etc.) 
- Lead automation efforts to enable zero-touch upgrades and certification pipelines  for Kubernetes clusters and associated workloads.
 
 
- Maintain Git-backed sources of truth for all platform configurations and integrations.
 
 
- Standardize deployment practices across multi-cluster GPU environments, ensuring scalability, repeatability, and compliance.
 
 
- Drive observability, testing, and validation as part of the continuous delivery process (e.g., cluster conformance, GPU health checks).
 
 
- Collaborate with infrastructure, security, and SRE teams to ensure seamless handoffs between lower layers (hardware/OS) and the Kubernetes platform.
 
 
- Mentor junior engineers and contribute to the platform automation roadmap.
 
 
Required Skills & Experience  
- 10+ years of hands-on experience  in infrastructure engineering, with a strong focus on Kubernetes-based environments.
 
 
- Primary key skills  required are Kubernetes API, Helm templating, Argo CD GitOps integration, Go/Python scripting, Containerd 
- Deep knowledge and hands-on experience with: 
- Kubernetes  cluster management (kubeadm, ClusterAPI) 
- Argo CD  for GitOps-based delivery 
- Helm  for application and cluster add-on packaging 
- Containerd  as a container runtime and its integration in GPU workloads 
- Experience deploying and operating the NVIDIA GPU Operator  or equivalent in production environments.
 
 
- Solid understanding of CNI plugin ecosystems , network policies, and multi-tenant networking in Kubernetes.
 
 
- Strong GitOps mindset with experience managing infrastructure as code through Git-based workflows.
 
 
- Experience building Kubernetes clusters in on-prem environments (vs.
 
 managed cloud services).
 
 
- Proven ability to scale and manage multi-cluster, GPU-accelerated workloads  with high availability and security.
 
 
- Solid scripting and automation skills (Bash, Python, or Go).
 
 
- Familiarity with Linux internals, systemd, and OS-level tuning for container workloads.
 
 
- Bonus: 
- Experience with custom controllers, operators, or Kubernetes API extensions 
- Contributions to Kubernetes or CNCF projects 
- Exposure to service meshes, ingress controllers, or workload identity providers