Job description
Job Description:
Senior Infrastructure Test & Validation Engineer (Zero-Touch GPU Cloud – GitOps Validation & Certification)
We are seeking a Senior Infrastructure Test & Validation Engineer with 10+ years of experience to lead the Zero-Touch Validation, Upgrade, and Certification automation of our on-prem GPU cloud platform.
This role focuses on ensuring the stability, performance, and conformance of the entire stack—from hardware to Kubernetes—using automated, GitOps-based validation pipelines.
The ideal candidate has a strong infrastructure background with deep hands-on skills in Sonobuoy, LitmusChaos, k6, and pytest, and is passionate about automated test orchestration, platform resilience, and continuous conformance.
Key Responsibilities
- Design and implement automated, GitOps-compliant pipelines for validation and certification of the GPU cloud stack across hardware, OS, Kubernetes, and platform layers.
- Integrate Sonobuoy for Kubernetes conformance and certification testing.
- Design and orchestrate chaos engineering workflows using LitmusChaos to validate system resilience across failure scenarios.
- Implement performance testing suites using k6 and system-level benchmarks, integrated into CI/CD pipelines.
- Develop and maintain end-to-end test frameworks using pytest and/or Go, focusing on cluster lifecycle events, upgrade paths, and GPU workloads.
- Ensure test coverage and validation across multiple dimensions: conformance, performance, fault injection, and post-upgrade validation.
- Build and maintain dashboards and reporting for automated test results, including traceability, drift detection, and compliance tracking.
- Collaborate with infrastructure, SRE, and platform teams to embed testing and validation early in the deployment lifecycle.
- Own quality assurance gates for all automation-driven deployments.
Required Skills & Experience
- 10+ years of hands-on experience in infrastructure engineering, systems validation, or SRE roles.
- Primary key skills required are pytest, Go, k6 scripting, automation frameworks integration (Sonobuoy, LitmusChaos), CI integration
- Strong experience with:
- Sonobuoy for Kubernetes conformance and diagnostics
- LitmusChaos for fault injection and resilience validation
- k6 for performance/load testing in distributed environments
- pytest or Go-based test frameworks for automation and validation scripting
- Deep understanding of Kubernetes architecture, upgrade patterns, and operational risks.
- Experience validating infrastructure components (GPU drivers, kernel modules, CNI, CRI, etc.) across lifecycle events.
- Proficient in GitOps workflows and integrating tests into declarative, Git-backed pipelines (e.g., with Argo CD, Flux).
- Hands-on experience with CI/CD systems (e.g., GitHub Actions, GitLab CI, Jenkins) to automate test orchestration.
- Solid scripting and automation experience (Python, Bash, or Go).
- Familiarity with GPU-based infrastructure and its performance characteristics is a strong plus.
- Strong debugging, root cause analysis, and incident investigation skills.
Required Skill Profession
Engineers