Job Description:
Senior Infrastructure Automation Engineer (Zero-Touch GPU Cloud Stack – Linux Image Lifecycle)
We are seeking a Senior Infrastructure Automation Engineer with 10+ years of experience to lead the design and implementation of a Zero-Touch Build, Upgrade, and Certification pipeline for our on-prem GPU cloud infrastructure.
This role focuses on automating the full stack—from hardware provisioning through OS and Kubernetes deployment—leveraging 100% GitOps workflows.
The candidate will bring deep expertise in Linux systems automation, image management, and compliance hardening, with a strong foundation in infrastructure engineering.
Key Responsibilities
- Architect and implement a fully automated, GitOps-based pipeline for building, upgrading, and certifying the Linux operating system layer in the GPU cloud stack (hardware → OS → Kubernetes → platform).
- Design and automate Linux image builds using Packer, Kickstart, and Ansible.
- Integrate CIS/STIG compliance hardening and OpenSCAP scanning directly into the image lifecycle and validation workflows.
- Own and manage kernel module/driver automation, ensuring version compatibility and hardware enablement for GPU nodes.
- Collaborate with platform, SRE, and security teams to standardize image build and deployment practices across the stack.
- Maintain GitOps-compliant infrastructure-as-code repositories, ensuring traceability and reproducibility of all automation logic.
- Build self-service capabilities and frameworks for zero-touch provisioning, image certification, and drift detection.
- Mentor junior engineers and contribute to strategic automation roadmap initiatives.
Required Skills & Experience
- 10+ years of hands-on experience in Linux infrastructure engineering, system automation, and OS lifecycle management.
- Primary key skills required are Ansible, Python, Packer, Kickstart, OpenSCAP
- Deep expertise with:
- Packer for automated image builds
- Kickstart for unattended OS provisioning
- OpenSCAP for security compliance and policy enforcement
- Ansible for configuration management and post-build customization
- Strong understanding of CIS/STIG hardening standards and their application in automated pipelines.
- Experience with kernel and driver management, particularly in hardware-accelerated (GPU) environments.
- Proven ability to implement GitOps workflows for infrastructure automation (e.G., Git-backed pipelines for image release and validation).
- Solid knowledge of Linux internals, bootloaders, and provisioning mechanisms in bare-metal environments.
- Exposure to Kubernetes, particularly in the context of OS-level customization and compliance.
- Strong collaboration skills across teams including security, SRE, platform, and hardware engineering.
- Bonus:
- Familiarity with image signing, SBOM generation, or secure boot workflows
- Experience working in regulated or compliance-heavy environments (e.G., FedRAMP, PCI-DSS)
- Contributions to infrastructure automation frameworks or open-source tools