Job description
 
                         Role Description
Senior Infrastructure Automation Engineer (Zero-Touch GPU Cloud Build & Upgrade)
We are looking for a  Senior Infrastructure Automation Engineer  with 10+ years of hands on experience in building and scaling infrastructure automation systems to lead the design and implementation of a  Zero-Touch Build, Upgrade, and Certification framework  for our on-prem GPU cloud environment.
This role demands deep technical expertise across bare-metal provisioning, configuration management, and full-stack automation—from hardware to Kubernetes—built entirely on  GitOps principles .
Key Responsibilities
Architect, lead, and implement  a fully automated, zero-touch deployment pipeline for GPU cloud infrastructure spanning hardware → OS → Kubernetes → platform layers.
Build robust GitOps-based workflows to manage end-to-end infrastructure lifecycle—from provisioning to continuous compliance.
Design and maintain automation for:
Bare-metal control : Power cycling, provisioning, remote installs
Firmware and configuration flashing : BIOS, NIC, RAID, etc.
Hardware inventory management
Configuration drift detection and remediation
Develop and extend internal automation frameworks using  Ansible, Python , and related infrastructure tooling.
Serve as a  technical authority and mentor , guiding junior engineers and collaborating cross-functionally with hardware, SRE, and platform engineering teams.
Lead architectural and design reviews for infrastructure automation systems.
Define and implement best practices for  infrastructure as code , compliance, and operational resilience.
Champion automation-driven operational models and reduce manual intervention to near-zero.
Bonus:  Familiarity with  Terraform, Chef, and Cloud Automation Platforms .
Required Skills & Experience
10+ years of hands-on experience  in infrastructure engineering, automation, and systems design, with a strong track record of delivering scalable and maintainable solutions.
Primary key skills  required are Ansible, Python, ipmitool, firmware scripting, Linux shell scripting
Deep expertise in:
Ansible  for automation and configuration management
Python  for scripting, integration, and automation logic
ipmitool  and related tools for low-level hardware management (e.g., IPMI, Redfish)
Proven experience with  bare-metal automation  in data center environments, including:
Power control and PXE booting
BIOS/NIC/RAID firmware upgrades
Hardware and platform inventory systems
Strong foundation in  Linux systems , networking, and Kubernetes infrastructure.
Fluency with  GitOps  workflows and tools.
Experience with CI/CD systems and managing Git-based pipelines for infrastructure.
Familiarity with infrastructure monitoring, logging, and drift detection.
Strong cross-team collaboration and communication skills, especially across hardware, platform, and SRE teams.
Bonus:
Prior leadership or mentorship roles
Experience contributing to or maintaining open-source infrastructure projects
Exposure to GPU-based compute stacks and high-performance workloads
 
                    
                    
Required Skill Profession
 
                     
                    
                    Engineers