Job Title: Senior Engineer-HPC
Department: Production & Support
Location: Faridabad
Position Summary:
Accomplished HPC Systems Engineer with 8–10 years of enterprise Linux administration and over 5 years of hands-on experience managing large-scale HPC clusters exceeding 500 cores and multi-petabyte storage environments.
Proven expertise in designing, implementing, and optimizing HPC infrastructure, including compute, storage, and high-speed networking, to deliver maximum performance for demanding workloads.
Key Responsibilities:
HPC Cluster Management & Optimization
- Design, implement, and maintain HPC environments, including compute, storage, and network components.
- Configure and optimize Slurm, PBS Pro, or other workload managers/schedulers for efficient job scheduling and resource allocation.
- Implement performance tuning for CPU, GPU, memory, I/O, and network subsystems to meet workload demands.
- Manage HPC filesystem solutions such as Lustre, BeeGFS, or GPFS/Spectrum Scale.
Linux Administration
- Administer enterprise-grade Linux distributions (RHEL, CentOS, Rocky, Ubuntu) in large-scale compute environments.
- Manage kernel upgrades, patching, and security hardening.
- Troubleshoot kernel-level and system-level issues for performance and stability.
Automation & Configuration Management
- Develop and maintain Ansible playbooks/roles for automated provisioning, configuration, and patching of HPC systems.
- Integrate Ansible with CI/CD pipelines for infrastructure as code (IaC) practices.
- Automate cluster deployment and environment consistency across hundreds of nodes.
Monitoring, Troubleshooting & Support
- Implement and maintain monitoring tools (e.g., Grafana, Prometheus, Nagios, Ganglia).
- Troubleshoot complex HPC workloads, MPI communication issues, and application performance bottlenecks.
- Provide Tier-3 escalation support for Linux/HPC-related incidents.
Collaboration & Documentation
- Work closely with research teams, DevOps engineers, and system architects to deliver high-performance solutions.
- Document architecture, SOPs, troubleshooting guides, and performance tuning methodologies.
Requirements
Required Skills & Experience
- 8–10 years of hands-on Linux system administration experience in production environments.
- 5+ years managing HPC clusters at scale (500+ cores / multiple petabytes of storage).
- Strong Ansible automation skills (complex playbooks, roles, variables, templates).
- Deep understanding of MPI, OpenMP, and GPU/accelerator integration in HPC workloads.
- Proficient with HPC job schedulers (Slurm, PBS Pro, LSF).
- Experience with HPC storage (Lustre, BeeGFS, GPFS).
- Strong knowledge of TCP/IP networking, Infiniband, and RDMA technologies.
- Experience with performance tuning and benchmarking tools (perf, hpc tool kit, Intel VTune, Iperf, fio).
- Scripting proficiency in Bash, Python, or Perl for automation and tooling.
Preferred Qualifications
- Experience with containerized HPC (Singularity, Apptainer, or Podman).
- Familiarity with cloud-HPC integration (AWS Parallel Cluster, Azure Cycle Cloud, GCP HPC).
- Knowledge of security compliance standards (CIS benchmarks, STIG).
- Contribution to HPC community tools or open-source projects.
Soft Skills
- Strong problem-solving and analytical thinking.
- Ability to mentor junior engineers and collaborate across teams.
- Excellent communication skills for technical and non-technical stakeholders.