Job Description
<p><p><b>Key Responsibilities :</b></p><p><p><b><br/></b></p>- Lead and mentor a team of SREs/DevOps Engineers, fostering a culture of ownership, reliability, and continuous improvement.<br/><br/></p><p>- Own the availability, scalability, and performance of production systems and services.<br/><br/></p><p>- Design and manage distributed systems and microservices architectures at scale.<br/><br/></p><p>- Develop and implement incident response strategies, root cause analysis, and create actionable postmortems.<br/><br/></p><p>- Drive improvements in infrastructure automation, CI/CD pipelines, and deployment strategies.<br/><br/></p><p>- Collaborate with cross-functional teams including engineering, product, and QA to embed SRE best practices.<br/><br/></p><p>- Implement observability tools (e.g., Prometheus, Grafana, ELK, Datadog) to monitor system performance and proactively detect issues.<br/><br/></p><p>- Manage and optimize cloud infrastructure on AWS, including services such as EC2, ELB, </p><p>AutoScaling, S3, CloudFront, and CloudWatch.<br/><br/></p><p>- Utilize Infrastructure-as-Code tools such as Terraform, CloudFormation, or Pulumi for provisioning and maintaining infrastructure.<br/><br/></p><p>- Apply strong Linux, networking, load balancing, and security principles to ensure platform </p><p>resilience.<br/><br/></p><p>- Leverage Docker and Kubernetes for container orchestration and scalable deployments.<br/><br/></p><p>- Build internal tools and automation using Python, Go, or Bash scripting.<br/><br/></p><p>- Support event-driven architectures leveraging Kafka or RabbitMQ for high-throughput, real-time systems.<br/><br/></p><p>- Proactively contribute to reliability-focused architecture and design Skills & Experience : </b></p><p><br/></p>- 6 - 10 years of overall experience in backend engineering, infrastructure, DevOps, or SRE roles.<br/><br/></p><p>- Minimum 3 years of experience leading SRE, DevOps, or Infrastructure teams.</p><p><br/>- Proven track record managing distributed systems and microservices at scale.<br/><br/></p><p>- Deep understanding of Linux systems, networking fundamentals, load balancing, and infrastructure security.<br/><br/></p><p>- Strong hands-on experience with AWS services : EC2, ELB, AutoScaling, CloudFront, S3, and CloudWatch.<br/><br/></p><p>- Expert-level knowledge of Docker and Kubernetes in production environments.<br/><br/></p><p>- Proficient with Infrastructure-as-Code tools : Terraform, CloudFormation, or Pulumi.<br/><br/></p><p>- Hands-on experience with monitoring and observability tools : Prometheus, Grafana, ELK </p><p>Stack, or Datadog.<br/><br/></p><p>- Strong scripting or programming skills in Python, Go, Bash, or similar languages.<br/><br/></p><p>- Familiarity with Kafka or RabbitMQ for event-driven and messaging architectures.<br/><br/></p><p>- Excellent incident management skills, including triage, RCA, and communication.<br/><br/></p><p>- Ability to thrive in fast-paced environments and adapt to changing Qualifications : </b></p><p><br/></p>- Bachelors degree in Computer Science, Engineering, or a related field.<br/><br/></p><p>- Experience in startup or high-growth environments.<br/><br/></p><p>- Contributions to open-source DevOps or SRE tools are a plus.<br/><br/></p><p>- Certifications in AWS, Kubernetes, or other cloud-native technologies are advantageous.</p><br/></p> (ref:hirist.tech)