Job description
Senior SRE (Engineering & Reliability)
Job Summary:
We are seeking an experienced and dynamic Site Reliability Engineering (SRE) Lead to oversee the reliability, scalability, and performance of our critical systems. As an SeniorSRE, you will play a pivotal role in establishing and implementing SRE practices, leading a team of engineers, and driving automation, monitoring, and incident response strategies.
This position combines software engineering and systems engineering expertise to build and maintain high-performing, reliable systems.
Experience:7+ years
Key Responsibilities:
Reliability & Performance:
- Lead efforts to maintain high availability and reliability of critical services.
- Define and monitor SLIs, SLOs, and SLAs to ensure business requirements are met.
- Proactively identify and resolve performance bottlenecks and system inefficiencies.
Incident Management & Response:
- Establish and improve incident management processes and on-call rotations.
- Lead incident response and root cause analysis for high-priority outages.
- Drive post-incident reviews and ensure actionable insights are implemented.
Automation & Tooling:
- Develop and implement automated solutions to reduce manual operational tasks.
- Enhance system observability through metrics, logging, and distributed tracing tools (e.G., Prometheus, Grafana, Elastic APM).
- Optimize CI/CD pipelines for seamless deployments.
Collaboration:
- Partner with software engineering teams to improve the reliability of applications and infrastructure.
- Work closely with product/ engineering teams to design scalable and robust systems.
- Ensure seamless integration of monitoring and alerting systems across teams.
Leadership & Team Building:
- Manage, mentor, and grow a team of SREs.
- Promote SRE best practices and foster a culture of reliability and performance across the organization.
- Drive performance reviews, skills development, and career progression for team members.
Capacity Planning & Cost Optimization:
- Perform capacity planning and implement autoscaling solutions to handle traffic spikes.
- Optimize infrastructure and cloud costs while maintaining reliability and performance.
Skills & Qualifications:
Required Skills:
- Technical Expertise: o Experience with cloud platforms (AWS / Azure / GCP) and Kubernetes.
o Hands-on knowledge of infrastructure-as-code tools like Terraform /Helm/ Ansible.
o Proficiency in Java o Expertise in distributed systems, databases, and load balancing.
- Monitoring & Observability:
o Proficient with tools like Prometheus, Grafana,, Elastic APM, or New relic.
o Understanding of metrics-driven approaches for system monitoring and alerting.
- Automation & CI/CD:
o Hands-on experience with CI/CD pipelines (e.G., Jenkins, Azure Pipelines etc).
o Skilled in automation frameworks and tools for infrastructure and application deployments.
- Incident Management:
o Proven track record in handling incidents, post-mortems, and implementing solutions to prevent recurrence.
Leadership & Communication Skills:
- Strong people management and leadership skills with the ability to inspire and motivate teams.
- Excellent problem-solving and decision-making skills.
- Clear and concise communication, with the ability to translate technical concepts for non-technical stakeholders.
Preferred Qualifications:
- Experience with database optimization, Kafka, or other messaging systems.
- Knowledge of autoscaling techniques
- Previous experience in an SRE, DevOps, or infrastructure engineering leadership role.
- Understanding of compliance and security best practices in distributed systems.
Required Skill Profession
Computer Occupations