Senior SRE (Engineering & Reliability)
Job Summary:
We are seeking an experienced and dynamic Site Reliability Engineering (SRE) Lead to oversee the reliability, scalability, and performance of our critical systems. As an SeniorSRE, you will play a pivotal role in establishing and implementing SRE practices, leading a team of engineers, and driving automation, monitoring, and incident response strategies.
This position combines software engineering and systems engineering expertise to build and maintain high-performing, reliable systems.
 
Experience:7+ years
Key Responsibilities:
Reliability & Performance:
• Lead efforts to maintain high availability and reliability of critical services.
• Define and monitor SLIs, SLOs, and SLAs to ensure business requirements are met.
• Proactively identify and resolve performance bottlenecks and system inefficiencies.
Incident Management & Response:   
• Establish and improve incident management processes and on-call rotations.
• Lead incident response and root cause analysis for high-priority outages.
• Drive post-incident reviews and ensure actionable insights are implemented.
Automation & Tooling:
• Develop and implement automated solutions to reduce manual operational tasks.
• Enhance system observability through metrics, logging, and distributed tracing tools (e.g., Prometheus, Grafana, Elastic APM).
 
• Optimize CI/CD pipelines for seamless deployments.
Collaboration:
• Partner with software engineering teams to improve the reliability of applications and infrastructure.
• Work closely with product/ engineering teams to design scalable and robust systems.
• Ensure seamless integration of monitoring and alerting systems across teams.
Leadership & Team Building:  
• Manage, mentor, and grow a team of SREs.
• Promote SRE best practices and foster a culture of reliability and performance across the organization.
• Drive performance reviews, skills development, and career progression for team members.
Capacity Planning & Cost Optimization:   
• Perform capacity planning and implement autoscaling solutions to handle traffic spikes.
• Optimize infrastructure and cloud costs while maintaining reliability and performance.
Skills & Qualifications:
Required Skills:
• Technical Expertise: o Experience with cloud platforms (AWS / Azure / GCP) and Kubernetes.
o Hands-on knowledge of infrastructure-as-code tools like Terraform /Helm/ Ansible.
 
o Proficiency in Java o Expertise in distributed systems, databases, and load balancing.
• Monitoring & Observability:
o Proficient with tools like Prometheus, Grafana,, Elastic APM, or New relic.
  
o Understanding of metrics-driven approaches for system monitoring and alerting.
• Automation & CI/CD:
o Hands-on experience with CI/CD pipelines (e.g., Jenkins, Azure Pipelines etc).
o Skilled in automation frameworks and tools for infrastructure and application deployments.
• Incident Management:
o Proven track record in handling incidents, post-mortems, and implementing solutions to prevent recurrence.
Leadership & Communication Skills:
• Strong people management and leadership skills with the ability to inspire and motivate teams.
• Excellent problem-solving and decision-making skills.
• Clear and concise communication, with the ability to translate technical concepts for non-technical stakeholders.
Preferred Qualifications:
• Experience with database optimization, Kafka, or other messaging systems.
• Knowledge of autoscaling techniques
• Previous experience in an SRE, DevOps, or infrastructure engineering leadership role.
• Understanding of compliance and security best practices in distributed systems.