Job description
 
                         Senior SRE (Engineering & Reliability) 
Job Summary:
We are seeking an experienced and dynamic Site Reliability Engineering (SRE) Lead to oversee the reliability, scalability, and performance of our critical systems. As an SeniorSRE, you will play a pivotal role in establishing and implementing SRE practices, leading a team of engineers, and driving automation, monitoring, and incident response strategies.
This position combines software engineering and systems engineering expertise to build and maintain high-performing, reliable systems.
 Experience:7+ years
Key Responsibilities:
 Reliability & Performance:
-  Lead efforts to maintain high availability and reliability of critical services.
-  Define and monitor SLIs, SLOs, and SLAs to ensure business requirements are met.
-  Proactively identify and resolve performance bottlenecks and system inefficiencies.
Incident Management & Response: 
-  Establish and improve incident management processes and on-call rotations.
-  Lead incident response and root cause analysis for high-priority outages.
-  Drive post-incident reviews and ensure actionable insights are implemented.
Automation & Tooling:
-  Develop and implement automated solutions to reduce manual operational tasks.
-  Enhance system observability through metrics, logging, and distributed tracing tools (e.G., Prometheus, Grafana, Elastic APM).
-  Optimize CI/CD pipelines for seamless deployments.
Collaboration:
-  Partner with software engineering teams to improve the reliability of applications and infrastructure.
-  Work closely with product/ engineering teams to design scalable and robust systems.
-  Ensure seamless integration of monitoring and alerting systems across teams.
Leadership & Team Building:
-  Manage, mentor, and grow a team of SREs.
-  Promote SRE best practices and foster a culture of reliability and performance across the organization.
-  Drive performance reviews, skills development, and career progression for team members.
Capacity Planning & Cost Optimization: 
-  Perform capacity planning and implement autoscaling solutions to handle traffic spikes.
-  Optimize infrastructure and cloud costs while maintaining reliability and performance.
Skills & Qualifications:
Required Skills:
-  Technical Expertise: o Experience with cloud platforms (AWS / Azure / GCP) and Kubernetes.
o Hands-on knowledge of infrastructure-as-code tools like Terraform /Helm/ Ansible.
o Proficiency in Java o Expertise in distributed systems, databases, and load balancing.
-  Monitoring & Observability:
o Proficient with tools like Prometheus, Grafana,, Elastic APM, or New relic.
o Understanding of metrics-driven approaches for system monitoring and alerting.
-  Automation & CI/CD:
o Hands-on experience with CI/CD pipelines (e.G., Jenkins, Azure Pipelines etc).
o Skilled in automation frameworks and tools for infrastructure and application deployments.
 -  Incident Management:
o Proven track record in handling incidents, post-mortems, and implementing solutions to prevent recurrence.
Leadership & Communication Skills:
-  Strong people management and leadership skills with the ability to inspire and motivate teams.
-  Excellent problem-solving and decision-making skills.
-  Clear and concise communication, with the ability to translate technical concepts for non-technical stakeholders.
Preferred Qualifications: 
-  Experience with database optimization, Kafka, or other messaging systems.
-  Knowledge of autoscaling techniques
-  Previous experience in an SRE, DevOps, or infrastructure engineering leadership role.
-  Understanding of compliance and security best practices in distributed systems.
 
                    
                    Required Skill Profession
 
                     
                    
                    Computer Occupations