Job Overview
Category
Computer Occupations
Ready to Apply?
Take the Next Step in Your Career
Join Neemtree and advance your career in Computer Occupations
Apply for This Position
Click the button above to apply on our website
Job Description
<p><p><b>Responsibilities : </b><br/><br/>- Team Leadership : Manage and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of collaboration and continuous learning.<br/><br/></p><p>- Design and Implement Monitoring and Alerting : Lead the implementation of reliable, scalable, and fault-tolerant systems, including infrastructure, monitoring, and alerting.<br/><br/></p><p>- Incident Management : Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimize system downtime and impact.<br/><br/></p><p>- Monitoring and Alerting : Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimize system performance.<br/><br/></p><p>- Automation and Tooling : Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilizing relevant tools and technologies.<br/><br/></p><p>- Capacity Planning : Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.<br/><br/></p><p>- Performance Optimization : Analyze system metrics and identify bottlenecks, implement performance improvements, and optimize resource utilization.<br/><br/></p><p>- Collaboration : Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.<br/><br/></p><p>- Technical Strategy : Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.<br/><br/><b>Requirements : </b><br/><br/>- Technical Expertise : Strong proficiency in system administration, cloud computing (AWS, Azure), networking, distributed systems, and containerization technologies (Docker, Kubernetes).<br/><br/></p><p>- Programming Skills : Expertise in scripting languages (Python, Bash) and ability to develop automation tools.
Good to have a basic understanding of Java<br/><br/></p><p>- Monitoring and Alerting : Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.<br/><br/></p><p>- Incident Management : Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.<br/><br/></p><p>- Leadership and Communication : Excellent communication skills to convey technical concepts to both technical and non-technical audiences, ability to lead and motivate a team.<br/><br/></p><p>- Problem-Solving : Strong analytical and troubleshooting skills to identify and resolve complex technical issues.</p><br/></p> (ref:hirist.tech)
Don't Miss This Opportunity!
Neemtree is actively hiring for this Lead Site Reliability Engineer - Cloud Computing position
Apply Now