Key Responsibilities:
- Develop and maintain automation scripts and tools primarily using Python to support infrastructure provisioning, monitoring, and incident response.
- Collaborate with development and operations teams to build and maintain highly available, scalable systems.
- Implement and manage monitoring, alerting, and incident management solutions using tools like Prometheus, Grafana, ELK Stack, Datadog, etc.
- Participate in on-call rotations to respond to and resolve production incidents.
- Conduct root cause analysis of outages and implement preventative measures.
- Design and implement CI/CD pipelines to automate deployment processes.
- Optimize system performance, reliability, and scalability in cloud platforms such as AWS, Azure, or GCP.
- Manage container orchestration platforms like Kubernetes and container tools like Docker.
- Document operational procedures, runbooks, and best practices.
- Drive continuous improvement in system architecture and operational processes.
Qualifications and Requirements:
- Bachelor's degree in Computer Science, Engineering, or related field.
- 3+ years of experience as a Site Reliability Engineer, DevOps engineer, or related role.
- Strong programming skills in Python with experience writing production-grade automation scripts and tools.
- Experience with cloud platforms (AWS, Azure, GCP) and infrastructure-as-code tools such as Terraform, CloudFormation, or Ansible.
- Proficient with containerization and orchestration technologies like Docker and Kubernetes.
- Hands-on experience with monitoring and alerting tools (Prometheus, Grafana, ELK Stack, Datadog).
- Solid understanding of Linux system administration and networking concepts.
- Experience with CI/CD tools such as Jenkins, GitLab CI, CircleCI, or similar.
- Strong problem-solving, analytical, and communication skills.
- Familiarity with incident management and ITIL processes is a plus.
Skills Required
Jenkins, Aws, Azure, Gcp, Python, Terraform, Cloudformation