Job description
YOUR IMPACT:
Reliability, Automation, and Observability As a hybrid Site Reliability Engineer/DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-critical SaaS platform.
You'll apply engineering principles to operational challenges, constantly striving to eliminate toil through automation.
Operational Excellence & Reliability
● Provide day-to-day management of system alerts, check system health, and escalate issues as necessary to maintain high availability.
● Actively participate in a 24x7 on-call rotation for critical SaaS platform incidents, and be available in case of emergencies.
● Lead the incident response process, ensuring fast and effective mitigation and resolution of production issues.
● Perform thorough Root Cause Analysis (RCA) and lead blameless post-mortems to identify systemic weaknesses and create a corrective action plan to prevent recurrence.
● Collaborate with engineering teams to set and enforce error budgets (derived from SLOs, or Service Level Objectives), ensuring a healthy balance between development speed and system stability.
Platform Automation & Infrastructure Development
● Automate routine operational tasks to reduce manual effort and toil and increase overall team efficiency.
● Design, deploy, and maintain cloud infrastructure using Infrastructure as Code (IaC), specifically leveraging Terraform and Helm for deployment to EKS/K8s clusters.
● Improve existing infrastructure health by developing and implementing checks and scripts to proactively correct known issues and self-heal the platform.
● Maintain, develop, and evolve our Continuous Integration/Continuous Delivery (CI/CD) deployment code and pipelines.
● Learn and maintain existing infrastructure running under Docker and Docker Swarm while driving migration strategies toward EKS/K8s.
● Implement and integrate new technologies and services into our Cloud Infrastructure to enhance platform capabilities and resilience.
Monitoring & Observability
● Design and implement comprehensive Observability strategies across all three pillars: Metrics, Logs, and Traces.
● Proactively create and refine robust monitoring and alerting configurations within the EKS/K8s ecosystem.
● Utilize and maintain our Observability platform, Datadog, to gather performance data, create complex synthetic tests, and visualize system health via dashboards.
● Leverage existing monitoring solutions such as Grafana and Prometheus while planning and executing the migration or integration of data into a unified platform.
● Document all issues, remediation steps, system architecture, and runbooks to facilitate knowledge transfer and rapid incident response.
● Collaborate closely with Support, Customer Success, Migration, and Professional Services teams to provide the highest level of SaaS service and minimize customer impact during changes.
● Apply a real customer focus when planning deployments/updates, always considering the impact on the end-user before making changes.
YOUR EXPERIENCE: Essential Skills and Qualifications
● Hands-on AWS Cloud Engineer experience, with expert working knowledge of the AWS Cloud ecosystem, including a good understanding of AWS IAM roles and policies.
● Proficiency with container orchestration technologies: EKS/Kubernetes (K8s).
● Demonstrable experience with Infrastructure as Code (IaC) tools, specifically Terraform and Helm.
● Working experience with Docker and maintaining systems using Docker Swarm.
● Expertise in setting up and managing logging and monitoring solutions.
Direct experience with Datadog is highly preferred, with experience in setting up APM, infrastructure monitoring, and custom dashboards.
● Experience with existing monitoring solutions such as Grafana and Prometheus is required.
● Proficient in a Linux environment and strong skills in Bash and/or Python scripting for automation and troubleshooting.
● A strong understanding of web technologies, including REST APIs, Systems Architecture, Design, and Databases.
● Experience in Product/Application Support for high-availability SaaS-based products.
● Experience in designing, implementing, and operating in a DevSecOps environment.
● Excellent oral and written communication skills, with the ability to clearly explain complex technical issues and RCAs to both technical and customer-facing audiences.
Required Skill Profession
Computer Occupations