Position Summary:
In this role, you will play a key role in ensuring the reliability, performance, and scalability of our cloud-based platform.
Your expertise will be essential in maintaining the health and availability of critical systems and applications, contributing directly to the seamless delivery of high-quality software and services.
By applying strong technical knowledge and support best practices, you will proactively troubleshoot issues, optimize system performance, and collaborate with crossfunctional teams to minimize downtime and improve infrastructure efficiency.
Your efforts will help drive operational excellence and ensure a resilient and scalable platform that meets business demands.
EXPERIENCE AND REQUIRED SKILL SETS
Education:
Bachelor's degree or master's in computer science, Engineering, Software Engineering or a relevant field
Experience:
Relevant 3+ years of experience in SRE / Production/Product Support role, with a track record of implementing SRE practices
Key Responsibilities:
-  Ensure 24x7 uptime and reliability of production systems
-  Investigate, troubleshoot, and resolve production issues in real-time
-  Collaborate with development and engineering teams to optimize system performance and reduce operational toil
-  Participate in on-call rotation to provide support for critical systems
-  Develop and implement automation for deployments, monitoring, and routine tasks
-  Continuously enhance infrastructure and workflows to reduce manual intervention
-  Maintain and improve CI/CD pipelines and Infrastructure-as-Code practices
-  Contribute to system monitoring, logging, and alerting enhancements
-  Work closely with stakeholders across time zones and cultures
-  Engage with clients via calls to understand reported issues and conduct real-time investigations when necessary.
Required Qualifications:
-  Proven track record implementing SRE practices and improving system resilience
-  Hands-on experience with cloud platforms such as AWS, Azure, or GCP.
Relevant certifications would be a plus.
-  Hands-on experience with Linux OS, including system commands and shell scripting
-  Proficient in Python, Docker and containerization, with experience in at least one additional scripting language such as Bash or PowerShell.
-  Hands-on experience with MongoDB, including designing, configuring, and managing replica sets.
Familiarity with replication, failover, high availability, and performance optimization is required.
-  Strong problem-solving skills and the ability to analyse complex technical issues
-  Excellent communication and collaboration skills across global teams
-  Proven experience managing and meeting customer-facing Service Level Agreements (SLAs), ensuring timely resolution of issues and maintaining high levels of customer satisfaction.
-  Hands-on experience and proficiency with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible for automating cloud infrastructure deployment and management.
-  Good Understanding of Tools: o Orchestration – Autosys / Airflow or Cron o Monitoring & Logging – PagerDuty, Prometheus & Grafana or Datadog, Splunk o Project Management / ITSM – Service Now (Basic ability to navigate / create change tickets / incidents), Jira (Basic ability to create Jira Tickets, ability to filter your work)