Key Job Responsibilities and Duties:
The core premise for the SRE lies in treating operational issues as a software problem.
We code our way out of problems where operations are concerned addressing availability,
scalability, latency, and efficiency challenges within the vast infrastructure here.
- You will impact millions of people all over the globe with your creative solutions
- You work in one of the biggest e-commerce companies in the world
- You will solve exciting problems at scale by writing and deploying code across tens of thousands of servers
- You will have the opportunity to collaborate with many of the world’s leading SREs
- You will be free to launch your own ideas and solutions within our sophisticated production environment
- Here are some of the tools and technologies we use to achieve this: Python, Go, Puppet, Kubernetes, Elasticsearch, Prometheus, HAProxy, Cassandra, Kafka etc
What you’ll be Doing:
- Design, develop and implement systems software that improves the stability, scalability, availability and latency of the products;
- Take ownership of one or more services and have the freedom to do what is best for our business and customers;
- Solve problems occurring with our highly available production systems and build solutions and automation to prevent them from happening again;
- Build effectivemonitoring to monitor the health of your system, and jump in to handle outages;
- Build and run capacity tests to handle the growth of your systems;
- Plan for reliability by designing systems to work across our multinational data centers;
- Develop tools to assist the product development teams with successfully deploying 1000s of change sets every day;
- Share the on-call rotation and be an escalation contact for incidents (depending on level of role)
What you’ll bring:
- Solid experience in at least one programming language.
- Experience with building, operating and maintaining scalable distributed systems, and with operations automation;
- Experience withInfrastructure as Code technologies;
- Knowledge of cloud computing fundamentals;
- Solid foundation in Linux administration and troubleshooting;
- Understanding of Service level agreements and objectives;
- Additional experience in OpenStack, Kubernetes, Networking, Security or Storage is desirable;
- Monitoring / observability technologies like Prometheus, Graphite, Grafana, Kibana, Elasticsearch are a plus;
- Good interpersonal skills
- Proficient command of the English language, both written and spoken