Job Description
<p><p><b>Site Reliability Engineer (SRE)</b></p><p><p><b><br/></b></p>We are hiring a Site Reliability Engineer (SRE) to support the night-time operations of a mission-critical banking platform for a US-based enterprise client.
This is a permanent night shift role tailored for experienced engineers who thrive in production environments and bring a proactive approach to incident resolution and automation.<br/><br/>You will work on system monitoring, incident response, and platform stability-while also improving observability, creating automation scripts, and collaborating with developers and DevOps teams.
You wont just respond to alerts-youll help prevent them.<br/><br/><b>Work Mode : </b> Permanent Night Shift<br/><br/><b>Note : </b> This is a fixed night shift role.
Candidates must have prior experience or explicitly confirm readiness for permanent US-time zone shifts.<br/><br/><b>Key Responsibilities :</b></p><p><p><b><br/></b></p>- Monitor system health, SLIs/SLOs, and infrastructure using tools like Prometheus, Grafana, ELK, Stackdriver, etc.<br/><br/></p><p>- Lead incident triage for P1/P2 alerts, engage in war rooms, update tickets (JIRA/SNOW), and participate in post-incident RCA documentation.<br/><br/></p><p>- Create or enhance automation scripts (Bash/Python) for log ingestion, alert suppression, auto-recovery, and health checks.<br/><br/></p><p>- Analyze application runtime issues-such as JVM logs, memory usage, GC pauses, or thread deadlocks-to support root cause analysis.<br/><br/></p><p>- Participate in daily DevOps/SRE standups, collaborating closely with engineering teams to improve production reliability.<br/><br/></p><p>- Handle database performance alerts (Oracle/Postgres) and collaborate with DBAs or developers to resolve backend bottlenecks.<br/><br/></p><p>- Track and interpret SLO breaches, availability metrics, and system latencies to enforce production SLAs.<br/><br/><b>Core Skills & Expertise : Technical Skills : </b></p><p><br/></p><p>- Experience with Grafana, Prometheus, ELK Stack, or Stackdriver.
Able to define alerts, read logs, and correlate cross-system issues.</p><p><br/></p>- Full ownership of P1/P2 incidents - including triage, ticketing, stakeholder communication, and RCA participation.<br/><br/></p><p>- Proficient in Bash or Python scripting to automate routine SRE tasks and recovery workflows.<br/><br/></p><p>- Experience managing production workloads on GCP, AWS, or Azure, with ability to inspect cloud logs, VM status, networking, and storage configurations.<br/><br/></p><p>- Familiar with concepts like error budgets, latency thresholds, and SLO tracking.
Capable of interpreting breaches and reporting anomalies.<br/><br/></p><p>- Able to spot symptoms of JVM issues like GC pauses, memory leaks, thread contention, and raise appropriate diagnostics.<br/><br/></p><p>- Identify backend delays or errors from logs and assist in pinpointing query or connection-related issues.<br/><br/></p><p>- Strong communication skills to work with distributed teams during escalations, code fixes, or configuration changes.<br/><br/></p><p>- Must be fully aligned to a permanent night shift (US time) and self-sufficient in a remote-first environment.<br/><br/><b>Nice-to-Have Skills :</b></p><p><p><b><br/></b></p>- Familiarity with ServiceNow, change advisory boards, rollback planning, and structured release processes.<br/><br/></p><p>- Experience monitoring CPU, memory, and traffic metrics to recommend infrastructure scale-up/down strategies.<br/><br/></p><p>- Exposure to embedding SRE gates, smoke tests, or health validations in CI pipelines like Jenkins or GitHub Actions.<br/><br/></p><p>- Basic understanding of tools like SLO Generator or Datadog for automated budget tracking and alerting.<br/><br/></p><p>- Can interpret Terraform code related to monitoring, infrastructure, or alert rules.
Not required to author full modules.<br/><br/></p><p>- Holding a GCP Associate Cloud Engineer or similar certification is a plus but not mandatory.</p><br/></p> (ref:hirist.tech)