Description GSPANN is hiring a Site Reliability Engineer with expertise in Java and Spark.
The role involves ensuring service reliability, automating operations, and supporting Java-based big data applications using Spark.
You'll work closely with cross-functional teams to enhance system performance, observability, and scalability.
Role and Responsibilities
Gain a deep understanding of the business and map the full customer journey end-to-end.Apply software development principles to operations, leveraging broad experience in software engineering and Site Reliability Engineering (SRE) practices.Collaborate with stakeholders to enhance the design, observability, availability, scalability, and performance of critical services.Clearly communicate your availability to both the team and your manager.Automate manual workflows, investigate incidents thoroughly, and lead blameless post-mortems for continuous learning.Use standardized telemetry data to improve alert management, incident analysis, decision-making, and system optimization.Support planned changes by managing deployments, monitoring systems post-deployment, and creating or updating dashboards and alerts as needed.Develop and enhance new services, and deploy tools that automate the support of systems and services.Meet and uphold organizational Service Level Objectives (SLOs) consistently.Create value-focused deliverables including Standard Operating Procedures (SOPs), presentations, case studies, and accelerators.Skills and Experience
5+ years of experience in software development, technical operations, and managing large-scale application environments.5+ years in Service Engineering, IT Support, or Production Operations.5+ years of hands-on experience with Java application development and support, including knowledge of Spring and Hibernate frameworks.Set up and debug Apache Spark jobs for over 4 years, with a solid understanding of data processing, cleansing, and integrity validation.Write and maintain Unix shell scripts for over 3 years, with strong hands-on scripting capability.Preferably have working knowledge of Microsoft Azure, Azure Cosmos DB, Azure Synapse Analytics, and Apache Kafka.Apply creative problem-solving skills to resolve cross-functional technical challenges in dynamic, fast-changing environments.Communicate effectively, take ownership of triage calls, and drive resolution of critical incidents to logical closure.Stay open to working in rotational shifts as required.