Job description
Key Responsibilities
• Ensure platform uptime and application health as per SLOs/KPIs
• Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
• Debug and resolve complex production issues, performing root cause analysis
• Automate routine tasks and implement self-healing systems
• Design and maintain dashboards, alerts, and operational playbooks
• Participate in incident management, problem resolution, and RCA documentation
• Own and update SOPs for repeatable processes
• Collaborate with L3 and Product teams for deeper issue resolution
• Support and guide L1 operations team
• Conduct periodic system maintenance and performance tuning
• Respond to user data requests and ensure timely resolution
• Address and mitigate security vulnerabilities and compliance issues Technical Skillset
• Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
• Strong Linux fundamentals and scripting (Python, Shell)
• Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
• Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki
• Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
• Strong SQL skills (Oracle/Exadata preferred)
• Familiarity with DataHub, DataMesh, and security best practices is a plus
• Strong problem-solving and debugging mindset
• Ability to work under pressure in a fast-paced environment.
• Excellent communication and collaboration skills.
• Ownership, customer orientation, and a bias for actionKey Responsibilities
• Ensure platform uptime and application health as per SLOs/KPIs
• Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
• Debug and resolve complex production issues, performing root cause analysis
• Automate routine tasks and implement self-healing systems
• Design and maintain dashboards, alerts, and operational playbooks
• Participate in incident management, problem resolution, and RCA documentation
• Own and update SOPs for repeatable processes
• Collaborate with L3 and Product teams for deeper issue resolution
• Support and guide L1 operations team
• Conduct periodic system maintenance and performance tuning
• Respond to user data requests and ensure timely resolution
• Address and mitigate security vulnerabilities and compliance issues Technical Skillset
• Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
• Strong Linux fundamentals and scripting (Python, Shell)
• Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
• Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki
• Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
• Strong SQL skills (Oracle/Exadata preferred)
Skills Required
Airflow, Hive, Hadoop, Pyspark, Shell Scripting, Python
Required Skill Profession
Computer Occupations