We are building the future of healthcare analytics.
Join us to design, build, and scale robust data pipelines that power nationwide analytics and support our machine learning systems.
Our goal: pipelines that are reliable, observable, and continuously improving in production.
This is a fully remote role, open to candidates based in Europe or India, with periodic team gatherings in Mountain View, California.
What You’ll Do
- Design, build, and maintain scalable ETL pipelines using Python (Pandas, PySpark) and SQL, orchestrated with Airflow (MWAA).
- Develop and maintain the SAIVA Data Lake/Lakehouse on AWS, ensuring quality, governance, scalability, and accessibility.
- Run and optimize distributed data processing jobs with Spark on AWS EMR and/or EKS.
- Implement batch and streaming ingestion frameworks (APIs, databases, files, event streams).
- Enforce validation and quality checks to ensure reliable analytics and ML readiness.
- Monitor and troubleshoot pipelines with CloudWatch, integrating observability tools like Grafana, Prometheus, or Datadog.
- Automate infrastructure provisioning with Terraform, following AWS best practices.
- Manage SQL Server, PostgreSQL, and Snowflake integrations into the Lakehouse.
- Participate in an on-call rotation to support pipeline health and resolve incidents quickly.
- Write production-grade code, and contribute to design/code reviews and engineering best practices.
What We’re Looking For
- 5+ years in data engineering, ETL pipeline development, or data platform roles (flexible for exceptional candidates).
- Experience designing and operating data lakes or Lakehouse architectures on AWS (S3, Glue, Lake Formation, Delta Lake, Iceberg).
- Strong SQL skills with PostgreSQL, SQL Server, and at least one AWS cloud warehouse (Snowflake or Redshift).
- Proficiency in Python (Pandas, PySpark);
Scala or Java a plus. - Hands-on with Spark on AWS EMR and/or EKS for distributed processing.
- Strong background in Airflow (MWAA) for workflow orchestration.
- Expertise with AWS services: S3, Glue, Lambda, Athena, Step Functions, ECS, CloudWatch.
- Proficiency with Terraform for IaC;
familiarity with Docker, ECS, and CI/CD pipelines. - Experience building monitoring, validation, and alerting into pipelines with CloudWatch, Grafana, Prometheus, or Datadog.
- Strong communication skills and ability to collaborate with data scientists, analysts, and product teams.
- A track record of delivering production-ready, scalable AWS pipelines, not just prototypes.