Job Description
<p><p><b>Responsibilities :</b><br/><br/>- Define and lead the data architecture vision and strategy, ensuring it supports analytics, ML, and business operations at scale.<br/><br/></p><p>- Architect and manage cloud-native data platforms using Databricks and AWS, leveraging the lakehouse architecture to unify data engineering and ML workflows.<br/><br/></p><p>- Build and optimize large-scale batch and streaming pipelines using Apache Spark, Airflow, and AWS Glue, ensuring high availability and fault tolerance.<br/><br/></p><p>- Design and develop data marts, warehouses, and analytics-ready datasets tailored for BI, product, and data science teams.<br/><br/></p><p>- Implement robust ETL/ELT pipelines with a focus on reusability, modularity, and automated testing.<br/><br/></p><p>- Enforce and scale data governance practices, including data lineage, cataloging, access management, and compliance with security and privacy standards.</p><p><br/></p><p>- Partner with ML Engineers and Data Scientists to build and deploy ML pipelines, leveraging Databricks MLflow, Feature Store, and MLOps practices.<br/><br/></p><p>- Provide architectural leadership across data modeling, data observability, pipeline monitoring, and CI/CD for data workflows.<br/><br/></p><p>- Evaluate emerging tools and frameworks, recommending technologies that align with platform scalability and cost-efficiency.<br/><br/></p><p>- Mentor data engineers and foster a culture of technical excellence, innovation, and ownership across data teams.<br/><br/><b>Requirements :</b><br/><br/>- 8+ years of hands-on experience in data engineering, with at least 4 years in a lead or architect-level role.<br/><br/></p><p>- Deep expertise in Apache Spark, with proven experience developing large-scale distributed data processing pipelines.<br/><br/></p><p>- Strong experience with Databricks platform and its internal ecosystem (e.
g., Delta Lake, Unity Catalog, MLflow, Job orchestration, Workspaces, Clusters, Lakehouse architecture).<br/><br/></p><p>- Extensive experience with workflow orchestration using Apache Airflow.<br/><br/></p><p>- Proficiency in both SQL and NoSQL databases (e.
g., Postgres, DynamoDB, MongoDB, Cassandra) with a deep understanding of schema design, query tuning, and data partitioning.<br/><br/></p><p>- Proven background in building data warehouse/data mart architectures using AWS services like Redshift, Athena, Glue, Lambda, DMS, and S3<br/><br/></p><p>- Strong programming and scripting ability in Python (preferred) or other AWS-compatible languages.<br/><br/></p><p>- Solid understanding of data modeling techniques, versioned datasets, and performance tuning strategies.<br/><br/></p><p>- Hands-on experience implementing data governance, lineage tracking, data cataloging, and compliance frameworks (GDPR, HIPAA, etc.
).<br/><br/></p><p>- Experience with real-time data streaming using tools like Kafka, Kinesis, or Flink.<br/><br/></p><p>- Working knowledge of MLOps tooling and workflows, including automated model deployment, monitoring, and ML pipeline orchestration.<br/><br/></p><p>- Familiarity with MLflow, Feature Store, and Databricks-native ML tooling is a plus.<br/><br/></p><p>- Strong grasp of CI/CD for data and ML pipelines, automated testing, and infrastructure-as-code (Terraform, CDK, etc.
).<br/><br/></p><p>- Excellent communication, leadership, and mentoring skills with a collaborative mindset and the ability to influence across functions.<br/></p><br/></p> (ref:hirist.tech)