JOB DESCRIPTION Data Engineer Designation – Data Engineer Experience – 5+ Years Location Mumbai (onsite)
Job Summary: We are seeking a highly skilled Data Engineer with deep expertise in Apache Kafka integration with Databricks, structured streaming, and large-scale data pipeline design using the Medallion Architecture.
The ideal candidate will demonstrate strong hands-on experience in building and optimizing real-time and batch pipelines, and will be expected to solve real coding problems during the interview.
Job Description:
- Design, develop, and maintain real-time and batch data pipelines in Databricks.
- Integrate Apache Kafka with Databricks using Structured Streaming.
- Implement robust data ingestion frameworks using Databricks Autoloader.
- Build and maintain Medallion Architecture pipelines across Bronze, Silver, and Gold layers.
- Implement checkpointing, output modes, and appropriate processing modes in structured streaming jobs.
- Design and implement Change Data Capture (CDC) workflows and Slowly Changing Dimensions (SCD) Type 1 and Type 2 logic.
- Develop reusable components for merge/upsert operations and window functionbased transformations.
- Handle large volumes of data efficiently through proper partitioning, caching, and cluster tuning techniques.
- Collaborate with cross-functional teams to ensure data availability, reliability, and consistency.
Must Have:
- Apache Kafka: Integration, topic management, schema registry (Avro/JSON).
- Databricks & Spark Structured Streaming: o Processing Modes: Append, Update, Complete o Output Modes: Memory, Console, File, Kafka, Delta o Checkpointing and fault tolerance
- Databricks Autoloader: Schema inference, schema evolution, incremental loads.
- Medallion Architecture implementation expertise.
- Performance Optimization: o Data partitioning strategies o Caching and persistence o Adaptive query execution and cluster configuration tuning
- SQL & Spark SQL: Proficiency in writing efficient queries and transformations.
- Data Governance: Schema enforcement, data quality checks, and monitoring.
- Good to Have:
- Strong coding skills in Python and PySpark.
- Experience working in CI/CD environments for data pipelines.
- Exposure to cloud platforms (AWS/Azure/GCP).
- Understanding of Delta Lake, time travel, and data versioning.
- Familiarity with orchestration tools like Airflow or Azure Data Factory.
Mandatory Hands-on Coding Assessment (During Interview): Candidates will be required to demonstrate hands-on proficiency in the following areas:
1.
Window Functions: o Implement logic using ROW_NUMBER, RANK, and DENSE_RANK in Spark.
o Use cases such as deduplication, ranking within groups.
2.
Merge/Upsert Logic: o Write PySpark code to perform MERGE operations in Delta Lake.
3.
SCD Implementation: o SCD Type 1: Overwriting existing records.
o SCD Type 2: Versioning records with effective start/end dates or is_current flags.
4.
CDC (Change Data Capture): o Capture and process changes using techniques such as: ▪ Comparison with previous snapshots ▪ Using audit columns or timestamps ▪ Kafka-based event-driven ingestion