We are seeking a highly skilled and motivated Data Engineer to join our team.
The ideal candidate will be responsible for designing, developing, and optimizing large-scale data pipelines and data warehouse solutions, utilizing a modern, cloud-native data stack.
You'll play a crucial role in transforming raw data into actionable insights, ensuring data quality, and maintaining the infrastructure required for seamless data flow.
Key Responsibilities
- Develop, construct, test, and maintain robust and scalable large-scale ETL pipelines using PySpark for processing and Apache Airflow for workflow orchestration.
- Design and implement both Batch ETL and Streaming ETL processes to handle various data ingestion requirements.
- Build and optimize data structures and schemas in cloud data warehouses like AWS Redshift.
- Work extensively with AWS data services, including AWS EMR for big data processing, AWS Glue for serverless ETL, and Amazon S3 for data storage.
- Implement and manage real-time data ingestion pipelines using technologies like Kafka and Debezium for Change Data Capture (CDC).
- Interact with and integrate data from various relational and NoSQL databases such as MySQL, PgSQL (PostgreSQL), and MongoDB.
- Monitor, troubleshoot, and optimize data pipeline performance and reliability.
- Collaborate with data scientists, analysts, and other engineering teams to understand data needs and deliver high-quality, reliable data solutions.
- Ensure data governance, security, and quality across all data platforms.
Required Skills & Qualifications
Technical Skills
- Expert proficiency in developing ETL/ELT solutions using PySpark.
- Strong experience in workflow management and scheduling tools, specifically Apache Airflow.
- In-depth knowledge of AWS data services including:
- AWS EMR (Elastic MapReduce)
- AWS Glue
- AWS Redshift
- Amazon S3
- Proven experience implementing and managing data streams using Kafka.
- Familiarity with Change Data Capture (CDC) tools like Debezium.
- Hands-on experience with diverse database technologies: MySQL, PgSQL, and MongoDB.
- Solid understanding of data warehousing concepts, dimensional modeling, and best practices for both batch and real-time data processing.
- Proficiency in a scripting language, preferably Python.
General Qualifications
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Excellent problem-solving, analytical, and communication skills.
- Ability to work independently and collaboratively in a fast-paced, dynamic environment.
Nice to Have (Preferred Skills)
- Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
- Knowledge of containerization technologies (Docker, Kubernetes).
- Familiarity with CI/CD pipelines.