Key Responsibilities:
- Design, develop, and optimize big data pipelines and ETL workflows using PySpark, Hadoop (HDFS, MapReduce, Hive, HBase).
- Develop and maintain data ingestion, transformation, and integration processes on Google Cloud Platform services such as BigQuery, Dataflow, Dataproc, and Cloud Storage.
- Ensure data quality, security, and governance across all pipelines.
- Monitor and troubleshoot performance issues in data pipelines and storage systems.
- Collaborate with data scientists and analysts to understand data needs and deliver clean, processed datasets.
- Implement batch and real-time data processing solutions.
- Write efficient, reusable, and maintainable code in Python and PySpark.
- Automate deployment and orchestration using tools like Airflow, Cloud Composer, or similar.
- Stay current with emerging big data technologies and recommend improvements.
Qualifications and Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- 3+ years of experience in big data engineering or related roles.
- Strong hands-on experience with Google Cloud Platform (GCP) services for big data processing.
- Proficiency in Hadoop ecosystem tools: HDFS, MapReduce, Hive, HBase, etc.
- Expert-level knowledge of PySpark for data processing and analytics.
- Experience with data warehousing concepts and tools such as BigQuery.
- Good understanding of ETL processes, data modeling, and pipeline orchestration.
- Programming proficiency in Python and scripting.
- Familiarity with containerization (Docker) and CI/CD pipelines.
- Strong analytical and problem-solving skills.
Desirable Skills:
- Experience with streaming data platforms like Kafka or Pub/Sub.
- Knowledge of data governance and compliance standards (GDPR, HIPAA).
- Familiarity with ML workflows and integration with big data platforms.
- Experience with Terraform or other infrastructure-as-code tools.
- Certification in GCP Data Engineer or equivalent.
Skills Required
Gdpr, Hipaa, Pyspark, Python, Hadoop