Job Overview
Company
Masin Projects Pvt. Ltd
Category
Computer Occupations
Ready to Apply?
Take the Next Step in Your Career
Join Masin Projects Pvt. Ltd and advance your career in Computer Occupations
Apply for This Position
Click the button above to apply on our website
Job Description
<p><p><b>Data Engineer - Multi-source ETL & GenAI Pipelines (3+ Years)</b><br/><br/><b>Roles and Responsibilities : </b></p><p><br/></p><p>- Build and maintain scalable, fault-tolerant data pipelines to support GenAI and analytics workloads across OCR, documents, and case data.<br/><br/></p><p>- Manage ingestion and transformation of semi-structured legal documents (PDF, Word, Excel) into structured formats.<br/><br/></p><p>- Enable RAG workflows by processing data into chunked, vectorized formats with metadata.<br/><br/></p><p>- Handle large-scale ingestion from multiple sources into cloud-native data lakes (S3, GCS), data warehouses (BigQuery, Snowflake), and PostgreSQL.<br/><br/></p><p>- Automate pipelines using orchestration tools like Airflow/Prefect, including retry logic, alerting, and metadata tracking.<br/><br/></p><p>- Collaborate with ML Engineers to ensure data availability, traceability, and performance for inference and training pipelines.<br/><br/></p><p>- Implement data validation and testing frameworks using Great Expectations or dbt.<br/><br/></p><p>- Integrate OCR pipelines and post-processing outputs for embedding and document search.</p><p><br/></p><p>- Design infrastructure for streaming vs batch data needs and optimize for cost, latency, and reliability.<br/><br/><b>Qualifications : </b></p><p><br/></p><p>- Bachelors or Masters degree in Computer Science, Data Engineering, or equivalent.<br/><br/></p><p>- 3+ years of experience in building distributed data pipelines and managing multi-source ingestion.<br/><br/></p><p>- Proficiency with Python, SQL, and data tools like Pandas, PySpark.<br/><br/></p><p>- Experience working with data orchestration tools (Airflow, Prefect), and file formats like Parquet, Avro, JSON.<br/><br/></p><p>- Hands-on experience with cloud storage/data warehouse systems (S3, GCS, BigQuery, Redshift).<br/><br/></p><p>- Understanding of GenAI and vector database ingestion pipelines is a strong plus.<br/><br/></p><p>- Bonus : Experience with OCR tools (Tesseract, Google Document AI), PDF parsing libraries (PyMuPDF), and API-based document processors.</p><br/></p> (ref:hirist.tech)
About Masin Projects Pvt. Ltd
Don't Miss This Opportunity!
Masin Projects Pvt. Ltd is actively hiring for this Data Engineer - Python/SQL position
Apply Now