Role : ML OPS Lead Engineer_ (Machine Learning Operations Lead Engineer)
Job Mode : Remote
Experience : 7+ Years
Notice Period : Immediate / 10 to 15 Days
Role Overview
We are seeking a highly skilled ML Ops Lead Engineer with extensive experience in Machine Learning Operations, Cloud Infrastructure, and Generative AI platforms.
The ideal candidate will have deep expertise in both Azure and AWS ecosystems, with a proven track record of designing, deploying, and maintaining scalable and secure ML solutions across hybrid environments.
Key Responsibilities
- Design, deploy, and manage scalable, secure, and reliable cloud-based ML infrastructures leveraging Azure and AWS services.
- Lead ML Ops initiatives to streamline model development, deployment, and monitoring pipelines.
- Collaborate with data scientists, ML engineers, and platform teams to operationalize ML models efficiently.
- Implement and maintain CI/CD pipelines for ML workflows using Azure DevOps or AWS CodePipeline.
- Drive governance, observability, compliance, and audit controls within ML and GenAI environments.
- Refine and enforce security best practices, including IAM, RBAC, Azure Policy, and AWS SCPs.
- Oversee AI evaluation, prompt security scans, and red teaming using Azure AI Evaluation SDK.
- Manage data storage, compute, and networking integrations across S3, DynamoDB, Cosmos DB, RDS, and Blob Storage.
- Build Infrastructure as Code (IaC) using Terraform, ARM/Bicep, CloudFormation, or equivalent tools.
- Implement monitoring and observability solutions using Grafana, Prometheus, Application Insights, and Azure Monitor.
- Support ML model lifecycle management deployment, monitoring, retraining, and drift detection.
- Collaborate with stakeholders to resolve ML pipeline issues and support model infrastructure needs.
Required Skills & Experience
- 7+ years of experience in platform engineering, ML Ops, or DevOps with cloud infrastructure expertise.
- Proficiency with Azure (Azure ML, Databricks, AKS, AI Services, Azure Search) and AWS (SageMaker, Bedrock, Lambda).
- Experience in Generative AI and Agentic AI ecosystems, including Azure OpenAI, AI Foundry, AI Hub, Bedrock, Anthropic Claude, OpenAI API, LlamaCloud, and LangChain.
- Strong understanding of token usage, prompt injection risks, jailbreak attempts, and mitigation techniques.
- Expertise in Azure DevOps / AWS CodePipeline for ML CI/CD automation.
- Proficient in Azure Blob Storage, Cosmos DB, Key Vault, AWS S3, RDS, DynamoDB, and integrations with AI services.
- Advanced understanding of networking (DNS, load balancing, VPNs, VNets) and security concepts (IAM, policies, encryption).
- Proficiency in Infrastructure as Code (IaC) Azure ARM/Bicep, Terraform, or CloudFormation.
- Knowledge of Python (with AI/ML libraries like TensorFlow, PyTorch, Scikit-learn) and scripting in Bash / PowerShell.
- Experience with containerization and orchestration using Docker and Kubernetes.
- Familiarity with Azure Bot Framework, API Management, Application Gateway, and M365 Copilot.
- Working knowledge of monitoring and logging tools such as Grafana, Prometheus, and Azure Log Analytics.
ML Engineering & Model Lifecycle Expertise
- Hands-on experience with Azure Machine Learning Studio, Python SDK (v2), and CLI (v2) for ML model management.
- Understanding of ML/DL algorithms, model training, evaluation, and deployment workflows.
- Practical exposure to CI/CD orchestration for data science pipelines and post-deployment model monitoring.
- Experience enabling production-grade ML models, including drift monitoring, model retraining, and business validation.
Security & Governance
- Familiarity with Microsoft Active Directory (AD) and principle of least privilege for RBAC enforcement.
- Experience applying unit testing, integration testing, and CI/CD best practices within ADO.
Cloud-Specific Expertise
AWS
- Proficiency in AWS services RDS, DynamoDB, Redshift, Aurora, EC2, EBS, EFS, Lambda, SQS, SNS, EventBridge, Step Functions, KMS, ECR.
- Strong experience with AWS CloudFormation, CDK, and Python (Boto3) SDK.
Azure
- Expertise in Azure databases (Cosmos DB, Azure SQL Serverless), compute services (VMs, Scale Sets), and serverless components (Functions, Event Grid/Hub, Service Bus, Queue Storage).
- Experience managing Azure AKS/ACR, Azure Machine Learning, Azure Data Lake, Azure Key Vault, and ARM/Bicep templates.