Role Overview  
We are looking for an experienced MLOps Lead  with deep expertise in Azure and AWS cloud ecosystems , who can design, deploy, and manage scalable AI/ML infrastructure.
The ideal candidate should bring a strong background in cloud governance, GenAI tooling, automation, and CI/CD pipelines , with hands-on experience across modern MLOps frameworks.
Key Responsibilities  
- Design, implement, and manage scalable cloud-based AI/ML infrastructure across Azure and AWS .
 
 
- Drive end-to-end MLOps lifecycle  — model deployment, monitoring, retraining, and governance.
 
 
- Enable GenAI and Agentic AI platforms  leveraging Azure OpenAI, Bedrock, Anthropic Claude, LangChain, etc.
 
 
- Implement CI/CD pipelines  using Azure DevOps or AWS CodePipeline.
 
 
- Ensure security, observability, and compliance  across ML and GenAI ecosystems.
 
 
- Manage infrastructure automation via Terraform, Bicep, CloudFormation , or similar IaC tools.
 
 
- Collaborate with data science and engineering teams to optimize ML workflows, data pipelines, and API integrations.
 
 
- Implement monitoring and alerting  using Grafana, Prometheus, Azure Monitor, and Application Insights.
 
 
- Oversee networking, identity management, and role-based access controls (IAM, RBAC)  across clouds.
 
 
- Support model lifecycle management — drift monitoring, retraining, technical evaluation, and business validation.
 
 
Technical Skills & Expertise  
Cloud & MLOps Platforms  
- Azure:  Azure ML, Azure AI Services, Azure OpenAI, Azure Kubernetes Service (AKS), Databricks, Azure Search, Azure Blob, Cosmos DB, Azure SQL, Azure Functions, Azure Event Hub, Azure Resource Manager (ARM), Bicep.
 
 
- AWS:  SageMaker, Bedrock, Lambda, DynamoDB, S3, RDS, Redshift, ECR, CloudFormation, CDK, KMS, EventBridge, Step Functions.
 
 
AI/ML & Programming  
- Hands-on in Python , with exposure to TensorFlow, PyTorch, scikit-learn.
 
 
- Understanding of LLM tokenization, prompt injection risks, jailbreak prevention, and AI safety techniques.
 
 
- Familiarity with LangChain, LlamaCloud, AI Foundry , and related frameworks.
 
 
- Experience in model monitoring, retraining, and evaluation workflows.
 
 
DevOps & Infrastructure  
- Expertise in CI/CD pipelines , containerization (Docker, Kubernetes) , and infrastructure automation .
 
 
- Strong in governance, audit logging, security policies  (Azure Policy, AWS SCP, IAM).
 
 
- Deep understanding of networking, DNS, load balancers, VNets/VPCs, VPNs.  
- Skilled in IaC  tools – Terraform, Bicep, ARM, CloudFormation.
 
 
Monitoring & Observability  
- Experience with Grafana, Prometheus, Application Insights, Log Analytics Workspaces, Azure Monitor.
 
 
Security & Access Management  
- Understanding of Microsoft AD, least privilege principles, IAM, RBAC.
 
 
Testing & Automation  
- Familiarity with unit testing and integration testing  in CI/CD workflows (preferably Azure DevOps).
 
 
Good to Have  
- Experience with Azure Bot Framework , M365 Copilot , and APIM .
 
 
- Exposure to code assistants  such as GitHub Copilot, Cursor, Claude Code.
 
 
- Knowledge of Boto3 SDK (AWS Python)  and TypeScript for IaC .
 
 
Preferred Background  
- Strong background in cloud infrastructure engineering  and machine learning operations .
 
 
- Proven ability to lead cross-functional teams  and implement AI governance  at scale.
 
 
- Excellent problem-solving, communication, and documentation skills.