Role Overview
We are looking for an experienced MLOps Lead with deep expertise in Azure and AWS cloud ecosystems, who can design, deploy, and manage scalable AI/ML infrastructure.
The ideal candidate should bring a strong background in cloud governance, GenAI tooling, automation, and CI/CD pipelines, with hands-on experience across modern MLOps frameworks.
Key Responsibilities
- Design, implement, and manage scalable cloud-based AI/ML infrastructure across Azure and AWS.
 - Drive end-to-end MLOps lifecycle — model deployment, monitoring, retraining, and governance.
 - Enable GenAI and Agentic AI platforms leveraging Azure OpenAI, Bedrock, Anthropic Claude, LangChain, etc.
 - Implement CI/CD pipelines using Azure DevOps or AWS CodePipeline.
 - Ensure security, observability, and compliance across ML and GenAI ecosystems.
 - Manage infrastructure automation via Terraform, Bicep, CloudFormation, or similar IaC tools.
 - Collaborate with data science and engineering teams to optimize ML workflows, data pipelines, and API integrations.
 - Implement monitoring and alerting using Grafana, Prometheus, Azure Monitor, and Application Insights.
 - Oversee networking, identity management, and role-based access controls (IAM, RBAC) across clouds.
 - Support model lifecycle management — drift monitoring, retraining, technical evaluation, and business validation.
 
Technical Skills & Expertise
Cloud & MLOps Platforms
- Azure: Azure ML, Azure AI Services, Azure OpenAI, Azure Kubernetes Service (AKS), Databricks, Azure Search, Azure Blob, Cosmos DB, Azure SQL, Azure Functions, Azure Event Hub, Azure Resource Manager (ARM), Bicep.
 - AWS: SageMaker, Bedrock, Lambda, DynamoDB, S3, RDS, Redshift, ECR, CloudFormation, CDK, KMS, EventBridge, Step Functions.
 
AI/ML & Programming
- Hands-on in Python, with exposure to TensorFlow, PyTorch, scikit-learn.
 - Understanding of LLM tokenization, prompt injection risks, jailbreak prevention, and AI safety techniques.
 - Familiarity with LangChain, LlamaCloud, AI Foundry, and related frameworks.
 - Experience in model monitoring, retraining, and evaluation workflows.
 
DevOps & Infrastructure
- Expertise in CI/CD pipelines, containerization (Docker, Kubernetes), and infrastructure automation.
 - Strong in governance, audit logging, security policies (Azure Policy, AWS SCP, IAM).
 - Deep understanding of networking, DNS, load balancers, VNets/VPCs, VPNs.
 - Skilled in IaC tools – Terraform, Bicep, ARM, CloudFormation.
 
Monitoring & Observability
- Experience with Grafana, Prometheus, Application Insights, Log Analytics Workspaces, Azure Monitor.
 
Security & Access Management
- Understanding of Microsoft AD, least privilege principles, IAM, RBAC.
 
Testing & Automation
- Familiarity with unit testing and integration testing in CI/CD workflows (preferably Azure DevOps).
 
Good to Have
- Experience with Azure Bot Framework, M365 Copilot, and APIM.
 - Exposure to code assistants such as GitHub Copilot, Cursor, Claude Code.
 - Knowledge of Boto3 SDK (AWS Python) and TypeScript for IaC.
 
Preferred Background
- Strong background in cloud infrastructure engineering and machine learning operations.
 - Proven ability to lead cross-functional teams and implement AI governance at scale.
 - Excellent problem-solving, communication, and documentation skills.