Role Overview
We are looking for an experienced MLOps Lead with deep expertise in Azure and AWS cloud ecosystems , who can design, deploy, and manage scalable AI/ML infrastructure.
The ideal candidate should bring a strong background in cloud governance, GenAI tooling, automation, and CI/CD pipelines , with hands-on experience across modern MLOps frameworks.
Key Responsibilities
- Design, implement, and manage scalable cloud-based AI/ML infrastructure across Azure and AWS .
- Drive end-to-end MLOps lifecycle — model deployment, monitoring, retraining, and governance.
- Enable GenAI and Agentic AI platforms leveraging Azure OpenAI, Bedrock, Anthropic Claude, LangChain, etc.
- Implement CI/CD pipelines using Azure DevOps or AWS CodePipeline.
- Ensure security, observability, and compliance across ML and GenAI ecosystems.
- Manage infrastructure automation via Terraform, Bicep, CloudFormation , or similar IaC tools.
- Collaborate with data science and engineering teams to optimize ML workflows, data pipelines, and API integrations.
- Implement monitoring and alerting using Grafana, Prometheus, Azure Monitor, and Application Insights.
- Oversee networking, identity management, and role-based access controls (IAM, RBAC) across clouds.
- Support model lifecycle management — drift monitoring, retraining, technical evaluation, and business validation.
Technical Skills & Expertise
Cloud & MLOps Platforms
- Azure: Azure ML, Azure AI Services, Azure OpenAI, Azure Kubernetes Service (AKS), Databricks, Azure Search, Azure Blob, Cosmos DB, Azure SQL, Azure Functions, Azure Event Hub, Azure Resource Manager (ARM), Bicep.
- AWS: SageMaker, Bedrock, Lambda, DynamoDB, S3, RDS, Redshift, ECR, CloudFormation, CDK, KMS, EventBridge, Step Functions.
AI/ML & Programming
- Hands-on in Python , with exposure to TensorFlow, PyTorch, scikit-learn.
- Understanding of LLM tokenization, prompt injection risks, jailbreak prevention, and AI safety techniques.
- Familiarity with LangChain, LlamaCloud, AI Foundry , and related frameworks.
- Experience in model monitoring, retraining, and evaluation workflows.
DevOps & Infrastructure
- Expertise in CI/CD pipelines , containerization (Docker, Kubernetes) , and infrastructure automation .
- Strong in governance, audit logging, security policies (Azure Policy, AWS SCP, IAM).
- Deep understanding of networking, DNS, load balancers, VNets/VPCs, VPNs.
- Skilled in IaC tools – Terraform, Bicep, ARM, CloudFormation.
Monitoring & Observability
- Experience with Grafana, Prometheus, Application Insights, Log Analytics Workspaces, Azure Monitor.
Security & Access Management
- Understanding of Microsoft AD, least privilege principles, IAM, RBAC.
Testing & Automation
- Familiarity with unit testing and integration testing in CI/CD workflows (preferably Azure DevOps).
Good to Have
- Experience with Azure Bot Framework , M365 Copilot , and APIM .
- Exposure to code assistants such as GitHub Copilot, Cursor, Claude Code.
- Knowledge of Boto3 SDK (AWS Python) and TypeScript for IaC .
Preferred Background
- Strong background in cloud infrastructure engineering and machine learning operations .
- Proven ability to lead cross-functional teams and implement AI governance at scale.
- Excellent problem-solving, communication, and documentation skills.