About the Role
We are looking for a Systems or Solutions Architect with deep expertise in networking, infrastructure-as-a-service (IaaS), and cloud-scale system design to help architect and optimize AI/ML infrastructure.
The ideal candidate combines strong fundamentals in cloud architecture (AWS or equivalent), networking, compute, and storage, with hands-on experience in Kubernetes, observability, and automation.
You’ll design scalable systems that support large AI workloads — enabling efficient training, inference, and data pipelines across distributed environments.
Key Responsibilities
- Architect and scale AI/ML infrastructure across public cloud (AWS / Azure / GCP) and hybrid environments.
- Design and optimize compute, storage, and network topologies for distributed training and inference clusters.
- Build and manage containerized environments using Kubernetes, Docker, and Helm.
- Develop automation frameworks for provisioning, scaling, and monitoring infrastructure using Python, Go, and IaC (Terraform / CloudFormation).
- Partner with data science and ML Ops teams to align AI infrastructure requirements (GPU/CPU scaling, caching, throughput, latency).
- Implement observability, logging, and tracing using Prometheus, Grafana, CloudWatch, or Open Telemetry.
- Drive networking automation (BGP, routing, load balancing, VPNs, service meshes) using software-defined networking (SDN) and modern APIs.
- Lead performance, reliability, and cost-optimization efforts for AI training and inference pipelines.
- Collaborate cross-functionally with product, platform, and operations teams to ensure secure, performant, and resilient infrastructure.
Required Qualifications
- Knowledge of AI/ML infrastructure patterns, including distributed training, inference pipelines, and GPU orchestration.
- Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
- 10+ years of experience in systems, infrastructure, or solutions architecture roles.
Deep understanding of:
- Cloud architecture: AWS (preferred), Azure, or GCP
- Networking: VPC, Transit Gateway, DNS, routing, peering, load balancing, VPN
- Compute and storage: EC2, ECS/EKS, S3, EBS, EFS, FSx, caching systems
- Core infrastructure: virtualization, containers, distributed systems, and OS-level tuning
- Proficiency in Linux systems engineering and scripting with Python and Bash.
- Experience with Kubernetes (EKS/GKE/AKS) for large-scale workload orchestration.
- Experience with Go (Golang) for infrastructure or network automation.
- Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation.
- Experience implementing monitoring and observability systems (Prometheus, Grafana, ELK, Datadog, CloudWatch).
Preferred Qualifications
- Experience with DevOps and MLOps ecosystems (SageMaker, Kubeflow, MLflow, Airflow).
- AWS or cloud certifications such as Solutions Architect Professional or Advanced Networking Specialty.
- Experience in performance benchmarking, security hardening, and cost optimization for compute-intensive workloads.
- Strong collaboration skills and ability to communicate complex infrastructure concepts clearly.