About the Role
We are looking for a Systems or Solutions Architect with deep expertise in networking, infrastructure-as-a-service (IaaS), and cloud-scale system design to help architect and optimize AI/ML infrastructure.
The ideal candidate combines strong fundamentals in cloud architecture (AWS or equivalent), networking, compute, and storage, with hands-on experience in Kubernetes, observability, and automation.
You’ll design scalable systems that support large AI workloads — enabling efficient training, inference, and data pipelines across distributed environments.
Key Responsibilities
- Architect and scale AI/ML infrastructure across public cloud (AWS / Azure / GCP) and hybrid environments.
 - Design and optimize compute, storage, and network topologies for distributed training and inference clusters.
 - Build and manage containerized environments using Kubernetes, Docker, and Helm.
 - Develop automation frameworks for provisioning, scaling, and monitoring infrastructure using Python, Go, and IaC (Terraform / CloudFormation).
 - Partner with data science and ML Ops teams to align AI infrastructure requirements (GPU/CPU scaling, caching, throughput, latency).
 - Implement observability, logging, and tracing using Prometheus, Grafana, CloudWatch, or Open Telemetry.
 - Drive networking automation (BGP, routing, load balancing, VPNs, service meshes) using software-defined networking (SDN) and modern APIs.
 - Lead performance, reliability, and cost-optimization efforts for AI training and inference pipelines.
 - Collaborate cross-functionally with product, platform, and operations teams to ensure secure, performant, and resilient infrastructure.
 
Required Qualifications
- Knowledge of AI/ML infrastructure patterns, including distributed training, inference pipelines, and GPU orchestration.
 - Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
 - 10+ years of experience in systems, infrastructure, or solutions architecture roles.
 
Deep understanding of:
- Cloud architecture: AWS (preferred), Azure, or GCP
 - Networking: VPC, Transit Gateway, DNS, routing, peering, load balancing, VPN
 - Compute and storage: EC2, ECS/EKS, S3, EBS, EFS, FSx, caching systems
 - Core infrastructure: virtualization, containers, distributed systems, and OS-level tuning
 - Proficiency in Linux systems engineering and scripting with Python and Bash.
 - Experience with Kubernetes (EKS/GKE/AKS) for large-scale workload orchestration.
 - Experience with Go (Golang) for infrastructure or network automation.
 - Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation.
 - Experience implementing monitoring and observability systems (Prometheus, Grafana, ELK, Datadog, CloudWatch).
 
Preferred Qualifications
- Experience with DevOps and MLOps ecosystems (SageMaker, Kubeflow, MLflow, Airflow).
 - AWS or cloud certifications such as Solutions Architect Professional or Advanced Networking Specialty.
 - Experience in performance benchmarking, security hardening, and cost optimization for compute-intensive workloads.
 - Strong collaboration skills and ability to communicate complex infrastructure concepts clearly.