Job description
About the Role We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure.
This role focuses on next-generation GPU systems used for AI/ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance.
You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.
Key Responsibilities
Colocation & Infrastructure Setup
GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.
Coordinate with data centers for rack installation, power, and cooling.
Deploy and configure GPU-based servers for production readiness.
2.
GPU & AI/ML Infrastructure
Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.
Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.
Optimize GPU infrastructure for AI/ML workloads (TensorFlow, PyTorch, RAPIDS).
Support multi-GPU scaling using NVLink and PCIe passthrough.
3.
Systems & Virtualization
Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.
Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.
Handle container orchestration with Docker and Kubernetes GPU Operators.
Integrate high-performance storage (NFS, Ceph, SAN/NAS) for large-scale datasets.
4.
Monitoring & Performance Optimization
Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.
Proactively detect, analyze, and resolve GPU or system bottlenecks.
Optimize GPU nodes for training and inference performance.
Implement structured logging, alerts, and usage reporting.
one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.
5.
Security & Compliance
Harden GPU servers for multi-tenant workloads.
Manage driver, firmware, and software license compliance.
Ensure infrastructure security and audit readiness with periodic patching and updates.
6.
Networking & High-Performance I/O
Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).
Optimize low-latency interconnects for distributed GPU workloads.
Troubleshoot and enhance data transfer performance.
7.
Customer & Infrastructure Ownership
Serve as the primary contact for GPU resource allocation.
Provision GPU slices or MIG instances for internal and external teams.
Troubleshoot, document, and optimize workload performance.
Qualifications
Proven experience in data center server setup and colocation.
Deep expertise in GPU server administration (NVIDIA A100/H100 or equivalent).
Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.
Experience with Linux administration, virtualization (VMware/KVM/Proxmox), and containers (Docker/Kubernetes).
Hands-on experience with AI/ML frameworks such as TensorFlow and PyTorch.
Familiarity with monitoring tools (Prometheus, Grafana, DCGM).
Knowledge of storage systems (NFS, Ceph) and high-performance networking.
Strong vendor coordination and infrastructure management skills.
Why This Role Matters
This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization.
You will build and maintain the backbone of our AI/ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.
Required Skill Profession
Other General