About T-Mobile:
T-Mobile US, Inc.
(NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile.
Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation.
With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
About the Role:
As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems.
You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.
What You’ll Do:
- Architect observability and incident response pipelines for LLM, API, and backend systems
- Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
- Lead high-severity incident response, root cause analysis, and system recovery
- Collaborate with AI, Platform, and Security teams to enforce operational guardrails
- Implement automation-first strategies using GitLab CI/CD, Terraform, and deployment tooling
- Guide infrastructure tuning, capacity planning, and cost optimization
- Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry
- Support AIOps, model observability, policy enforcement, and audit readiness
- Mentor senior SREs and foster a high-ownership, technical excellence culture
What You’ll Bring:
- Bachelor's or Master’s in Computer Science, Engineering, or related field
- 7-12 years in SRE, infrastructure, or platform roles in distributed systems
- Strong experience in incident management, AI/ML observability, and performance engineering
- Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
- Proficiency in Python, Java, Bash/PowerShell, YAML
- Deep knowledge of CI/CD workflows, GitLab pipelines, and SDLC processes
- Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
- Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
- Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
- Background in compliance, governance, and enterprise risk management for AI systems
- Advanced debugging skills across data, infrastructure, networking, and app layers
- Leadership in chaos engineering, SLO-based operations, and system resilience
Must Have Skills:
- Application & Microservice: Java, Spring boot, API & Service Design
- Any CI/CD Tools : Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI
- App Platform: Docker & Containers (Kubernetes)
- Any Databases : SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB)
- Any Messaging: Kafka, Rabbit MQ
- Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus)
- Incident/Change/Problem Management
Nice To Have:
- Compliance-aligned continuity planning (PCI, SOX)
- Error-budget pacts with product/org leadership
- Executive Incident/Change/Problem /risk reporting
- Observability cost vs coverage trade-offs
- Org-wide reliability governance strategy