- Own backend infrastructure for a multi-cluster, event-driven telemetry platform on AWS EKS, supporting life-critical data systems.
- Design and operate distributed services across Kubernetes, Kafka, Knative, and Keycloak — responsible for availability, latency, and reliability.
- Debug complex production issues across networking, databases, and event pipelines, including Cilium clustermesh, PostgreSQL (CloudNativePG), and Knative eventing.
- Led a cross-account AWS migration end-to-end: ~173 GB database recovery, DNS cutover, and infrastructure reconfiguration with zero data loss.
- Built a disaster recovery strategy covering RPO/RTO targets, backup systems, and failure modes for stateful services.
- Resolved GitOps bootstrap dependencies by introducing Terraform-based provisioning for reproducible infrastructure.
- Leverage AI-augmented engineering workflows to navigate large codebases, accelerate development, and ship reliable systems.
- Supported production platform across Kubernetes-based infrastructure, debugging issues across networking, PostgreSQL, Redis, and microservices.
- Built and maintained CI/CD pipelines and Helm-based deployment workflows for multi-region environments.
- Improved observability and incident response processes, working closely with engineering and product teams to resolve production issues.
- Designed and executed test plans for backend systems, validating PostgreSQL, Redis, and microservice integrations.
- Collaborated with engineering teams to identify defects and ensure production readiness of distributed systems.