ioaiaaii
Senior Site Reliability Engineer
Engineer with 10+ years of experience managing large-scale cloud infrastructures. Skilled in systems performance, automation, and observability. Hands-on in production support and on-call, with a strong track record in incident response and ensuring high availability for critical systems.
Experience: 11 years
Yearly salary: $110,000
Hourly rate: $100
Nationality: 🇬🇷 Greece
Residency: 🇬🇷 Greece
Experience
Senior Site Reliability Engineer
TileDB 2023 - 2024
-Owned reliability for TileDB Cloud, managing 5 EKS & RDS clusters, leading RCAs, and building team culture, backlog, and roadmap. - Created a strategy to reduce Toil, scale IaC, Kubernetes, RDS, and TileDB Cloud products, and introduce multi-cloud, release engineering, and SLO frameworks to improve system resilience and SRE culture. -Redesigned 300+ Auto Scaling Groups into 5 Multi-AZ workload-dedicated groups, managing bursty CPU/GPU requests (0-50+ pods/sec), mitigating scaling issues and zone outages. -Optimized Kubernetes deployments, increasing cluster utilization from ~20% to ~70%, cutting EC2 costs, and improving uptime. -Executed EKS upgrades (1.21 to 1.26) in self-managed node groups, creating shell-scripted and documented procedures. -Improved team-wide incident response and infrastructure observability, adding dashboards and actionable alerts, reducing alert noise by ~80%, and mentoring engineers in SRE. -Orchestrated five microservice migrations, introducing API Gateway, profiling, container signal handling, and Helm helpers for maintainable configurations, improving product scalability and monitoring. -Revamped IaC, improving codebase maintainability & documentation; cut AWS CloudTrail cost by ~50%.
Senior Site Reliability Engineer
Sitecore 2022 - 2022
-Collaborated with leadership to establish the SRE function; added product telemetry and monitoring.
Senior Site Reliability Engineer
FreeNow (formerly BEAT) 2019 - 2022
-Established SLO-driven observability with Thanos, Tempo, Loki, and Grafana across 6 Kubernetes clusters (500+ nodes, 70+ microservices), tracking thousands of requests per second across LATAM. -Migrated production traffic of six Docker Registries and ChartMuseums to a centralized Harbor Registry, enhancing security and reducing image pull times and storage costs. -Built two Go-based tools to orchestrate microservices CI/CD with multi-stage pipelines, integrating GitHub Actions, ArgoCD, and reusable workflows, boosting developer productivity and GitOps adoption. -Implemented SSO provisioning for AWS with OneLogin, improving access management for 200+ users. -Developed Observability documentation and contributed to hiring and onboarding.
Systems Engineer
Upstream 2014 - 2019
-Managed on-prem hybrid infrastructure across 3 Data Centers, VMware vSphere, Kubernetes, and AWS, overseeing 300+ bare-metal servers, 5K+ VMs, Linux Images, and Authoritative DNS serves. -Optimized resource utilization across hypervisors, executing capacity planning and scaling vCenters. -Migrated production traffic to NGINX Reverse Proxy farms, improving security and performance.
Systems Engineer (Intern)
University of Macedonia 2013 - 2014
-Designed the University’s IaaS Cloud with Synnefo Cloud Stack (3 bare-metal servers, 1 SAN).
Skills
aws
devops
gcp
git
grafana
kubernetes
linux
reliability
system-engineer
english