ioaiaaii

Senior Site Reliability Engineer

Engineer with 10+ years of experience managing large-scale cloud infrastructures. Skilled in systems performance, automation, and observability. Hands-on in production support and on-call, with a strong track record in incident response and ensuring high availability for critical systems.

Experience: 11 years

Yearly salary: $110,000

Hourly rate: $100

Nationality: 🇬🇷 Greece

Residency: 🇬🇷 Greece

Experience

Senior Site Reliability Engineer

TileDB

2023 - 2024

-Owned reliability for TileDB Cloud, managing 5 EKS & RDS clusters, leading RCAs, and building team culture, backlog, and roadmap. - Created a strategy to reduce Toil, scale IaC, Kubernetes, RDS, and TileDB Cloud products, and introduce multi-cloud, release engineering, and SLO frameworks to improve system resilience and SRE culture. -Redesigned 300+ Auto Scaling Groups into 5 Multi-AZ workload-dedicated groups, managing bursty CPU/GPU requests (0-50+ pods/sec), mitigating scaling issues and zone outages. -Optimized Kubernetes deployments, increasing cluster utilization from ~20% to ~70%, cutting EC2 costs, and improving uptime. -Executed EKS upgrades (1.21 to 1.26) in self-managed node groups, creating shell-scripted and documented procedures. -Improved team-wide incident response and infrastructure observability, adding dashboards and actionable alerts, reducing alert noise by ~80%, and mentoring engineers in SRE. -Orchestrated five microservice migrations, introducing API Gateway, profiling, container signal handling, and Helm helpers for maintainable configurations, improving product scalability and monitoring. -Revamped IaC, improving codebase maintainability & documentation; cut AWS CloudTrail cost by ~50%.

Senior Site Reliability Engineer

Sitecore

2022 - 2022

-Collaborated with leadership to establish the SRE function; added product telemetry and monitoring.

Senior Site Reliability Engineer

FreeNow (formerly BEAT)

2019 - 2022

-Established SLO-driven observability with Thanos, Tempo, Loki, and Grafana across 6 Kubernetes clusters (500+ nodes, 70+ microservices), tracking thousands of requests per second across LATAM. -Migrated production traffic of six Docker Registries and ChartMuseums to a centralized Harbor Registry, enhancing security and reducing image pull times and storage costs. -Built two Go-based tools to orchestrate microservices CI/CD with multi-stage pipelines, integrating GitHub Actions, ArgoCD, and reusable workflows, boosting developer productivity and GitOps adoption. -Implemented SSO provisioning for AWS with OneLogin, improving access management for 200+ users. -Developed Observability documentation and contributed to hiring and onboarding.

Systems Engineer

Upstream

2014 - 2019

-Managed on-prem hybrid infrastructure across 3 Data Centers, VMware vSphere, Kubernetes, and AWS, overseeing 300+ bare-metal servers, 5K+ VMs, Linux Images, and Authoritative DNS serves. -Optimized resource utilization across hypervisors, executing capacity planning and scaling vCenters. -Migrated production traffic to NGINX Reverse Proxy farms, improving security and performance.

Systems Engineer (Intern)

University of Macedonia

2013 - 2014

-Designed the University’s IaaS Cloud with Synnefo Cloud Stack (3 bare-metal servers, 1 SAN).

Skills

aws

devops

gcp

git

grafana

kubernetes

linux

reliability

system-engineer

english

Create Profile Hire DevOpss