Senior Site Reliability Engineer (Heretic Stealth PortCo)
- Build and extend tooling for end-to-end ML model deployment and lifecycle management
- Setup, configure and connect cloud infrastructure services together to serve as the foundation of our platform
- Automate deployment orchestration, building a fast and maintainable CI/CD pipeline for our web applications
- Hook up real time monitoring and alerting for all parts of the web platform, enabling engineering teams to quickly respond to incidents.
- Build and maintain analytics pipeline, connecting data sources to data warehouse, then from data warehouse to reporting platform and back to model training.
- Collaborate with cross-functional teams to deploy and maintain AI models in production environments, ensuring scalability, reliability, efficiency and robustness
- Orchestrate model serving to accommodate our unique infrastructure in a scalable manner
- Configure and maintain Kubernetes clusters on Ubuntu.
- Maintain backend planning and optimize GPU capacity continuously.
- Bachelor's or Master's degree in Computer Science, a related field, or equivalent work experience
- 5+ years of professional experience as DevOps, TechOps, or SRE engineer
- Extensive experience with setting up IaaS cloud platforms (GCP preferred)
- Experience scaling infrastructure for consumer facing web applications
- Proven experience in working with and scaling GPUs
- Proficiency in containerization technologies, especially Docker and Kubernetes
- Proficient in Python and creating scripts to automate pipelines and processes
- Extensive Linux troubleshooting experience
- Excellent problem-solving and analytical thinking skills, with a strong attention to detail
- Effective verbal and written communication a must.
- Comfortable working in a dynamic, fast-paced, and collaborative environment
Nice to Haves
- Marketplace and/or E-commerce experience a plus
- Experience with deploying AI models in cloud-based environments (Diffusion models preferred)
- Experience managing Triton inference servers
- Experience in popular machine learning libraries (e.g., TensorFlow, PyTorch, Spark)
When applying, mention the word CANDYSHOP to show you read the job post completely. This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they are human RMzQuMjA0LjE2OS4yMzAM