Reliability Engineer

487 jobs found

web3.career is now part of the Bondex Logo Bondex Ecosystem

Receive emails of Reliability Engineer
Job Position Company Posted Location Salary Tags

Hyperbolic Labs

San Francisco, CA, United States

$103k - $120k

Zscaler

Remote

$119k - $170k

Zscaler

Remote

$115k - $165k

Copperco

Remote

$140k - $180k

Chainlink Labs

United States

$115k - $117k

Keyrock

Brussels, Belgium

$133k - $135k

Zora

Remote

$170k - $225k

asymmetric.re

Remote

$124k - $150k

Chainlink Labs

Argentina

$112k - $156k

Kraken

Remote

$88k - $101k

Zscaler

Remote

$161k - $230k

Layerzerolabs

Remote

$86k - $110k

Zinnia

Remote

$126k - $127k

Gsrmarkets

Remote

$80k - $100k

Douro Labs

North America

$112k - $156k

Hyperbolic Labs
$103k - $120k estimated
California San Francisco USA

Who We Are

Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open-source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward-thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality.

As we prepare for growth after our Series A, our team — led by co-founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing.

About the Role

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product-critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7.

In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post-mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high-impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale.

Who You Are

  • Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems

  • Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems

  • Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience

  • Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms

  • Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)

  • Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening

  • Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation

  • Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)

  • Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure

  • Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines

Preferred Qualifications

  • Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale

  • Background in distributed systems, peer-to-peer networks, or decentralized infrastructure

  • Knowledge of multi-tenancy security patterns, container security, and runtime security tools

  • Experience with chaos engineering, fault injection, and resilience testing

  • Familiarity with cost optimization strategies for cloud infrastructure and GPU resources

  • Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)

  • Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups

  • Contributions to open-source reliability, observability, or security tools

Hyperbolic is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

What does Reliability Engineer do?

A Reliability Engineer is a professional who is responsible for ensuring the reliability and availability of systems and equipment in an organization

They use their knowledge of engineering principles, statistical analysis, and data science to identify and mitigate risks, prevent failures, and optimize system performance

Here are some of the typical tasks and responsibilities of a Reliability Engineer:

  1. Analyze data and perform statistical modeling: Reliability Engineers analyze data related to equipment performance, failure rates, and maintenance history to identify trends and patterns. They use statistical modeling to predict future failures and plan maintenance activities accordingly.
  2. Develop and implement reliability strategies: Reliability Engineers develop and implement strategies to improve the reliability and availability of equipment and systems. This may include performing root cause analysis, implementing preventive maintenance programs, and conducting failure mode and effects analysis (FMEA).
  3. Collaborate with other teams: Reliability Engineers collaborate with other teams such as operations, maintenance, and engineering to identify and address reliability issues. They may also work with suppliers to ensure the reliability of equipment and materials.
  4. Monitor and evaluate performance: Reliability Engineers monitor the performance of systems and equipment to identify areas for improvement. They use data to evaluate the effectiveness of reliability strategies and make adjustments as necessary.
  5. Provide technical support: Reliability Engineers provide technical support to other teams and stakeholders, answering questions and providing guidance on reliability-related issues.
  6. Continuously improve processes: Reliability Engineers are responsible for continuously improving reliability processes and methodologies. They stay up-to-date with the latest technologies and best practices in the field and identify opportunities for improvement.