Reliability Engineer

443 jobs found

web3.career is now part of the Bondex Logo Bondex Ecosystem

Receive emails of Reliability Engineer
Job Position Company Posted Location Salary Tags

Blockdaemon

New York, NY, United States

$90k - $145k

Blockdaemon

India

$90k - $145k

Scrollio

Remote

$133k - $135k

Layerzerolabs

Remote

$86k - $110k

Dfinity

Remote

$122k - $141k

Seedify

Remote

$36k - $48k

Gsrmarkets

Remote

$80k - $100k

Chainlink Labs

San Francisco, CA, United States

$98k - $112k

Coinbase

Remote

$186k - $218k

Zscaler

Remote

$115k - $165k

Tenderly

Remote

Avalabs

Remote

$90k - $100k

CleanSpark

Las Vegas, NV, United States

$90k - $110k

Zamp

Bangalore, India

$77k - $85k

Blockdaemon
$90k - $145k estimated
New York New York City United States
Apply

Position Overview

As a Site Reliability Engineer (SRE), you will play a critical role supporting our Blockdaemon team by ensuring the reliability, scalability, and performance of our systems and services. You will collaborate closely with cross-functional teams to design, implement, and maintain robust and resilient infrastructure solutions in a Multi-Cloud environment.

The ideal candidate is passionate about automation, possesses strong analytical skills, and thrives in a fast-paced, dynamic environment.

Blockdaemon is a Blockchain Infrastructure Company operating in a multi-cloud configuration with a global footprint. The expectation for this role is a candidate capable of supporting systems & infrastructure stack across the major clouds, Google Cloud Platform (GCP) and Amazon Web Services (AWS), Azure.

Your Impact

  • System Architecture and Design: Collaborate with software engineering teams to design scalable, highly available, and resilient systems. Drive architectural improvements to enhance system reliability and performance.

  • Implement Infrastructure as Code to manage services and deployments in a multi-cloud, multi-project configuration.

  • Automation and Tooling: Develop automation tools and scripts to streamline deployment, monitoring, and incident response processes. Implement and maintain infrastructure as code frameworks.

  • Monitoring and Alerting: Configure and maintain monitoring systems to detect and mitigate potential issues proactively. Define alerting thresholds and response procedures to ensure timely incident resolution.

  • Incident Management: Respond to and resolve critical incidents, perform root cause analysis, and implement preventive measures to minimize the likelihood of recurrence. Participate in an on-call rotation to provide 24/7 support as needed.

  • Capacity Planning and Performance Optimization: Analyze system performance metrics, identify bottlenecks, and propose optimizations to improve resource utilization and efficiency.

  • Security and Compliance: Work closely with security teams to implement best practices for data protection, access control, and compliance with regulatory requirements. Conduct periodic security audits and vulnerability assessments.

  • Documentation and Knowledge Sharing: Document system configurations, procedures, and troubleshooting steps. Share knowledge and best practices with team members to foster a culture of continuous learning and improvement.

Role Requirements

Must Have:

  • Proven experience in an independent contributor role working with cloud platforms: GCP, AWS, Azure, Infrastructure-as-Code tooling: Terraform, Helm, and CI/CD orchestration platforms: GitlabCI, ArgoCD, Github Actions or similar GitOps workflows.

  • Excellent problem-solving skills and the ability to independently troubleshoot complex issues.

  • Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams.

  • Strong Architectural & Security Mindset.

Should Have:

  • Strong understanding of Linux/Unix systems administration and networking concepts.

  • Hands-on experience with configuring and running monitoring tools like Prometheus, Grafana, etc.

  • 5+ years experience of maintaining infrastructure-as-code on Google Cloud Platform, Amazon Web Services and Azure.

  • Experience working in SOC 2 Type 1 and Type 2 certified companies.

Nice-to-Have:

  • Proficiency in scripting and programming languages such as BASH, Golang, Python and TypeScript.

  • 2+ years hands-on experience operating highly available Kubernetes clusters.

  • Experience being involved in incident management and resolution.

  • Experience with AI development tools and related security considerations.

  • Passion for the Blockchain Industry & Decentralised Systems.

  • Experience with Blockchain Infrastructure, either in a personal or professional capacity.

About Us:


We Power the Blockchain economy.


Blockdaemon powers the blockchain economy with its suite of industry-leading
infrastructure solutions. We are a globally established, ISO-27001 certified partner with extensive protocol coverage, offering technical depth, industry-leading SLAs, 70+ global points of presence through 10+ cloud and bare metal providers, and 24/7 support for an unmatched institutional-grade experience. We provide integrated business solutions to exchanges, custodians, crypto platforms, financial institutions, and developers using our end-to-end suite of blockchain tools, including dedicated nodes, APIs, staking, liquid staking, MPC tech, and more. Blockdaemon provides its customers with the confidence to quickly and easily scale without compromising security or compliance.


We are a globally distributed team.


Blockdaemon is an Equal Opportunity Employer.

What does Reliability Engineer do?

A Reliability Engineer is a professional who is responsible for ensuring the reliability and availability of systems and equipment in an organization

They use their knowledge of engineering principles, statistical analysis, and data science to identify and mitigate risks, prevent failures, and optimize system performance

Here are some of the typical tasks and responsibilities of a Reliability Engineer:

  1. Analyze data and perform statistical modeling: Reliability Engineers analyze data related to equipment performance, failure rates, and maintenance history to identify trends and patterns. They use statistical modeling to predict future failures and plan maintenance activities accordingly.
  2. Develop and implement reliability strategies: Reliability Engineers develop and implement strategies to improve the reliability and availability of equipment and systems. This may include performing root cause analysis, implementing preventive maintenance programs, and conducting failure mode and effects analysis (FMEA).
  3. Collaborate with other teams: Reliability Engineers collaborate with other teams such as operations, maintenance, and engineering to identify and address reliability issues. They may also work with suppliers to ensure the reliability of equipment and materials.
  4. Monitor and evaluate performance: Reliability Engineers monitor the performance of systems and equipment to identify areas for improvement. They use data to evaluate the effectiveness of reliability strategies and make adjustments as necessary.
  5. Provide technical support: Reliability Engineers provide technical support to other teams and stakeholders, answering questions and providing guidance on reliability-related issues.
  6. Continuously improve processes: Reliability Engineers are responsible for continuously improving reliability processes and methodologies. They stay up-to-date with the latest technologies and best practices in the field and identify opportunities for improvement.