Reliability Engineer

502 jobs found

web3.career is now part of the Bondex Ecosystem

Join now

Be human-verified. Be hire-ready.

Join now

ai analyst backend bitcoin blockchain community manager crypto cryptography cto customer support dao data science defi design developer relations devops discord economy designer entry level erc erc 20 evm front end full stack gaming ganache golang hardhat intern java javascript layer 2 marketing mobile moderator nft node non tech open source openzeppelin pay in crypto product manager project manager react refi research ruby rust sales smart contract solana solidity truffle web3 py web3js zero knowledge

Job Position	Company	Posted	Location	Salary	Tags
Site Reliability Engineer Observability	Ripple	1d	New York, NY, United States	$160k - $200k	engineer reliability aws crypto
Site Reliability Engineer Observability	Ripple	1d	Chicago, IL, United States	$160k - $200k	engineer reliability aws crypto
Staff Site Reliability Engineer	Zscaler	2d	Remote	$115k - $165k	engineer reliability aws golang java
Principal Site Reliability Engineer	Zscaler	2d	Remote	$161k - $230k	engineer executive reliability aws golang
Web3 Bootcamp - Job Guaranteed 💯	Learn job-ready web3 skills on your schedule with 1-on-1 support & get a job, or your money back.		by Metana Bootcamp Info
Site Reliability Engineer	Layerzerolabs	2d	Remote	$86k - $110k	engineer reliability blockchain golang kubernetes
Principal Site Reliability Engineer	Copperco	2d	Remote	$140k - $180k	engineer executive reliability aws blockchain
Senior Site Reliability Engineer	Bloxstaking	2d	Remote	$82k - $112k	engineer reliability senior blockchain ethereum
Senior Site Reliability Engineer CCIP	Chainlink Labs	2d	Charlotte, NC, United States	$110k - $112k	engineer reliability senior blockchain crypto
Platm Engineer Site Reliability Engineering	Bitso	9d	Latin America	$98k - $109k	engineer reliability
Senior Site Reliability Engineer	Chainalysis	14d	Tel Aviv, Israel	$88k - $150k	engineer reliability senior aws blockchain
Site Reliability Engineer Core Infrastructure	Kraken	16d	Paraguay	$80k - $101k	infrastructure engineer reliability aws crypto
DevOps Site Reliability Engineer	Okx	21d	Remote	$140k - $144k	devops engineer reliability aws blockchain
Site Reliability Engineer AI Agents	Kraken	21d	United States	$96k - $192k	ai engineer reliability aws crypto
Site Reliability Engineer	Alpaca	1mo	Remote	$119k - $135k	engineer reliability crypto kubernetes remote
Sr. Staff Site Reliability EngineerFederal Security Clearance	Zscaler	2mo	Remote	$140k - $200k	engineer reliability security senior aws

Site Reliability Engineer Observability

Ripple

$160k - $200k

New York, NY, United States

Join Talent Pool Apply

At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world. And we get to do the best work of our career and grow our skills surrounded by colleagues who have our backs.

If you’re ready to see your impact and unlock incredible career growth opportunities, join us, and build real world value.

At Ripple, we’re building a world where value moves like information does today. Through our crypto solutions for financial institutions, businesses, governments, and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world.

Ripple Treasury, now a Ripple solution acquired in 2025, marks a significant expansion into the multi-trillion-dollar corporate finance arena. With more than 40 years of experience supporting some of the world’s largest and most sophisticated companies, Ripple Treasury integrates a treasury command center into Ripple’s technology stack—giving corporates the ability to move, manage, and optimize liquidity in real-time, across traditional and digital assets, under one expanded umbrella.

THE WORK:

This is an engineering-first role with a coaching dimension—not the other way around. You will spend the majority of your time doing hands-on observability and reliability engineering work: building instrumentation, designing alert configurations, authoring Terraform, and troubleshooting production systems. Alongside that, you will coach and consult with stream-aligned product teams, helping them build operational maturity over time.

You will join Ripple’s Technical Operations team and work across Azure (80%) and AWS (20%) environments supporting infrastructure that is predominantly Windows-based (80%), handling significant payment volume for enterprise treasury customers. The incident management program you will help build is early-stage—you will be establishing practices, not inheriting a mature playbook.

WHAT YOU’LL DO:

Observability Engineering
- Design and implement monitoring, alerting, and dashboards in New Relic (APM, Infrastructure, Logs, Synthetics) across Azure and AWS; write NRQL queries for troubleshooting, analysis, and reporting.
- Define and implement SLOs/SLIs and error budgets; coach teams on using them to balance feature velocity with reliability and communicate system health to stakeholders.
- Lead alert noise reduction and signal quality engineering—tune thresholds, eliminate false positives, and ensure every alert is actionable.
- Optimize observability costs through log ingestion management, pipeline rules, and New Relic configuration governance.
- Partner with engineering teams to improve observability maturity: structured logging, metrics instrumentation (RED/USE methods), distributed tracing, and effective dashboard patterns.
Infrastructure & IaC
- Develop and maintain Terraform infrastructure as code for provisioning and managing monitoring resources, alert configurations, and observability infrastructure—this is a primary engineering responsibility, not an occasional task.
- Establish and enforce IaC governance standards for observability infrastructure across teams, providing a repeatable, auditable model for how monitoring resources are managed.
- Author and troubleshoot Azure DevOps pipelines; support teams with deployment visibility, change tracking, and release hygiene as it relates to production reliability.
Incident Management
- Administer and configure Incident.IO: alert routing, notification workflows, Slack and OpsGenie integration, and runbook management—operationalizing what exists today and expanding from there.
- Build out incident management foundations that are largely yours to establish: PIR/postmortem processes, on-call rotation design, escalation policies, incident severity classification, and response playbooks.
- Track and report on MTTR, MTTD, and incident frequency; identify trends and drive continuous improvement in partnership with engineering teams.
- Respond to and debrief on production incidents—providing real-time troubleshooting support and facilitating structured post-incident reviews.
Cross-Functional Enablement
- Enable stream-aligned engineering teams to adopt improved observability and incident management practices through workshops, consultation, and hands-on guidance.
- Collaborate with the Subsystems Platform Team to translate common needs into self-service observability and incident management capabilities.
- Build lasting team competency through documentation, training materials, and knowledge-sharing sessions that outlast any individual engagement.

WHAT YOU'LL BRING:

Core SRE Experience
- 7+ years in Site Reliability Engineering, DevOps, or Platform Engineering with a strong focus on observability and production operations.
- Proven ability to deliver hands-on engineering work while coaching and mentoring teams—comfortable switching between builder and consultant modes.
- Experience working in Agile/Scrum environments and collaborating effectively with cross-functional teams.
Observability & Incident Management Expertise — Required
- Expert-level hands-on experience with New Relic (APM, Infrastructure, Logs, Synthetics, Alerts) and strong NRQL proficiency for troubleshooting and analysis.
- Deep understanding of structured logging, metrics collection (RED/USE methods), distributed tracing, and designing effective dashboards and alerts.
- Expertise defining and implementing SLOs/SLIs and error budgets for reliability management.
- Hands-on experience with incident management platforms (Incident.IO, PagerDuty, OpsGenie, or similar).
- Experience designing incident response workflows, on-call rotations, escalation policies, and facilitating post-incident reviews that drive actionable improvements.
- Demonstrated ability to troubleshoot complex production issues using observability data across distributed systems.
Infrastructure & Tools — Required
- Strong Terraform experience: developing and maintaining IaC for cloud infrastructure and monitoring resources; familiarity with IaC governance patterns.
- Proficiency with PowerShell scripting (required given the 80% Windows environment).
- Strong experience with Azure cloud (App Services, Virtual Machines, Azure SQL, networking, monitoring) and working knowledge of AWS.
- Experience with Azure DevOps for CI/CD pipeline authoring and troubleshooting.
- Experience with Octopus Deploy for deployment management and release orchestration.
- Comfort working across both Windows and Linux server environments.
- Familiarity with Slack for operational workflows, alert routing, and incident communication.
Desired / Additional
- Experience with alert noise reduction strategies and observability cost optimization (log ingestion, pipeline rules, cardinality management).
- Background facilitating chaos engineering, game day exercises, or failure injection to build team resilience.
- Knowledge of VM-hosted SQL Server monitoring and performance optimization.
- Familiarity with FinTech compliance requirements (SOC 2, ISO 27001) and audit evidence collection.
- Experience measuring and improving key reliability metrics (MTTR, MTTD, availability, error budgets) at an organizational level.
- Python or Bash scripting experience in addition to PowerShell.
- Familiarity with Jira for incident tracking and workflow automation.
Other common names for this role: Senior Site Reliability Engineer, Observability Engineer, Incident Management Engineer

For positions that will be based in NY, the annual salary range for this position is below. Actual salaries may vary based on numerous factors including, among other things, an individual applicant’s experience and qualifications for the position. This range does not include equity or additional compensation, such as bonuses or commissions.

NY Annual Base Salary Range

$160,000—$200,000 USD

WHO WE ARE:

Do Your Best Work

The opportunity to build in a fast-paced start-up environment with experienced industry leaders
A learning environment where you can dive deep into the latest technologies and make an impact. A professional development budget to support other modes of learning.
Thrive in an environment where no matter what race, ethnicity, gender, origin, or culture they identify with, every employee is a respected, valued, and empowered part of the team.
In-office collaboration for moments that matter is important to our culture, and we give managers and teams the flexibility to decide which 10+ days a month they come in.
Bi-weekly all-company meeting - business updates and ask me anything >
We come together for moments that matter which include team offsites, team bonding activities, happy hours and more!

Take Control of Your Finances

Competitive salary, bonuses, and equity
Competitive benefits that cover physical and mental healthcare, retirement, family forming, and family support
Employee giving match
Mobile phone stipend

Take Care of Yourself

R&R days so you can rest and recharge
Generous wellness reimbursement and weekly onsite & virtual programming
Generous vacation policy - work with your manager to take time off when you need it
Industry-leading parental leave policies. Family planning benefits.
Catered lunches, fully-stocked kitchens with premium snacks/beverages, and plenty of fun events

Benefits listed above are for full-time employees.

Ripple is an Equal Opportunity Employer. We’re committed to building a diverse and inclusive team. We do not discriminate against qualified employees or applicants because of race, color, religion, gender identity, sex, sexual identity, pregnancy, national origin, ancestry, citizenship, age, marital status, physical disability, mental disability, medical condition, military status, or any other characteristic protected by local law or ordinance.

Please find our UK/EU Applicant Privacy Notice and our California Applicant Privacy Notice for reference.

⬇

Apply Now

Join talent pool

What does Reliability Engineer do?

▼

A Reliability Engineer is a professional who is responsible for ensuring the reliability and availability of systems and equipment in an organization

They use their knowledge of engineering principles, statistical analysis, and data science to identify and mitigate risks, prevent failures, and optimize system performance

Here are some of the typical tasks and responsibilities of a Reliability Engineer:

Analyze data and perform statistical modeling: Reliability Engineers analyze data related to equipment performance, failure rates, and maintenance history to identify trends and patterns. They use statistical modeling to predict future failures and plan maintenance activities accordingly.
Develop and implement reliability strategies: Reliability Engineers develop and implement strategies to improve the reliability and availability of equipment and systems. This may include performing root cause analysis, implementing preventive maintenance programs, and conducting failure mode and effects analysis (FMEA).
Collaborate with other teams: Reliability Engineers collaborate with other teams such as operations, maintenance, and engineering to identify and address reliability issues. They may also work with suppliers to ensure the reliability of equipment and materials.
Monitor and evaluate performance: Reliability Engineers monitor the performance of systems and equipment to identify areas for improvement. They use data to evaluate the effectiveness of reliability strategies and make adjustments as necessary.
Provide technical support: Reliability Engineers provide technical support to other teams and stakeholders, answering questions and providing guidance on reliability-related issues.
Continuously improve processes: Reliability Engineers are responsible for continuously improving reliability processes and methodologies. They stay up-to-date with the latest technologies and best practices in the field and identify opportunities for improvement.

Site Reliability Engineer Observability

Ripple

Site Reliability Engineer Observability

Ripple

Staff Site Reliability Engineer

Zscaler

Principal Site Reliability Engineer

Zscaler

Site Reliability Engineer

Layerzerolabs

Principal Site Reliability Engineer

Copperco

Senior Site Reliability Engineer

Bloxstaking

Senior Site Reliability Engineer CCIP

Chainlink Labs

Platm Engineer Site Reliability Engineering

Bitso

Senior Site Reliability Engineer

Chainalysis

Site Reliability Engineer Core Infrastructure

Kraken

DevOps Site Reliability Engineer

Okx

Site Reliability Engineer AI Agents

Kraken

Site Reliability Engineer

Alpaca

Sr. Staff Site Reliability EngineerFederal Security Clearance

Zscaler

Observability Engineering

Infrastructure & IaC

Incident Management

Cross-Functional Enablement

Core SRE Experience

What does Reliability Engineer do?