Reliability Engineer

340 jobs found

Receive emails of Reliability Engineer
Job Position Company Posted Location Salary Tags

Pagoda

EMEA

$91k - $100k

NEAR

London, United Kingdom

$91k - $100k

Aztec

Remote

$165k - $209k

Mythical East

Lisbon, Portugal

$80k - $100k

Startale

Remote

$103k - $117k

Osmosis

United States

RockTree Capital

Remote

$70k - $120k

Circle

San Francisco, CA, United States

$147k - $195k

Circle

Washington, United States

$147k - $195k

Circle

San Francisco, CA, United States

$147k - $195k

Circle

London, United Kingdom

$103k - $117k

Circle

Los Angeles, CA, United States

$147k - $195k

Ava Labs

Seoul, South Korea

$99k - $124k

Token Metrics

Remote

$90k - $95k

Token Metrics

Remote

$90k - $95k

Token Metrics

Remote

$90k - $95k

Token Metrics

Remote

$90k - $95k

Token Metrics

Remote

$90k - $95k

Token Metrics

Islamabad, Pakistan

$90k - $95k

Token Metrics

Hyderabad, India

$90k - $95k

Token Metrics

Ho Chi Minh City, Vietnam

$90k - $95k

Token Metrics

Delhi, India

$90k - $95k

Token Metrics

Remote

$90k - $95k

Token Metrics

Bucharest, Romania

$90k - $95k

Token Metrics

Bengaluru, India

$90k - $95k

Pagoda
$91k - $100k est.

About Pagoda

Pagoda is a technology services firm dedicated to developing core components for the NEAR Ecosystem. We believe that re-inventing how software is made and distributed is our greatest opportunity to open economic access to those who are not fully integrated into the global economy. Our products empower people to find opportunity, invent new experiences, and collaborate. Let's build an Open Web world. A world where people control their assets, data, and power of governance.

<strong>About The Role</strong></p>

Pagoda is seeking a passionate and experienced Senior Site Reliability Engineer (SRE) to join our team in building a resilient and scalable infrastructure for the NEAR blockchain ecosystem. This is your chance to help us define the future of Web3 by creating robust, self-healing services that support the next generation of decentralized applications.

What You'll Be Doing

  • Collaborate with engineering to ensure seamless 24/7 service uptime, utilizing your expertise to develop self-healing, automated systems that proactively address potential issues. (Don't worry, we have a fair on-call rotation and compensate with time off!)
  • Define and monitor Service Level Objectives (SLOs) and mission-critical metrics, ensuring our systems meet the highest reliability standards.
  • Develop incident response playbooks and build robust monitoring and alerting capabilities to ensure swift and efficient issue resolution.
  • Work hand-in-hand with core blockchain, middleware, and applications teams to guarantee the security and high availability of our services.
  • Engage with our globally distributed team, participate in open-source projects, and connect with the vibrant NEAR community.

What We're Looking For

  • You can clearly articulate technical concepts to both technical and non-technical audiences.
  • You excel at collaborating within distributed teams and fostering a positive, inclusive environment.
  • You believe in the power of collaboration and shared knowledge.
  • Proficiency in Python, deep understanding of UNIX internals, and experience with cloud provisioning (Terraform, Packer, Ansible, and Docker.), monitoring (Grafana, Prometheus, and Datadog), and CI/CD tools (Github Actions, BuildKite, and Jenkins.).
  • You approach challenges with a proactive, solution-oriented mindset.
  • 7+ years of experience in Site Reliability, DevOps, or Platform Engineering, managing large-scale distributed systems.
  • Strong automaton and tooling experience.
  • Bachelor's Degree in Computer Science or related fields is a must

We'd Love If You Have

  • Experience with Rust and/or Go, familiarity with multiple cloud providers (AWS, Azure, GCP), and knowledge of Kubernetes, Helm, and GitOps.
  • A basic understanding of blockchain technology will help you quickly grasp the unique challenges and opportunities in the Web3 space.

Here’s What Our Interview Process Looks Like

Our interviews take place via Zoom and typically consists of the following stages:

  • Recruiter Call
  • Hiring Manager Call
  • 1st Round
    • Coding Interview in Python or Go
    • DevOps Troubleshooting Interview

  • Final Round
    • Large System Design Interview
    • Pagoda Values Interview

#LI-Remote

<div class="content-conclusion">


Benefits & Perks

  • Encouraged 20 days of flexible PTO per year, plus your local holidays
  • Paid Holiday Week: the last week of the year
  • Paid Wellness Week: week of choice in July or August
  • 100% Paid medical, dental and vision, AD&D and life insurance for US employees, including 85% coverage for dependents, and HSA + FSA options; For non-US employees, 100% Paid private medical coverage available at the highest tiered plan
  • Access to licensed therapists and mental health resources through Spill, 100% confidential and paid by Pagoda; plus $75 monthly reimbursement for wellness
  • Generous parental leave options; All employees have access to $10,000 in fertility assistance through Carrot
  • For US employees, 401(k) retirement plan available (no match)
  • Annual company retreats and team offsites (2023 was in Spain; 2022 in Portugal)
  • $2,000 Continued Education Reimbursement
  • $2,000 Home Office Reimbursement
  • Co-working Space Reimbursement

Our Values at Pagoda

Our values express our company culture. Learn more on our careers page.

Pagoda is an Equal Employment Opportunity (EEO) employer and welcomes all qualified applicants. Applicants will receive fair and impartial consideration without regard to race, sex, color, religion, national origin, age, disability, veteran status, genetic data, or other legally protected status.

Global Data Privacy Notice for Job Candidates and Applicants

Information collected and processed as part of your Pagoda Careers profile, and any job applications you choose to submit is subject to ourPrivacy Policy. By submitting your application, you are agreeing to our use and processing of your data as required.

What does Reliability Engineer do?

A Reliability Engineer is a professional who is responsible for ensuring the reliability and availability of systems and equipment in an organization

They use their knowledge of engineering principles, statistical analysis, and data science to identify and mitigate risks, prevent failures, and optimize system performance

Here are some of the typical tasks and responsibilities of a Reliability Engineer:

  1. Analyze data and perform statistical modeling: Reliability Engineers analyze data related to equipment performance, failure rates, and maintenance history to identify trends and patterns. They use statistical modeling to predict future failures and plan maintenance activities accordingly.
  2. Develop and implement reliability strategies: Reliability Engineers develop and implement strategies to improve the reliability and availability of equipment and systems. This may include performing root cause analysis, implementing preventive maintenance programs, and conducting failure mode and effects analysis (FMEA).
  3. Collaborate with other teams: Reliability Engineers collaborate with other teams such as operations, maintenance, and engineering to identify and address reliability issues. They may also work with suppliers to ensure the reliability of equipment and materials.
  4. Monitor and evaluate performance: Reliability Engineers monitor the performance of systems and equipment to identify areas for improvement. They use data to evaluate the effectiveness of reliability strategies and make adjustments as necessary.
  5. Provide technical support: Reliability Engineers provide technical support to other teams and stakeholders, answering questions and providing guidance on reliability-related issues.
  6. Continuously improve processes: Reliability Engineers are responsible for continuously improving reliability processes and methodologies. They stay up-to-date with the latest technologies and best practices in the field and identify opportunities for improvement.