USD 210000-270000/yr

Staff Site Reliability Engineer

Silicon Valleylead

OtherDevOps & InfrastructureSite Reliability EngineerStaff Site Reliability EngineerInfrastructure & Cloud

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

Our Mission Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit.

Technical Tools

OtherDevOps & InfrastructureSite Reliability EngineerStaff Site Reliability EngineerInfrastructure & Cloud

Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit. Then wait in a room solely designed for waiting. Then wait for a surprise bill. In any other consumer industry, the companies delivering such a poor customer experience would not survive. But in healthcare, patients lack market power. Which means they are expected to accept the unacceptable.

Zocdoc’s mission is to give power to the patient. To do that, we’ve built the leading healthcare marketplace that makes it easy to find and book in-person or virtual care in all 50 states, across +200 specialties and +12k insurance plans. By giving patients the ability to see and choose, we give them power. In doing so, we can make healthcare work like every other consumer sector, where businesses compete for customers, not the other way around. In time, this will drive quality up and prices down.

We’re 18 years old and the leader in our space, but we are still just getting started. If you like solving important, complex problems alongside deeply thoughtful, driven, and collaborative teammates, read on.

As a Staff Site Reliability Engineer (SRE) at Zocdoc, you will shape how we operate safe, observable, and scalable systems across the company. You’ll lead initiatives that improve incident response, define reliability patterns, and drive organization-wide operational excellence—helping us build systems that fail gracefully, recover quickly, and scale efficiently.

You won’t just respond to incidents—you’ll help design the systems, tools, and practices that teams rely on to avoid them. Your work will clarify ownership, improve on-call quality, and strengthen our observability posture. By embedding best practices into how we build and run services, you’ll enable every engineering team at Zocdoc to move faster, safer, and with greater confidence.

Stay composed and clear during incidents, and use them as catalysts for systemic improvement
Treat observability as a strategic capability that enables better decisions, not just better dashboards
Build scalable, default-safe patterns and tools that support resiliency and reliability
Build strong cross-functional relationships and navigate complex systems to drive scalable, reliable outcomes
Are endlessly curious—about how systems fail, how teams operate, and how to make both better
Share knowledge generously and help others build with confidence and operational rigor

Participate in and influence high-impact incident response efforts, contributing calm decision-making and retrospective-driven learning
Define and evolve org-wide incident practices, retrospectives, and reliability tooling
Architect and evolve observability platforms that offer actionable insight into system health, business-critical paths, and failure modes
Lead the development of reliability and observability practices, including alerting hygiene, SLOs, and deployment safeguards
Guide teams in building resilient, fault-tolerant services through consultative design, operational reviews, and safety-focused defaults
Partner with Product, Platform, and Security teams to ensure new systems are operable and scalable from day one
Design and implement internal tools that improve deployment safety, incident coordination, and production readiness
Mentor engineers across teams in operational rigor, reliability principles, and system debugging

8+ years of experience operating and scaling production infrastructure in cloud-native environments
Deep expertise in incident response, debugging distributed systems, and driving reliability improvements
Strong working knowledge of observability stacks (metrics, logs, traces), alerting strategy, and SLO design
Experience implementing fault isolation, graceful degradation, and chaos engineering practices
Proficiency with infrastructure-as-code and config management (e.g., Terraform, CDK, etc.)
A proven ability to influence teams through standards, tooling, and culture—not just code
A growth mindset and strong communication skills for mentoring, influencing, and aligning across teams

What We Offer

~2 min read

✓Flexible, hybrid work environment

✓Unlimited Vacation

✓100% paid employee health benefit options (including medical, dental, and vision)

✓Commuter Benefits

✓401(k) with employer funded match

✓Corporate wellness program with Wellhub

✓Sabbatical leave (for employees with 5+ years of service)

✓Competitive paid parental leave and fertility/family planning reimbursement

✓Cell phone reimbursement

✓Catered lunch everyday along with beverages and snacks

✓Employee Resource Groups and ZocClubs to promote shared community and belonging

✓Great Place to Work Certified

Listing Details

First seen: April 3, 2026
Last seen: April 27, 2026

Posting Health

Days active: 23
Repost count: 0
Trust Level: 41%
Scored at: April 27, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust

Apply for this position

Zocdoc

greenhouse

Zocdoc is an online medical care scheduling service that allows people to find and book in-person or telemedicine appointments for medical or dental care. It also functions as a physician and dentist rating and comparison database.

Employees

750

Founded

2007

Domain

zocdoc.com

Jobs

View 58 jobs

View company profile

Salary

USD 210000-270000

per year