Staff Site Reliability Engineer
Quick Summary
Our Mission Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit.
Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit. Then wait in a room solely designed for waiting. Then wait for a surprise bill. In any other consumer industry, the companies delivering such a poor customer experience would not survive. But in healthcare, patients lack market power. Which means they are expected to accept the unacceptable.
Zocdoc’s mission is to give power to the patient. To do that, we’ve built the leading healthcare marketplace that makes it easy to find and book in-person or virtual care in all 50 states, across +200 specialties and +12k insurance plans. By giving patients the ability to see and choose, we give them power. In doing so, we can make healthcare work like every other consumer sector, where businesses compete for customers, not the other way around. In time, this will drive quality up and prices down.
We’re 18 years old and the leader in our space, but we are still just getting started. If you like solving important, complex problems alongside deeply thoughtful, driven, and collaborative teammates, read on.
As a Staff Site Reliability Engineer (SRE) at Zocdoc, you will shape how we operate safe, observable, and scalable systems across the company. You’ll lead initiatives that improve incident response, define reliability patterns, and drive organization-wide operational excellence—helping us build systems that fail gracefully, recover quickly, and scale efficiently.
You won’t just respond to incidents—you’ll help design the systems, tools, and practices that teams rely on to avoid them. Your work will clarify ownership, improve on-call quality, and strengthen our observability posture. By embedding best practices into how we build and run services, you’ll enable every engineering team at Zocdoc to move faster, safer, and with greater confidence.
- Stay composed and clear during incidents, and use them as catalysts for systemic improvement
- Treat observability as a strategic capability that enables better decisions, not just better dashboards
- Build scalable, default-safe patterns and tools that support resiliency and reliability
- Build strong cross-functional relationships and navigate complex systems to drive scalable, reliable outcomes
- Are endlessly curious—about how systems fail, how teams operate, and how to make both better
- Share knowledge generously and help others build with confidence and operational rigor
- Participate in and influence high-impact incident response efforts, contributing calm decision-making and retrospective-driven learning
- Define and evolve org-wide incident practices, retrospectives, and reliability tooling
- Architect and evolve observability platforms that offer actionable insight into system health, business-critical paths, and failure modes
- Lead the development of reliability and observability practices, including alerting hygiene, SLOs, and deployment safeguards
- Guide teams in building resilient, fault-tolerant services through consultative design, operational reviews, and safety-focused defaults
- Partner with Product, Platform, and Security teams to ensure new systems are operable and scalable from day one
- Design and implement internal tools that improve deployment safety, incident coordination, and production readiness
- Mentor engineers across teams in operational rigor, reliability principles, and system debugging
- 8+ years of experience operating and scaling production infrastructure in cloud-native environments
- Deep expertise in incident response, debugging distributed systems, and driving reliability improvements
- Strong working knowledge of observability stacks (metrics, logs, traces), alerting strategy, and SLO design
- Experience implementing fault isolation, graceful degradation, and chaos engineering practices
- Proficiency with infrastructure-as-code and config management (e.g., Terraform, CDK, etc.)
- A proven ability to influence teams through standards, tooling, and culture—not just code
- A growth mindset and strong communication skills for mentoring, influencing, and aligning across teams
What We Offer
~2 min readListing Details
- First seen
- April 3, 2026
- Last seen
- April 27, 2026
Posting Health
- Days active
- 23
- Repost count
- 0
- Trust Level
- 41%
- Scored at
- April 27, 2026
Signal breakdown
Zocdoc is an online medical care scheduling service that allows people to find and book in-person or telemedicine appointments for medical or dental care. It also functions as a physician and dentist rating and comparison database.
View company profilePlease let Zocdoc know you found this job on Jobera.
Similar Staff Site Reliability Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.