Staff Site Reliability Engineer- Eng

lead

OtherStaff Site Reliability Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

Engage in and improve the lifecycle of services from conception to end-of-life, including system design reviews, capacity planning, and production readiness.

Technical Tools

OtherStaff Site Reliability Engineer

Engage in and improve the lifecycle of services from conception to end-of-life, including system design reviews, capacity planning, and production readiness. Define and implement standards and best practices for system architecture, service delivery, reliability, and automation, including the definition and monitoring of service health indicators (latency, traffic, error rates, and resource saturation), service level objectives (SLOs), and the use of error budgets to guide operational and delivery decisions. Support service, product, and engineering teams by providing common tooling and frameworks to increase availability and improve incident detection and response. Improve system performance, availability, and efficiency through automation, process refinement, post-incident reviews, and in-depth configuration analysis. Collaborate closely with engineering teams across the organization to deliver and operate reliable services. Increase operational efficiency, effectiveness, and service quality by treating operational challenges as software engineering problems (reducing toil). Guide junior team members and serve as a champion for Site Reliability Engineering best practices. Actively participate in incident responses, including on-call rotations and post-incident reviews, collaborating with engineering teams to restore service and reduce recurrence. Partner with stakeholders to influence and help drive the best possible technical and business outcomes. 5+ years of hands-on experience in software engineering, systems engineering, or cloud-based environments. 5+ years of experience working with public cloud platforms (e.g., GCP (preferred), AWS, or Azure). 5+ years of experience configuring, operating, and maintaining applications and/or systems infrastructure in a large-scale, customer-facing environment. Demonstrated understanding of observability best practices, including metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing. Experience coding in one or more higher-level programming languages (e.g., Python, Java, or C++). Strong working knowledge of Linux systems, including troubleshooting, performance analysis, and scripting in production environments. Experience with GitHub Actions and modern CI/CD practices. Experience building operational dashboards and alerts using observability tools such as Splunk or Grafana. Excellent communication and collaboration skills, with experience of mentoring and guiding engineers. Hands-on experience with cloud-native applications and containerization technologies (Kubernetes, containers). Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible). Experience operating production workloads in Google Cloud Platform (GCP). Solid grounding in at least two of the following areas: Computer Science fundamentals, Cloud Architecture, Security, or Network Design.