ukg
ukg6h ago
New

Staff Site Reliability Engineer- Eng

lead
OtherStaff Site Reliability Engineer
0 views0 saves0 applied

Quick Summary

Overview

Engage in and improve the lifecycle of services from conception to end-of-life, including system design reviews, capacity planning, and production readiness.

Technical Tools
OtherStaff Site Reliability Engineer
Engage in and improve the lifecycle of services from conception to end-of-life, including system design reviews, capacity planning, and production readiness. Define and implement standards and best practices for system architecture, service delivery, reliability, and automation, including the definition and monitoring of service health indicators (latency, traffic, error rates, and resource saturation), service level objectives (SLOs), and the use of error budgets to guide operational and delivery decisions. Support service, product, and engineering teams by providing common tooling and frameworks to increase availability and improve incident detection and response. Improve system performance, availability, and efficiency through automation, process refinement, post-incident reviews, and in-depth configuration analysis. Collaborate closely with engineering teams across the organization to deliver and operate reliable services. Increase operational efficiency, effectiveness, and service quality by treating operational challenges as software engineering problems (reducing toil). Guide junior team members and serve as a champion for Site Reliability Engineering best practices. Actively participate in incident responses, including on-call rotations and post-incident reviews, collaborating with engineering teams to restore service and reduce recurrence. Partner with stakeholders to influence and help drive the best possible technical and business outcomes. 5+ years of hands-on experience in software engineering, systems engineering, or cloud-based environments. 5+ years of experience working with public cloud platforms (e.g., GCP (preferred), AWS, or Azure). 5+ years of experience configuring, operating, and maintaining applications and/or systems infrastructure in a large-scale, customer-facing environment. Demonstrated understanding of observability best practices, including metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing. Experience coding in one or more higher-level programming languages (e.g., Python, Java, or C++). Strong working knowledge of Linux systems, including troubleshooting, performance analysis, and scripting in production environments. Experience with GitHub Actions and modern CI/CD practices. Experience building operational dashboards and alerts using observability tools such as Splunk or Grafana. Excellent communication and collaboration skills, with experience of mentoring and guiding engineers. Hands-on experience with cloud-native applications and containerization technologies (Kubernetes, containers). Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible). Experience operating production workloads in Google Cloud Platform (GCP). Solid grounding in at least two of the following areas: Computer Science fundamentals, Cloud Architecture, Security, or Network Design.

Location & Eligibility

Where is the job
Location terms not specified

Listing Details

Posted
June 22, 2026
First seen
June 22, 2026
Last seen
June 22, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
51%
Scored at
June 22, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

ukgStaff Site Reliability Engineer- Eng