Software Development Engineer

EngineeringOtherDevOps & InfrastructureSite Reliability EngineerSoftware Development EngineerDevops EngineerInfrastructure & Cloud

7 views0 saves0 applied

Apply Now

Quick Summary

Overview

Location Details: At GoDaddy the future of work looks different for each team. Some teams work in the office full-time, others have a hybrid arrangement (they work remotely some days and in the office some days) and some work entirely remotely.

Technical Tools

ansiblegrafanajavascriptprometheuspythonci-cdlinuxperformance-optimization

At GoDaddy the future of work looks different for each team. Some teams work in the office full-time, others have a hybrid arrangement (they work remotely some days and in the office some days) and some work entirely remotely.

This is a remote position, so you’ll be working remotely from your home. You may occasionally visit a GoDaddy office to meet with your team for events or meetings.

GoDaddy is looking for a Software Development Engineer / Site Reliability Engineer to join our Monitoring and Observability team. In this hybrid SDE+SRE role, you'll design and build scalable software solutions while also owning the reliability, performance, and availability of systems serving millions of customers worldwide. You'll focus on developing high-quality applications and platforms that enable proactive monitoring, deep insights, and rapid troubleshooting — and you'll go a step further by operating those platforms, responding to incidents, and driving continuous reliability improvements across cloud and on-prem environments.

Responsibilities

~1 min read

Design, develop, and maintain scalable observability and monitoring platforms using Python and modern software engineering practices, including systems for metrics, logging, tracing, and visualization (e.g., Loki, Grafana, Tempo and Mimir(LGTM), Prometheus, ICINGA2, Site24x7 and BigPanda).
Build and enhance production-grade software services, APIs, and tooling that improve system visibility, reliability, and developer experience.
Collaborate with cross-functional teams to define requirements, architect solutions, and deliver robust, maintainable code.
Develop automation and self-service tools to streamline workflows and improve engineering productivity.
Implement and evolve infrastructure-as-code and configuration management using tools such as Terraform, Ansible, Puppet, or Chef.
Manage and troubleshoot containerized workloads across Docker, Kubernetes (including EKS/ECS), and Fargate, ensuring configuration consistency and operational reliability.
Contribute to system design, code reviews, testing strategies, and performance optimization for large-scale distributed systems.
Support and enhance CI/CD pipelines, ensuring efficient, high-quality software delivery.

Implement SLIs, SLOs, and error budgets to define and track service health and reliability targets, balancing reliability with feature velocity.
Build and maintain dashboards and alerting that provide actionable insights and minimize alert fatigue; champion SLO-based alerting and noise reduction.

Respond to automated alerts and production incidents, participating in on-call rotations supporting global operations.
Partner with engineering teams to resolve availability, performance, and security issues.
Lead blameless postmortems and root cause analysis (RCA), converting findings into durable fixes, runbooks, and repeatable automation.
Troubleshoot complex system issues using advanced diagnostics (e.g., strace, tcpdump, systemd) and partner with reliability and infrastructure teams to improve application resilience and performance.

3+ years of professional experience in software development, building and delivering scalable, production-grade applications or platforms.
3+ years of experience with observability platforms (metrics, logging, tracing, and visualization).
3+ years of experience with event correlation or incident management platforms (e.g., BigPanda, Site24x7, ServiceNow, PagerDuty).
2+ years of hands-on incident response experience, including on-call participation and postmortem facilitation.
2+ years of professional experience with containerization and orchestration technologies in a production SRE context.
Strong programming experience in Python (and/or JavaScript, Go, or similar languages) with a focus on writing clean, maintainable, and testable code.
Experience designing and building distributed systems, APIs, or developer platforms.
Familiarity with observability concepts (metrics, logging, tracing) and tools such as Open Telemetry(OTel), LGTM, Prometheus, or similar.
Solid understanding of Linux/Unix environments, full stack engineering, including debugging and performance optimization from an application perspective.
Experience with CI/CD pipelines, version control systems, and modern development workflows.
Exposure to containerization and orchestration technologies (e.g., Docker, Kubernetes).
Experience building internal tools, platforms, or services that improve developer productivity or system reliability.
Strong problem-solving skills and ability to debug complex issues in distributed systems.
Configuration management experience with tools such as Ansible, Puppet, Chef, or SaltStack.
Practical understanding of SLIs, SLOs, SLAs, and error budgets as reliability engineering concepts.
Experience writing and maintaining runbooks, SOPs, and operational documentation to ensure knowledge continuity.

Nice to Have

~1 min read

Experience building platforms or SDKs for observability or monitoring.
Deep, hands-on expertise with cloud platforms (AWS, Azure, GCP) and cloud-native application design.
Familiarity with infrastructure-as-code practices or DevOps tooling.
Experience with capacity planning, forecasting, and cost governance for large-scale cloud infrastructure.
Familiarity with compliance and audit-ready operations (e.g., PCI-DSS, WebTrust).
Passion for mentoring junior engineers and driving a culture of reliability and continuous improvement.