avalara
avalara3h ago
New

Lead Site Reliability Engineer

RomaniaRomaniaRemotelead
EngineeringDevops Engineer
0 views0 saves0 applied

Quick Summary

Key Responsibilities

Improve platform stability and reduce incidents through AIOps, predictive monitoring, and self-healing systems Accelerate deployment velocity and reduce risk through progressive delivery,

Technical Tools
EngineeringDevops Engineer

Responsibilities

~1 min read

As Avalara continues to scale its global SaaS platform and accelerate toward an AI-first operating model, we must fundamentally transform how reliability, deployment, and operational excellence are engineered.

This role exists to design and lead an enterprise-grade, AI-driven reliability ecosystem, enabling self-healing systems, intelligent observability, and low-risk deployment practices across multi-cloud environments. This includes modernizing RELE practices through automation, feature flag–driven deployments, and AI-powered operational workflows.

This is a high-impact individual contributor role responsible for reducing operational risk, eliminating manual toil, improving system resilience, and enabling faster, safer product delivery at scale.

This role strengthens Avalara’s reliability engineering and platform operations capability by introducing AI-driven, automation-first reliability practices.

This Lead Reliability Engineer will:

  • Improve platform stability and reduce incidents through AIOps, predictive monitoring, and self-healing systems
  • Accelerate deployment velocity and reduce risk through progressive delivery, feature flag strategies, and CI/CD optimization
  • Enhance customer experience by improving availability, performance, and recovery times
  • Increase operational efficiency by eliminating manual processes and introducing intelligent automation workflows
  • Advance Avalara’s AI-first strategy by embedding agentic AI into observability, incident response, and reliability engineering

Responsibilities

~1 min read

As a Bar Raiser, this role is expected to elevate the performance of the entire reliability engineering function:

  • Hold high standards for availability, reliability, automation, and operational excellence
  • Use metrics such as MTTR, SLI/SLO/SLA adherence, deployment success rate, and incident reduction to drive decisions
  • Simplify complex distributed systems into scalable, resilient, and automated platforms
  • Mentor engineers and raise technical rigor, automation maturity, and AI adoption
  • Challenge assumptions and drive measurable improvements in system reliability and deployment safety
  • Leave every system, process, and platform more resilient and scalable than before

This role does not just operate systems—it redefines how reliability is engineered at scale.

  • Own the end-to-end reliability strategy for distributed SaaS systems across multi-cloud environments
  • Design and implement AI-driven operations (AIOps) including anomaly detection, predictive failure analysis, and automated root cause identification
  • Build and scale observability platforms using Prometheus, Grafana, OpenTelemetry, and ML-based analytics
  • Architect self-healing systems and automation frameworks to eliminate manual operational toil
  • Lead modernization of deployment practices through feature flags, progressive delivery, and safe rollout strategies
  • Drive reliability improvements across Kubernetes-based container platforms
  • Own reliability of CI/CD pipelines and infrastructure as code (Terraform/Pulumi)
  • Design deployment strategies that reduce risk, including:
    • Feature flag–based releases
    • Canary and progressive rollout models
    • Automated rollback and kill-switch capabilities
  • Improve deployment observability and traceability across environments
  • Ensure high availability, scalability, and fault tolerance of production systems
  • Implement advanced monitoring, logging, and tracing systems across services
  • Integrate agentic AI workflows into incident detection, triage, and resolution
  • Build automation pipelines using Go, Python, and modern workflow tools
  • Enable AI-assisted observability, including:
    • Intelligent alerting
    • Automated diagnostics
    • Performance optimization insights
  • Drive adoption of automation-first and AI-first operational practices
  • Lead incident response and on-call readiness for production systems
  • Improve incident resolution time and system recovery through automation
  • Conduct post-incident reviews and implement systemic improvements
  • Communicate clearly with stakeholders and customers during incidents

Within the first 12 months, this role will have:

  • Reduced MTTR by 30–50% through automation and AI-driven diagnostics
  • Decreased production incidents and customer impact events
  • Implemented AI-driven observability and alerting systems across core platforms
  • Enabled feature flag–based deployment strategies across engineering teams
  • Delivered self-healing automation workflows that significantly reduce manual intervention
  • Increased deployment frequency with lower failure and rollback rates
  • Elevated team capability through mentorship, standards, and AI adoption

As an AI-first company, Avalara expects this role to embed AI into reliability engineering practices:

This role will:

  • Design and implement AI-driven operational workflows for incident detection and resolution
  • Use AI to predict failures, analyze system behavior, and optimize performance
  • Build or integrate AI-powered observability assistants and diagnostics tools
  • Identify high-value AI use cases tied to reliability, efficiency, and customer impact
  • Apply AI responsibly with strong governance, security, and data considerations
  • Elevate AI adoption across teams by sharing best practices and driving measurable outcomes

This role must demonstrate applied AI impact, not just familiarity.

  • B.S. in Computer Science or Engineering
  • 10+ years of experience in SaaS, distributed systems, or reliability engineering
  • Strong programming experience in Go, Java and Python
  • Deep expertise in observability tools (Prometheus, Grafana, OpenTelemetry, etc.)
  • Experience with multi-cloud platforms (AWS, GCP, Azure/OCI)
  • Strong knowledge of Kubernetes, Docker, and container orchestration
  • Advanced understanding of Linux systems, networking (TCP/IP, DNS), and cloud-native architecture
  • Experience with Infrastructure as Code and CI/CD pipelines
  • Familiarity with AI/ML-driven operations and automation workflows
  • Proven ability to operate as a self-starter and drive complex initiatives independently
  • Strong communication and documentation skills
  • Willingness to participate in on-call rotation for production systems

AI is embedded in our workflows, decision-making, and products.  Success here requires embracing AI as an essential capability.

  • You’ll bring experience using AI and AI-related technologies, ready to thrive here.

  • You’ll apply AI every day to business challenges - improving efficiency, contributing solutions, and driving results for your team, our company, and our customers.

  • You’ll grow with AI by staying curious about new trends and best practices, and by sharing what you learn so others can benefit too.

Total Rewards 

In addition to a great compensation package, paid time off, and paid parental leave, many Avalara employees are eligible for bonuses. 

Health & Wellness 
Benefits vary by location but generally include private medical, life, and disability insurance. 

Inclusive culture and diversit
Avalara strongly supports diversity, equity, and inclusion, and is committed to integrating them into our business practices and our organizational culture. We also have a total of 8 employee-run resource groups, each with senior leadership and exec sponsorship. 

Requirements

~1 min read

We’re defining the relationship between tax and tech.

We’ve already built an industry-leading cloud compliance platform, processing over 54 billion customer API calls and over 6.6 million tax returns a year. Our growth is real - we're a billion dollar business - and we’re not slowing down until we’ve achieved our mission - to be part of every transaction in the world.

We’re bright, innovative, and disruptive, like the orange we love to wear. It captures our quirky spirit and optimistic mindset. It shows off the culture we’ve designed, that empowers our people to win. We’ve been different from day one. Join us, and your career will be too.

Supporting diversity and inclusion is a cornerstone of our company — we don’t want people to fit into our culture, but to enrich it. All qualified candidates will receive consideration for employment without regard to race, color, creed, religion, age, gender, national orientation, disability, sexual orientation, US Veteran status, or any other factor protected by law. If you require any reasonable adjustments during the recruitment process, please let us know.

Location & Eligibility

Where is the job
Romania
Remote within one country
Who can apply
RO

Listing Details

Posted
May 12, 2026
First seen
May 12, 2026
Last seen
May 12, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
61%
Scored at
May 12, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

avalaraLead Site Reliability Engineer