Senior Software Engineer - Core Team

United States·AustinFull-Timesenior

Software EngineerSoftware Engineering

0 views0 saves0 applied

Apply Now

Quick Summary

Requirements Summary

monitoring, alerting, and on-call practices that surface real problems without drowning the team in noise (Grafana, Prometheus, CloudWatch). Be first-in on incidents: run the diagnosis,

Technical Tools

Software EngineerSoftware Engineering

About Userpilot

Userpilot is a leading product analytics and engagement platform. Hundreds of product teams use us to understand, segment, and activate their users in real time. Under the hood, that's a distributed Elixir/Phoenix backend sustaining hundreds of thousands of concurrent WebSocket connections, high-throughput Kafka event ingestion, ClickHouse analytics at scale, and always-on content delivery.

We move fast, we ship often, and we believe the best engineers care as much about how the whole system holds together as about the feature in front of them.

The Role

This is the most senior individual-contributor engineering role at Userpilot, and it is a different kind of role. Core Team engineers are the closest thing we have to software architects. They don't own a single feature area; they own how the system fits together, how it behaves under load, and how it recovers when something breaks.

They are a rare breed: equally at home in a Terraform module, an application lifecycle, a high volume database query plan, and an architecture review. They set the technical direction the rest of engineering builds on, they are the first responders when production is on fire, and they design the guardrails that stop a class of problem from ever happening twice. Application squads move fast on features precisely because the Core Team keeps the ground underneath them solid.

And they do all of this in an AI-native way. Coding agents extend their reach across the stack, but the judgment about what is safe, what will scale, and what must never break stays with them.

Where You'll Have Impact

Technical direction and system design. Decide how non-trivial work should be built before a squad writes the first line. Write the ADRs, choose the patterns, and make durability, extensibility, robustness, observability, and scalability properties of the system rather than afterthoughts bolted on later.
Scale and reliability. Keep a distributed, real-time system healthy as traffic grows: event pipelines from Kafka into ClickHouse, real-time delivery over hundreds of thousands of connections, caching, backpressure, and the failure modes that only appear at scale or during a deploy.
Firefighting and incident response. Be the first call when production breaks. Diagnose under pressure, restore service, find the real root cause, and then turn that incident into a guardrail so the squads don't keep hitting it.
Infrastructure and foundations. Own infrastructure provisioning end to end: AWS (EKS, EC2, S3, RDS) and the Terraform and Kubernetes that tie it together. This is one of the things you do, not the whole job.
Enabling the squads. Raise the architectural bar across teams you don't manage. Review for architectural consistency, drive adoption of patterns that actually stick, and keep application engineers focused on shipping product.
Agentic engineering infrastructure. Make the system safe for a team that ships with AI agents: CI/CD quality gates every PR must pass regardless of author, AGENTS.md and runbooks that teach agents the topology and operational constraints, and Infrastructure as Code clean enough that an agent's change proposal is safe to reason about.

What You'll Do

Lead system design for cross-cutting and high-risk work, and write and shepherd ADRs the org actually follows.
Partner with application squads to turn product requirements into designs that hold up under load and over time, then get out of their way.
Own production reliability: monitoring, alerting, and on-call practices that surface real problems without drowning the team in noise (Grafana, Prometheus, CloudWatch).
Be first-in on incidents: run the diagnosis, coordinate the fix, write the postmortem, and ship the change that prevents a recurrence.
Design, provision, and operate infrastructure on AWS with Terraform and Kubernetes, with high availability and cost both in mind.
Build and improve CI/CD pipelines and validation gates that make every change trustworthy, whether a human or an agent wrote it.
Write the technical context (ADRs, runbooks, AGENTS.md) that makes the system understandable to new engineers and safe for AI tools.
Keep an eye on infrastructure cost and find the optimizations that actually matter.
Provide technical direction and mentorship across the engineering org.

What We're Looking For

Required

Senior experience designing and operating distributed systems in production, with a track record of being the person who owns how the whole system fits together.
Strong software-engineering and CS fundamentals (data structures, algorithms, system design). You can go deep in application and backend code, not just infrastructure.
Architectural judgment: you reason explicitly about durability, extensibility, robustness, observability, and scalability and the tradeoffs between them, and can write an ADR others can follow.
Distributed-systems instincts: you can break down a complex system to find its failure modes, bottlenecks, and the one change that actually moves the needle.
Calm, methodical incident response: you root-cause under pressure and instinctively turn an incident into prevention.
Hands-on infrastructure: AWS (EKS, EC2, S3, RDS) and the networking that connects them, production Kubernetes and Docker (operating clusters, not just deploying to them), and solid Terraform / Infrastructure as Code.
Observability in practice: Grafana, Prometheus, CloudWatch, and alerting that signals real problems.
Strong communication and influence: this role touches every team, and you drive adoption of patterns across people who don't report to you.
An AI-native workflow: you use AI coding agents (Claude Code, Cursor) as a real part of how you work, and you have a point of view on how to review and trust their output.

Bonus Points

Elixir, Erlang, or BEAM systems (our backend runs on them) and OTP patterns: supervision trees, GenServers, distribution.
Scaling highly available distributed systems in a fast-moving product environment.
Kafka, RabbitMQ, ClickHouse, Broadway, or similar high-throughput data tooling (we use both brokers).
Building and operating CI/CD that supports high-frequency deployments.
Cloud cost optimization through caching, right-sizing, or more efficient data processing.
Experience as a tech lead, staff engineer, or architect setting direction for an engineering org.
A point of view on the trust model for automated and agent-generated change: automated PRs, agent-triggered deploys, and the gates that make them safe.
Interest in AI-powered observability: anomaly detection, automated runbook execution, or self-healing infrastructure.
Writing technical context documentation (runbooks, ADRs, AGENTS.md-style files) that makes systems understandable to the people and agents joining them.

Our Stack

Cloud: AWS (EKS, EC2, S3, RDS, CloudFront)
Orchestration: Kubernetes, Docker, Terraform
Backend: Elixir / Phoenix, OTP
Data: ClickHouse (analytics), MySQL (primary)
Messaging: Kafka, RabbitMQ, Broadway
Observability: Grafana, Prometheus, CloudWatch
CI/CD: GitHub Actions
AI: Claude Code / Cursor for agentic development; AGENTS.md, CLAUDE.md, and Infrastructure as Code as shared context for humans and agents alike

What “Agentic Engineering” Means Here

We are shifting toward spec-driven, AI-assisted development, and the Core Team is what makes that safe.

Every PR, human or agent, passes the same quality gates. Our CI/CD has to be reliable, fast, and unambiguous in its feedback, regardless of who (or what) wrote the change.
Agents need to understand where they're operating. We maintain AGENTS.md and operational context so an agent doesn't make a dangerous assumption about topology, service contracts, or operational constraints.
Infrastructure as Code is the single source of truth, for humans and for agents proposing changes. The cleaner and more expressive it is, the safer agent-assisted work becomes.
Agents do a lot of the typing; the Core Team owns the architecture, the judgment, and the boundaries that keep fast-moving, non-deterministic development from compounding into risk.

You don't need to have built agentic infrastructure before. But you should find the challenge genuinely interesting.

EEO Statement

Userpilot is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other characteristic protected by applicable law. All qualified applicants will receive consideration for employment.

Visa/Work Authorization

Applicants must be legally authorized to work in the United States. We are not able to sponsor or take over sponsorship of an employment visa at this time.